$$\usepackage{amsmath}$$

# Table of Contents

>## 1. Introduction to Probabilistic Machine Learning
* 1.1. Mathematical Models
* 1.2. Linear in the Parameters Model
* 1.3. Likelihood and Noise
* 1.4. Probability Basics
* 1.5. Bayesian Inference and Prediction in Finite Regression Models

>## 2. Gaussian Process
* 2.1. Gaussian Distribution
* 2.2. Gaussian Process
* 2.3. Posterior Gaussian Process

# 1. Introduction to Probabilistic Machine Learning

## 1.1. Mathematical Models
*Essentially, all models are wrong, but some are useful* - George E. T. Box

* **Aim to:**

>1. Make predictions
>2. Generalise from observations (inter/extrapolation)
>3. Understand and interpret statistical relationships
>4. Generate more data, from a similar distribution

* **Originate from:**

>1. **First Principles** (e.g. Newtonian mechanics)
>2. **Observations** (data)
>  * $\Rightarrow$ ***Machine Learning***: significantly rely on data

* **Rely on:**

>1. **Knowledge** (expressed through ***priors***)
>2. **Assumptions** (e.g. conditional independence)
>3. **Simplifying assumptions** (if they are 'good enough')

* **Terminology:**

><img src="images/image01.png" width=300>
>
>* $y$: observations
>* $x$: unobserved or hidden or latent variables (# grow with data)
>* $A$: parameter 1 for transitions
>* $C$: parameter 2 for emissions (# fixed)

## 1.2. Linear in the Parameters Regression

* **Model**

>$$f_w(x)=\sum^{M}_{j=0}{w_j\Phi_j(x)} \;\;\; \text{where} \;\;\; \Phi_j(x)=x^j$$
>
>$$\widehat{y}=\Phi \mathbf{w}$$

* **Least Squares Fit**

>$$\text{cost:} \;\;\; \text{E}(\mathbf{w})=(y-\widehat{y})^T(y-\widehat{y})=(y-\Phi \mathbf{w})^T(y-\Phi \mathbf{w})$$

>$$\frac{\partial \text{E}(\mathbf{w})}{\partial \mathbf{w}}=0 \;\;\; \Rightarrow \;\;\; \widehat{\mathbf{w}}=(\Phi^T\Phi)^{-1}\Phi^Ty$$

* **Overfitting**

><img src="images/image02.png" width=400>

## 1.3. Likelihood and Noise

><img src="images/image03.png" width=400>

* **Gaussian Noise Assumption:**

>$$\boldsymbol{\epsilon}\; \text{~} \; \mathcal{N}(\boldsymbol{\epsilon}_n;\textbf{0},\sigma^2_{noise} \textbf{I})$$

* **Maximum Likelihood Estimate:**

>$$p(\mathbf{y}|\mathbf{f}, {\sigma}^2_{noise})=\mathcal{N}(\mathbf{y};
\mathbf{f},\sigma^2_{noise})=\left( \frac{1}{\sqrt{2\pi \sigma^2_{noise}}} \right)^N \exp{\left( -\frac{||\mathbf{y}-\mathbf{f}||^2}{2\sigma^2_{noise}}\right)}$$

>* **ML solution = Least squares solution**

## 1.4. Probability Basics
* **Sum Rule**

>$$p(A)=\sum_B{p(A,B)}\;\;\;\text{or}\;\;\;p(A)=\int_B{p(A,B)dB}$$

* **Product Rule**

>$$p(A,B)=p(A|B)p(B)$$

* **Bayes' Rule:**

>$$p(A|B)=\frac{p(A,B)}{p(B)}=\frac{p(B|A)p(A)}{p(B)}$$
>
>* $p(A)$: **marginal**
>* $p(B|A)$: **conditional**
>* $p(A,B)$: **joint**
>* If $A$ and $B$ are independent, $p(A,B)=p(A)p(B)$

## 1.5. Bayesian Inference and Prediction in Finite Regression Models

* **Posterior:**

>$$p(\mathbf{w}|\mathbf{x},\mathbf{y},\mathcal{M})=\frac{p(\mathbf{w}|\mathcal{M})p(\mathbf{y}|\mathbf{x},\mathbf{w},\mathcal{M})}{p(\mathbf{y}|\mathbf{x},\mathcal{M})}=\mathcal{N}(\mathbf{w;μ,\Sigma})$$
>$$\;$$
>* **Gaussian Prior**: $p(\mathbf{w}|\mathcal{M})=\mathcal{N}(\mathbf{w};\textbf{0},\sigma^2_\mathbf{w}\mathbf{I})$
>$$\;$$
>* **Gaussian Likelihood**: $p(\mathbf{y}|\mathbf{x},\mathbf{w},\mathcal{M})=\mathcal{N}(\mathbf{y};\mathbf{Φw},\sigma^2_{noise}\mathbf{I})$
>$$\;$$
>* **Marginal Likelihood**: $p(\mathbf{y}|\mathbf{x},\mathcal{M})=\int{p(\mathbf{w}|\mathbf{x},\mathcal{M})p(\mathbf{y}|\mathbf{x},\mathbf{w},\mathcal{M})\text{d}\mathbf{w}}$
>$$\;$$
>$$\mathbf{μ}=\left( \mathbf{Φ}^T\mathbf{Φ} + \frac{\sigma^2_{noise}}{\sigma^2_{\mathbf{w}}}\mathbf{I} \right)^{-1}\mathbf{Φ}^T\mathbf{y} \;\;\;,\;\;\; \mathbf{Σ}=\left( \sigma^{-2}_{noise}\mathbf{Φ}^T\mathbf{Φ}+\sigma^{-2}_{\mathbf{w}}\mathbf{I}\right)^{-1}$$


* **Bayesian Inference:**

>\begin{align}
p(y_*|x_*,\mathbf{x},\mathbf{y},\mathcal{M})&=\int{p(y_*,\mathbf{w}|\mathbf{x},\mathbf{y},x_*,\mathcal{M})\text{d}\mathbf{w}}\\
&= \int{p(y_*|\mathbf{w},x_*,\mathcal{M})p(\mathbf{w}|\mathbf{x},\mathbf{y},\mathcal{M})\text{d}\mathbf{w}}\\
&=\mathcal{N}(y_*;\mathbf{Φ}(x_*)^T\mathbf{μ},\mathbf{Φ}(x_*)^T\mathbf{Σ}\mathbf{Φ}(x_*)+\sigma^2_{noise}\mathbf{I})
\end{align}

* **Evidence:** (marginal likelihood, used to select between models)

>\begin{align}
p(\mathcal{M}|\mathbf{x,y}) \propto p(\mathbf{y|\mathcal{M},x}) &= \int{p(\mathbf{w}|\mathbf{x},\mathcal{M})p(\mathbf{y}|\mathbf{x},\mathbf{w},\mathcal{M})\text{d}\mathbf{w}}\\
&=\mathcal{N}(\mathbf{y;0,\sigma^2_{w}ΦΦ}^T+\sigma^2_{noise}\mathbf{I})
\end{align}


# 2. Gaussian Process

* **Basic Idea:**

>* In a parametric model, the model is represented using **parameters**
>* But, parameters are **nuisance**
>* The aim is to work directly in the **space of functions**
>  * **Step 1.** Set up a model in terms of parameters
>  * **Step 2.** Marginalise out the parameters

## 2.1. Gaussian Distribution

* **Univariate:**

>$$p(x|\mu,\sigma^2)=(2\pi\sigma^2)^{-1/2}\exp{\left( -\frac{1}{2\sigma^2} (x-\mu)^2 \right)}$$

* **Multivariate:**

>$$p(\mathbf{x|μ,Σ})=\det{(2\pi\mathbf{Σ})}^{-1/2}\exp{ \left( -\frac{1}{2}(\mathbf{x-μ})^T\mathbf{Σ}^{-1}(\mathbf{x-μ})\right) }$$

* **Conditionals and Marginals are also Gaussian:**

><img src="images/image04.png" width=600>

* **Algebra:**

>$$p(\mathbf{x,y})=p \left( \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \right) = \mathcal{N} \left( \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix} , \begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{B}^T & \mathbf{C} \end{bmatrix} \right)$$
>$$\;$$
>$$\Rightarrow p(\mathbf{x})=\mathcal{N}(\mathbf{a,A})$$
>$$\;$$
>$$\Rightarrow p(\mathbf{x}|\mathbf{y})=\mathcal{N}(\mathbf{a+BC}^{-1}(\mathbf{y-b}),\mathbf{A-BC}^{-1}\mathbf{B}^T)$$

## 2.2. Gaussian Process

* **Definition:**

>* A **Gaussian Process** is a collection of random variables, any finite number of which have (consistent) Gaussian distributions
>* It is fully specified by the **mean function** $\mathcal{m}(\mathcal{x})$ and **covariance function** $\mathcal{k}(x,x')$
>
>$$f \; \text{~} \; \mathcal{GP}(m,k)$$

* **Marginalisation Property:**

>$$p(\mathbf{x})=\mathcal{N}(\mathbf{a,A})$$

* **Sequential Generation:**

>$$p(f_n,f_{<n})= \mathcal{N} \left( \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix} , \begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{B}^T & \mathbf{C} \end{bmatrix} \right)$$
>$$\;$$

>$$p(f_n|f_{<n})=\mathcal{N}(\mathbf{a+BC}^{-1}(f_{<n}-\mathbf{b}),\mathbf{A-BC}^{-1}\mathbf{B}^T)$$

## 2.3. Non-Parametric Gaussian Process Models

* In non-parametric model, the **parameters** are the function itself
* **Gaussian Likelihood:**

>$$p(\mathbf{y}|\mathbf{x},f,\mathcal{M}_i) \;\text{~}\; \mathcal{N}(\mathbf{f},\sigma^2_{noise}\mathbf{I})$$

* **Gaussian Process Prior:**

>$$p(f|\mathcal{M}_i) \;\text{~}\; \mathcal{GP}(m \equiv 0,k)$$

* **Gaussian Process Posterior:**

>$$p(f|\mathbf{x},\mathbf{y},\mathcal{M}_i) \;\text{~}\; \mathcal{GP}(m_{post},k_{post})$$
>$$\;$$
>\begin{align}
m_{post}(x)&=\mathbf{k}(x,\mathbf{x})[K(\mathbf{x},\mathbf{x})+\sigma^2_{noise}\mathbf{I}]^{-1}\mathbf{y}\\
k_{post}(x,x')&=k(x,x')-\mathbf{k}(x,\mathbf{x})[K(\mathbf{x},\mathbf{x})+\sigma^2_{noise}\mathbf{I}]^{-1}\mathbf{k}(\mathbf{x},x')
\end{align}

* **Gaussian Predictive:**

>$$p(y_*|x_*,\mathbf{x},\mathbf{y},\mathcal{M}_i) \;\text{~}\; \mathcal{N}\left(\mathbf{k}(x_*,\mathbf{x})^T[K+\sigma^2_{noise}\mathbf{I}]^{-1}\mathbf{y}\\,k(x_*,x_*)-\mathbf{k}(x_*,\mathbf{x})^T[K+\sigma^2_{noise}\mathbf{I}]^{-1}\mathbf{k}(x_*,\mathbf{x})\right)$$

* **Mean:** Linear in 2 Ways

>$$\mathbf{k}(x_*,\mathbf{x})^T[K(\mathbf{x},\mathbf{x})+\sigma^2_{noise}\mathbf{I}]^{-1}\mathbf{y}=\sum^{N}_{n=1}{\beta_n y_n}=\sum^{N}_{n=1}{\alpha_nk(x_*,x_n)}$$

* **Variance:** Difference between 2 Terms

>$$k(x_*,x_*)-\mathbf{k}(x_*,\mathbf{x})^T[K+\sigma^2_{noise}\mathbf{I}]^{-1}\mathbf{k}(x_*,\mathbf{x})$$
>* 1st term: **prior variance**
>* 2nd term: **how much the data $x$ has explained**
>* **NOTE:** the variance is independent of the observed outputs $\mathbf{y}$


## 2.4. GP Marginal Likelihood and Hyperparameters

## 2.5. Linear in the Parameters Models and GP

## 2.6. Finite and Infinite Basis GPs