# 2. Gaussian Process

## 2.1. Introduction

* **Posterior Distribution over Functions:**

>* Parametric model: represented using **parameters** (nuisance)
>* The aim is to work directly in the **space of functions**
>  * **Step 1.** Set up a model in terms of parameters
>  * **Step 2.** Marginalise out the parameters

>$$p(\mathbf{f}|\mathbf{y})=\frac{p(\mathbf{y}|\mathbf{f})p(\mathbf{f})}{p(\mathbf{y})}$$

>* $p(\mathbf{f})$: Of all the functions generated from the prior,
>* $p(\mathbf{y}|\mathbf{f})$: Keep those that fit the data

* **Gaussian Distribution - PDF**

>\begin{align}
p(x|\mu,\sigma^2)&=(2\pi\sigma^2)^{-1/2}\exp{\left( -\frac{1}{2\sigma^2} (x-\mu)^2 \right)}\\
p(\mathbf{x|μ,Σ})&=\det{(2\pi\mathbf{Σ})}^{-1/2}\exp{ \left( -\frac{1}{2}(\mathbf{x-μ})^T\mathbf{Σ}^{-1}(\mathbf{x-μ})\right) }
\end{align}

>* **Conditionals and Marginals of joint Gaussian are also Gaussian**

* **Algebra**

>$$p(\mathbf{x,y})=p \left( \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \right) = \mathcal{N} \left( \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix} , \begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{B}^T & \mathbf{C} \end{bmatrix} \right)$$
>$$\;$$
>$$\Rightarrow p(\mathbf{x})=\mathcal{N}(\mathbf{a,A})$$
>$$\;$$
>$$\Rightarrow p(\mathbf{x}|\mathbf{y})=\mathcal{N}(\mathbf{a+BC}^{-1}(\mathbf{y-b}),\mathbf{A-BC}^{-1}\mathbf{B}^T)$$

## 2.2. Gaussian Process

* **Definition**

>* A **Gaussian Process** is a collection of random variables, any finite number of which have (consistent) Gaussian distributions
>* It is fully specified by the **mean function** $\mathcal{m}(\mathcal{x})$ and **covariance function** $\mathcal{k}(x,x')$
>
>$$f \; \text{~} \; \mathcal{GP}(m,k)$$
>
>* $m(x)$: mean function, function on $\mathcal{X}$
>* $k(x,x')$: covariance function, function on $\mathcal{X}\times\mathcal{X}$

* **Marginalisation Property**

>$$p(\mathbf{x,y}) = \mathcal{N} \left( \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix} , \begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{B}^T & \mathbf{C} \end{bmatrix} \right) \;\Rightarrow\;
p(\mathbf{x})=\mathcal{N}(\mathbf{a,A})$$

* **Joint Generation**

>* Generate a random sample from a $D$-dim joint Gaussian with $\mathbf{m}$ and $\mathbf{K}$

>>`z=randn(D,1);`

>>`y=chol(K)'*z + m;`

>* `chol(K)`: ***Cholesky factor***, $\mathbf{R}$ such that $\mathbf{R}^T \mathbf{R} = \mathbf{K}$
>* $\mathbb{E}[(\mathbf{y-m})(\mathbf{y-m})^T]=\mathbb{E}[\mathbf{R}^T\mathbf{zz}^T\mathbf{R}]=\mathbf{R}^T\mathbb{E}[\mathbf{zz}^T]\mathbf{R}=\mathbf{R}^T\mathbf{IR}=\mathbf{K}$

* **Sequential Generation**

>$$p(f_n,f_{<n})= \mathcal{N} \left( \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix} , \begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{B}^T & \mathbf{C} \end{bmatrix} \right)$$
>$$\;$$

>$$p(f_n|f_{<n})=\mathcal{N}(\mathbf{a+BC}^{-1}(f_{<n}-\mathbf{b}),\mathbf{A-BC}^{-1}\mathbf{B}^T)$$

## 2.3. Non-Parametric Gaussian Process Models

* In non-parametric model, the **parameters** are the function itself
* **Gaussian Likelihood**

>$$p(\mathbf{y}|\mathbf{x},f,\mathcal{M}_i) \;\text{~}\; \mathcal{N}(\mathbf{f},\sigma^2_{noise}\mathbf{I})$$

* **Gaussian Process Prior**

>$$p(f|\mathcal{M}_i) \;\text{~}\; \mathcal{GP}(m \equiv 0,k)$$

* **Gaussian Process Posterior**

>$$p(f|\mathbf{x},\mathbf{y},\mathcal{M}_i) \;\text{~}\; \mathcal{GP}(m_{post},k_{post})$$
>$$\;$$
>\begin{align}
m_{post}(x)&=\mathbf{k}(x,\mathbf{x})[\mathbf{K}(\mathbf{x},\mathbf{x})+\sigma^2_{noise}\mathbf{I}]^{-1}\mathbf{y}\\
k_{post}(x,x')&=k(x,x')-\mathbf{k}(x,\mathbf{x})[\mathbf{K}(\mathbf{x},\mathbf{x})+\sigma^2_{noise}\mathbf{I}]^{-1}\mathbf{k}(\mathbf{x},x')
\end{align}

* **Gaussian Predictive**

><img src="images/image2_01.png" width=550>

* **Mean** (linear in 2 ways)

><img src="images/image2_02.png" width=600>

* **Variance** (difference between 2 terms)

><img src="images/image2_03.png" width=430>

>* 1st term: **prior variance**
>* 2nd term: **how much the data $x$ has explained**
>* **NOTE:** the variance is independent of the observed outputs $\mathbf{y}$


## 2.4. GP Marginal Likelihood and Hyperparameters

* **Log Marginal Likelihood:**

><img src="images/image2_04.png" width=550>

>* First term: ***data fit***
>* Second term: ***complexity penalty*** $\rightarrow$ Occam's Razor

* **Hyperparameters:**

>$$k(x,x')=\exp \left( -\frac{(x-x')^2}{2l^2} \right)$$

>* $l$: characteristic lengthscales

* **Learning in GP:**

>1. Find the form of the covariance function
>2. Find any unknown (hyper)parameters $\theta$ 

## 2.5. Linear in the Parameters Models and GP

* **Q.** Does every GP corresponds to a linear in the parameters model?

>* **A.** Yes, but not necessarily a finite one (Mercer's theorem)

* **Computational complexity:**

>* **GP:** $\mathcal{O}(N^3)$ vs **Linear model:** $\mathcal{O}(NM^2)$
>* $N$: no. of training data
>* $M$: no. of basis functions

* **Linear Model with Gaussian Random Parameters:**

><img src="images/image2_05.png" width=380>

>* **Mean fn.**

><img src="images/image2_06.png" width=500>

>* **Covariance fn.**

><img src="images/image2_07.png" width=500>

* **Finite Linear Model with Gaussian Priors on the Weights:**

><img src="images/image2_08.png" width=380>

>* **Mean fn.**

><img src="images/image2_09.png" width=500>

>* **Covariance fn.**

><img src="images/image2_10.png" width=500>

## 2.6. Finite and Infinite Basis GPs

* **GP with Squared Exponential Covariance fn.:** 
* $\rightarrow$ corresponds to an infinite linear in the parameters model with Gaussian bumps everywhere

><img src="images/image2_11.png" width=500>

>* **Mean fn.**

><img src="images/image2_12.png" width=500>

>* **Covariance fn.**

><img src="images/image2_13.png" width=500>

## 2.7. Covariance Functions

* **ARD(Automatic Relevance Determination):**
  * Used for feature/variable selection

><img src="images/image2_14.png" width=500>

* **RQ(Rational Quadratic):**

><img src="images/image2_15.png" width=430>

>* $r=x-x'$
>* $l>0$
>* $\alpha \rightarrow \infty$: RQ becomes SE

* **Matérn:**

><img src="images/image2_16.png" width=400>

>* $\text{K}_{\nu}$: modified Bessel fn. of second kind of order $\nu$
>* $l$: characteristic length scale
>* $\lfloor \nu - 1 \rfloor$ times differentiable (degree of smoothness)

><img src="images/image2_17.png" width=430>

* **Periodic:**

><img src="images/image2_18.png" width=330>

>* Map the inputs: $u=(\sin(x),\cos(x))^T$
>* & Measure the distance in $u$ space
>* & Combine with SE covariance fn.

* **Composite Covariance fn.:**

>* Covariance fn. have to be positive definite
>* Compose by **sum**, **products**, or **other combinations** (e.g. $g(x)k(x,x')g(x')$)
