# Gaussian Processes

## Introduction
We want to learn unknown function $f$ such that: $y _ { i } = f \left( \mathbf { x } _ { i } \right)$, $f$ possibly corrupted by noise.

Optimal approach is to infer: Distribution over functionis given the data: $p ( f | \mathbf { X } , \mathbf { y } )$

Make predictions given new inputs:

$$p \left( y _ { * } | \mathbf { x } _ { * } , \mathbf { X } , \mathbf { y } \right) = \int p \left( y _ { * } | f , \mathbf { x } _ { * } \right) p ( f | \mathbf { X } , \mathbf { y } ) d f$$

**Gaussian Processes (GPs)** defines a prior over the functions, which can be converted into a posterior over functions once we have seen some data. 

We only need to be able to define a distribution over the function's values at a finite, but arbitrary set of points $\mathbf { x } _ { 1 } , \dots , \mathbf { x } _ { N }$. 

A GP assumes that $p \left( f \left( \mathbf { x } _ { 1 } \right) , \ldots , f \left( \mathbf { x } _ { N } \right) \right)$ is jointly Gaussian with some mean $\mu(\mathbf{x})$ and covariance $\Sigma(\mathbf{x})$ given by $\Sigma _ { i j } = \kappa \left( \mathbf { x } _ { i } , \mathbf { x } _ { j } \right)$, where $\kappa$ is a positive definite kernel function. 

The key idea here is that if $x_i$ and $x_j$ are deemed bu the kernel to be similar, then we expect the output of the function at those points to be similar too.

![](../images/15.GP.png)

In the regression setting, all these computations can be done in closed form in $O(N^3)$ time. In the classification setting, we must use approximations, such as Gaussian approximation. GPs can be thought of as a Bayesian alternative to the kernel methods (like SVM).

## GPs for regression

Let prior on the regression function be a GP:
$$f ( \mathbf { x } ) \sim G P \left( m ( \mathbf { x } ) , \kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) \right)$$

where $m(x)$ is the mean function and $\kappa(x)$ is the kernel or covariance function:

$$\begin{aligned} m ( \mathbf { x } ) & = \mathbb { E } [ f ( \mathbf { x } ) ] \\ \kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) & = \mathbb { E } \left[ ( f ( \mathbf { x } ) - m ( \mathbf { x } ) ) \left( f \left( \mathbf { x } ^ { \prime } \right) - m \left( \mathbf { x } ^ { \prime } \right) \right) ^ { T } \right] \end{aligned}$$

For any finite set of points, this process defines a joint Gaussian prior:

$$p ( \mathbf { f } | \mathbf { X } ) = \mathcal { N } ( \mathbf { f } | \boldsymbol { \mu } , \mathbf { K } )$$

where $K _ { i j } = \kappa \left( \mathbf { x } _ { i } , \mathbf { x } _ { j } \right)$ and $\boldsymbol { \mu } = \left( m \left( \mathbf { x } _ { 1 } \right) , \ldots , m \left( \mathbf { x } _ { N } \right) \right)$

It is common to use mean function of $m(x) = 0$ since the GP is flexible enough to model the mean arbitrarily well. 

### Predictions using noise-free observations:

training set $\mathcal { D } = \left\{ \left( \mathbf { x } _ { i } , f _ { i } \right) , i = 1 : N \right\}$ where $f _ { i } = f \left( \mathbf { x } _ { i } \right)$ is the noise-free observation of function evluated at $x_i$. Given the test set $\mathbf { X } _ { * } \text { of size } N _ { * } \times D$, we want to predict the function outputs $\mathbf{f_*}$

Joint distribution:
$$\left( \begin{array} { l } { \mathbf { f } } \\ { \mathbf { f } _ { * } } \end{array} \right) \sim \mathcal { N } \left( \left( \begin{array} { c } { \boldsymbol { \mu } } \\ { \boldsymbol { \mu } _ { * } } \end{array} \right) , \left( \begin{array} { c c } { \mathbf { K } } & { \mathbf { K } _ { * } } \\ { \mathbf { K } _ { * } ^ { T } } & { \mathbf { K } _ { * * } } \end{array} \right) \right)$$

where $\mathbf { K } = \kappa ( \mathbf { X } , \mathbf { X } ) \text { is } N \times N , \mathbf { K } _ { * } = \kappa \left( \mathbf { X } , \mathbf { X } _ { * } \right) \text { is } N \times N _ { * } , \text { and } \mathbf { K } _ { * * } = \kappa \left( \mathbf { X } _ { * } , \mathbf { X } _ { * } \right) \text { is } N _ { * } \times N _ { * }$

Posterior predictive density: 
$$\begin{aligned} p \left( \mathbf { f } _ { \mathbf {* } } | \mathbf { X } _ { * } , \mathbf { X } , \mathbf { f } \right) & = \mathcal { N } \left( \mathbf { f } _ { * } | \boldsymbol { \mu } _ { * } , \mathbf { \Sigma } _ { * } \right) \\ \boldsymbol { \mu } _ { * } & = \boldsymbol { \mu } \left( \mathbf { X } _ { * } \right) + \mathbf { K } _ { * } ^ { T } \mathbf { K } ^ { - 1 } ( \mathbf { f } - \boldsymbol { \mu } ( \mathbf { X } ) ) \\ \mathbf { \Sigma } _ { * } & = \mathbf { K } _ { * * } - \mathbf { K } _ { * } ^ { T } \mathbf { K } ^ { - 1 } \mathbf { K } _ { * } \end{aligned}$$

with squared exponential kernel (Gaussian Kernel): 
$$\kappa \left( x , x ^ { \prime } \right) = \sigma _ { f } ^ { 2 } \exp \left( - \frac { 1 } { 2 \ell ^ { 2 } } \left( x - x ^ { \prime } \right) ^ { 2 } \right)$$

where $l$ controls the horizontal length scale over which the function varies, and $\sigma_f^2$ controls the vertical variation.

![](../images/15.GP_example.png)

### Predictions using noisy observations:
$$y = f ( \mathbf { x } ) + \epsilon , \text { where } \epsilon \sim \mathcal { N } \left( 0 , \sigma _ { y } ^ { 2 } \right)$$

Covariance of the observed noisy response:
$$\operatorname { cov } \left[ y _ { p } , y _ { q } \right] = \kappa \left( \mathbf { x } _ { p } , \mathbf { x } _ { q } \right) + \sigma _ { y } ^ { 2 } \delta _ { p q }$$

where $\delta _ { p q } = \mathbb { I } ( p = q )$. In other words:
$$\operatorname { cov } [ \mathbf { y } | \mathbf { X } ] = \mathbf { K } + \sigma _ { y } ^ { 2 } \mathbf { I } _ { N } \triangleq \mathbf { K } _ { y }$$

The joint density of the observed data and latent, noise-free function on the test points (assuming the mean is zero):
$$\left( \begin{array} { l } { \mathbf { y } } \\ { \mathbf { f } _ { * } } \end{array} \right) \sim \mathcal { N } \left( \mathbf { 0 } , \left( \begin{array} { c c } { \mathbf { K } _ { y } } & { \mathbf { K } _ { * } } \\ { \mathbf { K } _ { * } ^ { T } } & { \mathbf { K } _ { * * } } \end{array} \right) \right)$$

Posterior predictive density:
$$\begin{aligned} p \left( \mathbf { f } _ { * } | \mathbf { X } _ { * } , \mathbf { X } , \mathbf { y } \right) & = \mathcal { N } \left( \mathbf { f } _ { * } | \boldsymbol { \mu } _ { * } , \mathbf { \Sigma } _ { * } \right) \\ \boldsymbol { \mu } _ { * } & = \mathbf { K } _ { * } ^ { T } \mathbf { K } _ { y } ^ { - 1 } \mathbf { y } \\ \mathbf { \Sigma } _ { * } & = \mathbf { K } _ { * * } - \mathbf { K } _ { * } ^ { T } \mathbf { K } _ { y } ^ { - 1 } \mathbf { K } _ { * } \end{aligned}$$

In the case of single test input: 
$$p \left( f _ { * } | \mathbf { x } _ { * } , \mathbf { X } , \mathbf { y } \right) = \mathcal { N } \left( f _ { * } | \mathbf { k } _ { * } ^ { T } \mathbf { K } _ { y } ^ { - 1 } \mathbf { y } , k _ { * * } - \mathbf { k } _ { * } ^ { T } \mathbf { K } _ { y } ^ { - 1 } \mathbf { k } _ { * } \right)$$

$\text { where } \mathbf { k } _ { * } = \left[ \kappa \left( \mathbf { x } _ { * } , \mathbf { x } _ { 1 } \right) , \ldots , \kappa \left( \mathbf { x } _ { * } , \mathbf { x } _ { N } \right) \right] \text { and } k _ { * * } = \kappa \left( \mathbf { x } _ { * } , \mathbf { x } _ { * } \right)$

We have posterior mean:
$$\overline { f } _ { * } = \mathbf { k } _ { * } ^ { T } \mathbf { K } _ { y } ^ { - 1 } \mathbf { y } = \sum _ { i = 1 } ^ { N } \alpha _ { i } \kappa \left( \mathbf { x } _ { i } , \mathbf { x } _ { * } \right)$$ 

where $\boldsymbol { \alpha } = \mathbf { K } _ { y } ^ { - 1 } \mathbf { y }$

Kernel:
$$\kappa _ { y } \left( x _ { p } , x _ { q } \right) = \sigma _ { f } ^ { 2 } \exp \left( - \frac { 1 } { 2 \ell ^ { 2 } } \left( x _ { p } - x _ { q } \right) ^ { 2 } \right) + \sigma _ { y } ^ { 2 } \delta _ { p q }$$

where $l$ controls the horizontal length scale over which the function varies, and $\sigma_f^2$ controls the vertical variation and $\sigma_y^2$ is the noise variance

### Estimating the kernel parameters:
Marginal likelihood (marginalize out the latent Gaussian vector $f$):
$$p ( \mathbf { y } | \mathbf { X } ) = \int p ( \mathbf { y } | \mathbf { f } , \mathbf { X } ) p ( \mathbf { f } | \mathbf { X } ) d \mathbf { f }$$

$\text { since } p ( \mathbf { f } | \mathbf { X } ) = \mathcal { N } ( \mathbf { f } | \mathbf { 0 } , \mathbf { K } ) , \text { and } p ( \mathbf { y } | \mathbf { f } ) = \prod _ { i } \mathcal { N } \left( y _ { i } | f _ { i } , \sigma _ { y } ^ { 2 } \right)$

so we have: 
$$\log p ( \mathbf { y } | \mathbf { X } ) = \log \mathcal { N } ( \mathbf { y } | \mathbf { 0 } , \mathbf { K } _ { y } ) = - \frac { 1 } { 2 } \mathbf { y } \mathbf { K } _ { y } ^ { - 1 } \mathbf { y } - \frac { 1 } { 2 } \log \left| \mathbf { K } _ { y } \right| - \frac { N } { 2 } \log ( 2 \pi )$$

The first term is a data fit term, the second term is a model complexity term and the third term is just a constant. And we can use SGD to find $l, \sigma_y, \sigma_f$

### GP Regreesion Algorith
![](../images/15.GP_Algo.png)