<img src=../figures/Brown_logo.svg width=50%>

## Data-Driven Design & Analyses of Structures & Materials (3dasm)

## Lecture 13

### Miguel A. Bessa | <a href = "mailto: miguel_bessa@brown.edu">miguel_bessa@brown.edu</a>  | Associate Professor

**What:** A lecture of the "3dasm" course

**Where:** This notebook comes from this [repository](https://github.com/bessagroup/3dasm_course)

**Reference for entire course:** Murphy, Kevin P. *Probabilistic machine learning: an introduction*. MIT press, 2022. Available online [here](https://probml.github.io/pml-book/book1.html)

**How:** We try to follow Murphy's book closely, but the sequence of Chapters and Sections is different. The intention is to use notebooks as an introduction to the topic and Murphy's book as a resource.
* If working offline: Go through this notebook and read the book.
* If attending class in person: listen to me (!) but also go through the notebook in your laptop at the same time. Read the book.
* If attending lectures remotely: listen to me (!) via Zoom and (ideally) use two screens where you have the notebook open in 1 screen and you see the lectures on the other. Read the book.

**Optional reference (the "bible" by the "bishop"... pun intended 😆) :** Bishop, Christopher M. *Pattern recognition and machine learning*. Springer Verlag, 2006.

**References/resources to create this notebook:**
* This simple tutorial is still based on a script I created for this article: https://imechanica.org/node/23957
* It follows from some examples provided by the scikit-learn user guide, which seem to have originated from Mathieu Blondel, Jake Vanderplas, Vincent Dubourg, and Jan Hendrik Metzen.

Apologies in advance if I missed some reference used in this notebook. Please contact me if that is the case, and I will gladly include it here.

## **OPTION 1**. Run this notebook **locally in your computer**:
1. Confirm that you have the '3dasm' mamba (or conda) environment (see Lecture 1).
2. Go to the 3dasm_course folder in your computer and pull the last updates of the [repository](https://github.com/bessagroup/3dasm_course):
```
git pull
```
    - Note: if you can't pull the repo due to conflicts (and you can't handle these conflicts), use this command (with **caution**!) and your repo becomes the same as the one online:
        ```
        git reset --hard origin/main
        ```
3. Open command window and load jupyter notebook (it will open in your internet browser):
```
jupyter notebook
```
5. Open notebook of this Lecture and choose the '3dasm' kernel.

## **OPTION 2**. Use **Google's Colab** (no installation required, but times out if idle):

1. go to https://colab.research.google.com
2. login
3. File > Open notebook
4. click on Github (no need to login or authorize anything)
5. paste the git link: https://github.com/bessagroup/3dasm_course
6. click search and then click on the notebook for this Lecture.

In [1]:
# Basic plotting tools needed in Python.

import matplotlib.pyplot as plt # import plotting tools to create figures
import numpy as np # import numpy to handle a lot of things!
from IPython.display import display, Math # to print with Latex math

%config InlineBackend.figure_format = "retina" # render higher resolution images in the notebook
plt.rcParams["figure.figsize"] = (8,4) # rescale figure size appropriately for slides

## Outline for today

* Introduction to Gaussian Processes
    - Using Scikit-learn for Gaussian Process Regression (noiseless and noisy datasets)

**Reading material**: This notebook + (GPs in Section 17.2 of book)

**Optional reading material (GPs bible)**: Rasmussen, Carl E. *Gaussian Processes for Machine Learning*. MIT press, 2006. Available online [here](https://gaussianprocess.org/gpml/chapters/RW.pdf)

## Gaussian processes

Gaussian processes are a type of machine learning algorithm categorized as a kernel method.

Kernel machines can be very powerful, but also have important limitations...

### Understanding Gaussian Processes

We will derive Gaussian Processes (GPs) in weight-space, instead of function-space (the results are the same).

* This way, we will see that Gaussian Processes are a very natural generalization of Bayesian linear regression!

## Recap of PPD for Bayesian linear regression

Last lecture we found the PPD of Bayesian linear regression to be:

$$\require{color}
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y}\right) \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x}^*)\right)
$$

with the previously determined covariance: $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w = \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}$

This long expression of the PPD is **often simplified** by considering a prior with zero mean, i.e. $\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w = \mathbf{0}$:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y} \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x}^*)\right)
$$

We will consider this simplification, without loss of generality (it makes the expressions slightly shorter).

Furthermore, we will also abbreviate the PPD expression by using the notation:

$\boldsymbol{\phi}^*=\boldsymbol{\phi}(\mathbf{x}^*)$

from which the PPD can be rewritten as:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y} \,,\, \sigma^2 + {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^*\right)
$$

with the previously determined covariance: $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w = \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}$

Also recall that $\boldsymbol{\Phi}$ is where we group all $N$ evaluations of the basis functions into the $N\times M$ matrix:

$$
\boldsymbol{\Phi} = \begin{bmatrix} \phi_0(\mathbf{x}_1) & \phi_1(\mathbf{x}_1) & \cdots & \phi_{M-1}(\mathbf{x}_1) \\
\phi_0(\mathbf{x}_2) & \phi_1(\mathbf{x}_2) & \cdots & \phi_{M-1}(\mathbf{x}_2) \\
\vdots & \vdots & \ddots & \vdots \\
\phi_0(\mathbf{x}_N) & \phi_1(\mathbf{x}_N) & \cdots & \phi_{M-1}(\mathbf{x}_N) \\
\end{bmatrix}
$$

where $N$ is the number of training points and $M$ is the number of basis functions.

Moreover, if we define the matrix:

$\mathbf{K}=\boldsymbol{\Phi} \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w\boldsymbol{\Phi}^T$

We can show that the PPD can be rewritten as:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T(\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1}\mathbf{y} \,,\, \sigma^2 + {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* - {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1} \boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* \right)
$$

I know... You don't think that this simplified anything... Be patient!

### Equivalency of these two PPD expressions for Bayesian linear regression

**The notes below** show that the PPD we found in the last lecture for Bayesian linear regression:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y} \,,\, \sigma^2 + {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^*\right)
$$

with the previously determined covariance: $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w = \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}$

is equivalent to the following expression (shown in 2 lines due to length):

$$\begin{align}
{\color{orange}p(y^*| \mathbf{x}^*, \mathcal{D})} = \mathcal{N}(y^* \mid & {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T(\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1}\mathbf{y} \,, \\
& \sigma^2 + {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* - {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1} \boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* )
\end{align}
$$

where $\mathbf{K}=\boldsymbol{\Phi} \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w\boldsymbol{\Phi}^T$.

### Proof of PPD mean equivalency

We want to show that the two expressions for the **mean** of the PPD are equivalent, i.e.:

$$
\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T(\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1} \equiv \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \frac{1}{\sigma^2}\boldsymbol{\Phi}^T
$$

If you multiply from the left by $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w$, and from the right by $(\mathbf{K}+\sigma^2\mathbf{I}_N)$ we get:

$$
\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w^{-1}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T \equiv \frac{1}{\sigma^2}\boldsymbol{\Phi}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)
$$

Focusing on the LHS and knowing that: $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w^{-1} = \left[ \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1} \right]^{-1} = \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}$

$$
\left(\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right) \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T \equiv \frac{1}{\sigma^2}\boldsymbol{\Phi}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)
$$

$$
\underbrace{\boldsymbol{\Phi}^T}_{\boldsymbol{\Phi}^T\frac{1}{\sigma^2}\sigma^2\mathbf{I}_N} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\underbrace{\boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T}_{\mathbf{K}}  \equiv \frac{1}{\sigma^2}\boldsymbol{\Phi}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)
\quad \text{q.e.d.}$$

### Proof of PPD variance equivalency

We want to show that the two expressions for the **variance** of the PPD are equivalent, i.e.:

$$
\sigma^2 + {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* - {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1} \boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* \equiv \sigma^2 + {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^*
$$

We can prove this from teh matrix inversion Lemma defined for matrices $\mathbf{Z}$, $\mathbf{U}$, $\mathbf{W}$, and $\mathbf{V}$:

$$
\left(\mathbf{Z}+\mathbf{U}\mathbf{W}\mathbf{V}^T\right)^{-1} = \mathbf{Z}^{-1} - \mathbf{Z}^{-1}\mathbf{U}\left(\mathbf{W}^{-1} + \mathbf{V}^T\mathbf{Z}^{-1}\mathbf{U} \right)^{-1}\mathbf{V}^T\mathbf{Z}^{-1}
$$

So, if we consider $\mathbf{Z}^{-1}=\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w$, $\mathbf{W}^{-1}=\sigma^2\mathbf{I}_N$ and $\mathbf{V}=\mathbf{U}=\boldsymbol{\Phi}^T$ then we get:

$$
\left(\underbrace{\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}}_{\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w^{-1}}\right)^{-1} = \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w - \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T \left(\sigma^2\mathbf{I}_N + \underbrace{\boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T}_{\mathbf{K}} \right)^{-1} \boldsymbol{\Phi} \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w
$$

Thus, if we multiply on the left by ${\boldsymbol{\phi}^*}^T$ and on the right by $\boldsymbol{\phi}^*$ for both sides we get:

$$
{\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^*\equiv {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* - {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1} \boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^*
\quad \text{q.e.d.}$$


## This result is incredibly powerful!

Rewriting the PPD for Bayesian linear regression into this form:

$$\begin{align}
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}( y^* \mid &
{\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T(\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1}\mathbf{y} \,, \\
& \, \sigma^2 + {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* - {\boldsymbol{\phi}^*}^T \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1} \boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* )\\
\end{align}
$$

where $\mathbf{K}=\boldsymbol{\Phi} \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w\boldsymbol{\Phi}^T$,

is very interesting because the following **three terms of the same form** arise:

$$
\boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T \quad , \quad {\boldsymbol{\phi}^*}^T\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T \quad , \quad {\boldsymbol{\phi}^*}^T\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^*
$$

**<font color=blue>Let's pause for a minute and observe the PPD expression carefully...</font>**

For simplicity of argument (but without loss of generality), consider for the time being that $\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w = \overset{\scriptscriptstyle <}{\sigma}_w^2 \mathbf{I}_N$.

In this case, we can simplify each of the above terms in a very easy form:

$$
\boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T = \overset{\scriptscriptstyle <}{\sigma}_w^2\boldsymbol{\Phi} \boldsymbol{\Phi}^T
$$

$$
{\boldsymbol{\phi}^*}^T\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T = \overset{\scriptscriptstyle <}{\sigma}_w^2 {\boldsymbol{\phi}^*}^T \boldsymbol{\Phi}^T
$$

$$
{\boldsymbol{\phi}^*}^T\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^* = \overset{\scriptscriptstyle <}{\sigma}_w^2 {\boldsymbol{\phi}^*}^T \boldsymbol{\phi}^*
$$

Where we see that the covariance matrix of the prior are simple "coefficients" for the basis functions (hyperparameters!). This holds even if we did not simplify the covariance matrix to be diagonal, but it just makes this argument easier to make.

Therefore, we have to analyze what it means to have pairs of basis functions being multiplied together.

Let's start with the term $\overset{\scriptscriptstyle <}{\sigma}_w^2\boldsymbol{\Phi} \boldsymbol{\Phi}^T$. The $N\times M$ basis function matrix is:

$$
\boldsymbol{\Phi} = \begin{bmatrix} \phi_0(\mathbf{x}_1) & \phi_1(\mathbf{x}_1) & \cdots & \phi_{M-1}(\mathbf{x}_1) \\
\phi_0(\mathbf{x}_2) & \phi_1(\mathbf{x}_2) & \cdots & \phi_{M-1}(\mathbf{x}_2) \\
\vdots & \vdots & \ddots & \vdots \\
\phi_0(\mathbf{x}_N) & \phi_1(\mathbf{x}_N) & \cdots & \phi_{M-1}(\mathbf{x}_N) \\
\end{bmatrix}
$$

So, the matrix multiplication $\overset{\scriptscriptstyle <}{\sigma}_w^2\boldsymbol{\Phi} \boldsymbol{\Phi}^T$ creates an $N \times N$ matrix (let's call it $\mathbf{A}=\overset{\scriptscriptstyle <}{\sigma}_w^2\boldsymbol{\Phi} \boldsymbol{\Phi}^T$ for now) where each element is given by:

$$
A_{ij} = \overset{\scriptscriptstyle <}{\sigma}_w^2\phi_m(\mathbf{x}_i)\phi_m(\mathbf{x}_j)
$$

Note that I am using Einstein's notation where repeated indices indicate a summation, i.e. $\phi_m(\mathbf{x}_i)\phi_m(\mathbf{x}_j) = \sum_{m=1}^M \phi_m(\mathbf{x}_i)\phi_m(\mathbf{x}_j)$.

Similarly, the remaining two terms have the same construction.

Focusing on the term $\overset{\scriptscriptstyle <}{\sigma}_w^2{\boldsymbol{\phi}^*}^T \boldsymbol{\Phi}^T$, it leads to an $1\times N$ vector (let's call it $\mathbf{b} = \overset{\scriptscriptstyle <}{\sigma}_w^2\boldsymbol{\Phi} \boldsymbol{\Phi}^T$ for now) whose elements are:

$$
b_j = \overset{\scriptscriptstyle <}{\sigma}_w^2 \phi_m(\mathbf{x}^*)\phi_m(\mathbf{x}_j) \quad \text{where } j=1,..., N \text{ and } m=0,\ldots,M-1
$$

and where you should recall that $\mathbf{x}^*$ highlights that this is a point not seen in training (point where you want to make a prediction).

And, finally, the last term $\overset{\scriptscriptstyle <}{\sigma}_w^2 {\boldsymbol{\phi}^*}^T \boldsymbol{\phi}^*$ leads to a scalar (let's call it $c=\overset{\scriptscriptstyle <}{\sigma}_w^2 {\boldsymbol{\phi}^*}^T \boldsymbol{\phi}^*$ for now) given by:

$$
c = \overset{\scriptscriptstyle <}{\sigma}_w^2 \phi_m(\mathbf{x}^*)\phi_m(\mathbf{x}^*) \quad \text{where } m=0,\ldots,M-1
$$


## Hello kernels!

So, we are seeing that all these terms that are in the PPD of Bayesian linear regression:

$$
\boldsymbol{\Phi}\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T \quad , \quad {\boldsymbol{\phi}^*}^T\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T \quad , \quad {\boldsymbol{\phi}^*}^T\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}^*
$$

are associated to a <font color='red'>single mathematical entity</font>:

$$
k(\mathbf{x}_i,\mathbf{x}_j) = \sigma_{ij}^2\phi_m(\mathbf{x}_i)\phi_m(\mathbf{x}_j)  \quad \text{where } m=0,\ldots,M-1
$$
    
which is called a **kernel function** (a.k.a. Mercer kernel).

For example, if we consider $N$ input points $x_n$ **in one-dimension**, if we choose a quadratic polynomial basis function without the bias term, $\boldsymbol{\phi}(x_n)=[x_n , x_n^2]$, and assume the hyperparameter $\sigma_{ij}=1$, then we construct a quadratic kernel:

$$
k(x_i,x_j) = \phi_m(x_i)\phi_m(x_j) = x_i^2 x_j^2 + 2 x_i x_j
$$

for every pair of points $x_i$ and $x_j$.

At this point you may say: "What the hell, Miguel? Why do we care???"

* I admit it, if you just use the same basis functions (polynomial with a given degree, i.e. a finite $M$), then writing Bayesian linear regression using the "kernel" formalism can be a waste of time...

BUT, here's something that will blow your mind 🤯:

* Ask yourself this question: do you know what happens if we make $M \rightarrow \infty$ ?

We can demonstrate that if $M \rightarrow \infty$ then the kernel function becomes a **Gaussian kernel** (a.k.a. **Radial Basis Function** or RBF kernel, a.k.a. **squared exponential kernel** or SE kernel):

$$
k(\mathbf{x}_i,\mathbf{x}_j) = s^2 \exp\left( - \frac{||\mathbf{x}_i-\mathbf{x}_j||^2}{2 l^2} \right)
$$

where $l$ is called the length scale of the kernel, and $s^2$ is often interpreted as the overall variance. However, there are more general versions of the RBF kernel (e.g. with anisotropic length scales).

Note that from our derivation, we can understand that $l$ can be related to the variance of the prior if we assume $\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w=\overset{\scriptscriptstyle <}{\sigma}_w^2 \mathbf{I}$. But you don't need to think about it that way. Just think about $l$ as a hyperparameter (just like $\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w$ was a matrix of hyperparameters).

* (Want to see the proof for this?? Check the notes below!)

#### Note: Proof that RBF comes from inner product of Infinite Polynomial Basis Functions

The mathematical notation for a kernel (disregarding hyperparameters) is:

$
k(\mathbf{x}_i,\mathbf{x}_j) = \langle \phi(\mathbf{x}_i),  \phi(\mathbf{x}_j) \rangle
$

where the bra-ket notation refers to the inner-product between two projected vectors $\phi(\mathbf{x_i})$, i.e. basis functions.

Focusing on the RBF kernel we can write the following (without loss of generality, consider $l=1$):

$$\begin{align}
k(\mathbf{x}_i,\mathbf{x}_j) &= \exp\left[ - \frac{||\mathbf{x}_i-\mathbf{x}_j||^2}{2} \right]\\
&= \exp\left[ - \frac{1}{2} \langle \mathbf{x}_i-\mathbf{x}_j, \mathbf{x}_i-\mathbf{x}_j  \rangle \right]\\
&= \exp\left[ - \frac{1}{2} \left(\langle \mathbf{x}_i, \mathbf{x}_i-\mathbf{x}_j  \rangle - \langle \mathbf{x}_j, \mathbf{x}_i-\mathbf{x}_j  \rangle \right) \right]\\
&= \exp\left[ - \frac{1}{2} \left(\langle \mathbf{x}_i, \mathbf{x}_i  \rangle - \langle \mathbf{x}_i, \mathbf{x}_j  \rangle - \langle \mathbf{x}_j, \mathbf{x}_i  \rangle + \langle \mathbf{x}_j, \mathbf{x}_j  \rangle \right) \right]\\
&= \exp\left[ - \frac{1}{2} \left(||\mathbf{x}_i||^2 - 2\langle \mathbf{x}_j, \mathbf{x}_i  \rangle  +||\mathbf{x}_j||^2 \right) \right]\\
&= \underbrace{\exp\left[ - \frac{1}{2}||\mathbf{x}_i||^2 - \frac{1}{2}||\mathbf{x}_j||^2  \right]}_{\text{Constant}} \exp\left[ \langle \mathbf{x}_j, \mathbf{x}_i  \rangle \right]\\
&= C e^{\langle \mathbf{x}_j, \mathbf{x}_i  \rangle}\\
&= C \sum_{m=0}^{\infty}\frac{\langle \mathbf{x}_j, \mathbf{x}_i  \rangle^n}{m!} \quad \quad \text{Taylor expansion of $e^x$} \\
\end{align}
$$

Concluding that the RBF kernel function **results from an infinite sum over polynomial kernels** (inner-product of polynomial basis functions)!

## Gaussian process regression

In conclusion, we found that the Gaussian process method is simply Bayesian linear regression using nonlinear kernel functions, instead of linear basis functions.

So, we fully derived the simplest PPD for a Gaussian process:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left( y^* \mid 
{\mathbf{k}^*}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1}\mathbf{y} \,,\, \sigma^2 + k(\mathbf{x}^*, \mathbf{x}^*) - {\mathbf{k}^*}^T (\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1} \mathbf{k}^*\right)
$$

where it is assuming **homoscedastic noise** $\sigma^2$, i.e. we assumed constant noise (uncertainty) in the data.

* The homoscedastic noise that we consider here is a type of uncertainty that is called **aleatoric uncertainty** (a.k.a. data uncertainty) because even if we had infinite amounts of data we could not reduce the uncertainty below $\sigma^2$ at every point.

* The predicted uncertainty, i.e. the remaining terms in the PPD variance, is the **epistemic uncertainty** (a.k.a. model uncertainty) because it quantifies the error we make from our model predictions.

For completeness, let's also explicitly write the kernel function:

$$
k(\mathbf{x}^*, \mathbf{x}^*) \quad \text{where $k$ is a chosen kernel function}
$$

the kernel vector:

$$
{\mathbf{k}^*}^T = [k(\mathbf{x}_1, \mathbf{x}^*), k(\mathbf{x}_2, \mathbf{x}^*), \ldots, k(\mathbf{x}_N, \mathbf{x}^*) ] \quad \text{where $N$ is the number of training points}
$$


and the kernel matrix, that is usually called **Covariance matrix**:  

$$
\mathbf{K} = \begin{bmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \cdots & k(\mathbf{x}_1, \mathbf{x}_N) \\
k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \cdots & k(\mathbf{x}_2, \mathbf{x}_N) \\
\vdots & \vdots & \ddots & \vdots \\
k(\mathbf{x}_N, \mathbf{x}_1) & k(\mathbf{x}_N, \mathbf{x}_2) & \cdots & k(\mathbf{x}_N, \mathbf{x}_N) \\
\end{bmatrix}
$$

## Kernels and more kernels...

The freedom that we get from chosing kernels is the best asset of Gaussian processes. The kernel choice can be viewed as a hyperparameter itself (that has additional hyperparameters).

The overwhelming majority of people simply use the RBF kernel with two hyperparameters that we introduced above:

$$
k(\mathbf{x}_i,\mathbf{x}_j) = s^2 \exp\left( - \frac{||\mathbf{x}_i-\mathbf{x}_j||^2}{2 l^2} \right)
$$

But there are many more kernels and they can be very useful!

Here's some important basic kernels:

* ARD kernel (generalization of RBF kernel)
* Matern kernel (less smooth than RBF, as it has limited degree of differentiability)
* Periodic kernel (captures repeating structure of the data)

You can see their expressions in Murphy's book (Chapter 17.1).

You can see interesting plots of these kernels [here](https://distill.pub/2019/visual-exploration-gaussian-processes/#MultipleKernels).

* In fact, the complete [distill.pub article](https://distill.pub/2019/visual-exploration-gaussian-processes) where those plots are included is a great resource because it is interactive. Highly recommended!

Kernels are very interesting entities...

Not every funciton is a kernel! They have to obbey some rules, but there's a lot of flexibility in defining a kernel.

One of the most interesting aspects of kernels is that given two valid kernels $k_1(\mathbf{x}_i, \mathbf{x}_j)$ and $k_2(\mathbf{x}_i, \mathbf{x}_j)$, we can create a new kernel using any of the following methods:

$
k(\mathbf{x}_i, \mathbf{x}_j) = c k_1(\mathbf{x}_i, \mathbf{x}_j), \quad \text{for any constant $c>0$}
$

$
k(\mathbf{x}_i, \mathbf{x}_j) = f(\mathbf{x}_i) k_1(\mathbf{x}_i, \mathbf{x}_j)f(\mathbf{x}_j), \quad \text{for any function $f$}
$

$
k(\mathbf{x}_i, \mathbf{x}_j) = q \left[k_1(\mathbf{x}_i, \mathbf{x}_j)\right], \quad \text{for any function polynomial $q$ with non-negative coefficients}
$

$
k(\mathbf{x}_i, \mathbf{x}_j) = \exp\left[k(\mathbf{x}_i, \mathbf{x}_j)\right]
$

$
k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{A} \mathbf{x}_j, \quad \text{for any positive semi-definite matrix $\mathbf{A}$}
$

Even more interestingly, kernels can also be combined using **addition** or **multiplication**:

$
k(\mathbf{x}_i, \mathbf{x}_j) = k_1(\mathbf{x}_i, \mathbf{x}_j) + k_2(\mathbf{x}_i, \mathbf{x}_j)
$

* Adding two positive-definite kernels together always results in another positive definite kernel. This is a way to get a disjunction of the individual properties of each kernel.

$
k(\mathbf{x}_i, \mathbf{x}_j) = k_1(\mathbf{x}_i, \mathbf{x}_j) \times k_2(\mathbf{x}_i, \mathbf{x}_j)
$

* Multiplying two positive-definite kernels together always results in another positive definite kernel. This is a way to get a conjunction of the individual properties of each kernel.


A great resource discussing the creation of non-trivial kernels is in [Duvenaud's webpage](https://www.cs.toronto.edu/~duvenaud/cookbook/). Highly recommended!

Now that we know all of this, we can write a general expression for the PPD of a Gaussian process becomes:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left( y^* \mid \mu^* + 
{\mathbf{k}^*}^T (\mathbf{K}+\mathbf{R})^{-1}\mathbf{y} \,,\, {\sigma^*}^2 + k(\mathbf{x}^*, \mathbf{x}^*) - {\mathbf{k}^*}^T (\mathbf{K}+\mathbf{R})^{-1} \mathbf{k}^*\right)
$$

* where we added a mean function $\mu^* = \mu(\mathbf{x}^*)$, instead of assuming it to be zero (like we did in the beginning of the lecture, for conciseness)
* where the noise $\sigma^*=\sigma(\mathbf{x}^*)$ is also allowed to change for each point, i.e. we are assuming **heteroscedcastic noise**
* and where the noise matrix $\mathbf{R}$ is (usually) diagonal but where each entry is the noise level at data point $\mathbf{x}_i$ (this at least can be measured at every training point):

    $R_{ij} = \sigma_i^2\delta_{ij}$

    where $\sigma_i \equiv \sigma(\mathbf{x}_i)$ is the noise assumed (or measured) at each training point $\mathbf{x}_i$, and $\delta_{ij}$ is the Kronecker delta (Identity matrix).

## Summary of Gaussian Processes

In fact, the previous PPD expression for Gaussian processes can be generalized by absorbing terms into the kernel (because the kernels have all those fancy properties we just talked about!):

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left( y^* \mid \mu^* + {\mathbf{k}^*}^T \mathbf{K}^{-1}\mathbf{y} \,,\, k(\mathbf{x}^*, \mathbf{x}^*) - {\mathbf{k}^*}^T \mathbf{K}^{-1} \mathbf{k}^*\right)
$$

* where $\mu^*$ <font color='red'>is usually zero</font> (it depends on the prior mean $\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w$, but often we choose the prior mean as zero, leading to $\mu^*=0$).
* and where the noise (uncertainty) assumed in the observation distribution (previously, we assumed it to be constant: $\sigma^2$) can be incorporated in the kernel (and can even change at each input point!).

    For example, by adding a White Kernel to an RBF kernel:

    $k(\mathbf{x}_i, \mathbf{x}_j) = \sigma_i^2 \delta_{ij} + s^2 \exp\left( - \frac{||\mathbf{x}_i-\mathbf{x}_j||^2}{2 l^2} \right)$

### You will explore these and other things in Homework 5

In the next two lectures and in Homework 5 you are going to visualize and explore some of the effects of the hyperparameter and kernel choices.

Have fun!