<img src=docs/tudelft_logo.jpg width=50%>

## Data-driven Design and Analyses of Structures and Materials (3dasm)

## Lecture 13

### Miguel A. Bessa | <a href = "mailto: M.A.Bessa@tudelft.nl">M.A.Bessa@tudelft.nl</a>  | Associate Professor

**What:** A lecture of the "3dasm" course

**Where:** This notebook comes from this [repository](https://github.com/bessagroup/3dasm_course)

**Reference for entire course:** Murphy, Kevin P. *Probabilistic machine learning: an introduction*. MIT press, 2022. Available online [here](https://probml.github.io/pml-book/book1.html)

**How:** We try to follow Murphy's book closely, but the sequence of Chapters and Sections is different. The intention is to use notebooks as an introduction to the topic and Murphy's book as a resource.
* If working offline: Go through this notebook and read the book.
* If attending class in person: listen to me (!) but also go through the notebook in your laptop at the same time. Read the book.
* If attending lectures remotely: listen to me (!) via Zoom and (ideally) use two screens where you have the notebook open in 1 screen and you see the lectures on the other. Read the book.

**Optional reference (the "bible" by the "bishop"... pun intended 😆) :** Bishop, Christopher M. *Pattern recognition and machine learning*. Springer Verlag, 2006.

**References/resources to create this notebook:**
* Chapter 11 of Murphy's book.

Apologies in advance if I missed some reference used in this notebook. Please contact me if that is the case, and I will gladly include it here.

## **OPTION 1**. Run this notebook **locally in your computer**:
1. Confirm that you have the 3dasm conda environment (see Lecture 1).

2. Go to the 3dasm_course folder in your computer and pull the last updates of the [repository](https://github.com/bessagroup/3dasm_course):
```
git pull
```
3. Open command window and load jupyter notebook (it will open in your internet browser):
```
conda activate 3dasm
jupyter notebook
```
4. Open notebook of this Lecture.

## **OPTION 2**. Use **Google's Colab** (no installation required, but times out if idle):

1. go to https://colab.research.google.com
2. login
3. File > Open notebook
4. click on Github (no need to login or authorize anything)
5. paste the git link: https://github.com/bessagroup/3dasm_course
6. click search and then click on the notebook for this Lecture.

In [1]:
# Basic plotting tools needed in Python.

import matplotlib.pyplot as plt # import plotting tools to create figures
import numpy as np # import numpy to handle a lot of things!
from IPython.display import display, Math # to print with Latex math

%config InlineBackend.figure_format = "retina" # render higher resolution images in the notebook
#plt.style.use("seaborn") # style for plotting that comes from seaborn
plt.rcParams["figure.figsize"] = (8,4) # rescale figure size appropriately for slides

## Outline for today

* Derivation of different Linear Regression models
    - Picking up where we left off in Lecture 8.

**Reading material**: This notebook + Chapter 11 of the book.

## Recap of Lectures 8 and 9

Recall our view of Linear regression models from a Bayesian perspective: it's all about the choice of **likelihood** and **prior**!

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Uniform    | Point estimate | Least Squares regression  | 11.2.2  |
| Gaussian  | Gaussian    | Point estimate | Ridge regression   | 11.3  |
| Gaussian  | Laplace    | Point estimate | Lasso regression  | 11.4  |
| Student-$t$  | Uniform    | Point estimate | Robust regression   | 11.6.1  |
| Laplace  | Uniform    | Point estimate | Robust regression   | 11.6.2  |
| Gaussian  | Gaussian    | Gaussian | Bayesian linear regression   | 11.7 |

Let's continue along the lines of the <font color='red'>Homework</font> of Lecture 8, and derive a few of these models for the multidimensional case.

We are now totally prepared to derive any ML model in any dimension!

In Lecture 8 and its Homework we derived linear regression models using 1D input $x$, 1D output $y$, and a polynomial basis function $\boldsymbol{\phi}(x)$.

We will quickly recap what we did then, and then show how this generalizes to multidimensional inputs $\mathbf{x}$ and for any kind of basis function $\boldsymbol{\phi}(\mathbf{x})$.

* Note: without loss of generality, we will keep considering a single output $y$.

## Linear Least Squares: Linear regression with Gaussian likelihood, Uniform prior and posterior via Point estimate

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Uniform    | Point estimate | Least Squares regression  | 11.2.2  |

This model assumes a Gaussian observation distribution with constant variance and "linear" mean (recall: linear in the unknowns $\mathbf{z}$). If considering 1D input $x$ and 1D output $y$ the model is written as:
1. Gaussian observation distribution: $p(y|x, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(x), \sigma_{y|z}^2 = \sigma^2)$

where $\mathbf{z} = (\mathbf{w}, \sigma)$ are all the unknown model parameters (hidden rv's).

2. Uniform prior distribution for each hidden rv in $\mathbf{z}$: $p(\mathbf{z}) \propto 1$

3. MLE point estimate for posterior: $\hat{\mathbf{z}}_{\text{mle}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log{ p(y=y_i|x=x_i, \mathbf{z})}\right]$

Final prediction is given by the <font color='orange'>PPD</font>: $\require{color}
{\color{orange}p(y|x, \mathcal{D})} = \int p(y|x,\mathbf{z}) \delta(\mathbf{z}-\hat{\mathbf{z}}) dz = p(y|x, \mathbf{z}=\hat{\mathbf{z}})$

#### Notes

Compared to the previous lectures, pay attention to the following updates in the notation of our 1D linear regression model:

1. We are explicitly including the input $x$ in the probability densities, as we will no longer fix $x$ to a particular value like we did up to now

2. We are now considering more than one unknown rv and grouping them in the vector $\mathbf{z}$.

### Recall the car stopping distance problem

<img src="docs/reaction-braking-stopping.svg" title="Car stopping distance" width="25%" align="right">

Let's focus (again) on our favorite problem, but now we will not keep the velocity of the car $x$ fixed.

If we knew the "ground truth" of this problem, then it would be given by:

$\require{color}y = {\color{red}z_1}\cdot x + {\color{red}z_2}\cdot x^2$

- $y$ is the **output**: the car stopping distance (in meters)
- ${\color{red}z_1}$ is a hidden variable: an <a title="random variable">rv</a> representing the driver's reaction time (in seconds)
- ${\color{red}z_2}$ is another hidden variable: an <a title="random variable">rv</a> that depends on the coefficient of friction, the inclination of the road, the weather, etc. (in m$^{-1}$s$^{-2}$).
- $x$ is the **input**: constant car velocity (in m/s).

where $z_1 \sim \mathcal{N}(\mu_{z_1}=1.5,\sigma_{z_1}^2=0.5^2)$, and $z_2 \sim \mathcal{N}(\mu_{z_2}=0.1,\sigma_{z_2}^2=0.01^2)$.

Unsurprisingly, in Exercise 1 of Lecture 9 we saw that a linear model with a **quadratic polynomial basis function** predicts the stopping distance for this problem very well:

1. Gaussian observation distribution: $p(y|x, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(x), \sigma_{y|z}^2 = \sigma^2)$

where $\mathbf{z} = (\mathbf{w}, \sigma)$ are all the hidden rv's of the model, i.e. the model parameters.
* the vector $\mathbf{w} = [w_0, w_1, w_2 ..., w_{M-1}]^T$ includes the **bias** term $w_0$ and the remaining **weights** $w_m$ with $m=0,..., M-1$.
* the vector $\boldsymbol{\phi}(x) = [1, x, x^2, ..., x^{M-1}]^T$ includes the **basis functions**, which now correspond to a polynomial of degree $M-1$.  When <font color='red'>$M=3$</font> we have a quadratic polynomial basis (3 unknowns).

2. Uniform prior distribution for each hidden rv in $\mathbf{z}$: $p(\mathbf{z}) \propto 1$

3. MLE point estimate for posterior: $\hat{\mathbf{z}}_{\text{mle}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{n=1}^{N}\log{ p(y=y_n|x_n, \mathbf{z})}\right]$

For other problems, the polynomial degree $M-1$ of the basis functions may need to be different.

* For example, also in Lecture 9 we saw that for a problem whose ground truth is $x\sin{x}$ then the polynomial basis function needs to have a higher degree. However, even then the approximation is not brilliant because the ground truth is not really a polynomial!

There are other basis functions that can be adopted. For example, spline basis functions (Section 11.5 in the book), among many other possibilities (kernels!).

As we also mentioned, as long as the basis functions $\boldsymbol{\phi}(x)$ do not depend on any rv $\mathbf{z}$ and the mean of observation distribution is defined linearly as a function of the rv's, then we still have a linear regression model.

But now let's consider problems that still have only one output $y$ but that can have multiple inputs $\mathbf{x} = [x_1, x_2, ..., x_D]^T$ where $x_d$ is feature $d$ and where $d=1, ..., D$.

In this case, we can write the multidimensional linear regression model as:

1. Gaussian observation distribution: $p(y|\mathbf{x}, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}), \sigma_{y|z}^2 = \sigma^2)$

where $\mathbf{z} = (\mathbf{w}, \sigma)$ are all the hidden rv's of the model, i.e. the model parameters.
* the vector $\mathbf{w} = [w_0, w_1, w_2 ..., w_{M-1}]^T$ includes the **bias** term $w_0$ and the remaining **weights** $w_m$ with $m=0,..., M-1$.
* and the basis functions remain a vector but where each element also acts on a vector $\mathbf{x}$, where $x_d$ has $D$ features: $\boldsymbol{\phi}(\mathbf{x}) = [\phi_0(\mathbf{x}), \phi_1(\mathbf{x}), \phi_2(\mathbf{x}) ..., \phi_{M-1}(\mathbf{x})]^T$

and where the remaining choices for the linear regression model remain the same:

2. Uniform prior distribution for each hidden rv in $\mathbf{z}$: $p(\mathbf{z}) \propto 1$

3. MLE point estimate for posterior: $\hat{\mathbf{z}}_{\text{mle}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{n=1}^{N}\log{ p(y=y_n|\mathbf{x} = \mathbf{x}_n, \mathbf{z})}\right]$

Final prediction is given by the <font color='orange'>PPD</font>: 

$$\require{color}
{\color{orange}p(y|x, \mathcal{D})} = \int p(y|x,\mathbf{z}) \delta(\mathbf{z}-\hat{\mathbf{z}}) dz = p(y|x, \mathbf{z}=\hat{\mathbf{z}})$$

Therefore, we are capable of predicting the PPD by discovering the unknowns $\mathbf{z}$ via the point estimate of the posterior, which requires solving the $\mathrm{argmin}$ of the negative log likelihood.

Now, let's focus on estimating the unknowns $\mathbf{z}$ via the MLE point estimate of the posterior (maximum likelihood estimation).

As we saw in Lecture 8, finding the MLE is the same as finding the location of the minimum of the negative log likelihood.

Since our observation distribution is a multivariate Gaussian,

$p(y|x, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}), \sigma_{y|z}^2 = \sigma^2)$

then the likelihood is given by (Lecture 5 but now with vectors):

$$
\begin{align}
p(y=\mathcal{D}_y | \mathbf{x}=\mathcal{D}_x, \mathbf{z}) &= \prod_{n=1}^{N} p(y=y_n|\mathbf{x}=\mathbf{x}_n, \mathbf{z}) \\
&= p(y=y_1|\mathbf{x}=\mathbf{x}_1, \mathbf{z})p(y=y_2|\mathbf{x}=\mathbf{x}_2, \mathbf{z}) \cdots p(y=y_N|\mathbf{x}=\mathbf{x}_N, \mathbf{z})
\end{align}
$$

which we already know that is also a multivariate Gaussian (unnormalized).

But, since we are not going fully Bayesian, the only thing we need to estimate is the location of the maximum of the likelihood (point estimate!):

$$\begin{align}
\hat{\mathbf{z}}_{\text{mle}} &= \underset{z}{\mathrm{argmin}}\left[\text{NLL}(\mathbf{z})\right]
\\
&= \underset{z}{\mathrm{argmin}}\left[-\sum_{n=1}^{N}\log{ p(y=y_n|\mathbf{x}=\mathbf{x}_n, \mathbf{z})}\right]
\end{align}
$$

In Lecture 9 we allowed scikit-learn to find the minimum for us! But today we will actually determine this minimum...

You already did this in the Homework of Lecture 8 for the 1D case with a linear polynomial basis and fixing $x$. The multivariate case for a general basis function and for different $\mathbf{x}$ is just as easy! Especially when considering the variance of the observation distribution to be the same everywhere!

$$
\begin{align}
\hat{\mathbf{z}}_{\text{mle}} &= \underset{z}{\mathrm{argmin}}\left[-\sum_{n=1}^{N}\log{ p(y=y_n|\mathbf{x}=\mathbf{x}_n, \mathbf{z})}\right] \\
&= \underset{z}{\mathrm{argmin}}\left[-\sum_{n=1}^{N}\log{\left( \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{ -\frac{1}{2\sigma^2}\left[y_n-\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}_n)\right]^2\right\}\right)}\right]\\
&= \underset{z}{\mathrm{argmin}}\left[\frac{N}{2}\log{\left(2\pi \sigma^2\right)}+\frac{1}{2 \sigma^2}\sum_{n=1}^{N}\left[y_n-\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}_n)\right]^2 \right]\\
\end{align}
$$

where we recall that the unknowns are $\mathbf{z} = (\mathbf{w}, \sigma)$.

To find the minimum location we need to take the gradient of the $\text{NLL}(\mathbf{z})$ wrt $\mathbf{z}$ and equal it to zero:

$$
\nabla_{\mathbf{z}} \text{NLL}(\mathbf{z}) = \mathbf{0}
$$

which can be written as,

$$
\begin{bmatrix}
\frac{\partial \text{NLL}(\mathbf{z})}{\partial w_0}\\
\frac{\partial \text{NLL}(\mathbf{z})}{\partial w_1}\\
\vdots \\
\frac{\partial \text{NLL}(\mathbf{z})}{\partial w_M}\\
\frac{\partial \text{NLL}(\mathbf{z})}{\partial \sigma^2}\\
\end{bmatrix} =
\begin{bmatrix}0\\
0\\
\vdots \\
0\\
0\\
\end{bmatrix}
$$

We can first solve this system of equations wrt $\mathbf{w}$, and then solve wrt $\sigma$

Then, solving first for the weights $\mathbf{w}$:

$$
\nabla_{\mathbf{w}} \text{NLL}(\mathbf{w}, \sigma^2) = \mathbf{0}
$$

we note that,

$$\begin{align}
\nabla_{\mathbf{w}} \text{NLL}(\mathbf{w}, \sigma^2) = \mathbf{0} \\
\nabla_{\mathbf{w}} \left[\frac{N}{2}\log{\left(2\pi \sigma^2\right)}+\frac{1}{2 \sigma^2}\sum_{n=1}^{N}\left[y_n-\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}_n)\right]^2 \right] = \mathbf{0} \\
\nabla_{\mathbf{w}} \left[\underbrace{\frac{1}{2}\sum_{n=1}^{N}\left[y_n-\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}_n)\right]^2}_{\text{RSS}(\mathbf{w})} \right] = \mathbf{0}
\end{align}
$$

Note: in Statistics the term in the argument is called **residual sum of squares**.

We can rewrite the above expression in a simpler form:

$$\begin{align}
\nabla_{\mathbf{w}} \left[\frac{1}{2}\sum_{n=1}^{N}\left[y_n-\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}_n)\right]^2 \right] = 0 \\
\nabla_{\mathbf{w}} \left[\frac{1}{2}\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) \right] = 0
\end{align}
$$

where we group all output measurements $y_n$ into a $N\times 1$ vector $\mathbf{y}$ and where we group all $N$ evaluations of the basis functions into the $N\times M$ matrix:

$$
\boldsymbol{\Phi} = \begin{bmatrix} \phi_0(\mathbf{x}_1) & \phi_1(\mathbf{x}_1) & \cdots & \phi_{M-1}(\mathbf{x}_1) \\
\phi_0(\mathbf{x}_2) & \phi_1(\mathbf{x}_2) & \cdots & \phi_{M-1}(\mathbf{x}_2) \\
\vdots & \vdots & \ddots & \vdots \\
\phi_0(\mathbf{x}_N) & \phi_1(\mathbf{x}_N) & \cdots & \phi_{M-1}(\mathbf{x}_N) \\
\end{bmatrix}
$$

Setting the gradient wrt all $\mathbf{w}$ to zero gives,

$$\begin{align}
\nabla_{\mathbf{w}} \left[\frac{1}{2}\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) \right] = \mathbf{0}\\
\frac{1}{2} \left[ \left( \boldsymbol{\Phi}^T\boldsymbol{\Phi}+\boldsymbol{\Phi}^T\boldsymbol{\Phi} \right)\mathbf{w} -\boldsymbol{\Phi}^T\mathbf{y}-\boldsymbol{\Phi}^T\mathbf{y} \right] = \mathbf{0}
\end{align}
$$

where we used the identity $ \frac{\partial \mathbf{x}^T\mathbf{A}\mathbf{x}}{\partial \mathbf{x}} = \left( \mathbf{A}+\mathbf{A}^T\right)\mathbf{x}$. (See Section 7.8 of Murphy's book if you need to revise matrix calculus).

From which we reach the MLE prediction for the weights:

$$
\hat{\mathbf{w}}_{\text{mle}} = \left(\boldsymbol{\Phi}^T \boldsymbol{\Phi} \right)^{-1} \boldsymbol{\Phi}^T \mathbf{y}
$$

We conclude that our point estimate for the posterior (MLE) is: $\hat{\mathbf{w}}_{\text{MLE}} = \left(\boldsymbol{\Phi}^T \boldsymbol{\Phi} \right)^{-1} \boldsymbol{\Phi}^T \mathbf{y}$

where the quantity

$$\boldsymbol{\Phi}^{\dagger} = \left(\boldsymbol{\Phi}^T \boldsymbol{\Phi} \right)^{-1} \boldsymbol{\Phi}^T $$

is known as the Moore-Penrose pseudo-inverse of the matrix $\boldsymbol{\Phi}$. It can be regarded as a generalization of the notion of matrix inverse to **nonsquare matrices**. In the special case of $\boldsymbol{\Phi}$ being square and invertible, then using the property $\left( \mathbf{A}\mathbf{B}\right)^{-1}=\mathbf{B}^{-1}\mathbf{A}^{-1}$ we see that $\boldsymbol{\Phi}^{\dagger}=\boldsymbol{\Phi}^{-1}$.

Also note that we could have calculated separetely the bias term $w_0$ (which is convenient because for other models the bias usually has a uniform prior, unlike the remaining weights). If we do that we obtain:

$$
\hat{w}_0 = \bar{y}-\sum_{m=1}^{M-1} w_m \bar{\phi}_m
$$

where we defined $\bar{y} = \frac{1}{N}\sum_{n=1}^N y_n$ and $\bar{\phi}_m = \frac{1}{N}\sum_{n=1}^{N} \phi_m(\mathbf{x}_n)$.

Having found the solution for all $\mathbf{w}$, we just need to find one last unknown from the point estimate of the posterior:

$$
\nabla_{\mathbf{\sigma^2}} \text{NLL}(\mathbf{w}, \sigma^2) = 0
$$

which is particularly simple:

$$
\hat{\sigma}_{\text{mle}} = \frac{1}{N}\sum_{n=1}^{N} \left[y_n -\hat{\mathbf{w}}^T_{\text{mle}}\boldsymbol{\phi}(\mathbf{x}_n)\right]^2
$$

In summary, the MLE point estimate of the posterior leads to the following estimation of parameters:

$$
\hat{\mathbf{w}}_{\text{mle}} = \left(\boldsymbol{\Phi}^T \boldsymbol{\Phi} \right)^{-1} \boldsymbol{\Phi}^T \mathbf{y}
$$

$$
\hat{\sigma}_{\text{mle}} = \frac{1}{N}\sum_{n=1}^{N} \left[y_n -\hat{\mathbf{w}}^T_{\text{mle}}\boldsymbol{\phi}(\mathbf{x}_n)\right]^2
$$

where the Moore-Penrose pseudo inverse needs to be calculated $\boldsymbol{\Phi}^{\dagger} = \left(\boldsymbol{\Phi}^T \boldsymbol{\Phi} \right)^{-1} \boldsymbol{\Phi}^T $.

This calculation can be done efficiently by many libraries, including Numpy.

* For example, scikit-learn uses a solver based on SVD (Single Value Decomposition: the most common dimensionality reduction method) which is efficient when $N > M$ (overdetermined system). Book section 7.5 has an excellent summary of SVD, if you are curious.

* Of course, if $N = M$ then there is a unique solution (you know that from the Midterm!) and the error on the training set becomes zero (linear regression becomes fully interpolatory).

## Ridge regression: Linear regression with Gaussian likelihood, Gaussian prior and posterior via Point estimate

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Gaussian    | Point estimate | Ridge regression   | 11.3  |

1. Gaussian observation distribution: $p(y|\mathbf{x}, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}), \sigma_{y|z}^2 = \sigma^2)$

where $\mathbf{z} = (\mathbf{w}, \sigma)$ are all the unknown model parameters (hidden rv's).

2. But using a Gaussian prior for the weights $\mathbf{w}$: $p(\mathbf{w}) = \mathcal{N}(\mathbf{w}| \mathbf{0}, \overset{\scriptscriptstyle <}{\sigma}_w \mathbf{I})$

3. MAP point estimate for posterior: $\hat{\mathbf{z}}_{\text{map}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log{ p(y=y_i|z)} - \log{p(\mathbf{w})}\right]$

Final prediction is given by the <font color='orange'>PPD</font>: $\require{color}
{\color{orange}p(y|\mathcal{D}_y)} = \int p(y|z) \delta(z-\hat{z}_{\text{map}}) dz = p(y|z=\hat{z}_{\text{map}})$

#### Note on the choice of prior for linear regression

The Gaussian prior is usually imposed only on the weights. The bias and variance term in the observation distribution still have a Uniform prior because they do not contribute to overfitting.

You can see this from the expressions for $\hat{w}_{0, \text{mle}}$ and $\sigma^2_{\text{mle}}$, as they act on the global mean and MSE (mean squared error) of the residuals, respectively.

Computing the MAP estimate is very similar to what we did for the MLE:

$$
\begin{align}
\hat{\mathbf{w}}_{\text{map}} &= \underset{w}{\mathrm{argmin}}\left[\frac{1}{2\sigma^2}\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) + \frac{1}{2\overset{\scriptscriptstyle <}{\sigma}_w^2}\mathbf{w}^T\mathbf{w}\right] \\
&= \underset{w}{\mathrm{argmin}}\left[\text{RSS}(\mathbf{w}) + \lambda ||\mathbf{w}||_2^2\right]
\end{align}
$$

where $\lambda = \frac{\sigma^2}{\overset{\scriptscriptstyle <}{\sigma}_w^2}$ is proportional to the strength of the prior, and

$||\mathbf{w}||_2^2 = \sqrt{\sum_{m=1}^{M-1} |w_m|^2} = \sqrt{\mathbf{w}^T\mathbf{w}}$

is called the $l_2$ norm of the vector $\mathbf{w}$. Thus, we are penalizing weights that become too large in magnitude. In ML literature this is usually called $l_2$ **regularization** or **weight decay**, and is very widely used.


In the midterm, you will experience the difference between Linear Least Squares and Ridge regression, reporting on the influence of the prior strength for the latter.

Then, solving the MAP first for the weights $\mathbf{w}$, as we did for the MLE:

$$\begin{align}
\nabla_{\mathbf{w}} \left[\frac{1}{2\sigma^2}\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) + \frac{1}{2\overset{\scriptscriptstyle <}{\sigma}_w^2}\mathbf{w}^T\mathbf{w}\right] &= \mathbf{0} \\
\nabla_{\mathbf{w}} \left[\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) + \lambda\mathbf{w}^T\mathbf{w}\right] = \mathbf{0}
\end{align}
$$

from which we determine the MAP estimate for the the weights $\mathbf{w}$ as:

$$
\hat{\mathbf{w}}_{\text{map}} = \left( \boldsymbol{\Phi}^T\boldsymbol{\Phi} + \lambda \mathbf{I}_M \right)^{-1} \boldsymbol{\Phi}^T \mathbf{y} = \left( \sum_{n=1}^{N} \boldsymbol{\phi}(\mathbf{x}_n)\boldsymbol{\phi}(\mathbf{x}_n)^T + \lambda \mathbf{I}_M \right)^{-1} \left( \sum_{n=1}^{N} \boldsymbol{\phi}(\mathbf{x}_n) y_n \right)
$$

Once again, this can be solved using SVD or other methods to ensure that the Moore-Penrose pseudoi inverse is calculated properly.

## Lasso regression: Linear regression with Gaussian likelihood, Laplace prior and posterior via Point estimate

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Laplace    | Point estimate | Lasso regression  | 11.4  |

1. Gaussian observation distribution: $p(y|\mathbf{x}, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}), \sigma_{y|z}^2 = \sigma^2)$

where $\mathbf{z} = (\mathbf{w}, \sigma)$ are all the unknown model parameters (hidden rv's).

2. But using a **Laplace** prior for the weights $\mathbf{w}$: $p(\mathbf{w}) = \prod_{m=1}^{M-1}\text{Lap}\left(w_m| 0, 1/\overset{\scriptscriptstyle <}{\lambda}_w\right) \propto \prod_{m=1}^{M-1} \exp{\left[ -\overset{\scriptscriptstyle <}{\lambda}_w |w_m|\right]}$

3. MAP point estimate for posterior: $\hat{\mathbf{z}}_{\text{map}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log{ p(y=y_i|z)} - \log{p(\mathbf{w})}\right]$

Final prediction is given by the <font color='orange'>PPD</font>: $\require{color}
{\color{orange}p(y|\mathcal{D}_y)} = \int p(y|z) \delta(z-\hat{z}_{\text{map}}) dz = p(y|z=\hat{z}_{\text{map}})$

#### Note about the sparsity parameter $\overset{\scriptscriptstyle <}{\lambda}_w$

Please note that the $\overset{\scriptscriptstyle <}{\lambda}_w$ parameter defining the strength of the Laplace prior is different from the $\lambda$ parameter defined in Ridge regression.

#### Note about number of parameters $M$ and number of input dimensions $D$

If $M=D$ the method is called Lasso, but if there are more parameters than input variables $M>D$ then it is called Group Lasso (Section 11.4.7).

* The next cells describe Lasso, which introduces sparsity by making a weight associated to a particular variable tending to zero.

* Group Lasso leads to a sparsity of more than one parameter that is associated to a given variable (which is an interesting way to induce sparsity in overparameterized models such as Artificial Neural Networks). It is derived in a very similar manner.

Computing the MAP estimate:

$$
\begin{align}
\hat{\mathbf{w}}_{\text{map}} &= \underset{w}{\mathrm{argmin}}\left[\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) + \overset{\scriptscriptstyle <}{\lambda}_w||\mathbf{w}||_1\right]
\end{align}
$$

where $||\mathbf{w}||_1 = \sum_{m=1}^{M-1}|w_m|$ is called the $l_1$ norm of $\mathbf{w}$. In ML literature this is called $l_1$ regularization. 

* Calculating the MAP for Lasso is not done the same way as for Ridge because the term $||\mathbf{w}||_1$ is not differentiable whenever $w_m = 0$. In this case, the solution is found using hard- or soft-thresholding (See Section 11.4.3 in the book) because the gradient becomes a branch function.

* A more important point is to see that **different types of prior distributions** introduce **different regularizations** on the weights, aleviating overfitting in a different manner.
    - For example, in the case of Lasso, since the Laplace prior puts more density around the mean (which is zero here) than the Gaussian prior, then it has a tendency to lead the weights to zero, i.e. it introduces **sparsity** when estimating the weights via MAP. The book has a beautiful discussion about this.

## Bayesian linear regression: Linear regression with Gaussian likelihood, Gaussian prior and Gaussian posterior (Bayesian solution)

As we saw in the beginning of the Lecture, there are many more models we can define! The book covers quite a few!

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Gaussian    | Gaussian | Bayesian linear regression   | 11.7 |

1. Gaussian observation distribution (with known variance): $p(y|x, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(x), \sigma_{y|z}^2 = \sigma^2)$

where $\mathbf{z} = (\mathbf{w}, \sigma)$ are all the unknown model parameters (hidden rv's).

2. Gaussian prior for the weights $\mathbf{w}$: $p(\mathbf{w}) = \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w)$

3. Gaussian posterior (obtained from Bayes rule)

Final prediction is given by the <font color='orange'>PPD</font>: $\require{color}
{\color{orange}p(y|x, \mathcal{D})} = \int p(y|x,\mathbf{z}) p(\mathbf{z}|\mathcal{D}) dz$

At this point you may notice that we have already derived this model in Lecture 7.

The only differences are that now we have multiple weights $\mathbf{z}$, multidimensional inputs $\mathbf{x}$ and that we allow them to have different values.

Yet, the derivation is the same! We just need to bold the letters. Let's do it:

The likelihood is a (multivariate) Gaussian distribution (product of MVN evaluated at each data point):

$$
p(\mathcal{D}|\mathbf{w}, \sigma^2) = \prod_{n=1}^N p(y_n | \mathbf{w}^T\boldsymbol{\phi}(\mathbf{x}), \sigma^2) = \mathcal{N}(\mathbf{y} | \boldsymbol{\Phi}\mathbf{w}, \sigma^2 \mathbf{I}_N)
$$

where $\mathbf{I}_N$ is the $N\times N$ identity matrix, as defined previously.

To calculate the posterior, we also use the product of Gaussians rule (Lecture 5 in the cell after the Homework we also defined this rule for multivariate Gaussians!):

$$
p(\mathbf{w}| \boldsymbol{\Phi}, \mathbf{y}, \sigma^2) \propto \mathcal{N}(\mathbf{y} | \boldsymbol{\Phi}\mathbf{w}, \sigma^2 \mathbf{I}_N)    \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w) = \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w)
$$

where the mean and covariance of the posterior are given by:

$$
\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w = \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y}\right)
$$

$$
\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w = \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}
$$

Often, we use a prior with zero mean $\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w =\mathbf{0}$ and diagonal covariance $\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w = \overset{\scriptscriptstyle <}{\sigma}_w^2 \mathbf{I}_M$ like the prior we used in Ridge regression. In this case, the posterior mean becomes the same as the MAP estimate obtained from Ridge regression: $\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w =\left( \lambda \mathbf{I}_M  + \boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^T \mathbf{y}$.

#### Note on the reduction of the posterior mean for Bayesian linear regression to the MAP estimate of Ridge regression 

If we use a prior with zero mean $\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w =\mathbf{0}$ and diagonal covariance $\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w = \overset{\scriptscriptstyle <}{\sigma}_w^2 \mathbf{I}_M$ then the posterior mean becomes

$$\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w = \frac{1}{\overset{\scriptscriptstyle <}{\sigma}_w^2} \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T\mathbf{y}
$$

which is the same as the Ridge regression estimate when we define $\lambda = \frac{\sigma^2}{\overset{\scriptscriptstyle <}{\sigma}_w^2}$,

$\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w =\left( \lambda \mathbf{I}_M  + \boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^T \mathbf{y}$

Having determined the posterior, we can determine what we really want: the PPD.

$$
\begin{align}
{\color{orange}p(y|x, \mathcal{D})} &= \int p(y|x,\mathbf{z}) p(\mathbf{z}|\mathcal{D}) d\mathbf{z} \\
p(y|x, \mathcal{D}, \sigma^2) &= \int p(y|x,\mathbf{w}, \sigma^2) p(\mathbf{w}|\mathcal{D}) d\mathbf{w} \\
&= \int \mathcal{N}(y | \boldsymbol{\phi}(\mathbf{x})^T\mathbf{w}, \sigma^2)    \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w) dw \\
&= \mathcal{N}\left(y \mid \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}^T_w \boldsymbol{\phi}(\mathbf{x}) \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x})^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x})\right)
\end{align}
$$

where we recall the meaning of each term:

* ($\mathbf{x}$, $y$) is the point were we want to make a prediction
* $\boldsymbol{\phi}(\mathbf{x})$ is an $M\times 1$ vector of basis functions
* $\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w$ is the $M\times 1$ vector with the mean of the posterior for the weights
* and $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w$ is the $M \times M$ covariance matrix of the posterior for the weights.

#### Note about integration of the PPD

The book uses the following notation:

$$\begin{align}
p(y|x, \mathcal{D}, \sigma^2) &= \int p(y|x,\mathbf{w}, \sigma^2) p(\mathbf{w}|\mathcal{D}) d\mathbf{w}
\end{align}
$$

We could be more explicit and use the following notation:

$$\begin{align}
p(y|x, \mathcal{D}, \sigma^2) &= \int p(y|x,\mathbf{w}, \sigma^2) p(\mathbf{w}|\mathcal{D}) d^{M-1}\mathbf{w} \\
&= \int\int\cdots \int p(y|x,\mathbf{w}, \sigma^2) p(\mathbf{w}|\mathcal{D}) dw_1 dw_2 \cdots dw_{M-1}
\end{align}
$$

where it is clear that we are integrating for all variables (not integrating to get a vector).

However, despite this notation being more precise, it is also less appealing. Just make sure you realize the type of integral that we are calculating when we are finding the PPD.

Observing the obtained PPD,

$$
\begin{align}
p(y|x, \mathcal{D}, \sigma^2) &= \mathcal{N}\left(y \mid \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}^T_w \boldsymbol{\phi}(\mathbf{x}) \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x})^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x})\right)
\end{align}
$$

we see something very interesting: the variance of the PPD at a point $\mathbf{x}$ after seeing $N$ data points depends on two terms:

1. the variance of the observation noise, $\sigma^2$ that we defined to be constant

2. and the variance in the parameters obtained by the posterior $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w$

This means that the predicted uncertainty increases when $\mathbf{x}$ is located far from the training data $\mathcal{D}$, just like we want it to be! We are less certain about points **away** from our observations (training data).

We have seen this happening in Gaussian processes too... You will see that Gaussian processes are not too different from Bayesian linear regression...

## Other linear regression models

As we saw in the beginning of the Lecture, there are many more models we can define! The book covers quite a few!

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Uniform    | Point estimate | Least Squares regression  | 11.2.2  |
| Gaussian  | Gaussian    | Point estimate | Ridge regression   | 11.3  |
| Gaussian  | Laplace    | Point estimate | Lasso regression  | 11.4  |
| Gaussian  | Gaussian$\times$Laplace    | Point estimate | Elastic net  | 11.4.8  |
| Student-$t$  | Uniform    | Point estimate | Robust regression   | 11.6.1  |
| Laplace  | Uniform    | Point estimate | Robust regression   | 11.6.2  |
| Gaussian  | Gaussian    | Gaussian | Bayesian linear regression   | 11.7 |

For example, the "Elastic net" is literally the combination of Ridge regression and Lasso by defining a prior that depends on both a Gaussian and a Laplace distribution. This model was proposed in 2005. Five years later the "Bayesian Elastic Net" was also proposed, where the posterior is calculated in a Bayesian way (just like what we did for Bayesian linear regression).

## In summary

Almost every ML model is derived following 4 steps:

1. Define the Observation distribution and compute the likelihood.

2. Define the prior and its parameters.

3. Compute the posterior to estimate the unknown parameters (whether via a Bayesian approach or via a Point estimate).

4. Compute the PPD.

Done.

If you only care about the mean of the PPD (the mean of your prediction), then you can compute it directly without even thinking about uncertainty... But then, make sure you characterize the quality of your predictions using an appropriate **error metric** and use strategies like cross-validation, like you did for your Midterm Project!

### See you next class

Have fun!