<img src=../figures/Brown_logo.svg width=50%>

## Data-Driven Design & Analyses of Structures & Materials (3dasm)

## Lecture 12

### Miguel A. Bessa | <a href = "mailto: miguel_bessa@brown.edu">miguel_bessa@brown.edu</a>  | Associate Professor

**What:** A lecture of the "3dasm" course

**Where:** This notebook comes from this [repository](https://github.com/bessagroup/3dasm_course)

**Reference for entire course:** Murphy, Kevin P. *Probabilistic machine learning: an introduction*. MIT press, 2022. Available online [here](https://probml.github.io/pml-book/book1.html)

**How:** We try to follow Murphy's book closely, but the sequence of Chapters and Sections is different. The intention is to use notebooks as an introduction to the topic and Murphy's book as a resource.
* If working offline: Go through this notebook and read the book.
* If attending class in person: listen to me (!) but also go through the notebook in your laptop at the same time. Read the book.
* If attending lectures remotely: listen to me (!) via Zoom and (ideally) use two screens where you have the notebook open in 1 screen and you see the lectures on the other. Read the book.

**Optional reference (the "bible" by the "bishop"... pun intended 😆) :** Bishop, Christopher M. *Pattern recognition and machine learning*. Springer Verlag, 2006.

**References/resources to create this notebook:**
* Chapter 11 of Murphy's book.

Apologies in advance if I missed some reference used in this notebook. Please contact me if that is the case, and I will gladly include it here.

## **OPTION 1**. Run this notebook **locally in your computer**:
1. Confirm that you have the '3dasm' mamba (or conda) environment (see Lecture 1).
2. Go to the 3dasm_course folder in your computer and pull the last updates of the [repository](https://github.com/bessagroup/3dasm_course):
```
git pull
```
    - Note: if you can't pull the repo due to conflicts (and you can't handle these conflicts), use this command (with **caution**!) and your repo becomes the same as the one online:
        ```
        git reset --hard origin/main
        ```
3. Open command window and load jupyter notebook (it will open in your internet browser):
```
jupyter notebook
```
5. Open notebook of this Lecture and choose the '3dasm' kernel.

## **OPTION 2**. Use **Google's Colab** (no installation required, but times out if idle):

1. go to https://colab.research.google.com
2. login
3. File > Open notebook
4. click on Github (no need to login or authorize anything)
5. paste the git link: https://github.com/bessagroup/3dasm_course
6. click search and then click on the notebook for this Lecture.

## Outline for today

* Derivation of a few more Linear Regression models:
    - Ridge regression
    - Lasso regression
    - Bayesian linear regression

**Reading material**: This notebook + Chapter 11 of the book.

## Summary of important Linear Regression Models

Recall our view of Linear regression models from a Bayesian perspective: it's all about the choice of **likelihood** and **prior**!

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Uniform    | Point estimate | Least Squares regression  | 11.2.2  |
| Gaussian  | Gaussian    | Point estimate | Ridge regression   | 11.3  |
| Gaussian  | Laplace    | Point estimate | Lasso regression  | 11.4  |
| Student-$t$  | Uniform    | Point estimate | Robust regression   | 11.6.1  |
| Laplace  | Uniform    | Point estimate | Robust regression   | 11.6.2  |
| Gaussian  | Gaussian    | Gaussian | Bayesian linear regression   | 11.7 |

Last Lecture we covered Least Squares regression. Today we will cover Ridge regression, Lasso regression and Bayesian linear regression.

## Ridge regression: Linear regression with Gaussian likelihood, Gaussian prior and posterior via Point estimate

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Gaussian    | Point estimate | Ridge regression   | 11.3  |

1. Gaussian observation distribution: $p(y|\mathbf{x}, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}), \sigma_{y|z}^2 = \sigma^2)$

    where $\mathbf{z} = \mathbf{w}$ are all the unknown model parameters (hidden rv's), but where $\sigma$ is a **specified** value (i.e. $\sigma$ is chosen by us, unlike in the last Lecture where we had considered it an unknown).

2. But using a Gaussian prior for the weights $\mathbf{w}$: $p(\mathbf{w}) = \mathcal{N}(\mathbf{w}| \mathbf{0}, \overset{\scriptscriptstyle <}{\sigma}_w^2 \mathbf{I})$ where $\overset{\scriptscriptstyle <}{\sigma}_w$ is **specified** by us.

3. MAP point estimate for posterior: $\hat{\mathbf{z}}_{\text{map}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log{ p(y=y_i|\mathbf{z})} - \log{p(\mathbf{w})}\right]$,

Final prediction is given by the <font color='orange'>PPD</font>: $\require{color}{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \int p(y^*|\mathbf{x}^*,\mathbf{z}) \delta(\mathbf{z}-\hat{\mathbf{z}}_{\text{map}}) dz = p(y^*|\mathbf{x}^*, \mathbf{z}=\hat{\mathbf{z}}_{\text{map}}, \mathcal{D})$

### Important concepts for all ML models: parameters vs hyperparameters

At this point we have to highlight something very important that we have referred to multiple times, but never used the formal nomenclature:

* The difference between **parameters** and **hyperparameters**

**Parameters** of an ML model: these are the unknown variables $\mathbf{z}$, i.e. the rv's that are hidden (not explicitly visible in our data).Our goal when making a prediction from the PPD is to:
*  Marginalize these unknown parameters $\mathbf{z}$, if using Bayesian inference (i.e. integrating them out)  
* **OR** estimate the value of these parameters $\mathbf{z}$ using a Point estimate $\hat{\mathbf{z}}$ (if doing deterministic inference, i.e. the common machine learning models without uncertainty estimation)

**Hyperparameters** of an ML model: these are the variables that we are specifying or assuming for our model and that are not going to be updated after training, i.e. they do not change when we determine the PPD.

* For example, for the above-mentioned Ridge regression model we defined **3 hyperparameters**:
    * The prior parameters $\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w$ (that we set to $\mathbf{0}$) and $\overset{\scriptscriptstyle <}{\sigma}_w$ (that we can set to any constant value $\mathrm{cte}$)
    * **AND** the variance of the observation distribution $\sigma$ (that you set to whatever particular value you want)

### A note about the prior

Ususally the Gaussian prior is imposed only on the weights (not on the bias term, i.e. the $w_0$ term):

* This is actually very common. The bias term in the observation distribution usually has a Uniform prior because it does not contribute to overfitting.

You can also see this in the previous Lecture from the expressions for $\hat{w}_{0, \text{mle}}$ and $\sigma^2_{\text{mle}}$, as they act on the global mean and MSE (mean squared error) of the residuals, respectively.

Coming back to our Ridge regression model, computing the MAP estimate is very similar to what we did in the last Lecture, leading to:

$$
\begin{align}
\hat{\mathbf{w}}_{\text{map}} &= \underset{w}{\mathrm{argmin}}\left[\frac{1}{2\sigma^2}\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) + \frac{1}{2\overset{\scriptscriptstyle <}{\sigma}_w^2}\mathbf{w}^T\mathbf{w}\right] \\
&= \underset{w}{\mathrm{argmin}}\left[\text{RSS}(\mathbf{w}) + \alpha ||\mathbf{w}||_2^2\right]
\end{align}
$$

where $\alpha = \frac{\sigma^2}{\overset{\scriptscriptstyle <}{\sigma}_w^2}$ is proportional to the strength of the prior (depends on two hyperparameters), and

$||\mathbf{w}||_2 = \sqrt{\sum_{m=1}^{M-1} |w_m|^2} = \sqrt{\mathbf{w}^T\mathbf{w}}$

is called the $l_2$ norm of the vector $\mathbf{w}$. Thus, we are penalizing weights that become too large in magnitude. In ML literature this is usually called $l_2$ **regularization** or **weight decay**, and is very widely used.


In Homework 4, you explore the difference between Linear Least Squares and Ridge regression, reporting on the influence of the prior strength for the latter.

Then, solving the MAP first for the weights $\mathbf{w}$, as we did for the MLE:

$$\begin{align}
\nabla_{\mathbf{w}} \left[\frac{1}{2\sigma^2}\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) + \frac{1}{2\overset{\scriptscriptstyle <}{\sigma}_w^2}\mathbf{w}^T\mathbf{w}\right] &= \mathbf{0} \\
\nabla_{\mathbf{w}} \left[\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) + \alpha\mathbf{w}^T\mathbf{w}\right] = \mathbf{0}
\end{align}
$$

from which we determine the MAP estimate for the the weights $\mathbf{w}$ as:

$$
\hat{\mathbf{w}}_{\text{map}} = \left( \boldsymbol{\Phi}^T\boldsymbol{\Phi} + \alpha \mathbf{I}_M \right)^{-1} \boldsymbol{\Phi}^T \mathbf{y} = \left( \sum_{n=1}^{N} \boldsymbol{\phi}(\mathbf{x}_n)\boldsymbol{\phi}(\mathbf{x}_n)^T + \alpha \mathbf{I}_M \right)^{-1} \left( \sum_{n=1}^{N} \boldsymbol{\phi}(\mathbf{x}_n) y_n \right)
$$

Once again, this can be solved using SVD or other methods to ensure that the Moore-Penrose pseudo-inverse is calculated properly.

## Lasso regression: Linear regression with Gaussian likelihood, Laplace prior and posterior via Point estimate

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Laplace    | Point estimate | Lasso regression  | 11.4  |

1. Gaussian observation distribution: $p(y|\mathbf{x}, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}), \sigma_{y|z}^2 = \sigma^2)$

where $\mathbf{z} = \mathbf{w}$ are all the unknown model parameters (hidden rv's), and $\sigma$ is a hyperparameter.

2. But using a **Laplace** prior for the weights $\mathbf{w}$: $p(\mathbf{w}) = \prod_{m=1}^{M-1}\text{Lap}\left(w_m| 0, 1/\overset{\scriptscriptstyle <}{\lambda}_w\right) \propto \prod_{m=1}^{M-1} \exp{\left[ -\overset{\scriptscriptstyle <}{\lambda}_w |w_m|\right]}$

3. MAP point estimate for posterior: $\hat{\mathbf{z}}_{\text{map}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log{ p(y=y_i|\mathbf{z})} - \log{p(\mathbf{w})}\right]$

Final prediction is given by the <font color='orange'>PPD</font>: ${\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \int p(y^*|\mathbf{x}^*,\mathbf{z}) \delta(\mathbf{z}-\hat{\mathbf{z}}_{\text{map}}) dz = p(y^*|\mathbf{x}^*, \mathbf{z}=\hat{\mathbf{z}}_{\text{map}}, \mathcal{D})$

#### Note about the sparsity parameter $\overset{\scriptscriptstyle <}{\lambda}_w$

Please note that the $\overset{\scriptscriptstyle <}{\lambda}_w$ hyperparameter defining the strength of the Laplace prior is different from the $\alpha$ hyperparameter defined in Ridge regression.

#### Note about number of parameters $M$ and number of input dimensions $D$

If $M=D$ the method is called Lasso, but if there are more parameters than input variables $M>D$ then it is called Group Lasso (Section 11.4.7).

* The next cells describe Lasso, which introduces sparsity by making a weight associated to a particular variable tending to zero.

* Group Lasso leads to a sparsity of more than one parameter that is associated to a given variable (which is an interesting way to induce sparsity in overparameterized models such as Artificial Neural Networks). It is derived in a very similar manner.

Computing the MAP estimate:

$$
\begin{align}
\hat{\mathbf{w}}_{\text{map}} &= \underset{w}{\mathrm{argmin}}\left[\left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right)^T \left(\boldsymbol{\Phi}\mathbf{w} - \mathbf{y}\right) + \overset{\scriptscriptstyle <}{\lambda}_w||\mathbf{w}||_1\right]
\end{align}
$$

where $||\mathbf{w}||_1 = \sum_{m=1}^{M-1}|w_m|$ is called the $l_1$ norm of $\mathbf{w}$. In ML literature this is called $l_1$ regularization. 

* Calculating the MAP for Lasso is not done the same way as for Ridge because the term $||\mathbf{w}||_1$ is not differentiable whenever $w_m = 0$. In this case, the solution is found using hard- or soft-thresholding (See Section 11.4.3 in the book) because the gradient becomes a branch function.

* A more important point is to see that **different types of prior distributions** introduce **different regularizations** on the weights, aleviating overfitting in a different manner.
    - For example, in the case of Lasso, since the Laplace prior puts more density around the mean (which is zero here) than the Gaussian prior, then it has a tendency to lead the weights to zero, i.e. it introduces **sparsity** when estimating the weights via MAP. The book has a beautiful discussion about this.

## Bayesian linear regression: Linear regression with Gaussian likelihood, Gaussian prior and Gaussian posterior (Bayesian solution)

As we saw in the beginning of the Lecture, there are many more models we can define! The book covers quite a few!

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Gaussian    | Gaussian | Bayesian linear regression   | 11.7 |

1. Gaussian observation distribution (with **known** variance): $p(y|\mathbf{x}, \mathbf{z}) = \mathcal{N}(y| \mu_{y|z} = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}), \sigma_{y|z}^2 = \sigma^2)$
    
    where $\mathbf{z} = \mathbf{w}$ are all the unknown model parameters (hidden rv's), and where $\sigma$ is **not** treated as an unknown, i.e. it is a **hyperparameter**.

2. Gaussian prior for the weights $\mathbf{w}$: $p(\mathbf{w}) = \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w)$

3. Gaussian posterior (obtained from Bayes rule)

Final prediction is given by the <font color='orange'>PPD</font>: $\require{color}
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \int p(y^*|\mathbf{x}^*,\mathbf{z}) p(\mathbf{z}|\mathcal{D}) dz$

At this point you may notice that we have already derived this model in Lecture 7.

The only differences are that now we have multiple weights $\mathbf{z}$, multidimensional inputs $\mathbf{x}$ and that we allow them to have different values.

Yet, the derivation is the same! We just need to bold the letters. Let's do it:

The likelihood is a (multivariate) Gaussian distribution (product of MVN evaluated at each data point):

$$
p(\mathcal{D}|\mathbf{w}, \sigma^2) = \prod_{n=1}^N p(y_n | \mathbf{w}^T\boldsymbol{\phi}(\mathbf{x}_n), \sigma^2) = \mathcal{N}(\mathbf{y} | \boldsymbol{\Phi}\mathbf{w}, \sigma^2 \mathbf{I}_N)
$$

where $\mathbf{I}_N$ is the $N\times N$ identity matrix, as defined previously.

To calculate the posterior, we also use the product of Gaussians rule (in Lecture 5 we also defined this rule for multivariate Gaussians!):

$$
p(\mathbf{w}| \boldsymbol{\Phi}, \mathbf{y}, \sigma^2) \propto \mathcal{N}(\mathbf{y} | \boldsymbol{\Phi}\mathbf{w}, \sigma^2 \mathbf{I}_N)    \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w) = \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w)
$$

where the mean and covariance of the posterior are given by:

$$
\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w = \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y}\right)
$$

$$
\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w = \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}
$$

**Two additional notes** about this result (it's recommended to go through these notes):
1. The above result for the posterior is similar to what we did before in Lecture 5, but it involves a little more algebra. If you are curious, I derived it for you (see **NOTE 1** below).
2. It's easy to reduce the posterior mean for Bayesian linear regression to the MAP estimate of Ridge regression (see **NOTE 2** below)

#### NOTE 1: Calculation of the posterior in Bayesian linear regression 

$$
p(\mathbf{w}| \boldsymbol{\Phi}, \mathbf{y}, \sigma^2) \propto \mathcal{N}(\mathbf{y} | \boldsymbol{\Phi}\mathbf{w}, \sigma^2 \mathbf{I}_N)    \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w)
$$

where we are ignoring the marginal likelihood (denominator in Bayes' rule) because we already showed many times that it is only a constant that scales the numerator to normalize it.

Now, as we did several times before, calculating the posterior involves a product of Gaussian distributions in $\mathbf{w}$. Therefore, before we calculate the posterior, we need to:

- Express the likelihood in $\mathbf{w}$, instead of $\mathbf{y}$.

We have done this before for a simpler case, and found that the likelihood written in $\mathbf{w}$ is a (non-normalized) Gaussian:

$$
\mathcal{N}(\mathbf{y} | \boldsymbol{\Phi}\mathbf{w}, \sigma^2 \mathbf{I}_N) \propto \mathcal{N}(\mathbf{w} | \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1 )
$$

We can determine this mean of the likelihood $\boldsymbol{\mu}_1$ and its covariance matrix $\boldsymbol{\Sigma}_1$ by rewriting the likelihood accordingly. We do this by first considering the Gaussian likelihood in terms of $\mathbf{y}$:

$$\mathcal{N}(\mathbf{y} | \boldsymbol{\Phi}\mathbf{w}, \sigma^2 \mathbf{I}_N) = \frac{1}{\sqrt{2\pi \det\left(\sigma^2\mathbf{I}_N\right)}} \exp\left\{ -\frac{1}{2}\left(\mathbf{y}-\boldsymbol{\Phi}\mathbf{w}\right)^T\left(\frac{1}{\sigma^2}\mathbf{I}_N\right)\left(\mathbf{y}-\boldsymbol{\Phi}\mathbf{w}\right)\right\}$$

and then by focusing on the **exponent** term only, which we can rewrite as:

$$
-\frac{1}{2}\left(\mathbf{y}-\boldsymbol{\Phi}\mathbf{w}\right)^T\left(\frac{1}{\sigma^2}\mathbf{I}_N\right)\left(\mathbf{y}-\boldsymbol{\Phi}\mathbf{w}\right) = -\frac{1}{2\sigma^2} \left( \mathbf{y}^T\mathbf{y} - \mathbf{w}^T\boldsymbol{\Phi}^T\mathbf{y} - \mathbf{y}^T\boldsymbol{\Phi}\mathbf{w} + \mathbf{w}^T\boldsymbol{\Phi}^T\boldsymbol{\Phi}\mathbf{w} \right)
$$

However, if we want to get a Gaussian as a function of $\mathbf{w}$ whose mean is $\boldsymbol{\mu}_1$, then we need to do a "magic" trick by considering:

$$\mathbf{y} = \boldsymbol{\Phi}\boldsymbol{\mu}_1$$

from which we can rewrite the previous exponent as:

$$\begin{align}
-\frac{1}{2\sigma^2} \left( \mathbf{y}^T\mathbf{y} - \mathbf{w}^T\boldsymbol{\Phi}^T\mathbf{y} - \mathbf{y}^T\boldsymbol{\Phi}\mathbf{w} + \mathbf{w}^T\boldsymbol{\Phi}^T\boldsymbol{\Phi}\mathbf{w} \right) =&
-\frac{1}{2\sigma^2} \left( \boldsymbol{\mu}_1^T\boldsymbol{\Phi}^T\boldsymbol{\Phi}\boldsymbol{\mu}_1 - \mathbf{w}^T\boldsymbol{\Phi}^T\boldsymbol{\Phi}\boldsymbol{\mu}_1 - \boldsymbol{\mu}_1^T\boldsymbol{\Phi}^T\boldsymbol{\Phi}\mathbf{w} + \mathbf{w}^T\boldsymbol{\Phi}^T\boldsymbol{\Phi}\mathbf{w} \right)\\
=& -\frac{1}{2}\left(\mathbf{w}-\boldsymbol{\mu}_1\right)^T\left(\frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)\left(\mathbf{w}-\boldsymbol{\mu}_1\right)
\end{align}
$$

Reaching the interesting conclusion concerning the likelihood that we were already expecting:

$$\mathcal{N}(\mathbf{y} | \boldsymbol{\Phi}\mathbf{w}, \sigma^2 \mathbf{I}_N) \propto \mathcal{N}(\mathbf{w} | \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1 )$$

where the mean and covariance are given respectively by:

$\boldsymbol{\mu}_1 = \left(\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}\boldsymbol{\Phi}^T\mathbf{y} = \boldsymbol{\mu}_1$

$\boldsymbol{\Sigma}_1 = \left(\frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}$

As we said, the scaling factor does not matter because the marginal likelihood will ensure that after we multiply this likelihood by the prior it leads to a normalized pdf...

- In other words, we don't even need to explicitly calculate the scaling factor of the likelihood, neither do we need to calculate the marginal likelihood. But if this bothers you, you can do it! It's basically the same calculation that we did in Lecture 5 (and other lectures) 😉

Anyway, now that we finally rewrote the likelihood as a function of $\mathbf{w}$ we can finally calculate the product of MVN pdf's that defines the posterior:

$$
p(\mathbf{w}| \boldsymbol{\Phi}, \mathbf{y}, \sigma^2) \propto \mathcal{N}(\mathbf{w} | \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1 )   \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w)
$$

that is obtained from the rule of products of MVNs (introduced in Lecture 5), leading to:

$$
p(\mathbf{w}| \boldsymbol{\Phi}, \mathbf{y}, \sigma^2) \propto \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w)
$$

where

$\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w = \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w\left(\boldsymbol{\Sigma}_1^{-1}\boldsymbol{\mu}_1 + \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1}\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w \right) = \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \left( \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi} \boldsymbol{\mu}_1 + \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1}\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w\right) = \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \left( \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y} + \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1}\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w\right)$

$\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w = \left( \boldsymbol{\Sigma}_1^{-1}+\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1}\right)^{-1} = \left( \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi} + \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} \right)^{-1}$

#### NOTE 2: On the reduction of the posterior mean for Bayesian linear regression to the MAP estimate of Ridge regression 

If we use a prior with zero mean $\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w =\mathbf{0}$ and diagonal covariance $\overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w = \overset{\scriptscriptstyle <}{\sigma}_w^2 \mathbf{I}_M$ then the posterior mean becomes

$$\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w = \frac{1}{\overset{\scriptscriptstyle <}{\sigma}_w^2} \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\Phi}^T\mathbf{y}
$$

which is the same as the Ridge regression estimate when we define $\alpha = \frac{\sigma^2}{\overset{\scriptscriptstyle <}{\sigma}_w^2}$,

$\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w =\left( \alpha \mathbf{I}_M  + \boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^T \mathbf{y}$

Having determined the posterior, we can determine what we really want: the PPD.

$$
\begin{align}
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} &= \int p(y^*|\mathbf{x}^*,\mathbf{z}) p(\mathbf{z}|\mathcal{D}) d\mathbf{z} \\
p(y^*|\mathbf{x}^*, \mathcal{D}, \sigma^2) &= \int p(y^*|\mathbf{x}^*,\mathbf{w}, \sigma^2) p(\mathbf{w}|\mathcal{D}) d\mathbf{w} \\
&= \int \mathcal{N}(y^* | \boldsymbol{\phi}(\mathbf{x}^*)^T\mathbf{w}, \sigma^2)    \mathcal{N}(\mathbf{w}| \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w, \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w) d\mathbf{w}
\end{align}
$$

We calculate this integral like we did in previous lectures, i.e. again by making explicit the observation distribution as a function of $\mathbf{w}$ and then do the product of the two MVNs. This time, I really don't think we need to show these steps again... So, I will just skip the same procedure and give the final result:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w  \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x})^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x})\right)
$$

#### Review the notation about integrating the PPD for multiple unknowns

The book uses the following notation:

$$\begin{align}
p(y^*|x^*, \mathcal{D}, \sigma^2) &= \int p(y^*|x^*,\mathbf{w}, \sigma^2) p(\mathbf{w}|\mathcal{D}) d\mathbf{w}
\end{align}
$$

We could be more explicit and use the following notation:

$$\begin{align}
p(y^*|x^*, \mathcal{D}, \sigma^2) &= \int p(y^*|x^*,\mathbf{w}, \sigma^2) p(\mathbf{w}|\mathcal{D}) d^{M}\mathbf{w} \\
&= \int\int\cdots \int p(y^*|x^*,\mathbf{w}, \sigma^2) p(\mathbf{w}|\mathcal{D}) dw_0 dw_1 \cdots dw_{M-1}
\end{align}
$$

where it is clear that we are integrating for all variables (not integrating to get a vector).

However, despite this notation being more precise, it is also less appealing. Just make sure you realize the type of integral that we are calculating when we are finding the PPD.

We conclude that the PPD of the Bayesian linear regression model is:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w  \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x}^*)\right)
$$

where the mean and covariance of the posterior are given by:

$$
\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w = \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y}\right)
$$

$$
\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w = \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}
$$

And where we recall the meaning of each term:

* ($\mathbf{x}$, $y$) is the point were we want to make a prediction
* $\boldsymbol{\phi}(\mathbf{x})$ is an $M\times 1$ vector of basis functions
* $\overset{\scriptscriptstyle >}{\boldsymbol{\mu}}_w$ is the $M\times 1$ vector with the mean of the posterior for the weights
* and $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w$ is the $M \times M$ covariance matrix of the posterior for the weights.

$$
\begin{align}
p(y^*|\mathbf{x}^*, \mathcal{D}, \sigma^2) &= \mathcal{N}\left(y^* \mid \overset{\scriptscriptstyle >}{\boldsymbol{\mu}}^T_w \boldsymbol{\phi}(\mathbf{x}^*) \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x}^*)\right)
\end{align}
$$

The above PPD for Bayesian linear regression is quite interesting because:

1. The PPD does not depend on the weights $\mathbf{w}$. This is in contrast with the other models we saw for linear regression that depended on point estimates, instead of following the full Bayesian treatment.

2. The variance of the PPD at a point $\mathbf{x}^*$ after seeing $N$ data points depends on two terms: (1) the variance of the observation noise, $\sigma^2$ that we defined to be constant; and (2) the variance in the parameters obtained by the posterior $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w$.

This means that the predicted uncertainty increases when $\mathbf{x}^*$ is located far from the training data $\mathcal{D}$, just like we want it to be! We are less certain about points **away** from our observations (training data).

You will see that more advanced methods such as Gaussian processes are not too different from Bayesian linear regression...

Finally, for convenience, let's write the PPD of Bayesian linear regression including all the terms to get the complete (and rather long!) expression:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} \overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y}\right) \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x}^*)\right)
$$

with the previously determined covariance: $\overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w = \left( \overset{\scriptscriptstyle <}{\boldsymbol{\Sigma}}_w^{-1} + \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\right)^{-1}$

This long expression of the PPD is (very!) **often simplified** by considering a prior with zero mean, i.e. $\overset{\scriptscriptstyle <}{\boldsymbol{\mu}}_w = \mathbf{0}$:

$$
{\color{orange}p(y^*|\mathbf{x}^*, \mathcal{D})} = \mathcal{N}\left(y^* \mid \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \frac{1}{\sigma^2}\boldsymbol{\Phi}^T\mathbf{y} \,,\, \sigma^2 + \boldsymbol{\phi}(\mathbf{x}^*)^T \overset{\scriptscriptstyle >}{\boldsymbol{\Sigma}}_w \boldsymbol{\phi}(\mathbf{x}^*)\right)
$$

## Other linear regression models

As we saw in the beginning of the Lecture, there are many more linear regression models we can define! The book covers quite a few!

| Likelihood | Prior (on the weights)    | Posterior      | Name of the model | Book section  |
|---        |---         |---             |---              |---            |
| Gaussian  | Uniform    | Point estimate | Least Squares regression  | 11.2.2  |
| Gaussian  | Gaussian    | Point estimate | Ridge regression   | 11.3  |
| Gaussian  | Laplace    | Point estimate | Lasso regression  | 11.4  |
| Gaussian  | Gaussian$\times$Laplace    | Point estimate | Elastic net  | 11.4.8  |
| Student-$t$  | Uniform    | Point estimate | Robust regression   | 11.6.1  |
| Laplace  | Uniform    | Point estimate | Robust regression   | 11.6.2  |
| Gaussian  | Gaussian    | Gaussian | Bayesian linear regression   | 11.7 |

For example, the "Elastic net" is literally the combination of Ridge regression and Lasso by defining a prior that depends on both a Gaussian and a Laplace distribution. This model was proposed in 2005. Five years later the "Bayesian Elastic Net" was also proposed, where the posterior is calculated in a Bayesian way (just like what we did for Bayesian linear regression).

## Summary

Although we only considered Linear Regression Models until now, we are starting to realize something very important: 

**ML models can be derived from the Bayes' rule by following the same 4 steps**!

1. Define the Observation distribution and compute the likelihood.

2. Define the prior and its parameters.

3. Compute the posterior to estimate the unknown parameters (whether via a Bayesian approach or via a Point estimate).

4. Compute the PPD.

Done.

If you only care about the mean of the PPD (the mean of your prediction), then you can compute it directly without even thinking about uncertainty... But then, make sure you characterize the quality of your predictions using an appropriate **error metric** and use strategies like **cross-validation**, as you will explore in Homework 4!

### See you next class

Have fun!