<img src=../figures/Brown_logo.svg width=50%>

## Data-Driven Design & Analyses of Structures & Materials (3dasm)

## Lecture 25

### Miguel A. Bessa | <a href = "mailto: miguel_bessa@brown.edu">miguel_bessa@brown.edu</a>  | Associate Professor

**What:** A lecture of the "3dasm" course

**Where:** This notebook comes from this [repository](https://github.com/bessagroup/3dasm_course)

**Reference for entire course:** Murphy, Kevin P. *Probabilistic machine learning: an introduction*. MIT press, 2022. Available online [here](https://probml.github.io/pml-book/book1.html)

**How:** We try to follow Murphy's book closely, but the sequence of Chapters and Sections is different. The intention is to use notebooks as an introduction to the topic and Murphy's book as a resource.
* If working offline: Go through this notebook and read the book.
* If attending class in person: listen to me (!) but also go through the notebook in your laptop at the same time. Read the book.
* If attending lectures remotely: listen to me (!) via Zoom and (ideally) use two screens where you have the notebook open in 1 screen and you see the lectures on the other. Read the book.

**Optional reference (the "bible" by the "bishop"... pun intended 😆) :** Bishop, Christopher M. *Pattern recognition and machine learning*. Springer Verlag, 2006.

**References/resources to create this notebook:**
* This simple tutorial is still based on a script I created for this article: https://imechanica.org/node/23957
* It follows from some examples provided by the scikit-learn user guide, which seem to have originated from Mathieu Blondel, Jake Vanderplas, Vincent Dubourg, and Jan Hendrik Metzen.

Apologies in advance if I missed some reference used in this notebook. Please contact me if that is the case, and I will gladly include it here.

## **OPTION 1**. Run this notebook **locally in your computer**:
1. Confirm that you have the '3dasm' mamba (or conda) environment (see Lecture 1).
2. Go to the 3dasm_course folder in your computer and pull the last updates of the [repository](https://github.com/bessagroup/3dasm_course):
```
git pull
```
    - Note: if you can't pull the repo due to conflicts (and you can't handle these conflicts), use this command (with **caution**!) and your repo becomes the same as the one online:
        ```
        git reset --hard origin/main
        ```
3. Open command window and load jupyter notebook (it will open in your internet browser):
```
jupyter notebook
```
5. Open notebook of this Lecture and choose the '3dasm' kernel.

## **OPTION 2**. Use **Google's Colab** (no installation required, but times out if idle):

1. go to https://colab.research.google.com
2. login
3. File > Open notebook
4. click on Github (no need to login or authorize anything)
5. paste the git link: https://github.com/bessagroup/3dasm_course
6. click search and then click on the notebook for this Lecture.

In [11]:
# Basic plotting tools needed in Python.

import matplotlib.pyplot as plt # import plotting tools to create figures
import numpy as np # import numpy to handle a lot of things!
from IPython.display import display, Math # to print with Latex math

%config InlineBackend.figure_format = "retina" # render higher resolution images in the notebook
plt.rcParams["figure.figsize"] = (8,4) # rescale figure size appropriately for slides

## Outline for today

* Training Artificial Neural Networks

**Reading material**: This notebook + (ANNs in Chapter 13)

### Recap of last lecture

* We learned that the common point estimates of ANNs (MLE or MAP) are obtained by minimizing a loss function.

* For example, if we assume a Gaussian observation distribution we obtain the $\ell_2$ loss function (with or without regularization, depending on the prior distribution).

* For the uniform prior we arrived to the following result:

$$
\hat{\mathbf{z}}_{\text{mle}} = \underset{z}{\mathrm{argmin}}\left\{\mathcal{L}\left[(\mathbf{x},\mathbf{y}), \mathbf{z} \right] \right\} = \underset{z}{\mathrm{argmin}}\left[\frac{1}{2}\sum_{n=1}^{N}\left[\mathbf{y}_n-\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})\right]^T\left[\mathbf{y}_n-\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})\right] \right]
$$

Therefore, **we need to minimize the loss function**: $\mathcal{L}\left[(\mathbf{x},\mathbf{y}), \mathbf{z} \right] = \frac{1}{2} ||\mathbf{Y}-\mathbf{f}(\mathbf{X},\mathbf{z}) ||_2^2$

## Minimizing the loss function with gradient-based optimizers

There are many optimization algorithms that can be used to minimize the loss function.

* The most common choices involve the use of first-order optimizers, i.e. optimizers that use the gradient of the loss function.

## How to calculate the gradient of the loss function?

The mean function of an ANN with $L$ hidden layers is then a composition of $L+1$ functions:

$\mathbf{f} = \mathbf{f}_{L+1} \circ \mathbf{f}_L \circ \ldots \circ \mathbf{f}_{2} \circ \mathbf{f}_1 $

And if we denote $f_{L+2} = \mathcal{L}$ as the loss function, then we can write the loss function as follows:

$\mathcal{L} = f_{L+2} \circ \mathbf{f}_{L+1} \circ \mathbf{f}_L \circ \ldots \circ \mathbf{f}_{l} \circ \ldots \circ \mathbf{f}_{2} \circ \mathbf{f}_1 $

Where $l=1,\ldots, L$ are the hidden layers.

The next slide makes it less abstract, by considering the example of an ANN with **1 hidden layer** and the $\ell_2$ loss (i.e. regression case).

### Neurons of ANN with **1 hidden layer** and using $\ell_2$ loss (i.e. regression case).

* Layer 0 (input layer): $\mathbf{n}_0 \equiv \mathbf{x}$

* Layer $L=1$ (first hidden layer):

$$\begin{align}
\mathbf{n}_L &= \mathbf{n}_1 \\
             &= \mathbf{f}_L(\mathbf{n}_{L-1}, \mathbf{z}_{L-1}) \\
             &= \mathbf{f}_1(\mathbf{n}_0, \mathbf{z}_0) \\
             &= \mathbf{f}_1(\mathbf{b}_0 + \mathbf{W}_0\mathbf{n}_0)
\end{align}
$$

* Layer $L+1=2$ (output layer):

$$\begin{align}
\mathbf{n}_{L+1} &= \mathbf{n}_2 \equiv \mathbf{y} \\
           &= \mathbf{f}_{L+1}(\mathbf{n}_{L}, \mathbf{z}_{L}) = \mathbf{f}_2(\mathbf{n}_1, \mathbf{z}_1) \\ 
           &= \mathbf{f}_2(\mathbf{b}_1 + \mathbf{W}_1\mathbf{n}_1) & \\
           &= \mathbf{b}_1 + \mathbf{W}_1\mathbf{n}_1
\end{align}
$$

(The last line results from using a linear activation function for the **output layer**, i.e. layer $L+1$, when we are solving a regression problem)

* Layer $L+2$ can be considered the loss function:

$$\begin{align}
\mathcal{L} &= f_{L+2}(\mathbf{n}_{L+1}, \mathbf{y})\\
            &= f_3(\mathbf{n}_2, \mathbf{y}) \\
            &= \frac{1}{2}|| \mathbf{y} - \mathbf{n}_2  ||_2^2
\end{align}
$$

where, again, we are considering here the $\ell_2$ loss because we are focusing on a regression case. If it was a classification problem we would consider the cross-entropy loss (see Lecture 21).

Now we can calculate the derivatives of the loss function wrt to the unknowns $\mathbf{z}$ very easily!

In order to calculate the derivatives of the loss function:

$\mathcal{L} = f_{L+2} \circ \mathbf{f}_{L+1} \circ \mathbf{f}_L \circ \ldots \circ \mathbf{f}_{2} \circ \mathbf{f}_1 $

we just need to use the chain rule!

Let's start by taking the derivative of the loss wrt the unkowns of the last layer, i.e. $\mathbf{z}_L$:

$$\begin{align}
\nabla_{\mathbf{z}_{L}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_L} &= \frac{\partial f_{L+2}}{\partial \mathbf{z}_{L}} \\
&= \frac{\partial f_{L+2}}{\partial \mathbf{f}_{L+1}} \frac{\partial \mathbf{f}_{L+1}}{\partial \mathbf{z}_{L}}
\end{align}
$$

Which is usually presented as: $\frac{\partial \mathcal{L}}{\partial \mathbf{z}_L} = \frac{\partial \mathcal{L}}{\partial \mathbf{n}_{L+1}} \frac{\partial \mathbf{n}_{L+1}}{\partial \mathbf{z}_{L}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{z}_{L}}$

but we should not stop here! Remember, we want all the derivatives of the loss wrt to all the unknowns $\mathbf{z}$.

So, taking the derivative of the loss wrt the unkowns of the next hidden layer, i.e. wrt $\mathbf{z}_{L-1}$:

$$\begin{align}
\nabla_{\mathbf{z}_{L-1}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L-1}} &= \frac{\partial f_{L+2}}{\partial \mathbf{z}_{L-1}} \\
&= \frac{\partial f_{L+2}}{\partial \mathbf{f}_{L+1}} \frac{\partial \mathbf{f}_{L+1}}{\partial \mathbf{f}_{L}} \frac{\partial \mathbf{f}_{L}}{\partial \mathbf{z}_{L-1}}
\end{align}
$$

Which is usually presented as:

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{n}_{L+1}} \frac{\partial \mathbf{n}_{L+1}}{\partial \mathbf{n}_{L}} \frac{\partial \mathbf{n}_{L}}{\partial \mathbf{z}_{L-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{n}_{L}} \frac{\partial \mathbf{n}_{L}}{\partial \mathbf{z}_{L-1}}
$$

but we should not stop here!

So, taking the derivative of the loss wrt $\mathbf{z}_{l}$:

$$\begin{align}
\nabla_{\mathbf{z}_{l}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{l}} &= \frac{\partial f_{L+2}}{\partial \mathbf{z}_{l}} \\
&= \frac{\partial f_{L+2}}{\partial \mathbf{f}_{L+1}} \frac{\partial \mathbf{f}_{L+1}}{\partial \mathbf{f}_{L}} \cdots \frac{\partial \mathbf{f}_{l+2}}{\partial \mathbf{f}_{l+1}}\frac{\partial \mathbf{f}_{l+1}}{\partial \mathbf{z}_{l}}
\end{align}
$$

Which is usually presented as:

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{n}_{L+1}} \frac{\partial \mathbf{n}_{L+1}}{\partial \mathbf{n}_{L}} \cdots \frac{\partial \mathbf{n}_{l+2}}{\partial \mathbf{n}_{l+1}}\frac{\partial \mathbf{n}_{l+1}}{\partial \mathbf{z}_{l}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{n}_{L}} \cdots \frac{\partial \mathbf{n}_{l+2}}{\partial \mathbf{n}_{l+1}}\frac{\partial \mathbf{n}_{l+1}}{\partial \mathbf{z}_{l}}
$$

but we should not stop here! We need to keep going until we reach the first layer of the network!

So, taking the derivative of the loss wrt $\mathbf{z}_{0}$:

$$\begin{align}
\nabla_{\mathbf{z}_{0}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{0}} &= \frac{\partial f_{L+2}}{\partial \mathbf{z}_{l}} \\
&= \frac{\partial f_{L+2}}{\partial \mathbf{f}_{L+1}} \frac{\partial \mathbf{f}_{L+1}}{\partial \mathbf{f}_{L}} \cdots \frac{\partial \mathbf{f}_{l+2}}{\partial \mathbf{f}_{l+1}}\frac{\partial \mathbf{f}_{l+1}}{\partial \mathbf{z}_{l}} \cdots \frac{\partial \mathbf{f}_{2}}{\partial \mathbf{f}_{1}}\frac{\partial \mathbf{f}_{1}}{\partial \mathbf{z}_{0}} 
\end{align}
$$

Which is usually presented as:

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{n}_{L+1}} \frac{\partial \mathbf{n}_{L+1}}{\partial \mathbf{n}_{L}} \cdots \frac{\partial \mathbf{n}_{l+2}}{\partial \mathbf{n}_{l+1}}\frac{\partial \mathbf{n}_{l+1}}{\partial \mathbf{z}_{l}} \cdots \frac{\partial \mathbf{n}_{2}}{\partial \mathbf{n}_{1}}\frac{\partial \mathbf{n}_{1}}{\partial \mathbf{z}_{0}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{n}_{L}} \cdots \frac{\partial \mathbf{n}_{l+2}}{\partial \mathbf{n}_{l+1}}\frac{\partial \mathbf{n}_{l+1}}{\partial \mathbf{z}_{l}} \cdots \frac{\partial \mathbf{n}_{2}}{\partial \mathbf{n}_{1}}\frac{\partial \mathbf{n}_{1}}{\partial \mathbf{z}_{0}} 
$$

* And with this we finally computed all the deriavatives!

This is what we call **Backpropagation**.

The partial derivatives of the neuron output with respect to the previous neuron outputs are organized in a matrix called the **Jacobian**:

$$
\mathbf{J}_{\mathbf{f}_{l+1}}(\mathbf{n}_l) = \frac{\partial \mathbf{f}_{l+1}(\mathbf{n}_{l})}{\partial \mathbf{n}_l} = \begin{bmatrix}\frac{\partial f_{1, l+1}}{\partial n_{1,l}} & \cdots & \frac{\partial f_{d_{l+1}, l+1}}{\partial n_{d_l,l}}\\
\vdots & \ddots & \vdots\\
\frac{\partial f_{d_{l+1}, l+1}}{\partial n_{1,l}} & \cdots & \frac{\partial f_{d_{l+1}, l+1}}{\partial n_{d_l,l}}\end{bmatrix} =
\begin{bmatrix}\nabla f_{1,l+1}(\mathbf{n}_l)^T\\
\vdots\\
\nabla f_{d_{l+1}, l+1}(\mathbf{n}_l)^T
\end{bmatrix}=
\begin{bmatrix}
\frac{\partial \mathbf{f}_{l+1}}{\partial n_{0,l}}, & \cdots & , \frac{\partial \mathbf{f}_{l+1}}{\partial n_{d_l,l}}
\end{bmatrix}
$$

where $d_{l+1}$ and $d_l$ are the number of neurons in layer $l+1$ and $l$, respectively.

So, doing backpropagation (chain rule) using Jacobians becomes:

$$
\mathbf{J}_{\mathbf{f}_{L+1}}(\mathbf{n}_0) = \mathbf{J}_{\mathbf{y}}(\mathbf{x}) = \mathbf{J}_{\mathbf{f}_{L+1}}(\mathbf{n}_{L}) \mathbf{J}_{\mathbf{f}_{L}}(\mathbf{n}_{L-1}) \cdots \mathbf{J}_{\mathbf{f}_{l+1}}(\mathbf{n}_{l}) \cdots \mathbf{J}_{\mathbf{f}_{1}}(\mathbf{n}_{0})
$$

## Backpropagation

* Backpropagation is simply the computation of the gradients (derivatives) of the loss wrt to the unknowns $\mathbf{z}$.

    * Note: remember, "loss function" just means NLL (when using a uniform prior) or negative log joint likelihood (when using any other prior).

* These gradients are then provided to **a gradient-based optimizer** (e.g. in Lecture 23 we used a variant of gradient descent, called ADAM), that will then provide a point estimate (the weights and biases that minimize the error; or, alternatively, that maximize the joint likelihood).

In the last part of the course, we will introduce the basics of **optimization**!

* Optimization is the key to deterministic machine learning because it's what enables to find the point estimate for the unknowns (here: weights and biases) by minimizing the NLL (or negative log joint likelihood).

**Note**: optimization is (obviously!) a separate discipline on its own! We will only have 3 classes about this topic! The main goal is to understand why we typically choose a particular algorithm for finding point estimates of particular models:

- Example, L-BFGS is a common choice to find the point estimate of the hyperparameters of GPs, but we use simpler optimizers such as ADAM and Stochastic Gradient Descent to find the point estimate of the parameters of an ANN.

### See you next class

Have fun!