<img src=../figures/Brown_logo.svg width=50%>

## Data-Driven Design & Analyses of Structures & Materials (3dasm)

## Lecture 24

### Miguel A. Bessa | <a href = "mailto: miguel_bessa@brown.edu">miguel_bessa@brown.edu</a>  | Associate Professor

**What:** A lecture of the "3dasm" course

**Where:** This notebook comes from this [repository](https://github.com/bessagroup/3dasm_course)

**Reference for entire course:** Murphy, Kevin P. *Probabilistic machine learning: an introduction*. MIT press, 2022. Available online [here](https://probml.github.io/pml-book/book1.html)

**How:** We try to follow Murphy's book closely, but the sequence of Chapters and Sections is different. The intention is to use notebooks as an introduction to the topic and Murphy's book as a resource.
* If working offline: Go through this notebook and read the book.
* If attending class in person: listen to me (!) but also go through the notebook in your laptop at the same time. Read the book.
* If attending lectures remotely: listen to me (!) via Zoom and (ideally) use two screens where you have the notebook open in 1 screen and you see the lectures on the other. Read the book.

**Optional reference (the "bible" by the "bishop"... pun intended 😆) :** Bishop, Christopher M. *Pattern recognition and machine learning*. Springer Verlag, 2006.

**References/resources to create this notebook:**
* This simple tutorial is still based on a script I created for this article: https://imechanica.org/node/23957
* It follows from some examples provided by the scikit-learn user guide, which seem to have originated from Mathieu Blondel, Jake Vanderplas, Vincent Dubourg, and Jan Hendrik Metzen.

Apologies in advance if I missed some reference used in this notebook. Please contact me if that is the case, and I will gladly include it here.

## **OPTION 1**. Run this notebook **locally in your computer**:
1. Confirm that you have the '3dasm' mamba (or conda) environment (see Lecture 1).
2. Go to the 3dasm_course folder in your computer and pull the last updates of the [repository](https://github.com/bessagroup/3dasm_course):
```
git pull
```
    - Note: if you can't pull the repo due to conflicts (and you can't handle these conflicts), use this command (with **caution**!) and your repo becomes the same as the one online:
        ```
        git reset --hard origin/main
        ```
3. Open command window and load jupyter notebook (it will open in your internet browser):
```
jupyter notebook
```
5. Open notebook of this Lecture and choose the '3dasm' kernel.

## **OPTION 2**. Use **Google's Colab** (no installation required, but times out if idle):

1. go to https://colab.research.google.com
2. login
3. File > Open notebook
4. click on Github (no need to login or authorize anything)
5. paste the git link: https://github.com/bessagroup/3dasm_course
6. click search and then click on the notebook for this Lecture.

In [11]:
# Basic plotting tools needed in Python.

import matplotlib.pyplot as plt # import plotting tools to create figures
import numpy as np # import numpy to handle a lot of things!
from IPython.display import display, Math # to print with Latex math

%config InlineBackend.figure_format = "retina" # render higher resolution images in the notebook
plt.rcParams["figure.figsize"] = (8,4) # rescale figure size appropriately for slides

## Outline for today

* Training Artificial Neural Networks

**Reading material**: This notebook + (ANNs in Chapter 13)

Today focuses on explaining how we train artificial neural networks.

However, before we start with that, let's see what happens when we consider a feedforward neural network with **linear activation functions**!

* Draw on the board a simple feedforward ANN with 2 hidden layers for multidimensional inputs (2) and outputs (2), considering the first hidden layer with 3 neurons and the second hidden layer with 2 neurons.

We can write the neurons of a neural network with $L$ hidden layers in the following way:

$x_i \equiv n_{i,0} \qquad n_{k,l} = f_{k,l}(b_{k,l-1}+W_{jk,l-1}n_{j,l-1}) \qquad n_{q,L+1}\equiv y_q$

* Layer $0$ is the **input layer**:
    * $n_{i,0}$ are the input neurons, which are equal to the inputs (features) $x_i$, where $i = 1, \ldots, d_0$ with $d_0 \equiv d_\text{in}$ being the input dimension (dimension of $\mathbf{x}$);
* **Hidden layer** $l$:
    * $n_{k,l}$ are the hidden neurons, where $k = 1, \ldots, d_l$ with $d_l$ being the number of neurons in layer $l$, and where $f_{k,l}$ is the **activation function** for neuron $k$ of layer $l$ (**Note:** Usually the same activation function is used for all neurons in the same layer, so $f_{1,l}=f_{2,l}=\ldots=f_{d_l,l}$).
* Layer $L+1$ is the **output layer**:
    * $n_{q,L+1}$ are the output neurons, which are equal to the outputs (targets) $y_q$, where $q = 1, \ldots, d_{L+1}$ with $d_{L+1}\equiv d_\text{out}$ being the output dimension (dimension of $\mathbf{y}$).

## A note about Linear activation functions

$x_i \equiv n_{i,0} \qquad n_{k,l} = f_{k,l}(b_{k,l-1}+W_{jk,l-1}n_{j,l-1}) \qquad n_{q,L+1}\equiv y_q$

So, if we use **linear activation functions**, $f(u) = u$, each neuron is simply given by:

$n_{k,l} = b_{k,l-1}+W_{jk,l-1}n_{j,l-1} \qquad \text{or in tensor notation:} \qquad \mathbf{n}_l = \mathbf{b}_{l-1} + \mathbf{W}_{l-1} \mathbf{n}_{l-1}$

### Example of ANN with linear activation functions (why it is not used!)

For example, an ANN with $L=2$ hidden layers and linear activation functions has the following neurons:

* Layer 0 (input layer): $\mathbf{n}_0 = \mathbf{x}$

* Layer 1 (first hidden layer): $\mathbf{n}_1 = \mathbf{b}_0 + \mathbf{W}_0\mathbf{n}_0 = \mathbf{b}_0 + \mathbf{W}_0 \mathbf{x}$

* Layer 2 (second hidden layer): $\mathbf{n}_2 = \mathbf{b}_1 + \mathbf{W}_1\mathbf{n}_1 \Rightarrow$

    $\Rightarrow \mathbf{n}_2 = \mathbf{b}_1 + \mathbf{W}_1(\mathbf{b}_0 + \mathbf{W}_0 \mathbf{x}) = (\mathbf{b}_1+\mathbf{W}_1\mathbf{b}_0)+ \mathbf{W}_1\mathbf{W}_0\mathbf{x}$

* Layer $L+1=3$ (output layer): $\mathbf{y} = \mathbf{n}_3 = \mathbf{b}_2 + \mathbf{W}_2\mathbf{n}_2 \Rightarrow$

    $\Rightarrow \mathbf{y} = \mathbf{b}_2 + \mathbf{W}_2(\mathbf{b}_1+\mathbf{W}_1\mathbf{b}_0 + \mathbf{W}_1\mathbf{W}_0\mathbf{x}) = \underbrace{(\mathbf{b}_2+\mathbf{W}_2\mathbf{b}_1+\mathbf{W}_2\mathbf{W}_1\mathbf{b}_0)}_{\tilde{\mathbf{b}}}+ \underbrace{\mathbf{W}_2\mathbf{W}_1\mathbf{W}_0}_{\tilde{\mathbf{W}}}\mathbf{x}$
    
Conclusion: only using **linear activation functions** leads to a linear prediction!

For this reason, only having linear activation functions is **not commonly considered**!

However, please note that as soon as you consider ANY nonlinear activation function, the output prediction becomes nonlinear:

$\require{color}\mathbf{y} = {\color{blue}\mathbf{f}_3} \left\{ {\color{red}\mathbf{W}_2 \mathbf{f}_2}\left[{\color{green}\mathbf{W}_1\mathbf{f}_1}\left( \mathbf{W}_0 \mathbf{x} + \mathbf{b}_0\right)+{\color{green}\mathbf{b}_1}\right] + {\color{red}\mathbf{b}_2}\right\}$ 

Simple rules of thumb when training neural networks:

* For regression problems: use the linear activation function for the **output layer** (why??)

* For classification problems: use the sigmoid (i.e. softmax) activation function for the **output layer** (why??)

* For predicting a smooth map: use smooth (differentiable) activation functions in the hidden layers.

* Do not start with a huge neural network with many hidden layers and neurons! In fact, it's good to start training simple models and gradually increase complexity, as you need it.
    * For example, start by training a linear model, then a kernel machine, and only then a neural network.

Let's explore ANNs with a popular interactive demo: https://playground.tensorflow.org.

* Consider the simple classification problem of the "**Circle**" dataset in that demo:

    1. Considering **ReLu** activation function and the same **2 hidden layers** with **4 neurons each** (all other hyperparameters are visible at the top of the page): [click here](https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.74415&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)
    
    2. Considering **Tanh** activation function and the same 2 hidden layers with 4 neurons each (all other hyperparameters are visible at the top of the page): [click here](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.38062&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)

    3. Considering **Linear** activation function and the same 2 hidden layers with 4 neurons each (all other hyperparameters are visible at the top of the page): [click here](https://playground.tensorflow.org/#activation=linear&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.38062&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)
    
    4. Still considering **ReLu** activation function but now with **6 hidden layers** with **8 neurons** each (all other hyperparameters are visible at the top of the page): [click here](https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=8,8,8,8,8,8&seed=0.74136&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)

    5. Considering **Tanh** activation function and the same 6 hidden layers with 8 neurons each (all other hyperparameters are visible at the top of the page): [click here](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=8,8,8,8,8,8&seed=0.74136&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)

    

There are many characteristics of the training of ANNs that you can understand empirically by exploring the [interactive demo for both regression and classification](https://playground.tensorflow.org):

* Even the ReLu activation function is capable of nonlinear predictions! It's also faster to train than other activation functions.

* Tanh activation function leads to smooth predictions because it is infinitely differentiable.

* More complicated neural networks are not necessarily better. You do not know a priori how many layers and how many neurons you need to train a good model. It depends on the data characteristics and quantity!

* Neural networks with more layers and neurons (more parameters) are slower to train! Each epoch takes longer and you can see that from playing the above demos

## Understanding the training of ANNs

ANN for **classification** problems is defined similarly to what we did for Logistic regression, but using a nonlinear function to model the mean:

1. Categorical observation distribution: $p(\mathbf{y}|\mathbf{x}, \mathbf{z}) = \text{Cat}\left(\mathbf{y}| \boldsymbol{\rho} = \mathbf{f}(\mathbf{x};\, \mathbf{z}) \right)$

    where $\mathbf{z} \equiv \{\mathbf{W}, \mathbf{b}\}$ are all the unknown model parameters (weights and biases of every layer) and where $\mathbf{f}(\mathbf{x};\, \mathbf{z})$ is nonlinear in the weights due to the choice of activation functions (the activation function in the **output layer** is the **softmax activation**, just like in logistic regression).

2. Uniform prior distribution for each hidden rv $\mathbf{z}$: $p(\mathbf{z}) \propto 1$
    
    * Note: we can also use other priors! For example a Gaussian prior ($\ell_2$ regularization) or a Laplace prior ($\ell_1$ regularization).

3. MLE point estimate for posterior: $\hat{\mathbf{z}}_{\text{mle}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log{ p(\mathbf{y}=\mathbf{y}_i|\mathbf{x}=\mathbf{x}_i, \mathbf{z})}\right]$
    * Note: if we use a different prior, then it becomes the MAP estimate.

Final prediction is given by the <font color='orange'>PPD</font>: $\require{color}
{\color{orange}p(\mathbf{y}^*|\mathbf{x}^*, \mathcal{D})} = \int p(\mathbf{y}^*|\mathbf{x}^*,\mathbf{z}) \delta(\mathbf{z}-\hat{\mathbf{z}}) d\mathbf{z} = p(\mathbf{y}^*|\mathbf{x}^*, \mathbf{z}=\hat{\mathbf{z}})$

ANN for **regression** problems is defined similarly to what we did for Linear regression, but using a nonlinear function to model the mean:

1. Gaussian observation distribution (**ignore variance**): $p(\mathbf{y}|\mathbf{x}, \mathbf{z}) = \mathcal{N}(\mathbf{y}| \boldsymbol{\mu}_{y|z} = \mathbf{f}(\mathbf{x};\, \mathbf{z}), \sigma_{y|z}^2 = \sigma^2)$

    where $\mathbf{z} \equiv \{\mathbf{W}, \mathbf{b}\}$ are all the unknown model parameters (weights and biases of every layer) and where $\mathbf{f}(\mathbf{x};\, \mathbf{z})$ is nonlinear in the weights due to the choice of activation functions (the activation function in the **output layer** is the **linear activation**, just like in linear regression).

2. Uniform prior distribution for each hidden rv $\mathbf{z}$: $p(\mathbf{z}) \propto 1$
    
    * Note: we can also use other priors! For example a Gaussian prior ($\ell_2$ regularization) or a Laplace prior ($\ell_1$ regularization).

3. MLE point estimate for posterior: $\hat{\mathbf{z}}_{\text{mle}} = \underset{z}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log{ p(\mathbf{y}=\mathbf{y}_i|\mathbf{x}=\mathbf{x}_i, \mathbf{z})}\right]$
    * Note: if we use a different prior, then it becomes the MAP estimate.

Final prediction is given by the <font color='orange'>PPD</font>: $\require{color}
{\color{orange}p(\mathbf{y}^*|\mathbf{x}^*, \mathcal{D})} = \int p(\mathbf{y}^*|\mathbf{x}^*,\mathbf{z}) \delta(\mathbf{z}-\hat{\mathbf{z}}) d\mathbf{z} = p(\mathbf{y}^*|\mathbf{x}^*, \mathbf{z}=\hat{\mathbf{z}})$

In both cases (classification and regression), we need to calculate the point estimate of the unknowns $\mathbf{z}$.

Without loss of generality, we will focus on the **regression** case and assume uniform prior, i.e. no "regularization", as in Lecture 11:

$$\begin{align}
\hat{\mathbf{z}}_{\text{mle}} &= \underset{z}{\mathrm{argmin}}\left[\text{NLL}(\mathbf{z})\right]
\\
&= \underset{z}{\mathrm{argmin}}\left[-\sum_{n=1}^{N}\log{ p(\mathbf{y}=\mathbf{y}_n|\mathbf{x}=\mathbf{x}_n, \mathbf{z})}\right]
\end{align}
$$

$$
\begin{align}
\hat{\mathbf{z}}_{\text{mle}} &= \underset{z}{\mathrm{argmin}}\left[-\sum_{n=1}^{N}\log{ p(\mathbf{y}=\mathbf{y}_n|\mathbf{x}=\mathbf{x}_n, \mathbf{z})}\right] \\
&= \underset{z}{\mathrm{argmin}}\left[-\sum_{n=1}^{N}\log{\left( \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{ -\frac{1}{2\sigma^2}\left[\mathbf{y}_n-\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})\right]^T\left[\mathbf{y}_n-\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})\right]\right\}\right)}\right]\\
&= \underset{z}{\mathrm{argmin}}\left[\frac{N}{2}\log{\left(2\pi \sigma^2\right)}+\frac{1}{2 \sigma^2}\sum_{n=1}^{N}\left[\mathbf{y}_n-\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})\right]^T\left[\mathbf{y}_n-\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})\right] \right]\\
\end{align}
$$

where we recall that the unknowns are $\mathbf{z}$ (the weights and biases of the network), and where we can ignore the variance term (because we assume it to be constant, i.e. we will not solve for it).

Therefore, the point estimate of the ANN for regression becomes:

$$
\hat{\mathbf{z}}_{\text{mle}} = \underset{z}{\mathrm{argmin}}\left[\frac{1}{2}\sum_{n=1}^{N}\left[\mathbf{y}_n-\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})\right]^T\left[\mathbf{y}_n-\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})\right] \right]
$$

Which is called the $\ell_2$ loss function (or mean squared error loss function) and we denote it as:

$$
\mathcal{L}\left[(\mathbf{x},\mathbf{y}), \mathbf{z} \right] = \frac{1}{2} ||\mathbf{Y}-\mathbf{f}(\mathbf{X},\mathbf{z}) ||_2^2
$$

* Note: similarly, we can calculate the point estimate for the **classification** case as in Lecture 21, obtaining the **cross-entropy** loss function, but where the prediction now depends on a nonlinear function: $\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})$.

Conclusion: as in the other ML models, finding the point estimate means that we need to minimize the loss function (finding minimum location of the NLL or of the negative log joint likelihood if the prior is not uniform).

* However, there is a **major** problem now: this function is no longer convex because of the mean $\mathbf{f}(\mathbf{x}_n;\, \mathbf{z})$ is a composition of nonlinear activation functions...

So, how do we solve this optimization problem??

* We do the same thing as before! We use optimization algorithms that find the minimum of the loss function!

* As before, we calculate the gradient of the loss function (gradient of the NLL) and then provide the loss function and its gradient to an optimizer that will then find the optimum for us!

In fact, from the user point of view, we don't care if the loss function is not convex... We use an optimizer, cross our fingers, and hope that we land in a "decent" optimum (which almost never is the "global" optimum, contrary to what happened in linear models).

The mean function of an ANN with $L$ hidden layers is then a composition of $L+1$ functions:

$\mathbf{f} = \mathbf{f}_{L+1} \circ \mathbf{f}_L \circ \ldots \circ \mathbf{f}_{2} \circ \mathbf{f}_1 $

And if we denote $f_{L+2} = \mathcal{L}$ as the loss function, then we can write the loss function as follows:

$\mathcal{L} = \mathbf{f}_{L+2} \circ \mathbf{f}_{L+1} \circ \mathbf{f}_L \circ \ldots \circ \mathbf{f}_{l} \circ \ldots \circ \mathbf{f}_{2} \circ \mathbf{f}_1 $

Where $l=1,\ldots, L$ are the hidden layers.

The next slide makes it less abstract, by considering the example of an ANN with **1 hidden layer** and the $\ell_2$ loss (i.e. regression case).

### Neurons of ANN with **1 hidden layer** and using $\ell_2$ loss (i.e. regression case).

* Layer 0 (input layer): $\mathbf{n}_0 \equiv \mathbf{x}$

* Layer $L=1$ (first hidden layer):

$$\begin{align}
\mathbf{n}_L &= \mathbf{n}_1 \\
             &= \mathbf{f}_L(\mathbf{n}_{L-1}, \mathbf{z}_{L-1}) \\
             &= \mathbf{f}_1(\mathbf{n}_0, \mathbf{z}_0) \\
             &= \mathbf{f}_1(\mathbf{b}_0 + \mathbf{W}_0\mathbf{n}_0)
\end{align}
$$

* Layer $L+1=2$ (output layer):

$$\begin{align}
\mathbf{n}_{L+1} &= \mathbf{n}_2 \equiv \mathbf{y} \\
           &= \mathbf{f}_{L+1}(\mathbf{n}_{L}, \mathbf{z}_{L}) = \mathbf{f}_2(\mathbf{n}_1, \mathbf{z}_1) \\ 
           &= \mathbf{f}_2(\mathbf{b}_1 + \mathbf{W}_1\mathbf{n}_1) & \\
           &= \mathbf{b}_1 + \mathbf{W}_1\mathbf{n}_1
\end{align}
$$

    because in regression the output layer uses a linear activation function

* And finally the loss function can be considered Layer $L+2$ (here we consider $\ell_2$ loss because it is a regression case):

$$\begin{align}
\mathcal{L} &= f_{L+2}(\mathbf{n}_{L+1}, \mathbf{y})\\
            &= f_3(\mathbf{n}_2, \mathbf{y}) \\
            &= \frac{1}{2}|| \mathbf{y} - \mathbf{n}_2  ||_2^2
\end{align}
$$

where, again, this loss expression (MSE or $\ell_2$ loss) is used because we are considering regression. If it was a classification problem we would have the cross-entropy loss (see Lecture 21).

Now we can calculate the derivatives of the loss function wrt to the unknowns $\mathbf{z}$ very easily!

In order to calculate the derivatives of the loss function:

$\mathcal{L} = f_{L+2} \circ \mathbf{f}_{L+1} \circ \mathbf{f}_L \circ \ldots \circ \mathbf{f}_{2} \circ \mathbf{f}_1 $

we just need to use the chain rule!

For example, taking the derivative of the loss wrt the unkowns of the last layer, i.e. $\mathbf{z}_L$:

$$\begin{align}
\nabla_{\mathbf{z}_{L}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_L} &= \frac{\partial f_{L+2}}{\partial \mathbf{z}_{L}} \\
&= \frac{\partial f_{L+2}}{\partial \mathbf{f}_{L+1}} \frac{\partial \mathbf{f}_{L+1}}{\partial \mathbf{z}_{L}}
\end{align}
$$

Which is usually presented as:

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{z}_L} = \frac{\partial \mathcal{L}}{\partial \mathbf{n}_{L+1}} \frac{\partial \mathbf{n}_{L+1}}{\partial \mathbf{z}_{L}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{z}_{L}}
$$

but we should not stop here! Remember, we want all the derivatives of the loss wrt to all the unknowns $\mathbf{z}$.

So, taking the derivative of the loss wrt the unkowns of the next hidden layer, i.e. wrt $\mathbf{z}_{L-1}$:

$$\begin{align}
\nabla_{\mathbf{z}_{L-1}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L-1}} &= \frac{\partial f_{L+2}}{\partial \mathbf{z}_{L-1}} \\
&= \frac{\partial f_{L+2}}{\partial \mathbf{f}_{L+1}} \frac{\partial \mathbf{f}_{L+1}}{\partial \mathbf{f}_{L}} \frac{\partial \mathbf{f}_{L}}{\partial \mathbf{z}_{L-1}}
\end{align}
$$

Which is usually presented as:

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{n}_{L+1}} \frac{\partial \mathbf{n}_{L+1}}{\partial \mathbf{n}_{L}} \frac{\partial \mathbf{n}_{L}}{\partial \mathbf{z}_{L-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{n}_{L}} \frac{\partial \mathbf{n}_{L}}{\partial \mathbf{z}_{L-1}}
$$

but we should not stop here!

So, taking the derivative of the loss wrt $\mathbf{z}_{l}$:

$$\begin{align}
\nabla_{\mathbf{z}_{l}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{l}} &= \frac{\partial f_{L+2}}{\partial \mathbf{z}_{l}} \\
&= \frac{\partial f_{L+2}}{\partial \mathbf{f}_{L+1}} \frac{\partial \mathbf{f}_{L+1}}{\partial \mathbf{f}_{L}} \cdots \frac{\partial \mathbf{f}_{l+2}}{\partial \mathbf{f}_{l+1}}\frac{\partial \mathbf{f}_{l+1}}{\partial \mathbf{z}_{l}}
\end{align}
$$

Which is usually presented as:

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{n}_{L+1}} \frac{\partial \mathbf{n}_{L+1}}{\partial \mathbf{n}_{L}} \cdots \frac{\partial \mathbf{n}_{l+2}}{\partial \mathbf{n}_{l+1}}\frac{\partial \mathbf{n}_{l+1}}{\partial \mathbf{z}_{l}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{n}_{L}} \cdots \frac{\partial \mathbf{n}_{l+2}}{\partial \mathbf{n}_{l+1}}\frac{\partial \mathbf{n}_{l+1}}{\partial \mathbf{z}_{l}}
$$

but we should not stop here! We need to keep going until we reach the first layer of the network!

So, taking the derivative of the loss wrt $\mathbf{z}_{0}$:

$$\begin{align}
\nabla_{\mathbf{z}_{0}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{0}} &= \frac{\partial f_{L+2}}{\partial \mathbf{z}_{l}} \\
&= \frac{\partial f_{L+2}}{\partial \mathbf{f}_{L+1}} \frac{\partial \mathbf{f}_{L+1}}{\partial \mathbf{f}_{L}} \cdots \frac{\partial \mathbf{f}_{l+2}}{\partial \mathbf{f}_{l+1}}\frac{\partial \mathbf{f}_{l+1}}{\partial \mathbf{z}_{l}} \cdots \frac{\partial \mathbf{f}_{2}}{\partial \mathbf{f}_{1}}\frac{\partial \mathbf{f}_{1}}{\partial \mathbf{z}_{0}} 
\end{align}
$$

Which is usually presented as:

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{n}_{L+1}} \frac{\partial \mathbf{n}_{L+1}}{\partial \mathbf{n}_{L}} \cdots \frac{\partial \mathbf{n}_{l+2}}{\partial \mathbf{n}_{l+1}}\frac{\partial \mathbf{n}_{l+1}}{\partial \mathbf{z}_{l}} \cdots \frac{\partial \mathbf{n}_{2}}{\partial \mathbf{n}_{1}}\frac{\partial \mathbf{n}_{1}}{\partial \mathbf{z}_{0}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{n}_{L}} \cdots \frac{\partial \mathbf{n}_{l+2}}{\partial \mathbf{n}_{l+1}}\frac{\partial \mathbf{n}_{l+1}}{\partial \mathbf{z}_{l}} \cdots \frac{\partial \mathbf{n}_{2}}{\partial \mathbf{n}_{1}}\frac{\partial \mathbf{n}_{1}}{\partial \mathbf{z}_{0}} 
$$

* And with this we finally computed all the deriavatives!

This is what we call **Backpropagation**.

## Backpropagation

* Backpropagation is simply the computation of the gradients (derivatives) of the loss wrt to the unknowns $\mathbf{z}$.

    * Note: remember, "loss function" just means NLL (when using a uniform prior) or negative log joint likelihood (when using any other prior).

* These gradients are then provided to **a gradient-based optimizer** (variants of gradient descent, such as ADAM that we used last lecture), that will then provide a point estimate (the weights and biases that minimize the error; or, alternatively, that maximize the joint likelihood).

In the last part of the course, we will introduce the basics of **optimization**!

* Optimization is the key to deterministic machine learning because it's what enables to find the point estimate for the unknowns (here: weights and biases) by minimizing the NLL (or negative log joint likelihood).

**Note**: optimization is (obviously!) a separate discipline on its own! We will only have 3 classes about this topic! The main goal is to understand why we typically choose a particular algorithm for finding point estimates of particular models:

    - Example, L-BFGS is a common choice to find the point estimate of the hyperparameters of GPs, but we use ADAM and Stochastic Gradient Descent to find the point estimate of the parameters of an ANN.

### See you next class

Have fun!