# Exercise on Back-propagation (only for the students who want to practice with backpropagation)

### Course on Deep Learning for System Identification
### Authors: Dario Piga, Marco Forgione
### Lugano, March 7th, 2024

Let us consider the 4-layer network  reported in the figure below:


![4-layer network](figures/forward.png)

Consider the following conditions:

- $x \in R^2; y \in R$. Both $x$ and $y$ are given and equal to: $x = [1, \ 2]^T$ and $y = 1$. 
- Layer 1 is a Linear Layer, mapping $z^1=x$ into $z^2 \in R^3$ as follows: $z^2 = W_1 z^1 + b_1$, where $W_1$ is matrix of size 3x2, and $b_1$ is a vector of size 3x1
- Layer 2 is a nonlinear layer made of sigmoid activation functions, mapping elementwise each element of $z^2$ into $z^3$ as follows:
$z^3_j = \sigma(z^2_j)$, where $\sigma$ is a sigmoid function defined as $$\sigma(z)=\frac{e^{z}}{1+e^z}$$
- Layer 3 is a Linear Layer, mapping $z^3$ into $z^4 \in R$ as follows: $z^4 = W_3 z^3 + b_3$, where $W_3$ is matrix of size 1x3 and $b_3$ is scalar
- Layer 4 simply takes $z^4$ as input and constructs a quadratic scalar loss $\ell$

The matrices $W_1, b_1$ and $W_3, b_3$ are parameters of the network. 

Exercise: Apply the backpropagation algorithm to compute the gradient of the loss (namely, $z_5$) w.r.t. the parameters $W_1$, $b_1$, $W_3$ and $b_3$, evaluated at the following values of the matrices $W_1$, $b_1$, $W_3$ and $b_3$:

$$
W_1 = \left[ \begin{array}{ll}
0 & 1 \\
1 & -2 \\
-2 & 1
\end{array} \right], \ \ \ b_1 = \left[ \begin{array}{l}
1  \\
1  \\
-1
\end{array} \right],  \ \ \ W_3 = \left[ \begin{array}{lll}
1 & 1 & 1
\end{array} \right], \ \ \ b_3 = 1.
$$

To compact the notation, we stack $W_1$ and $b_1$ in the parameter $\theta_1$. Similarly, we stack $W_3$ and $b_3$ in the parameter $\theta_3$.


**For your practice, we invite you to solve the exercise with "pen and paper", with the aid of a calculator (or Python itself) just for algebraic computations, such as matrix multiplication, evaluation of the output of a sigmoid funcation, etc.**

**At the end of the exercise, you can compare your solution with the one computed through the backward function in PyTorch.**

**Hint 1:**

Apply the **forward pass**  according to the diagram in the cell above

Apply the **backward pass** according to the following diagram: 

![4-layer network](figures/backward.png)
 
where each element of $\delta^L$ is computed recursively according to the following formula seen in the lesson:

 \begin{align*}
 \delta_i^L =     \delta^{L+1}   \frac{\partial z^{L+1}}{\partial z_i^{L}}
 \end{align*}
 
 Thus, you need to compute $\frac{\partial z^{L+1}}{\partial z_i^{L}}$, and compute $\delta^L$ recursively from $\delta^{L+1}$


**Hint 2:**

Once the derivatives $\delta^L$  (with $L=1,2,3,4$) are computed, the partial derivatives $\frac{\partial \ell}{\partial \theta_1} \ \ \text{ and   } \ \  \frac{\partial \ell}{\partial \theta_3}$  can be computed according to the general formula seen at the lesson:

$$  \frac{\partial \ell}{\partial \theta_L} = \sum_i  \delta_i^{L+1}  \frac{\partial z_i^{L+1}}{\partial \theta_L}$$.

Thus, you also need to compute $\frac{\partial z_i^{L+1}}{\partial \theta_{L}}$ for all $i=1,\ldots,n_{L+1}$ where $n_{L+1}$ is the size of $z^{L+1}$.

## Solution check

In order to verify your results, run the cell below, which computes the gradients using automatic differentiation in PyTorch 

In [1]:
# Solution using PyTorch

import numpy as np
import torch


# define here your variables. Do not forget to set requires_grad=True for W1, b1, W3, b3!

W1 = torch.tensor([[0, 1], [1, -2], [-2, 1]], dtype=torch.float, requires_grad=True)
b1 = torch.tensor([[1], [1], [-1]], dtype=torch.float, requires_grad=True)

W3 = torch.tensor([[1.0, 1, 1]], dtype=torch.float, requires_grad=True)
b3 = torch.tensor([1.0], dtype=torch.float, requires_grad=True)

x = torch.Tensor([1, 2]).reshape(-1,1)
y = torch.Tensor([1]).reshape(-1,1)


W1.grad = None
b1.grad = None
W3.grad = None
b3.grad = None

# forward pass

z1 = x
z2 = torch.matmul(W1, z1) + b1
z3 = torch.sigmoid(z2)
z4 = torch.matmul(W3, z3) + b3
z5 = torch.norm(z4-y)**2

# backward pass
z5.backward()

# print results
print(W1.grad)
print(b1.grad)
print(W3.grad)
print(b3.grad)

tensor([[0.1211, 0.2423],
        [0.2815, 0.5631],
        [0.5272, 1.0544]])
tensor([[0.1211],
        [0.2815],
        [0.5272]])
tensor([[2.5543, 0.3196, 0.7211]])
tensor([2.6814])
