# Backpropagation
For a long time it was not clear how to train networks on a given data set. While single-layer perceptrons had a simple learning rule that was guaranteed to converge to a solution, it could not be extended to networks with more than one layer. The AI community has struggled with this problem for more than 30 years (in a period known as the "AI winter"), when eventually in 1986 Rumelhart et al. introduced the **backpropagation algorithm** in their groundbreaking [paper](https://www.nature.com/articles/323533a0).

## The Three Phases of the Algorithm
The backpropagation algorithm consists of three phases:
- **Forward pass:** In this phase we feed the inputs through the network, make a prediction and measure its error with respect to the true label.
- **Backward pass:** We propagate the gradients of the error with respect to each one of the weights backward from the output layer to the input layer.
- **Gradient descent step:** We slightly tweak the connection weights in the network by taking a step in the opposite direction of the error gradients.

<div style="align:center">
    <img src="media/mlp.png" width=800>
</div>

<hr>

## Forward Pass
In the forward pass, we propagate the inputs in a forward direction, layer-by-layer, until the output is generated. The activation of neuron $i$ in layer $l$ is computed using the following equation:

$$a_i^{l} = f(z_i^{l}) = f(\sum_{j} w_{ij}^{l} a_j^{l-1} + b_i^{l})$$

where $f$ is the activation function, $z_i^{l}$ is the net input of neuron $i$ in layer $l$, $w_{ij}^{l}$ is the connection weight between neuron $j$ in layer $l - 1$ and neuron $i$ in layer $l$, and $b_i^{l}$ is the bias of neuron $i$ in layer $l$.

To simplify the derivation of the learning algorithm, we will treat the bias as if it were the weight $w_0$ of an input neuron $x_0$ that has a constant value of $1$. This enables us to write the above equation as follows:

$$a_i^{l} = f(z_i^{l}) = f(\sum_{j} w_{ij}^{l} a_j^{l-1})$$

<hr>

## Backward Pass
In the backward pass we propagate the gradients of the error from the output layer back to the input layer.

### Definition of the Error and Loss Functions
We first define the error of the network on the training set with respect to its weights. Let’s denote by $w$ the vector that contains all the weights of the network. Assume that we have $n$ training samples ${(x_i, y_i)}, i = 1, \cdots,n$, and the output of the network on sample $i$ is $o_i$. Then the error of the network with respect to $w$ is:

$$E(w) = \sum_{i=1}^n J(y_i, o_i)$$

where $J(y, o)$ is the loss function. The specific loss function that we use depends on the task the network is trying to accomplish:

1. For regression problems, we use the **squared loss** function:

$$J(y, o) = (y - o)^2$$

2. For binary classification problems, we use **log loss** (also known as the binary cross-entropy loss):

$$J(y, o) = -y \log{o} - (1 - y) \log{1 - o}$$

3. For multi-class classification problems, we use the **cross-entropy** loss function:

$$J_{CE} (y, o) = - \sum_{i=1}^k y_i \log{o_i}$$

where $k$ is the number of classes.

Our goal is to find the weights $w$ that minimize $E(w).$ Unfortunately, this function is non-convex because of the non-linear activations of the hidden neurons. This means that it may have multiple local minima:

<div style="align:center">
    <img src="media/non-convex.png" width=500>
</div>

### Finding the Gradients of the Error
In order to use gradient descent, we need to compute the partial derivatives of $E(w)$ with respect to each one of the weights in the network:

$$\frac{\partial{E}}{\partial{w_{ij}^{l}}}$$

To simplify the mathematical derivation, we will assume that we have only one training example and find the partial derivatives of the error with respect to that example:

$$\frac{\partial{E}}{\partial{w_{ij}^{l}}} = \frac{\partial{J(y, o)}}{\partial{w_{ij}^{l}}}$$

where $y$ is the label of this example and $o$ is the output of the network for that example. The extension to $n$ training samples is straightforward, since the derivative of the sum of functions is just the sum of their derivatives.

The computation of the partial derivatives of the weights in the hidden layers is not trivial, since those weights don’t affect directly the output (and hence the error). To address this problem, we will use the chain rule of derivatives to establish a relationship between the gradients of the error in a given layer and the gradients in the subsequent layer.

### The Delta Terms
We first note that $E$ depends on the weight $w_{ij}^{l}$ only via the net input $z_i^{l}$ of neuron $i$ in layer $l$. Therefore, we can apply the chain rule of derivatives to the gradient of $E$ with respect to this weight:

$$\frac{\partial{E}}{\partial{w_{ij}^{l}}} = \frac{\partial{J(y, o)}}{\partial{z_i^{l}}} \times \frac{\partial{z_i^{l}}}{\partial{w_{ij}^{l}}}$$

The second derivative on the right side of the equation is:

$$\frac{\partial{z_i^{l}}}{\partial{w_{ij}^{l}}} = \frac{\partial{\sum_{k} w_{ij}^{l} a_k^{l-1}}}{\partial{w_{ij}^{l}}} = a_j^{l-1}$$

Therefore, we can write:

$$\frac{\partial{E}}{\partial{w_{ij}^{l}}} = \frac{\partial{J(y, o)}}{\partial{z_i^{l}}} a_j^{l-1} = \delta_i^{l} a_j^{l-1}$$

The variable $\delta_i$ is called the delta term of neuron $i$ or delta for short.

### The Delta Rule
The delta rule establishes the relationship between the delta terms in layer $l$ and the delta terms in layer $l + 1$. To derive the delta rule, we again use the chain rule of derivatives. The loss function depends on the net input of neuron $i$ only via the net inputs of all the neurons it is connected to in layer $l + 1$. Therefore we can write:

$$\delta_i^{l} = \frac{\partial{J(y,o)}}{\partial{z_i^{l}}} = \sum_{j} (\frac{\partial{J(y,o)}}{\partial{z_j^{l+1}}} \frac{\partial{z_j^{l+1}}}{\partial{z_i^{l}}})$$

where the index $j$ in the sum goes over all the neurons in layer $l + 1$ that neuron $i$ in layer $l$ is connected to.

Once again we use the chain rule to decompose the second partial derivative inside the brackets:

$$\delta_i^{l} = \sum_{j} (\frac{\partial{J(y,o)}}{\partial{z_j^{l+1}}} \frac{\partial{z_j^{l+1}}}{\partial{a_i^{l}}} \frac{\partial{a_i^{l}}}{\partial{z_i^{l}}})$$

The first partial derivative inside the brackets is just the delta of neuron $j$ in layer $l + 1$, therefore we can write:

$$\delta_i^{l} = \sum_{j} (\delta_j^{l+1} \frac{\partial{z_j^{l+1}}}{\partial{a_i^{l}}} \frac{\partial{a_i^{l}}}{\partial{z_i^{l}}})$$

The second partial derivative is easy to compute:

$$\frac{\partial{z_j^{l+1}}}{\partial{a_i^{l}}} = \frac{\partial{(\sum_{k} w_{jk}^{l+1} a_k^{l})}}{\partial{a_i^{l}}} = w_{ji}^{l+1}$$

Therefore, we get:

$$\delta_i^{l} = \sum_{j} (\delta_j^{l+1} w_{ji}^{l+1} \frac{\partial{a_i^{l}}}{\partial{z_i^{l}}}) = \frac{\partial{a_i^{l}}}{\partial{z_i^{l}}} \sum_{j} (w_{ji}^{l+1} \delta_j^{l+1})$$

But $a_i^{l} = f(z_i^{l})$, where $f$ is the activation function. Hence, the partial derivative outside the sum is just the derivative of the activation function $f'(x)$ for $x = z_i^{l}$.

Therefore we can write:
$$\delta_i^{l} = f'(z_i^{l}) \sum_{j} (w_{ji}^{l+1} \delta_{j}^{l+1})$$

This equation, known as the **delta rule,** shows the relationship between the deltas in layer $l$ and the deltas in layer $l + 1.$ More specifically, each delta in layer $l$ is a linear combination of the deltas in layer $l + 1$, where the coefficients of the combination are the connection weights between these layers. The delta rule allows us to compute all the delta terms (and thus all the gradients of the error) recursively, starting from the deltas in the output layer and going back layer-by-layer until we reach the input layer.

The following diagram illustrates the flow of the error information:

<div style="align:center">
    <img src="media/error_flow.png" width=500>
</div>

For specific activation functions, we can derive more explicit equations for the delta rule. For example, if we use the sigmoid function then:

$$a_i^{l} = \sigma(z_i^{l}) = \frac{1}{1+e^{-z_i^{l}}}$$

The derivative of the sigmoid function has a simple form:

$$\sigma'(x) = \sigma(x)(1-\sigma(x))$$

Hence:

$$\sigma'(z_i^{l}) = \sigma(z_i^{l})(1-\sigma(z_i^{l})) = a_i^{l}(1-a_i^{l})$$

Then the delta rule for the sigmoid function gets the following form:

$$\delta_i^{l} = a_i^{l}(1-a_i^{l}) \sum_{j} (w_{ji}^{l+1} \delta_j^{l+1})$$

### The Deltas in the Output Layer
The final piece of the puzzle are the delta terms in the output layer, which are the first ones that we need to compute.

The deltas in the output layer depend both on the loss function and the activation function used in the output neurons:

$$\delta^{L} = \frac{\partial{J(y,o)}}{\partial{z^{L}}} = \frac{\partial{J(y,o)}}{\partial{o}} \frac{\partial{o}}{\partial{z^{L}}} = \frac{\partial{J(y,o)}}{\partial{o}} \frac{\partial{f(z^{L}}}{\partial{z^{L}}} = \frac{\partial{J(y,o)}}{\partial{o}} f'(z^{L})$$

where $f$ is the activation function used to compute the output.

Let's now derive more specific delta terms for each type of learning task:

1. In regression problems, the activation function we use in the output is the identity function $f(x) = x$, whose derivative is $1$, and the loss function is the squared loss. Therefore the delta is:

$$\delta^{L} = \frac{\partial{(y-o)^2}}{\partial{o}} = -2(y-o) = 2(o-y)$$

2. In binary classification problems, the activation function we use is sigmoid and the loss function is log loss, therefore we get:

$$\delta^{L} = \frac{\partial{-y \log{o} - (1-y) \log{1-o}}}{\partial{o}} \sigma'(z^{L})$$

$$ = (- \frac{y}{o} + \frac{1-y}{1-o}) \sigma(z^{L}) (1-\sigma(z^{L}))$$

$$ = (- \frac{y}{o} + \frac{1-y}{1-o}) o(1-o) = -y(1-o)+(1-y)o = o-y$$

3. In multiclass classification problems, we have $k$ output neurons (where $k$ is the number of classes) and we use softmax activation and the cross-entropy log loss. Similar to the previous case, the delta term of the $i$th output neuron is surprisingly simple:

$$\delta_i^{L} = o_i - y_i$$

<hr>

## Gradient Descent
Once we finish computing all the delta terms, we can use gradient descent to update the weights. In gradient descent, we take small steps in the opposite direction of the gradient (i.e., in the direction of the steepest descent) in order to get closer to the minimum error:

<div style="align:center">
    <img src="media/gradient_descent.png" width=500>
</div>

Remember that the partial derivative of the error function with respect to each weight is:

$$\frac{\partial{E}}{\partial{w_{ij}^{j}}} = \delta_i^{l} a_j^{l-1}$$

Therefore, we can write the gradient descent update rule as follows:

$$w_{ij}^{l} \leftarrow w{ij}^{l} - \alpha \delta_i^{l} a_j^{l-1}$$