# Backpropagation

Backpropagation is fundamental to how neural networks learn. At a high level, we take the error and feed it back into the network, updating the weights.

To update the weights to hidden layers using gradient descent, you need to know how much error each of the hidden units contributed to the final output. Since the output of a layer is determined by the weights between layers, the error resulting from units is scaled by the weights going forward through the network. Since we know the error at the output, we can use the weights to work backwards to hidden layers.

For example, in the output layer, you have errors $\delta_k^o$ attributed to each output unit k. Then, the error attributed to hidden unit j is the output errors, scaled by the weights between the output and hidden layers (and the gradient):

$$
\delta_h^j = \sum W_{jk} \delta_k^o f'(h_j)
$$

Then, the gradient descent step is the same as before, just with the new errors:
$$
\Delta w_{ij} = \eta \delta_j^h x_i
$$
where $w_{ij}$ are the weights between the inputs and hidden layer and $x_i$ are input unit values. The weight steps are equal to the step size times the output error of the layer times the values of the inputs to that layer.
$$
\Delta w_{pq} = \eta \delta_{output} V_{in}
$$

Here is an example. <img src="./backprop-network.png" width=200 height=400/>



Assume we're trying to fit some binary data and the target is $y = 1$. We'll start with the forward pass, first calculating the **input** to the hidden unit:

$h = \sum_i w_i x_i = 0.1 \times 0.4 + 0.3 \times -0.2 = -.02$

And the **output** is:

$a = f(h) = \text{sigmoid}(-.02) = .495$

Using this as the input to the output unit, the output of the network is:

$\hat{y} = f(W \cdot a) = \text{sigmoid}(0.1 \times .495) = .512$

With the network output, we can start the backwards pass to calculate the weight updates for both layers. Using the fact that for the sigmoid function $f'(W \cdot a) = f(W \cdot a)(1 - f(W \cdot a))$, the error term for the output unit is:

$\delta^o = (y - \hat{y})f'(W \cdot a) = (1 - .512) \times .512 \times (1 - .512) = .122$

Now we need to calculate the error term for the hidden unit with backpropagation. Here we'll scale the error term from the output unit by the weight $W$ connecting it to the hidden unit. For the hidden unit error term, $\delta_h^j = \sum W_{jk} \delta_k^o f'(h_j)$, but since we have one hidden unit and one output unit, this is much simpler.

$\delta^h = W \delta^o f'(h) = .1 \times .122 \times .495 \times (1 - .495) = .003$