# Calculus of Backpropagation

Lets start with an extreamly simple network. Each layer has just one neuron in it.
This particular network is determined by 3 weigths and 3 biases, and our goal is to 
understand how sensitive the Cost Function is to these variables.

$$ C(w_1, b_1, w_2, b_2, w_3, b_3) $$

Lets label the activation of the neurons as:

$ a^{(L-3)} \to a^{(L-2)} \to a^{(L-1)} \to a^{(L)} \to y $

where $ y $ is the desired output:

![extreamly simple nn](img/extreamly_simple_nn.jpg)

So the cost of this simple network for a single training example is:

$ C_0(\dots) = (a^{L} - y)^2 $

where:

$ a^{(L)} = \sigma(z^{(L)}) $

$ z^{(L)} = w^{(L)}a^{(L-1)}+b^{(L)} $

All of these are numbers and we can think of them as each having it own number line.
Our first goal is to understand how sensitive is the Cost Function to small changes
in $ w^{(L)} $. In other words what is the derivative of C with respect to $ w^{(L)} $.

$$ \frac{\partial C}{\partial w^{(L)}} $$

Think of the $ \partial w^{(L)} $ term as a tiny nudge to $ w^{(L)} $, like a change of 0.01.
And the $ \partial C $ term is the resulting nudge to the cost. What we want is their ratio.
Conceptually this tiny nudge to $ w^{(L)} $ causes nudge to $ z^{(L)} $, which in term causes
some nudge to $ a^{(L)} $, which directly influences the cost. Multipling together those
3 ratios gives us the sensitivity of $ C $ to small changes in $ w^{(L)} $.

![chain rule](img/chain_rule.jpg)

Now we have to find the derivatives for each.
Derivative of:

$ C_0 = (a^{(L)} - y)^{2} $

is:

$ \frac{\partial C}{\partial a^{(L)}} = 2(a^{(L)} - y) \frac{\partial C}{\partial a^{(L)}} [a^{(L)} - y] = 2(a^{(L)} - y) $

This means that its size is proportional to the difference between the network's output and the
thing we want it to be ($ y $). So if that output was very different, even slight changes stand
to have big impact on the Cost Function.

Derivative of:

$$ a^{(L)} = \sigma(z^{(L)}) $$

For simplicity we assume the activation function is a sigmoid function, but as well it may be a
different function for example a ReLU and than the dervation is specific to the derivative of
that activation function:

$$ \frac{\partial a^{(L)}}{\partial z^{(L)}} = \sigma'(z^{(L)}) = \frac{e^{-x}}{1 - e^{-z^{(L)}}} $$

Derivative of:

$$ z^{(L)} = b^{(L)} + a^{(L-1)}w^{(L)} $$

is just:

$$ \frac{\partial z^{(L)}}{\partial w^{(L)}} = a^{(L-1)} $$

Note that when we put it all together we get just the derivative with respect to $ w^{(L)} $
only of the cost for a specific training example:

$$ \frac{\partial C_0}{\partial w^{(L)}} = \frac{\partial z^{(L)}}{\partial w^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial C}{\partial a^{(L)}} = a^{(L-1)} \sigma'(z^{(L)}) 2(a^{(L)} - y) $$

Since the whole Cost Function envolves averaging together all costs across many training examples,
its derivative requires avaraging this expression that we found over all training examples.

$$ \frac{\partial C}{\partial w^{(L)}} = \frac{1}{n}\sum_{k=0}^{n-1}\frac{\partial C_k}{\partial w^{(L)}} $$

And that is just one component of the gradient vector, which itself is build from the partial
derivatives of the Cost Function with respect to all the weights and biases:

$$ \nabla C = \begin{bmatrix} \frac{\partial C}{\partial w^{(1)}} \\ \frac{\partial C}{\partial b^{(1)}} \\ \dots \\
\frac{\partial C}{\partial w^{(L)}} \\ \frac{\partial C}{\partial b^{(L)}} \end{bmatrix} $$

Now we need to do the same for the sensitivity to the biases. We just need to change out
the $ \frac{\partial C}{\partial w^{(L)}} $ term for a $ \frac{\partial C}{\partial b^{(L)}} $
term.

$$ z^{(L)} = b^{(L)} + a^{(L-1)}w^{(L)} $$

And from the formula its derivative comes to be just 1:

$$ \frac{\partial C_0}{\partial b^{(L)}} = \frac{\partial z^{(L)}}{\partial b^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial C}{\partial a^{(L)}} = 1 \sigma'(z^{(L)}) 2(a^{(L)} - y) $$

Also this is where the idea of propagating backwards come. You see how sensitive the Cost Function
is to the activation of the previous layer. Namely the initial derivative ($ \frac{\partial z^{(L)}}{\partial w^{(L-1)}} $) in the chain rule expression, the sensitivity of z to the previous activation,
comes out to be the weight w^{(L)}. And even we won't be able to influence that previous layer
activation, it's helpful to keep track of it, because now we can just keep iterating the same chain rule
idea backwards to see how sensitive the Cost Function is to previous weights and to previous biases.

When we scale up to a more complex, realistic example it does not get much more complicated.
We just need to add and keep track of few more indeces. Rather than the activation of some
layer be just $ a^{(L)} $ its going to have a subscript indicating which neuron of that layer it is.

$ a_{0}^{(L-1)} ;  a_{1}^{(L-1)} ; a_{2}^{(L-1)} $

are all activations from neurons in the $ L-1 $ layer.

Lets use the letter $ k $ to index the layer $ L-1 $ and the letter $ j $ to index the layer $ L $.
For the Cost we look at what the desired output is, but this time we add up the squares of the
differences between these last layer activations and the desired output.

$$ C_{0} = \sum_{j=0}^{n_{L}-1} (a_{j}^{(L)} - y_{j})^{2} $$

Lets call the weight of the edge connecting the k-th neuron to the j-th neuron:

$ w_{jk}^{(L)} $

![multiple layered nn indexed](img/multiple_layer_nn_indexing.jpg)

The relevant weighted sum is now:

$$ z_{j}^{(L)} = w_{j0}^{(L)}a_{0}^{(L-1)} + w_{j1}^{(L)}a_{1}^{(L-1)} + w_{j2}^{(L)}a_{2}^{(L-1)} + b_{j}^{(L)} $$

The activation of the last layer is just the activation function applied to $ z $:

$$ a_{j}^{(L)} = \sigma(z_{j}^{(L)}) $$

The chain rule derivative expression looks essentially the same:

$$ \frac{\partial C_0}{\partial w_{jk}^{(L)}} = \frac{\partial z_{j}^{(L)}}{\partial w_{jk}^{(L)}} \frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}} \frac{\partial C_{0}}{\partial a_{j}^{(L)}} $$

What changes here is the derivative with respect to one of the activations in the layer $ L-1 $.
In this case the difference is the neuron influences the Cost Function through multiple different
paths. On one hand it influences $ a_{0}^{(L)} $, which plays a role in the Cost Function, but
it also has an influence on a_{1}^{(L)}, which also plays a role in the Cost Function and you
have to add those up:

$$ \frac{\partial C_0}{\partial a_{k}^{(L-1)}} = \sum_{j=0}^{n_{L}-1} \frac{\partial z_{j}^{(L)}}{\partial a_{k}^{(L-1)}} \frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}} \frac{\partial C_{0}}{\partial a_{j}^{(L)}} $$

Once you know how sensitive the Cost Function is to the activations in this second to last layer
you can just repeat the process for all the weights and biases feeding into that layer.

These chain rule expressions give you the derivatives that determine each component in the
gradient that helps minimize the cost of the network by repeatedly steping downhill.

![summary](img/summary.jpg)