# Back Prop Derivation

This notebook contains a mathematical derivation of the back propagation algorithm.
This derivation includes a vectorized implementations and accounts for back propagation
over multiple training samples.

For a full theoretical deep dive into back propagation, please refer to this article [here](http://neuralnetworksanddeeplearning.com/chap2.html).

## Intermediary Errors

The back propagation algorithm can be summarized in 4 equations. The equations utlize an intermediary error derivative which makes the computations of derivatives easier:

The intermediary errors are defined as follows:

$$\delta^L_j = \frac {\partial C} { \partial a^L_j } \sigma ' (z^L_j) $$

$$\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma ' (z^L) $$

* Capital *C* denotes the overall cost function for the network

* Capital *L* denotes the last layer of the neural network while lowercase *l* denotes an arbitrary layer

* Subscript *j* denotes the jth node in a particular layer, *l*.

* $\odot$ denotes the Hadamard product, which is an element-wise product of two vectors.

## The Four Main Equations

The four main equations for computing back propagation are used to calculate the partial derivatives of all parameters in the neural network.


**1: Error derivative of last layer:**

$$\triangledown C \odot \sigma ' (z^L) $$

**2: Error derivate of arbitrary layer l:**

$$\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma ' (z^L) $$

**3: Partial derivate of cost with respect to bias j in layer l**

$$ \frac {\partial C} {\partial b^l_j} = \delta^l_j $$

**4: Partial derivate of cost with respect to weight j,k in layer l**

$$ \frac {\partial C} {\partial w^l_{jk} } = a^{l-1}_k \delta ^l_j$$

### Deriving Equation 1

$$\triangledown C \odot \sigma ' (z^L) $$

By definition, $\delta^L_j = \frac {\partial C} { \partial a^L_j } \sigma ' (z^L_j$

So $\delta^L_j = \sum_k \frac {\partial C} {\partial a_k^L } \frac { \partial a_k^L } {\partial z_j^L} $ where *k* represents the *kth* node in layer L.

Note that when $j \neq k$, the expression $ \frac { \partial a_k^L } {\partial z_j^L} $ becomes 0 since $a_k$ is a function of only $z_k$

The above equation then reduces to $\delta^L_j = \frac {\partial C} {\partial a_k^L } \frac { \partial a_k^L } {\partial z_j^L} $

We can vectorize this equation as follows: $\delta^L = (\delta_1^L, \delta_2^L, ..., \delta_{n_L}^L) $ where $n_L$ is the number of nodes in layer *L*




### Deriving Equation 2

$$\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma ' (z^L) $$
