# Back Prop Derivation

This notebook contains a mathematical derivation of the back propagation algorithm.
This derivation includes a vectorized implementations and accounts for back propagation
over multiple training samples.

For a full theoretical deep dive into back propagation, please refer to this article [here](http://neuralnetworksanddeeplearning.com/chap2.html).

## Intermediary Error Derivatives

The back propagation algorithm can be summarized in 4 equations. The equations utlize an intermediary error derivative which makes the computations of derivatives easier:

The intermediary errors are defined as follows:

$$\delta^L_j = \frac {\partial C} { \partial z^L_j } = \frac {\partial C} { \partial a^L_j } \sigma ' (z^L_j) $$

$$\delta^l_j = \frac {\partial C} { \partial z^l_j } $$

* Capital *C* denotes the overall cost function for the network

* Capital *L* denotes the last layer of the neural network while lowercase *l* denotes an arbitrary layer

* Subscript *j* denotes the jth node in a particular layer, *l*.

* $\odot$ denotes the Hadamard product, which is an element-wise product of two vectors.

## The Four Main Equations

The four main equations for computing back propagation are used to calculate the partial derivatives of all parameters in the neural network.


**1: Error derivative of last layer:**

$$\triangledown C \odot \sigma ' (z^L) $$

**2: Error derivate of arbitrary layer l:**

$$ \delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma '(z^L) $$

**3: Partial derivate of cost with respect to bias j in layer l**

$$ \frac {\partial C} {\partial b^l_j} = \delta^l_j $$

**4: Partial derivate of cost with respect to weight j,k in layer l**

$$ \frac {\partial C} {\partial w^l_{jk} } = a^{l-1}_k \delta ^l_j$$

### Deriving Equation 1

$$\triangledown C \odot \sigma ' (z^L) $$

By definition, $\delta^L_j = \frac { \partial C } { \partial z^L_j} = \frac {\partial C} { \partial a^L_j } \sigma ' (z^L_j) $

So $\delta^L_j = \sum_k \frac {\partial C} {\partial a_k^L } \frac { \partial a_k^L } {\partial z_j^L} $ where *k* represents the *kth* node in layer L.

Note that when $j \neq k$, the expression $ \frac { \partial a_k^L } {\partial z_j^L} $ becomes 0 since $a_k$ is a function of only $z_k$

The above equation then reduces to $\delta^L_j = \frac {\partial C} {\partial a_k^L } \frac { \partial a_k^L } {\partial z_j^L} $

We can vectorize this equation as follows: $\delta^L = (\delta_1^L, \delta_2^L, ..., \delta_{n_L}^L) $ where $n_L$ is the number of nodes in layer *L*




### Deriving Equation 2

$$\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma ' (z^L) $$

By definition, $\delta^l_j = \frac { \partial C } { \partial z^l_j } $

Using induction, we can derive: $ \delta^l_j = \sum_k \delta^{l+1}_k \frac {\partial z^{l+1}_k} {\partial z^l_j} $

Note that $z_j^{l+1} = (w_j^{l+1})^T a^l + b_j^{l+1}$ where $w_j^{l+1}$ is a vector containing the *jth* row of weight matrix $W^{l+1}$. These are all the parameters feeding into the *jth* node in layer $l + 1$.

Also note that $\frac {\partial a_j^l} {\partial z_k^l}$ is 0 when $k \neq j$, otherwise it is the derivative of our activation function, $\sigma '(z_j^l)$.

Putting this all together, we get $\frac {\partial z_k^{l+1}} {\partial z_j^l} = W^{l+1}_{jk} \sigma '(z_j^l)$

And $ \delta_j^{l} = \sum_k \delta_k^{l+1} W^{l+1}_{jk} \sigma '(z_k^l) $

This expression can be vectorized to $\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma ' (z^L) $


### Deriving Equation 3

$$ \frac {\partial C} {\partial b^l_j} = \delta^l_j $$

By definition, $\delta^l_j = \frac { \partial C } { \partial z^l_j } $

And by definition, $z_j^l = (w_j^l)^T a^l + b_j^l$ where $w_j^l$ is a vector containing the *jth* row of weight matrix $W^l$

So $\frac { \partial z^l_j } { \partial b^l_j } = 1$

And $\frac { \partial C } { \partial b^l_j } = \delta^l_j \frac { \partial z^l_j } { \partial b^l_j } = \delta^l_j $

### Deriving Equation 4

$$ \frac {\partial C} {\partial w^l_{jk} } = a^{l-1}_k \delta ^l_j$$

By definition, $\delta^l_j = \frac { \partial C } { \partial z^l_j } $

And by definition, $z_j^l = (w_j^l)^T a^l + b_j^l$ where $w_j^l$ is a vector containing the *jth* row of weight matrix $W^l$

So $\frac { \partial z^l_j } { \partial w^l_{jk} } = a^{l-1}_k$

And $\frac { \partial C } { \partial w^l_{jk} } = \delta^l_j \frac { \partial z^l_j } { \partial w^l_{jk} } = \delta^l_j *a^{l-1}_k $


## Vectorizing the Equations

We want the equations to be vectorized so that computing them across many samples and many parameters can happen using as few loops / math operations as possible.

We can further vectorize equations 3 and 4.

### Vectorizing Equation 3

$$ \frac {\partial C} {\partial b^l_j} = \delta^l_j $$

There isn't much extra work to do here, all we have to note is that we can simultaneously update all biases in a layer by using the error derivative directly, since the error derivative is equivalent to a vector of all biases in a single layer.

$$ \frac {\triangledown C} {\partial b^l} = \delta^l $$

### Vectorizing Equation 4

$$ \frac {\partial C} {\partial w^l_{jk} } = a^{l-1}_k \delta ^l_j$$

We want to create a gradient to update the weights in a layer that has the same dimensions as the weight matrix: $ \mathbb{R}^{n_l \times n_{l-1}}$.

This can be accomplished using an outer product of the vectors $\delta^l \in \mathbb{R}^{n_l} $ and $a^{l-1} \in \mathbb{R}^{n_{l-1}}$:

$$\frac {\triangledown C} {\partial W^l} = \delta^l (a^{l-1})^T$$

