In [10]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

### A Matrix Based Approach to Computing the Output of a Neural Network

To begin let's describe some notation.
1. $w_{jk}^l$ is the weight from neuron $k$ in the $(l-1)^{th}$ layer to neuron $j$ in the $l^{th}$
2. $b_j^l$ is the bias of neuron $j$ in layer $l$
3. $a_j^l$ is the activation of neuron $j$ in layer $l$

With this notation, we can write the activation
$$
a_j^l = \sigma\big(\sum_kw_{jk}^la_k^{(l-1)} + b_j^l\big)
$$

In matrix form, vectorizing $\sigma(\cdot)$
$$
\mathbf{a}^l = \sigma(\mathbf{z}^l) = \sigma\big(\mathbf{W}^l\mathbf{a}^{l-1} + \mathbf{b}^l\big)
$$

Revisiting our cost function, we write
$$
C = \frac{1}{2n}\sum_x\|\mathbf{y}(x) - \mathbf{a}^L(x)\|^2
$$

First, we need to be able to write our cost function as an average of cost functions for individual data points (which we can do in this case): $C = \frac{1}{n}\sum_xC_x$. Second, we must be able to write the cost function as a function of the network output (which we also can do in this case). So for a given input $x$, our cost function evaluation looks like:
$$
C_x = C = \frac{1}{2}\|\mathbf{y} - \mathbf{a}^L\|^2 = \frac{1}{2}\sum_j(y_j - a_j^L)^2
$$

Below is an example of an operation that we will be making consistent use of, called the *Hadamard* or *Schur* product. This is an elementwise multiplication of two vectors rather than the dot product:

In [9]:
a = array([2, 4])
b = array([5, 3])
a*b

array([10, 12])

### The Equations of Backpropagation

We begin by introducing the concept of an *error* $\delta_j^l$ for the $j^{th}$ neuron in layer $l$.

$$
\delta_j^l = \frac{{\partial C}}{{\partial z_j^l}}
$$



**BP Equation 1**: error at the output layer
$$
\delta_j^L = \frac{\partial C}{\partial a_j^L}\sigma'(z_j^L)
$$

The input to neuron $j$ in layer $L$, $z_j^L$ is calculated while computing hte behavior of the network. This is simply passed through the derivative of the sigmoid function, $\sigma'$, and this multiplies the derivative of the cost function with respect to activation $a_j^L$ of neuron $j$ in layer $L$. With our cost function from above for a single input data point,

$$
\frac{\partial C}{\partial a_j^L} = (a_j - y_j)
$$

In matrix notation we have the following,
$$
\delta^L = \nabla_a{C}\odot\sigma'(z^L) = (y-a^L)\odot\sigma'(z^L)
$$