# Neural Networks and Deep Learning Notes

Notes and equations from [neuralnetworksanddeeplearning.com](http://neuralnetworksanddeeplearning.com/)

# Chapter 2

- Backpropagation purpose: compute the gradient of the cost function to use in gradient descent.
- The partial derivatives $\frac{\partial C}{\partial w} and \frac{\partial C}{\partial b}$ tell us how quickly the cost changes w.r.t changes in the weights and biases.

## Warm up: a fast matrix-based approach to computing the output from a neural network.

### Notation

- $w^l_{jk}$ denotes the weight for the connection from the $k$th neuron in the $(l - 1)$th layer to the $j$th neuron in the $l$th layer. In code, this is expressed as a list of matrices: `w[l][j][k]` is the weight from neuron `k` in layer `l-1` to neuron `j` in layer `l`.

![](http://neuralnetworksanddeeplearning.com/images/tikz16.png)

- $b^l_j$ denotes the bias of the $j$th neuron in the $l$th layer. In code, this is expressed as a list of column vectors: `b[l][j]` is the bias from all neurons in layer `l-1` to neuron `j` in layer `l`.
- $a^l_j$ denotes the activation of the $j$th neuron in the $l$th layer. In code this is expressed as a list of column vectors: `a[l][j]` is the activation for neuron `j` in layer `l`.

- The value of $a^l_j$ is computed: $a^l_j = \sigma ( \sum_k w^l_{jk} \cdot a^{l-1}_k + b^l_j)$. Notice we are using the activation from the layer $l - 1$ to compute the activation at layer $l$. 
- The same values can also be vectorized by expressing activations, weights, and biases as vectors: $a^l = \sigma (w^l \cdot a^l-1 + b^l)$.
- $z^l$ is the weighted input to activation function in layer $l$, and it uses the activation of layer $l - 1$: $z^l = w^l \cdot a^{l-1} + b^l$.


## Necessary assumptions about the cost function

- Using the quadractic cost function $C = \frac{1}{2n} \sum_x (y(x) - a^L(x))^2$, where $y$ is the true output and $a^L(x)$ is the output activation given the current weights and biases.
- Cost function can be written as an average of the inidividual sample costs: $C = \frac{1}{n} \sum_x C_x$, $C_x = \frac{1}{2} (y - a^L)^2$
- Cost function can be written as a function of the outputs: $C = C(a^L)$, $C = \frac{1}{2} (y - a^L)^2 = \frac{1}{2} \sum_j (y_j - a^L_j)^2$

## Hadamard product

- Elementwise multiplication of two vectors of the same dimensions.

\begin{eqnarray}
\left[\begin{array}{c} 1 \\ 2 \end{array}\right] 
  \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]
= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]
= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].
\tag{28}\end{eqnarray}


In [10]:
# Hadamard example - it seems the np.multiply implicitly does the 
# hadamard if the dimensions are correct.
import numpy as np

a = np.array([[1],[2]])
b = np.array([[3],[4]])
c = np.multiply(a,b)
print(c)

# Is the hadamard the same as the dot product with one matrix transposed? No.
a = np.array([[1],[2]])
b = np.array([[3],[4]])
c = np.dot(a,b.transpose())
print(c)

[[3]
 [8]]
[[3 4]
 [6 8]]


## Four fundamental equations behind backpropagation

### Preliminary: measure of error $\delta^l_j$ 

- $\delta^l_j$ is the error in the $j$th neuron in the $l$th layer. Backprop gives us a procedure to compute this error, which is used to compute the partial derivatives w.r.t. to weight and bias, which are used to compute the gradient.
- $\delta^l_j = \frac{\partial C}{\partial z^l_j}$ is the error at layer $l$ neuron $j$ is the partial derivative of $C$ w.r.t. the weighted input to that neuron ($z^l_j$). "a measure of the error in the neuron".

### 1. Error in the output layer $\delta^L$

Element-wise form:

$\begin{eqnarray} 
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).
\tag{BP1}\end{eqnarray}$

Vectorized form:

$\begin{eqnarray} 
  \delta^L = (a^L-y) \odot \sigma'(z^L).
\tag{30}\end{eqnarray}$


- $\frac{\partial C}{\partial a^L_j}$ measures how fast the cost is changing as a function of the $j$th output activation. If $C$ doesn't heavily depend on a particular neuron, then the error $\delta^L_j$ will be small.
- $\sigma'(z^L_j)$ measures how fast the activation function $\sigma$ is changing at $z^L_j$. 

### 2. $\delta^l$ in terms of the error in the next layer $\delta^{l+1}$

To find the error of any individual layer:

$\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),
\tag{BP2}\end{eqnarray}$

- This equation gets applied to each layer, starting at the penultimate layer and moving backwards through the network.
- Use BP1 to get $\delta^L$, then BP2 to get $\delta^{L-1}$, $\delta^{L-2}$, ...

### 3. Rate of change of the cost with respect to any bias in the network

Element-wise form:

$\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j.
\tag{BP3}\end{eqnarray}$

Vectorized form:

$\begin{eqnarray}
  \frac{\partial C}{\partial b} = \delta,
\tag{31}\end{eqnarray}$

### 4. Rate of change of the cost with respect to any weight in the network

Element-wise form:

$\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.
\tag{BP4}\end{eqnarray}$

- ROC of $C$ w.r.t. the weight from neuron $k$ in layer $l-1$ to neuron $j$ in layer $l$.
- $a^{l-1}_k$ is the activation of neuron $k$ in the previous layer $l-1$.
- $\delta^l_j$ is the error at neuron $j$ in the current layer $l$.
- "Incoming activation times outgoing error".
- Because this is a product, having a very low activation neuron will yield little change even if the error is high. This is a potential pitfall - "learning slowly".

Simplified form:

$\begin{eqnarray}  \frac{\partial
    C}{\partial w} = a_{\rm in} \delta_{\rm out},
\tag{32}\end{eqnarray}$

## Proofs of the four fundamental equations

TODO

## Backpropagation algorithm summarized

1. **Input**: The raw features for a single training sample expressed as the activation $a^1$.
2. **Feedforward**: For $l = 2,3,...,L$, compute and store the weighted inputs: $z^l = w^l a^{l - 1} + b^l$: and the activations $a^l = \sigma(z^l)$.
3. **Output error**: Compute the output error $\delta^L = (a^L-y) \odot \sigma'(z^L)$.
4. **Backpropagate the error**: Going backwards $l = L - 1, L - 2, ..., 2$, compute each $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^{l})$.
5. **Compute, store gradients**: At each layer, store the gradients for weight: $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and for bias: $\frac{\partial C}{\partial b^l_j} = \delta^l_j$. These can be stored as lists of matrices of the same size as the bias and weight matrices.
6. **Return**: return the stored gradient values $\nabla C$, which will be used to update the weights and biases for the next iteration.

## Exercises

1) Suppose we modify a single neuron in a feedforward network so that the output is given by $f(\sum_j w_jx_j + b)$ where $f$ is some function other than the sigmoid. How should we modify backprop for this case?

You would need to be able to compute the derivative of that activation function and replace $\sigma'$ with $f'$ in equations BP1 and BP2.

2) Suppose we replace the usual non-linear $\sigma$ function with $\sigma z = z$ throughout the network. Rewrite the backpropagation algorith for this case.

Not rewriting the entire algorithm, but I believe this would also be a matter of modifying all places that use the derivative of the new $\sigma$.

## Gradient Descent with Backprop

This is just a summary of how backpropagation is used to determine the $\delta w$ and $\delta b$ that are applied at each iteration of gradient descent.

1. Input a set of training examples (`update_by_mini_batch()` function call).
2. For each training example $x$:
    - Set input activation $a^{x,1}$.
    - `back_prop()` function call.
    - Feedforward: for each $l = 2,3,...,L$ compute the weighted input to the next layer: $z^{x,l} = w^la^{x,l-1} + b^l$ and the activation at that layer: $a^{x,l} = \sigma(z^{x,l})$.
    - Output error: Compute the error at the output layer: $\delta^{x,L} = \nabla_aC_x \odot \sigma'(z^{x,L}) = \delta^L = (a^L-y) \odot \sigma'(z^L) = (a^L-y) \odot \sigma'(z^L)$.
    - Backpropagate the error: For each $l = L -1, L -2, ..., 2$ compute $\delta^{x,l} = ((w^{l+1})^T\delta^{x,l+1}) \odot \sigma'(z^{x,l})$. 

3. Gradient descent: use the computed errors to update the weight and bias values:
    - For each $l = L, L - 1,...,2$ update weights and biases according to rules:
    
    $w^l \rightarrow w^l - \frac{\eta}{m}\sum_x \delta^{x,l}(a^{x,l-1})^T$
    
    $b^l \rightarrow b^l - \frac{\eta}{m}\sum_x \delta^{x,l}$
    

## Back-propagation code annotated with the equations.


In [11]:
def back_prop(self, x, y):
        '''Execute back propogation for a single (x,y) (input, output)
        training pair. Return a tuple of the gradients (grad_b, grad_w)
        where grad_b and grad_w are layer-by-layer lists of numpy arrays.
        
        grad_w[l][j][k] is the gradient for the weight from neuron k in layer
        l - 1 to neuron j in layer l.
        
        grad_b[l][j] is the gradient for the bias for neuron j in layer l.
        
        '''
        
        # Both gradients initialized as zeros.
        grad_b = [np.zeros(b.shape) for b in self.biases]
        grad_w = [np.zeros(w.shape) for w in self.weights]
        
        # Duplicate forward prop, tracking a and z to compute deltas.
        a = x   # First activation is the input.
        A = [a] # Store the activations.
        Z = []  # Store the weighted inputs to activation.
        
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, a) + b # Compute and store the weighted input to the activation.
            Z.append(z)
            a = sigmoid(z)       # Compute and store the sigmoid activation.
            A.append(a)
            
        # Backward pass, compute the error first.
        output_delta = quadratic_cost_deriv(A[-1], y) * sigmoid_deriv(Z[-1]) # Eq. BP1.
        grad_b[-1] = output_delta # Eq. BP3. Bias gradient at the last layer is the output delta.
        grad_w[-1] = np.dot(delta, A[-2].transpose()) # Eq. BP4.
        
        # Python negative indexing used to loop backward through layers.
        for l in range(2, self.num_layers):
            sd = sigmoid_deriv(Z[-l]) 
            delta = np.dot(self.weights[-l + 1].transpose(), delta) * sd # Eq. BP2.
            grad_b[-l] = delta # Eq. BP3.
            grad_w[-l] = np.dot(delta, A[-l - 1].transpose()) # Eq. BP4.
        
        return grad_b, grad_w