# Neural Network Backpropogation
__MATH 3480__ - Dr. Michael Olson

Reading:
* Geron, Chapter 11
* Brunton, Chapter 6

## Preliminaries
* Create a separate virtual environment
  * Some of the packages in Tensorflow don't play nicely with other packages
* Install Tensorflow

## A conceptual description

To describe how we train a Neural Network, let's use a more concrete example. A common example to use is the MNIST dataset of 70,000 handwritten numbers.

![MNIST Dataset](https://upload.wikimedia.org/wikipedia/commons/f/f7/MnistExamplesModified.png)

If we set each image with a resolution of 28x28, then that is 784 pixels, so our input layer has 784 nodes. The output layer would have 10 nodes, one for each digit, and the value of each node would be the probability that the number in the image is the number the node represents.

To train this model better, let's add 2 hidden layers with 16 nodes each. Following is a step-through of the training process.
1. Initially, all the weights connecting nodes from one layer to the next are random
2. If we run the first image through the network, we will get a series of probabilities that the image is of that number (this is called a __forward pass__)
    * With this first run, the results will be very poor
3. Compare these results with the true value
    * This gives an indication if the result needs to increase or decrease
    * Variety of loss functions that can be used for this
4. Look at the previous layer
    * If any nodes are pushing the output value to the correct result, we note that we want to strengthen it
    * If any nodes are pushing the output value to the incorrect result, we note that we want to weaken it
5. Knowing how we want to change the 2nd-to-last layer, then follow the same process to determine how the 3rd-to-last layer must change. Keep doing this until we reach the 1st layer in the network. (The is called a __reverse pass__)
6. Run this for all the images in a training set
    * The average of the changes we calculated in (4) indicate how much the weights need to change
    * It is at this point that we actually nudge the weights

After this, do it again multiple times. Eventually, the changes to each weight will be minor. At that point, the network is well trained.

### Stochastic Gradient Descent
There is a problem, however. With 70,000 images, this will take a lot of time. A solution is to take a random sample of images (a mini-batch) and run that through the network. After the first forward- and reverse passes, we update the weights and take another sample and do the same with the new mini-batch.

This method isn't as effective right away as using the entire training dataset, but it does get us there faster.
* Using the entire training dataset creates a perfect gradient descent to the correct answer, but it is slow
* Using a smaller random sample, the gradient descent may not be in the perfect direction (like a ball bouncing around going down the mountain). However, the steps are faster and we get to the result in less time.

This process of using random samples to make seemingly random steps to the solution is known as __stochastic gradient descent__.

## The math
Let $(L)$ be the superscript index for layer $L$, so $a_j^{(L)}$ is the activation value of the jth element in the Lth layer.

Make this simple. We have 4 layers with a depth of 1 each.
* The cost function is $C = (a^{(L)} - y)^2$
* The activation is $a^{(L)} = \sigma(z^{(L)})$
* The term inside the activation function is our linear regression, $z^{(L)} = w^{(L)}a^{(L-1)}+b^{(L)}$

We want to know how much to change the weight $w^{(L)}$. So, we use the chain rule:
$$\frac{\partial C}{\partial w^{(L)}} = \frac{\partial C}{\partial a^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial z^{(L)}}{\partial w^{(L)}}$$
* $\frac{\partial C}{\partial a^{(L)}} = 2(a^{(L)} - y)$
* $\frac{\partial a^{(L)}}{\partial z^{(L)}} = \sigma'(z^{(L)})$
* $\frac{\partial z^{(L)}}{\partial w^{(L)}} = a^{(L)}$
$$\frac{\partial C}{\partial w^{(L)}} = 2(a^{(L)}-y)\sigma'(z^{(L)})a^{(L)}$$

If instead we want to know how much to change the bias $b^{(L)}$, we do the same thing:
$$\frac{\partial C}{\partial b^{(L)}} = \frac{\partial C}{\partial a^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial z^{(L)}}{\partial b^{(L)}}$$
* $\frac{\partial C}{\partial a^{(L)}} = 2(a^{(L)} - y)$
* $\frac{\partial a^{(L)}}{\partial z^{(L)}} = \sigma'(z^{(L)})$
* $\frac{\partial z^{(L)}}{\partial b^{(L)}} = 1$
$$\frac{\partial C}{\partial b^{(L)}} = 2(a^{(L)}-y)\sigma'(z^{(L)})$$

Finally, we want to know how much to change the previous activation.
$$\frac{\partial C}{\partial a^{(L-1)}} = \frac{\partial C}{\partial a^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial z^{(L)}}{\partial a^{(L-1)}}$$
* $\frac{\partial C}{\partial a^{(L)}} = 2(a^{(L)} - y)$
* $\frac{\partial a^{(L)}}{\partial z^{(L)}} = \sigma'(z^{(L)})$
* $\frac{\partial z^{(L)}}{\partial a^{(L-1)}} = w^{(L)}$
$$\frac{\partial C}{\partial a^{(L-1)}} = 2(a^{(L)}-y)\sigma'(z^{(L)})w^{(L)}$$

We do this for all layers. The total gradient is then:
$$\nabla C = \begin{bmatrix} \frac{\partial C}{\partial w^{(1)}} \\ \frac{\partial C}{\partial b^{(1)}} \\ \vdots \\ \frac{\partial C}{\partial w^{(L)}} \\ \frac{\partial C}{\partial b^{(L)}} \end{bmatrix}$$

And we apply gradient descent.

Now, we do this for a depth that is larger than 1. If we use $j$ as the index for layer $L$ and then use $k$ as the index for layer $L-1$,
$$C = \sum_{j=0}^{n_L - 1} = (a_j^{(L)} - y_j)^2$$
$$\frac{\partial C}{\partial w_{jk}^{(L)}} = \frac{\partial C}{\partial a_{j}^{(L)}}\frac{\partial a_j^{(L)}}{\partial z_j^{(L)}}\frac{\partial z_j^{(L)}}{\partial w_{jk}^{(L)}}$$
$$\frac{\partial C}{\partial a_k^{(L)}} = \sum_{j=0}^{n_L - 1}\frac{\partial C}{\partial a_j^{(L)}}\frac{\partial a_j^{(L)}}{\partial z_j^{(L)}}\frac{\partial z_j^{(L)}}{\partial a_k^{(L-1)}}$$