<h1> Neural Networks </h1>
<h3> Based on <a href="https://www.3blue1brown.com/lessons/neural-networks"> this tutorial </a> </h3>
<h2> Neural Network Intuition </h2>
<p> Neural Networks are algorithms inpired by the structure of the brain (loosely), designed to be good at solving problems by themselves that are hard to solve with classical logic based algorithms. They are trained on a set of training data representative of the problem they will encounter, so that they can provide predictions (what will happen next given these conditions?) or classification (which category does this data fall into?). An example of a problem that is extremely hard to solve with classical computing techniques but is very easy for neural networks in recognizing handdrawn digits. There are many different types of Neural-Networks, but the one I will be describing in this guide is a "simple" vanilla neural network called a multilayer-perceptron.</p>
<p> A neural network is made up of layers of neurons, which hold numbers. The number the neuron holds is called its "activation". The first layer is the input layer. When we feed an image of a hand-written digit to our input layer, it is normalized (so that each pixel has a value between 0.0-1.0 like our activation numbers), and reshaped so that there is one pixel per each input neuron.</p>
<p> Next we look at the output layer. The output layer gives us our prediction. An example of what this could look like is we could have an image classification system that sorts images into "cat" or "dog". The neuron could output a number between 0.0 and 1.0. A 0.0 would be 100% certainty of a dog, a 0.5 would be not sure (50% dog, 50% cat) and a 1.0 would be 100% certainty of a cat. Another way you can do output neurons for classification would be to have an output neuron for each category. This would give as a result a probability to each neuron, telling us the likelihood the input data corresponds to each category. (all probabilities should add up to 1.0). In the example of handwritten digits, we would have 10 neurons in the output layer, representing the digits 0-9. </p>
<p> The last type of layer I will describe in this basic description is the hidden layers. Hidden layers are those intermediate layers in-between the input and the output, so called "hidden" because we do not interact with them directly. </p>
<p> Each layer is connected to the previous and the next layer through weights. These weights are what the neural network optimizes to increase the accuracy of its predictions. The weight between two neurons A, B determines how much the activation of the neuron A in the first layer influences the neuron B in the subsequent layer.</p>
<p> So how would this sort of system detect handwritten digits? Well when a brain detects handwritten digits, it does something sort of like splitting the digits into components (or "features" as they are called in AI), and then recognizing how these components are combined to form a digits. Different neurons in the last hidden layer should theoretically roughly correspond to these features, and as we train the neural network it will get better at learning these. </p>
<img src="./imgs/features.png">
<p> But how would we recognize these "components"? Well, the component is made of subcomponents right? Like a circle is made of curves. The neurons from the previous layer would be trained to recognize these curves, and the weights (or correlation) between the subcomponent neurons in the previous layer and the component neuron in the next layer would be high, so when these are all activated, the last hidden layer that recognizes components would activate. This process goes all the way backwards in big multilayer neural networks. (Theoretically). A similar thing happens in voice recognition. The first layer recognizes distinct sounds, the second layer combines these to make syllables, the third layer combines these to make words, then sentances. (Again, heuristically). The weights from all the neurons connected to a given neuron correspond to whether or not the features they represent make up the given feature the neuron in the next layer represents. A positive weight means yes, a negative weight means no. </p>
<img src="./imgs/weights.png">
<p> It is common to pass the total activation through a function (which we call the activation function) that will cause the result to have a certain value. For the handwritten digits example, we would use the sigmoid function, which maps values to the range [0,1]. We define it like so: </p>
$$
\sigma (x) = \frac{1}{1 + e^{-x}}
$$
<p> But maybe we want the neuron to only activate sigmiodally above a certain threshold. This is where the bias cones in. Biases are another trainable parameter for each node in a hidden layer to add a constant offset to the output of that node. It is analogous to $b$ in the equation $y = mx + b$, i.e allows you to move the model you have fitted up or down to fit better. If we set our bias to be =-10, the activation will look like this:</p>
<img src="./imgs/bias.png">
<p> Put all together this is a massively complicated system. It would be really annoying (or impossible) to have to adjust these weights and biases by hand to make the neural network accurate. Luckily there is an algorithm that can do it automatically for us, called gradient descent. </p>

<h2> Gradient Descent </h2>
<p> As before, the activation of the 1st neuron in the $(n+1)$ th layer is given by the following formula (assuming there are k neurons in the previous layer and the network is fully connected (each neuron attaches to every neuron in the previous and next layers)</p>
<br>
$$
a^{n+1}_{0} = \sigma{(w_{n,0}a^{n}_{0} + w_{n,1}a^{n}_{1} + ... + w_{n,k}a^{n}_{k})}
$$
<br>
<p> To train the network, we give it labelled data. Then, to test it, we give it some data from the same set that it's never seen before. To train a neural network well requires a lot of nice, clean data.</p>
<img src="./imgs/train.png">
<p> We can define a function that determines how well the neural network performs, called a "cost function". This function exists in a high dimensional space where the dimensionality is determined by the number of trainable parameters in the neural network. Training the neural network comes down to finding "minima" on the cost landscape, corresponding to the set of weights and biases for which the neural network will make mistakes the least. </p>
<img src="./imgs/minima.png">
<p>Initially, we will put all the weights and biases to be random numbers. This neural network will suck and will not predict anything. You can use the "cost" function to compute just how bad the guesses are.</p>
<img src="./imgs/computing_cost.png">
<p>Remember that a certain point on the cost function surface corresponds to a certain configuration of the neural network weights. Of course this will be a function with many many dimensions in reality. If you want to optimize our place in this loss "landscape", how should we change the weights? Well, we could change them a little at a time, going in the direction of the steepest "slope" downwards from the point we were at. The "slope" makes sense for a 3d function, but here we are dealing with $n>>3$ dimensions. This is where we can use calculus. The gradient of a function at a given point gives us the direction that the function is increasing the most. For an n dimensional variable the gradient looks like this:</p>
<br>
$$
\nabla C = (\frac{dC}{dx_{1}},\frac{dC}{dx_{2}},...,\frac{dC}{dx_{n}})
$$
<br>
<p>
Now we want the direction of steepest <em>descent</em>, which will be the negative of that direction. So to minimize the cost function from a given point, we simply have to move a little in the following direction:</p>
$$
- \nabla C
$$
<img src="./imgs/grad_desc.png">
<p>To sum it up, here's a good quote from the  3blu1brown article linked at the top of the page: "When we refer to a network “learning”, we mean changing the weights and biases to minimize a cost function, which improves the network's performance on the training data."</p>
<p> Now the cost function depends on the weights and biases of the model, so when we calculate the gradient, these are the variables that we are taking partial derivatives of.
<br>
$$
\nabla C=(\frac{\partial C}{\partial w^{1}},\frac{\partial C}{\partial b^{1}},...,\frac{\partial C}{\partial w^{L}},\frac{\partial C}{\partial b^{L}})
$$
<br>

<h2> Backpropagation: How the gradient is calculated. </h2>
<p> So how do we calculate the partial derivatives that make up this gradient? Enter backpropagation, which is basically just the chain rule. </p>
<p> Take a simple neural network as a base example: </p>
<img src="./imgs/simple_network.png">
<p>Then we can define the cost as the difference between the desired output and the output of the corresponding output layer neuron (here the only output layer neuron)</p>
<img src="./imgs/output_layer_cost.png">
<p>We can write down an equation for this final activation</p>
<br>
$$
a^{L} = \sigma({w^{L}a^{L-1}+b^{L}})
$$
<p>Where $w^{L}$ is the weight from the previous neuron to the last one, $a^{L-1}$ is the activation from the previous neuron, and $b^{L}$ is the bias of the final neuron. These are superscripts, NOT exponents! Simplify this a bit by introducting a variable:</p>
<br>
$$
z^{L} =w^{L}a^{L-1}+b^{L}
$$
<br>
$$
a^{L} = \sigma(z^{L})
$$
<br>
<p>Now we can start to calculate </p> 
<br>
$$
\nabla C=(\frac{\partial C}{\partial w^{1}},\frac{\partial C}{\partial b^{1}},...,\frac{\partial C}{\partial w^{L}},\frac{\partial C}{\partial b^{L}})
$$
<br>
<p> We can use this dependancy diagram and chain rule for dependant variables to calculate the first derivative as an example. </p>
<p>
<img src="./imgs/diffs.png">
<br>
$$
\frac{\partial C_{0}}{\partial w^{L}} = \frac{\partial z^{L}}{\partial w^{L}}\frac{\partial a^{L}}{\partial z^{L}}\frac{\partial C_{0}}{\partial a^{L}}
$$
<br>
$$
z^{L} =w^{L}a^{L-1}+b^{L} \implies \frac{\partial z^{L}}{\partial w^{L}} = a^{L-1}
$$
<br>
$$
a^{L} = \sigma(z^{L}) \implies \frac{\partial a^{L}}{\partial z^{L}} = \sigma'(z^{L})
$$
<br>
$$
C_{0} = (a^{L} - y)^{2} \implies \frac{\partial C_{0}}{\partial a^{L}} = 2(a^{L} - y)
$$
<br>
$$
\frac{\partial C_{0}}{\partial w^{L}}=2a^{L-1} \sigma'(z^{L})(a^{L} - y)
$$

<p> Now note that this was just the cost for one piece of training data. The full cost function for the network is the average over all of the training samples.</p>
$$
C=\frac{1}{n}\Sigma_{k=0}^{n-1} C_{k}
$$
So the derivative of total C with respect to a weight will also be an average:
$$
\frac{\partial C}{\partial w^{L}}=\frac{1}{n}\Sigma_{k=0}^{n-1} \frac{\partial C_{k}}{\partial w^{L}}
$$
<p> Now that we have the derivative for the first entry, we can follow a similar process to caltulate the derivatives for the other entries. </p>
<p> First calculating the derivative for the first bias, $\large{\frac{\partial C_{0}}{\partial b^{L}}}$ we use the derivative dependence tree again:</p>
<br>
$$
\frac{\partial C_{0}}{\partial b^{L}} = \frac{\partial z^{L}}{\partial b^{L}}\frac{\partial a^{L}}{\partial z^{L}}\frac{\partial C_{0}}{\partial a^{L}}
$$
<br>
$$
z^{L} =w^{L}a^{L-1}+b^{L} \implies \frac{\partial z^{L}}{\partial b^{L}} = 1
$$
<br>
$$
a^{L} = \sigma(z^{L}) \implies \frac{\partial a^{L}}{\partial z^{L}} = \sigma'(z^{L})
$$
<br>
$$
C_{0} = (a^{L} - y)^{2} \implies \frac{\partial C_{0}}{\partial a^{L}} = 2(a^{L} - y)
$$
<br>
$$
\frac{\partial C_{0}}{\partial b^{L}}=\sigma'(z^{L})2(a^{L} - y)
$$
All other weights and biases are earlier on in the network, meaning that their effect on $C_{0}$ is less direct. Let's figure out how the previous neuron's activation changes the cost by using our tree diagram to calculate $\large{\frac{\partial C_{0}}{\partial a^{L-1}}}$
<br>
$$
\frac{\partial C_{0}}{\partial a^{L-1}} = \frac{\partial z^{L}}{\partial a^{L-1}} \frac{\partial a^{L}}{\partial z^{L}} \frac{\partial C_{0}}{\partial a^{L}}
$$
<br>
$$
z^{L} = w^{L}a^{L-1} + b^{L} \implies \frac{\partial z^{L}}{\partial a^{L-1} = w^{L}}
$$
<p> Which makes sense. The change in the activation of the last neuron due to the penultimate one is proportional to the weight connecting them. However this isn't super helpful. However, we can change this neuron's weights and biases and calculate how these change the cost function, we just have to add a layer to our tree diagram.</p>
<img src="./imgs/two_diffs.png">
<p> This is where the idea of "backpropagation" comes in. To calculate the later derivatives in the gradient of the cost function, we must "backpropagate" through the derivative dependance tree. It would look like this: </p>
<img src="./imgs/big_chain.png">
<h2> More Complicated Networks </h2>
<p> This technique we used on the "simple" network with one neuron per layer can be extended to layers with multiple neurons, we just need to include an additional subscript to indicate which neuron it is in the layer. </p>
<img src="./imgs/multi_neuron.png">
<p> Use a "sum-of-squares" method to calculate the cost function for an output layer with multiple neurons / predictions. </p>
<img src="./imgs/big_cost.png">
<p> Furthermore, call the weight connecting the kth neuron in the (L-1)th layer to the jth neuron in the Lth layer $w_{jk}^{L}$. </p>
<img src="./imgs/multi_weights.png">
<p> The relevant weighted sum would for the Lth j neuron would then take the following form: </p>
$$
z_{j}^{L} = w_{j0}^{L} a_{0}^{L-1} + w_{j1}^{L} a_{1}^{L-1} + w_{j2}^{L} a_{2}^{L-1} + b_{j}^{L}
$$
<p> The chain rule calculating the sensitivity of a cost function to a paricular weight takes a similar form. </p>
$$
\frac{\partial C_{0}}{\partial w_{jk}^{L}} = \frac{\partial z_{j}^{L}}{\partial w_{jk}^{L}} \frac{\partial a_{j}^{L}}{\partial z_{j}^{L}} \frac{\partial C_{0}}{\partial a_{j}^{L}}
$$
<p> Something that does change is the derivative of the cost with respect to the activations in a given layer, because the activation of one neuron changes the activations of all the neurons in the next layer. We do this by summing over the layer after the neuron (as this is the layer the neuron is connected to). That's all there is to it. We can recursively repeat this process for every weight and bias in the neural net. </p>
<img src="./imgs/multi_weight.png">
<img src="./imgs/in_a_nutshell.png">