# Artificial Neural Networks

Artificial Neural Networks (ANN) are a MASSIVE topic and I won't be ever able to cover all of it so I will do a trimmed down version that will cover enough to give you a background and make the Deep Q-Learning make sense in the next section.  

A basic overview is that a neural network is based on the biological neural network in our brains. Our brains are built of very complex web of interconnected neurons. To get an idea of the scope, the human brain is estimated to contain $10^11$ neurons each connected to $10^4$ other neurons. The fastest times are known to be $10^{-3}$ seconds. Computers have speeds up to $10^{-10}$. But, Humans are able to make descisions extremely quick. For example, it takes $10^{-1}$ second to visually recognize your mother. This means that it can't take more than a few hundred steps. This has led many to speculate that the process is highly parallel. While a neuron in a biological neural network outputs a complex time series of spikes, an ANN returns a single constant value.

**Function Approximation**  
As stated in the second notebook, one way that allows you to expand to larger problems is using function approximation. Function approximation is selecting a function to match up with your target. ANN can be used for a function approximator. 

**Perceptron**  
A perceptron is a unit that will take in real-valued inputs (think the observations from CartPole) and calculates a linear combination and then outputs a value. In a simple example, imaging you have 3 inputs. Those 3 imputs will be mutliplied by 3 weights. The sum of those equations will then be returned as the output. It will look something like this:  
$$\sum_{i=1}^{3} X_i * W_i$$  

**Perceptron Rule**  
The perceptron will alter the weights ($W_i$) until they are able to create a threshold that will separate the TRUE values and the FALSE values. Once we find the weights we will be able to create a line that will create half planes on a graph. Or, they will be linearly separable. It has been proven that this can be found in finite iterations (good news).

**Weight Update Walkthrough**  
To update the equation you need to alter your value of $W$ until you have a satisfactory threshold. A quick note, I am going to ignore learning rate to make this easier. But, it normally would be part of the $\Delta W_i$ equation.  

To update the weights: $$W_i = W_i + \Delta W_i$$
To find $\Delta W_i$: $$\Delta W_i = (y - y')X_i$$  
where $y$ is your target  
$y'$ is your output and determined by this: $$y' = \left( \sum W_i X_i >= 0 \right) $$
$X_i$ is the input value.  

Using binary:  
  
|y|y'| |y-y'|  
|----------|----------|----------|----------|  
|0|0| |0|  
|0|1| |-1|  
|1|0| |1|  
|1|1| |0| 



**Example [AND]**  
To start, I am going to cover a few VERY simple demo in order to understand how a neuron will work. In this demo, I am going to work through how the system would determine if your 2 variables were AND or not. Ex: $X_1$ = 1 and $X_2$ = 1. We do this by trying to create a graph that is linearly seperable. That is, when you can draw a line between the TRUE values and the FALSE values.  

Using this formula: $\sum X_i * W_i = \theta$ where X is your variable and W is your weight and $\theta$ is your threshold. Your X values are either 0 and 1 and your weights can be anything. In our example we will have $X_1 * W_1$ and $X_2 * W_2$ and some $\theta$ value we will use as a threshold to determine if the results is TRUE or FALSE.  

Our possible X values are {0,0}, {1,0}, {0,1}, and {1,1}  
We need to set the weights ($W_1$ and $W_2$) and threshold ($\theta$) so that we will only get TRUE for the pair that passes AND ({1,1}).
For this problem we will set the threshold at 0.75

Here are the equations:
0 * W_1 + 0 * W_2 < 0.75 (FALSE)  
1 * W_1 + 0 * W_2 < 0.75 (FALSE)  
0 * W_1 + 1 * W_2 < 0.75 (FALSE)  
0 * W_1 + 0 * W_2 > 0.75 (TRUE)  

We can solve for this and find that $W_1$ = 0.5 and $W_2$ = 0.5 will satisfy all the equations. So, our weights for this would be 0.5 and 0.5.  

Lets put this on a graph. We want to graph each of the points and then draw the line that will seperate the TRUE and FALSE sections of the graph.  

First, find the values on each axis:  
X1 Axis: We set $X_2$ to 0 and solve: $X_1$ * 0.5 + 0 * 0.5 = 0.75 => 1.5  
X2 Axis: We set $X_1$ to 0 and solve: 0 * 0.5 + $X_2$ * 0.5 = 0.75 => 1.5  

Second, connect the points with a line to show the half plane  

This will show that anything above the half plane is TRUE and anything below it is false. If you plot {0,0} or {1,0} or {1,0} you can clearly see that they are in the green side where {1,1} is on the blue side.    
![AND Graph](Graph-And.png)  

You can try and do the same with OR, NOT, and XOR but since we don't use the perceptron rule in ANN I won't cover them here. But, there is plenty of information out there about it.

**Stochastic Gradient Descent**  
Since this can get pretty deep in math pretty quickly I will keep this to a high level and let you go out and get as deep into this as you want.  

Stochastic gradient descent (SGD) is a more robust algorithm that will be able to separate the input but not limited to a straight line (threshold) like the perceptron rule. Instead of looking for TRUE/FALSE you are trying to find the output that matches your answer. Just know that gradient descent does converge to a local optimum and if you actually want to learn anything about machine learning of reinforcement learning you should give yourself some time and dig into SGD.     

**Perceptron vs Stochastic Gradient Descent**  
Just to show the difference I am going to compare the 2 learning rules.  
Perceptron:  $$\Delta W_i = \gamma(y-y')X_i$$  
Stochastic Gradient Descent: $$\Delta W_i = \gamma(y-a)X_i$$  where $$a = \sum X_iW_i$$  
But, with perceptron you try and find 1 or 0 because you are trying to linearly separate them. With SGD you are trying to minimze the error. Also, if you care about the math, $y'$ is non differentiable.

**Artificial Neural Network Layers**

Here is a diagram from Wikipedia: ![Image of ANN Layer](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/200px-Colored_neural_network.svg.png)  

With an ANN you have an input layer (in the cart pole it would be the 4 observations) and that layer connects to your hidden layer. There can be multiple hidden layers but the more layers the slower the training and the more chances you have of overfitting your training data. After all the hidden layer processing you have the output layer (in the cart pole this would be you Left/Right action). 

One thing to notice is the connections between neurons. It isn't 1 to 1 across layers. Each neuron will connect with each neuron in the next layer.

**Backpropagation**  
Backpropagation is the process that is used to update the values of the neurons inside the ANN. During training, you will compare your output with the correct answer and an error will be assigned. That error will then be distributed back through each of the layers by modifying the weights of each neuron. Upon enough training you will start to find the optimal model.

I don't have any demos in this section because we will cover this in the next 2 notebooks.  

The internet is your friend if you want to go crazy and dig into backpropagation, gradient descent, etc. Hopefully, when you are all done you can at least understand enough to hold your own and do your own experiments.