# Artificial Neural Networks: Learning

## The Learning Process

When intializing the _Neural Network_ model, the weights will be assigned small random values close to 0, because the model won't know how the input variables affect the output. Once intialized, each neuron will learn by:

* Calculating the prediction value $\hat y$ using an assigned activation function. This is known as __Forward Propgation__.

<img src="cost.png" width="700px;" alt="Flow from input layer to output layer." />

* Comparing the prediction value to the actual value, and calculate the difference through a cost function.

$$\large C = \frac{(\hat y - y)^2}{2} $$

* Using the cost function and modifying the synapse weights to reduce the cost value. This is known as __Backpropogation__. The neurons that reduce the cost value will be have a greater weighted synapse, while the neurons that increase the cost value will have a lower weighted synapse. 

<img src="cost-back.png" width="700px;" alt="Backpropogation" />

The model will perform the steps above for $n$ number of epochs to reduce the value from the cost function. One epoch is one complete presentation of the training data, meaning the model has learned from the entire training dataset once. Reaching a cost of 0 is not necessarily the end goal, as overfitting may occur.

## Gradient Descent

<img src="gradientDescent.png" width="600px;" alt="Moving in the direction of descent to minimize the cost function." />

_Gradient Descent_ is an optimization algorithm used to minimize the value of some function by moving in the direction of "steepest descent". It is a very efficient solution to figuring out better weight values in a neural network and minimizing the value of the cost function, especially because brute force solutions would take years to compute for most _Neural Network_ models with today's computational power.

### Batch Gradient Descent

<img src="batch.png" width="700px;" alt="Entire dataset is used to calculate the cost value." />

_Batch Gradient Descent_ is an application of gradient descent, where the cost function would be calculated using one entire epoch. From the general value of this cost function, the weights are all correspondingly changed. It is effective when dealing with convex cost functions like the one seen above and is a deterministic algorithm. The cost function for batch gradient descent is:

$$\LARGE C = \sum_{d = 1}^{D}{\large \phi(\sum_{i = 1}^{I} (x_i \times w_i))}$$

__Where:__

* __C__: Cost Value
* __d__: Current Datapoint
* __D__: Total # of Datapoints
* __i__: Current Input
* __I__: Total # of Inputs
* __x<sub>i</sub>__: Input Value
* __w<sub>i</sub>__: Corresponding Weight Value

### Sochastic Gradient Descent

<img src="invalid.png" width="700px;" alt="Polynomial function that has been wrongly predicted due to Batch Gradient Descent" />

There are two cases where _Batch Gradient Descent_ isn't very useful. One is when a polynomial cost function is used, while the other is when the 3<sup>rd</sup> dimension or more is considered due to more then 2 input parameters; the latter is very likely to happen. The problem with these cases is that they cause many minimums within the graph, so a algorithm that always travels in the direction of "steepest descent" may reach a low cost value that isn't ideal. 

<img src="sochastic.png" width="700px;" alt="The neural network learns from each datapoint" />

The _Sochastic Gradient Descent_ algorithm fixes the issues faced by _Batch Gradient Descent_ by creating a cost value and performing backpropogation for each datapoint in random order. Instead of using all the datapoints at once to perform backpropogation, the _Neural Network_ learns from each datapoint. This will not only be faster because only one datapoint needs to be loaded into memory at a time, but will also have a better chance of finding the ideal cost value because there will be higher fluctuations in the cost function due to the variance between each datapoint.  

### Mini Batch Gradient Descent 

A combination of _Batch Gradient Descent_ and *Sochastic Gradient Descent*, _Mini Batch Gradient Descent_ creates a cost value from a randomly picked set of datapoints instead of just a single datapoint. 