# Training Neural Networks

## Training Optimization

Now that we know how to train a neural network, but often they can fail during training. For these reasons:

    1. Architecture can be poorly chosen
    2. Data noisy
    3. Model takes years to run, need to be faster
    
## Testing

First we must understand which model is better based on intuition?

![b](Images/b1.png)

Well the one on the right makes no mistakes, while the left does worse. We can not simply assume this is the case, so we must use ***testing set*** of data.

![b](Images/b2.png)

Testing set can always be more accurate than intuition. Always try to go with the simplier model if given a choice.

## Overfitting and Underfitting

***Overfitting*** - modeling error that occurs when a function is too closely fit to a limited set of data points. (Too specific)

***Undefitting*** - occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. (Too General)

Use a test set to determine level of fit for data. Below is an visualization

![b](Images/b3.png)

So what do we do if we get it wrong? Typically aim to overfit and apply techniques to reduce the extreme fit.


## Stopping too Early

Typically the number of epochs corresponds to the level of fit. 

![b](Images/b6.png)

Rule of thumb, by plotting the test and training error we can assume certain fits about our model:

1. Training and Testing Error LARGE : Underfit
2. Training and Testing Error SMALL : Just Right
3. Training Error TINY and Testing Error MEDIUM: Overfit
4. Training Error TINY and Testing Error Large: Overfit

![b](Images/b5.png)

***Early Stopping***: Algorithm used to stop gradient descent as testing error begins to increase

## Regularization

Given the case: Two same regression line

![b](Images/b7.png)

Which has the smaller error. Ans: $10x_1 + 10x_2$ , why? Cause overfitting

![b](Images/b8.png)

So why is model A better? Because to perform gradient descent we need to have impactful derivatives. Given model B is too certain. we are simply taking weights of 1 and 0 when we derrive the sigmoid function.

![b](Images/b9.png)

In general, LARGE COEFFICIENT = OVERFITTING. We need to penalize large weights by adding a term to the end of the error function. Two Option: L1 or L2 Regarlization.

![b](Images/b10.png)

L1 - Good for Feature Selection

L2 - Normally better for training models-> Why? (1,0) is penalized more than (0.5,0.5)

## Dropout Layer

Analogy: If we train our chest muscles from Mon - Sun, our chest will be strong but the rest of body will fall behind. Thus, there is often a need to strong our body using full-body workouts or dedicating one day for a signifigant muscle group.

The idea can be applied to the weights of certain perceptrons that can dominate the output. We say one perceptron trains over the other such that we need to stop the training momentarily. It makes the most sense to randomly stop training certain nodes based on propability.

PROBABILITY EACH NODE WILL BE DROPPED = 0.2

Thus for each epoch, each node will recieve a 20% chance to be ignored

## Random Restart

Often in gradient descent the algorithm will get stuck at a local minima and assume it is the lowest point. What do we do then?

![b](Images/b11.png)

We will use something called ***random restart*** to start at a few different random places and gradient descent from there. This will give us a higher probability of reaching absolute minima.

![b](Images/b12.png)


## Vanishing Gradient

Another problem is vanishing gradient, a problem with the sigmoid function. The curve gets pretty flat to the left and right side, causing a derivative of 0.
The problem gets worse as more layers are added in a neural netwokrk. More and more tiny scores equates to an even smaller value.

![b](Images/b13.png)

## Batch vs. Stochastic Gradient Descent

***Batch Gradient Descent*** - Updating the weights of every single layer sequentially in the neural network. Done for all data per step, so this results in immense computation with higher performance.

![b](Images/b14.png)

Do we need to plug in all our data everytime? If the data is well distrubited a small subset will give a good pretty good understanding of what the gradient could be. It is less accurate but faster.

***Stochastic Gradient Descent*** - take small subsets of the data, run through neural network and calculate gradient. Update weights based on this small sample. 

We can do this by splitting the data into batches , e.g 6 of 24 points is one batch. Run through the NN and backpropagate. Repeat for the 3 other batches. The result is 4 steps with small improvements everytime, which is prefered over one large step for accuracy. 

![b](Images/b15.png)


## Learning Rate Decay

The faster the learning rate, the more chance you have of skipping the minimum error.

The best learning rates are the ones who:

    1. If Steep: Long Steps
    2. If Plain: Small Steps