# Training neural networks

## Underfitting vs. overfitting

- Underfitting is error due to bias
- Overfitting is error due to variance
- It's really hard to find the right architecture for a neural network (balance between being overly simplistic vs. overly complicated). Best to err on the side of overly complicated models and introduce controls to prevent overfitting.

## Techniques to control overfitting

### Early stopping

- The model complexity graph plots the test and training error versus the number of epochs, which can be considered as a measure of complexity.
- When considering errors on trainings vs. test data on the model complexity graph, the error will decrease for the training set with more epochs.
- The error on the test data is initially large due to the randomised weights, but then decreases as long as the model generalises well until it reaches the minimum point - the Goldilocks spot. After this point, the model starts overfitting and starts memorizing the training data.
- **Early stopping** determines the number of epochs required to reach the Goldilocks point.

### Regularization

- Large weights cause the sigmoid to be very steep, such that the gradients will generally be close to zero. The model is too certain about his predictions. Also, if this model misclassifies, then it will produce large errors. This makes it hard to do gradient descent.
- However, if a model with large weights classifies correctly, it will produce small error. Therefore, in order to incentivize the procedure to prefer smaller weights, you need to penalize large coefficients. You need to add terms to the error function which grow large for large weights:
    - L1: Add sum of absolute values of weights (times a constant lambda)
        - Better for feature selection as it produces sparse vectors.
    - L2: Add sum of squared values of weights (times a constant lambda)
        - Normally better for training models

### Dropout

- What you see a lot when training neural networks is that one part of the network has large weights and dominates all the training. This whereas other parts have smaller weights and don't play a large part in training.
- To prevent this, in each epoch you can randomnly remove a node or nodes from the network to engage other nodes in training. This method is called dropout.

### Random restart

- Random restart is a way of ending up in a local minimum rather than the global minimum.
- You randomnly start gradient descent from different starting points and see which track leads to the lowest error.

### Vanishing gradient

- The derivative of the sigmoid is very flat at both ends of the function. Therefore, the gradients are very low and lead to slow learning. This is further compounded with more layers since backpropagation multiplies the error terms.
- The best way to fix this is to change the activation function, e.g.:
    - Hyperbolic tangent function
    - Rectified Linear Unit (ReLU)
- You can apply a variation of activation function across layers, dependent on the requirement of the output unit. E.g. if that still needs probabilities, you would still use a sigmoid for the output layer, but a different one for the hidden layers.

### Batch vs stochastic gradient descent

- So far each, in each epoch we would feedforward/backward all data points through the network to update the weights a single time. This can be very computationally demanding.
- Alternatively you can apply **stochastic gradient descent**:
    - Take small subsets (batches) of data, run them through the network, calculate the gradient of the error function for this batch and continue with the following batch.
    - To ensure all batches are considered, every epoch will have sub epochs, one for each batch.

### Learning rate decay

- Lower learning rates are better than bigger ones. With bigger learning rates you have the chance that you overshould and keep going until forever.

### Momentum

- Momentum attempts to solve for the local minimum problem by increasing the gradient to hump out of the minimum.
- Momentum determines the size of the next step by considering the sizes of the previous steps scaled by a parameter beta (between 0 and 1) to the power of the number of steps back so that the weight of a step decreasing if more time has passed.