## Variations of Gradient Descent

In the Gradient Descent example we drew a single $x$ and $y$ from the dataset and applied the update rule:

$$ \theta_{n+1} = \theta_{n} - \alpha \nabla_{\theta} L(x,y;\theta_n) $$

This is called **Stochastic Gradient Descent (SGD)**. This variant has a high parameter update rate but the gradient is noisy and parameter updates are not very accurate. This may cause large chaotic jumps in the loss landscape.

SGD is also computationally inefficient because the ability of GPUs to calculate gradients in parallel can not be used.

The other extreme would be to average the gradients of all examples $\{x_1,...,x_n\}$ in the dataset before updating the parameters:

$$
\theta_{n+1} = \theta_{n} - \alpha \frac{1}{n} \sum_{i=1}^n \nabla_\theta L(x_i, y_i;\theta_{n})
$$

This is called **Batch Gradient Descent**. It calculates an accurate gradient but is very slow because it iterates over the whole dataset to apply a parameter update.

The compromise is to draw a random batch $x = [x_1,...,x_m]$ of size $m$ from the dataset and update the parameters with the average gradient:

$$
\theta_{n+1} = \theta_{n} - \alpha \frac{1}{m} \sum_{i=1}^m \nabla_\theta L(x_i, y_i;\theta_{n})
$$

This is called **Mini-Batch Gradient Descent**.

The parameter $m$ is usually called the **batch size**.

## Learning rate

The learning rate $\alpha$ controls how much we are updating the model parameters with respect to the gradient of the loss function.

$\alpha$ is one of the most important hyper-parameters.

If $\alpha$ is set to small gradient descent progress will be slow.

If $\alpha$ is set to large gradient descent may fail to converge.

This is a cartoonish depiction of the the effects of different learning rates:

<img src="images/cartoon_loss.png" height="300" width="400"/>

 * With a low learning rate (blue line) the loss will decrease slowly because the parameter updates are very small
 * High learning rates (green line) will decay the loss faster, but they get stuck at worse values of loss. This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a deep spot in the optimization landscape.
 * With very high learning rate (yellow line) the loss may even grow exponentially

One simple solution is to **anneal** the learning rate over time. Starting with a high learning rate helps to speed up the learning progress early on and reducing it over time helps to settle down into deeper but narrower parts of the loss function.

Another solution would be to use more sophisticated optimization algorithms that calculate individual learning rates for each model parameter. 

Examples are:

 * Adam
 * RMSProp


## Train/Test split

You always need to split up your dataset into a **train dataset** and a **test dataset** and keep them separate.

The train dataset is used to actually train the model.

The test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained.

The problem is that during training the model might become to sensitivity to the specific structure of the training data. 

The model might "memorize" the training data but perform poorly on unseen data. This is called **overfitting**.

Therefore the model performance is always evaluated on the test dataset that the model has not seen during training.

When splitting up the dataset it is important to have an even distribution of target values in both splits.

The **split ratio** between the train and test dataset is usually 70/30 to 90/10, depending on the overall datasize.

## Overfitting / Underfitting

We want a model to **generalize** well, this means to perform well on unseen data.

We track training and evaluation accuracy, but evaluation accuracy is the primary measure for model performance.

**Overfitting** occurs when a model becomes to sensitive to the specific structure of the training data and does not perform well to unseen data. 

**Underfitting** occurs when a model cannot adequately capture the underlying structure of the data. Such a model will tend to have poor predictive performance.

Here is aa example of overfitting / underfitting:

<img src="images/overfitting.png" height="200" width="300"/>

The example shows a **binary classification problem** that categorizes points as blue or red.

The black and green lines are decision boundaries for two different models on the train dataset.

The model with the green line has a tendency to overfit. It does a very good job in separating the classes but is too dependent on the train data and is likely to have a higher error rate on unseen data.

The model with the the black line has a lower train accuracy but is likely to generalize better.

It is a good idea to plot train and test accuracy over time during model training:

<img src="images/accuracies.jpg" height="200" width="300"/>

The large gap between train accuracy and the blue test accuracy is a clear sign for overfitting.

Another sign for overfitting is that the test accuracy gets worse later on.

Possible solutions:

 * reduce the number of parameters in the model
 * increase the size of the train dataset by collecting more data
 * increase regularization:
     * add dropout layers or increase dropout rate
     * stronger L1/L2 weight penalty


In general **regularization** is a set to methods that make it harder for the model to learn.

The other case is when the test accuracy tracks the train accuracy very well. 

This is a signal for underfitting.

The model capacity is not high enough and learning opportunities are wasted. 

The solution is to make the model larger: 

 * increasing the number of parameters in existing layers
 * add more layers

In practice you can balance over- and underfitting through experiments. 

You actually want to see a little bit of overfitting.

Stop training early when test accuracy starts to drop and use the model checkpoint with the best performance.
