What we are currently having:

 * a **dataset** with examples $x$ and associated target values $y$
 * a **model** $f(x)$ to that can make prediction given some $x$
 * a **loss function** L that measure how good model predictions are

The next step is to find model **parameters** that result in a low loss over the dataset.

Remember the MNIST classification example:

$$
\begin{align}
z & = relu(W_1 \cdot x + b_1) \\
f(x) & = softmax(W_2 \cdot z + b_2)
\end{align}
$$

In this expression, $W_1$, $W_2$, $b_1$ and $b_2$ are the parameters of the model. 

For convenience we will denote the parameter of a model as $\theta$. 

In our example $\theta = \{W_1, W_2, b_1, b_2 \}$.

Initially the parameters are filled with random values. 

Obviously the model will not predict anything useful in this state.

**Optimization** refers to the task of minimizing the loss function $L(x, y, \theta)$ by altering $\theta$.

We'll take advantage of the fact that all model operations are differentiable.

We'll compute the **gradient** of the loss function with respect to the model's parameters, this will tell us how to update the parameters to decrease the loss.

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

def f(x): return x**2 + 3 * x

def df_dx(x): return 2 * x + 3

x = np.arange(-5, 2, 0.01)
plt.xlabel('x')
plt.plot(x, f(x), label='f(x)')
plt.plot(x, df_dx(x), label='df(x)/dx')
plt.axhline(linewidth=1, color='black')
plt.legend()
plt.show()

When a function $f(x)$ has multiple inputs $x = [x_1,...,x_n]$, each input contributes to the value of the function and it does not make sense to have a single derivative. 

Instead we compute derivatives with respect to one input at a time, these are called **partial derivatives**, but we will not go into more detail at this point.

## Model optimization

Now we can come back to the initial question how to find model parameters $\theta$ that minimize the loss function $L$ of a model $f(x)$.

The loss function takes as input $x$, $y$ and $\theta$, so we can write it as $L(x, y, \theta)$.

For optimization we must only update the parameters $\theta$.  $x$ and $y$ are facts that we have observed and that can not be changed.

To minimize the loss function we use **gradient descent** with the following update rule:

$$ \theta_{n+1} = \theta_{n} - \alpha \nabla_{\theta} L(x,y;\theta_n) $$

To train the model we perform the following steps:

 1. Initialize $\theta_0$ with small random values
 2. Draw $x$ and $y$ from the dataset
 3. Calculate $\hat y = f(x)$
 4. Calculate the gradient of the loss function w.r.t. $\theta_n$
 5. Update all parameters according to the update rule $\theta_{n+1} = \theta_{n} - \alpha \nabla_\theta L(x, y;\theta_{n})$
 6. Repeat from step 2 until loss change stays below some threshold
 

## Learning rate

The learning rate $\alpha$ controls how much we are updating the model parameters with respect to the gradient of the loss function.

$\alpha$ is one of the most important hyper-parameters.

If $\alpha$ is set to small gradient descent progress will be slow.

If $\alpha$ is set to large gradient descent may fail to converge.

This is a cartoonish depiction of the the effects of different learning rates:

<img src="images/cartoon_loss.png" height="300" width="400"/>

 * With a low learning rate (blue line) the loss will decrease slowly because the parameter updates are very small
 * High learning rates (green line) will decay the loss faster, but they get stuck at worse values of loss. This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a deep spot in the optimization landscape.
 * With very high learning rate (yellow line) the loss may even grow exponentially

One simple solution is to **anneal** the learning rate over time. Starting with a high learning rate helps to speed up the learning progress early on and reducing it over time helps to settle down into deeper but narrower parts of the loss function.

Another solution would be to use more sophisticated optimization algorithms that calculate individual learning rates for each model parameter. 

Examples are:

 * Adam
 * RMSProp
 
 <img src="images/optimizer.gif" height="400" width="600"/>


## Train/Test split

You always need to split up your dataset into a **train dataset** and a **test dataset** and keep them separate.

The train dataset is used to actually train the model.

The test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained.

The problem is that during training the model might become to sensitivity to the specific structure of the training data. 

The model might "memorize" the training data but perform poorly on unseen data. This is called **overfitting**.

Therefore the model performance is always evaluated on the test dataset that the model has not seen during training.

When splitting up the dataset it is important to have an even distribution of target values in both splits.

The **split ratio** between the train and test dataset is usually 70/30 to 90/10, depending on the overall datasize.

Often a third batch of examples, the **validation set** is used to avoid overfitting.

## Overfitting / Underfitting

We want a model to **generalize** well, this means to perform well on unseen data.

We track training and evaluation accuracy, but evaluation accuracy is the primary measure for model performance.

**Overfitting** occurs when a model becomes to sensitive to the specific structure of the training data and does not perform well to unseen data. 

**Underfitting** occurs when a model cannot adequately capture the underlying structure of the data. Such a model will tend to have poor predictive performance.

Here is an example of overfitting / underfitting:

<img src="images/overfitting.png" height="200" width="300"/>

The example shows a **binary classification problem** that categorizes points as blue or red.

The black and green lines are decision boundaries for two different models on the train dataset.

The model with the green line has a tendency to overfit. It does a very good job in separating the classes but is too dependent on the train data and is likely to have a higher error rate on unseen data.

The model with the the black line has a lower train accuracy but is likely to generalize better.

It is a good idea to plot train and test accuracy over time during model training:

<img src="images/accuracies.jpg" height="200" width="300"/>

The large gap between train accuracy and the blue test accuracy is a clear sign for overfitting.

Another sign for overfitting is that the test accuracy gets worse later on.

Possible solutions:

 * reduce the number of parameters in the model
 * increase the size of the train dataset by collecting more data
 * increase regularization:
     * add dropout layers or increase dropout rate
     * stronger L1/L2 weight penalty


In general **regularization** is a set to methods that make it harder for the model to learn.

The other case is when the test accuracy tracks the train accuracy very well. 

This is a signal for underfitting.

The model capacity is not high enough and learning opportunities are wasted. 

The solution is to make the model larger: 

 * increasing the number of parameters in existing layers
 * add more layers

In practice you can balance over- and underfitting through experiments. 

You actually want to see a little bit of overfitting.

Stop training early when test accuracy starts to drop and use the model checkpoint with the best performance.


