# Deep Learning course - LAB 4

## A tour of the optimizers in PyTorch



### Recap from previous Lab

* We experimented with building a Multilayer Perceptron (MLP) trained on the MNIST dataset using _vanilla_ Stochastic Gradient Descent (SGD) and constructing a training loop that lets us track loss and accuracy as training goes on
* We saw how we can analyze parameters and gradients of this MLP as training is operated
* We explored how to add regularization to our network and loss function to increase generalization or speed up the training

### Agenda for today

* Today we will be taking a quick tour of the `torch.optim` library, having a look at some optimizers which are more advanced than vanilla SGD
* in addition to that, we will be exploring how to toggle the hyperparameters (chiefly, the learning rate) of the optimizer as the training is operated

# TODO

talk about the effect of the minibatch size as a regularizer

In [1]:
import torch
from scripts.architectures import MLP # I have pasted the code for the MLP with regularization in this script, no need to redefine it
from scripts.train_utils import AverageMeter, accuracy
from scripts import mnist

### Exploring optimizers in PyTorch

PT optimizers can be found in the `torch.optim` library.

We'll take a look at some of those, namely:

* SGD with momentum
* RMSProp
* Adam

If you're a fan of optimizers, you can yourself have a look at the plethora of optimizers in the `optim` library on the [official docs](https://pytorch.org/docs/stable/optim.html).

#### SGD w/ momentum

Adding a momentum term helps SGD optimize faster in some situations where the optimum is situated in _valleys_ which are way steeper along sime directions w.r.t. others.

![](img/sgd_momentum.jpg)

*Image from Deep Learning book (Goodfellow et al.) - chapter 8.3.2*

The gradient is updated via the following quantity $\nu$:

$\mathbf{\nu} \leftarrow m\cdot\mathbf{\nu} + \text{lr} \cdot \mathbf{G}$

$\mathbf{\Theta} \leftarrow \mathbf{\Theta} - \mathbf{\nu}$

where $\mathbf{G}$ is the gradient and $m$ is the momentum term (usually picked $\rightarrow 1$).

Actually, SGD with momentum is part of vanilla SGD in PT. Indeed, one of its arguments is `momentum`.

In [65]:
lr = .1
wd = 5e-4
momentum = .9

num_epochs = 5

loss_fn = torch.nn.CrossEntropyLoss()

model = MLP()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=wd, momentum=momentum)

Let us also recover the training and testing routines we defined last lab (without the trajectory):

In [2]:
def train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance): # note: I've added a generic performance to replace accuracy
    for X, y in dataloader:
        # 1. reset the gradients previously accumulated by the optimizer
        #    this will avoid re-using gradients from previous loops
        optimizer.zero_grad() 
        # 2. get the predictions from the current state of the model
        #    this is the forward pass
        y_hat = model(X)
        # 3. calculate the loss on the current mini-batch
        loss = loss_fn(y_hat, y)
        # 4. execute the backward pass given the current loss
        loss.backward()
        # 5. update the value of the params
        optimizer.step()
        # 6. calculate the accuracy for this mini-batch
        acc = performance(y_hat, y)
        # 7. update the loss and accuracy AverageMeter
        loss_meter.update(val=loss.item(), n=X.shape[0])
        performance_meter.update(val=acc, n=X.shape[0])


def train_model(model, dataloader, loss_fn, optimizer, num_epochs, checkpoint_loc=None, checkpoint_name="checkpoint.pt", performance=accuracy):

    # create the folder for the checkpoints (if it's not None)
    if checkpoint_loc is not None:
        os.makedirs(checkpoint_loc, exist_ok=True)
    
    model.train()

    # epoch loop
    for epoch in range(num_epochs):
        loss_meter = AverageMeter()
        performance_meter = AverageMeter()

        train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance)

        print(f"Epoch {epoch+1} completed. Loss - total: {loss_meter.sum} - average: {loss_meter.avg}; Performance: {performance_meter.avg}")

        # produce checkpoint dictionary -- but only if the name and folder of the checkpoint are not None
        if checkpoint_name is not None and checkpoint_loc is not None:
            checkpoint_dict = {
                "parameters": model.state_dict(),
                "optimizer": optimizer.state_dict(),
                "epoch": epoch
            }
            torch.save(checkpoint_dict, os.path.join(checkpoint_loc, checkpoint_name))

    return loss_meter.sum, performance_meter.avg

def test_model(model, dataloader, performance=accuracy, loss_fn=None):
    # create an AverageMeter for the loss if passed
    if loss_fn is not None:
        loss_meter = AverageMeter()
    
    performance_meter = AverageMeter()

    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            y_hat = model(X)
            loss = loss_fn(y_hat, y) if loss_fn is not None else None
            acc = performance(y_hat, y)
            if loss_fn is not None:
                loss_meter.update(loss.item(), X.shape[0])
            performance_meter.update(acc, X.shape[0])
    # get final performances
    fin_loss = loss_meter.sum if loss_fn is not None else None
    fin_perf = performance_meter.avg
    print(f"TESTING - loss {fin_loss if fin_loss is not None else '--'} - performance {fin_perf}")
    return fin_loss, fin_perf

and recover the data

In [67]:
trainloader, testloader, _, _ = mnist.get_data()

Let's train the network

In [14]:
train_model(model, trainloader, loss_fn, optimizer, num_epochs)

Epoch 1 completed. Loss - total: 26499.092242240906 - average: 0.44165153737068175; Performance: 0.8647500000317891
Epoch 2 completed. Loss - total: 14802.4463763237 - average: 0.246707439605395; Performance: 0.9258000000317892
Epoch 3 completed. Loss - total: 13043.869870185852 - average: 0.2173978311697642; Performance: 0.9346833333333333
Epoch 4 completed. Loss - total: 11759.672103881836 - average: 0.19599453506469727; Performance: 0.9414
Epoch 5 completed. Loss - total: 11193.425158977509 - average: 0.18655708598295848; Performance: 0.9434833333015442


NameError: name 'trajectory' is not defined

by adding the momentum term, we already saw a small increase in training accuracy. Let's test

In [17]:
test_model(model, testloader)

TESTING - loss -- - performance 0.9569333333333333


(None, 0.9569333333333333)

#### RMSProp and the LR sensitivity "dilemma"

With RMSProp we want to tackle a problem with SGD/SGD+momentum, which is related to the fact that, with SGD, there seems to be a deal of _sensitivity_ towards some specific _directions_ (read, parameters, since each parameter of the model represent a dimension in the optimization space).

RMSProp tries to tackle this issue by introducing an _adaptive rule_ for updating the learning rate parameter-wise in each step. 
In particular:
* it keeps track of the _history_ of the squared gradient via an exponentially decaying running average: 
  $\mathbf{R} \leftarrow \rho\mathbf{R} + (1-\rho) \mathbf{G}^2$
  * (the _decay_ is controlled by a hyperparameter $\rho \in (0,1)$, usually 0.9)
* The parameter update is 
  * directly proportional to the learning rate
  * directly proportional to the gradient for this step
  * inversely proportional to the gradient average
    * i.e., the direct effect of the gradient is _mitigated_ by dividing it with the accumulated average gradient

The formula for the update is:

$ \theta_{t+1} = \theta_t + \frac{\text{lr}}{\sqrt{\epsilon + \mathbf{R}}} \odot \mathbf{G}$

where:
* $\epsilon$ is a small constant for numerical stability
* $\mathbf{R}$ is the squared gradient running averate (which depends upon $\rho$)
* $\mathbf{G}$ is the gradient for the current step

In [24]:
model = MLP() # always remember to reinstantiate the net between tries
rmsprop = torch.optim.RMSprop(model.parameters()) # let's use the default hyperparams (lr=.01, eps=1e-8)

In [25]:
train_model(model, trainloader, loss_fn, rmsprop, num_epochs)
test_model(model, testloader)

Epoch 1 completed. Loss - total: 25307.13468837738 - average: 0.42178557813962303; Performance: 0.8705166666984558
Epoch 2 completed. Loss - total: 16272.398222923279 - average: 0.27120663704872133; Performance: 0.9179166666984558
Epoch 3 completed. Loss - total: 14396.38200044632 - average: 0.23993970000743867; Performance: 0.9285000000317891
Epoch 4 completed. Loss - total: 13521.132495880127 - average: 0.22535220826466879; Performance: 0.9339333333651225
Epoch 5 completed. Loss - total: 12783.330872535706 - average: 0.21305551454226176; Performance: 0.9366666666666666
TESTING - loss -- - performance 0.9390000000317892


(None, 0.9390000000317892)

#### ADAM

Adam is an extension to RMSProp where we try to implement momentum-like mechanics as well.

Instead of adding one single momentum term, though, we add two of them:

$\mathbf{M} \leftarrow (\beta_1 \mathbf{M} + (1 - \beta_1)\mathbf{G})~/~(1-\beta_1^t)$

$\mathbf{V} \leftarrow (\beta_2 \mathbf{V} + (1 - \beta_2)\mathbf{G}^2)~/~(1-\beta_2^t)$

where $t$ is the training iteration.

The two terms are running averages (with a so called *bias correction* at the denominator) of the gradient and its square respectively.

These terms are then incorporated into the parameters update formula:

$\mathbf{\Theta} \leftarrow \mathbf{\Theta} + \frac{\text{lr}}{\sqrt{\mathbf{V}} + \epsilon} \cdot \mathbf{G}$

Notice the similarities between Adam and RMSProp.

In [4]:
model = MLP()
adam = torch.optim.Adam(model.parameters()) # we keep the default hyperparameters

In [34]:
train_model(model, trainloader, loss_fn, adam, num_epochs)
test_model(model, testloader)

Epoch 1 completed. Loss - total: 48566.915053367615 - average: 0.8094485842227935; Performance: 0.8016000000317891
Epoch 2 completed. Loss - total: 18961.944045066833 - average: 0.3160324007511139; Performance: 0.9128166666666667
Epoch 3 completed. Loss - total: 15120.141127586365 - average: 0.2520023521264394; Performance: 0.9276666666348775
Epoch 4 completed. Loss - total: 13185.002179145813 - average: 0.21975003631909687; Performance: 0.9350666666348775
Epoch 5 completed. Loss - total: 12003.2468252182 - average: 0.2000541137536367; Performance: 0.9411
TESTING - loss -- - performance 0.9559833333651224


(None, 0.9559833333651224)

The literature is loaded with SGD variants for optimization: Adagrad, AdaMax, Nadam, AdamW... You can use one of them of your own choice in your exercises, (provided you can explain the concept behind it during the exam).

#### Additional reading

If you're interested in SGD variants, you may check out [this blog post](https://ruder.io/optimizing-gradient-descent/index.html) which, in my opinion, does a good job in summarising and presenting recent work in the field.

## Modifying the optimizer hyperparameters

One thing we might be interested in doing is to modify the hyperparameters of our optimizer mid-training.

The parameters of the optimizer are contained:
* in its `state_dict`
* under the `param_groups`

we will see how to work with the latter.


In [14]:
print(type(optimizer.param_groups))
print(len(optimizer.param_groups))

<class 'list'>
1


The `param_groups` represent groups of parameters for which given conditions apply.

Here we have only one group, corresponding to the params of our MLP network.

Let us see how this group is composed:

In [17]:
print(type(optimizer.param_groups[0]))
print(optimizer.param_groups[0].keys())

<class 'dict'>
dict_keys(['params', 'lr', 'momentum', 'dampening', 'weight_decay', 'nesterov'])


To the surprise of no-one, the parameters of the MLP are stored under the `params` key.

The other keys represent the _conditions_ that apply to these parameter group.

Toggling one of these hyperparameters can be done in that way.

In [None]:
optimizer.param_groups[0]["momentum"] = 0.8 # -> from now on, the momentum will be decreased a little bit

even if it's better to be general: if we're willing to do a global update for that optimizer, we better do it for all groups.

In [18]:
for pg in optimizer.param_groups:
    pg["momentum"] = .8

SyntaxError: invalid syntax (<ipython-input-18-f274e606b070>, line 1)

Let us suppose we wish to use a different momentum or learning rate for each layer

In [45]:
optimizer_diff = torch.optim.SGD(
    [
        {"params": mlp.layers[:6].parameters()},
        {"params": mlp.layers[6:].parameters()}
    ],
    lr=.1, weight_decay=5e-4, momentum=.9
)

We have split the params of our MLP in two groups (the first 6 layers and the remaining ones). Let's check that we have >1 `param_group`

In [46]:
len(optimizer_diff.param_groups)

2

Now, suppose we might want to have a different weight decay in the second group: we only need to toggle it.

In [47]:
optimizer_diff.param_groups[1]["weight_decay"] = 1e-3

In [51]:
_ = [[print(hyp, "\t", val) for hyp, val in pg.items() if hyp!="params"] for pg in optimizer_diff.param_groups]

lr 	 0.1
momentum 	 0.9
dampening 	 0
weight_decay 	 0.05
nesterov 	 False
lr 	 0.1
momentum 	 0.9
dampening 	 0
weight_decay 	 0.001
nesterov 	 False


We also could've done this like so:

In [61]:
optimizer_diff = torch.optim.SGD(
    [
        {"params": mlp.layers[:6].parameters(), "weight_decay": 5e-4},
        {"params": mlp.layers[6:].parameters(), "weight_decay": 1e-3}
    ],
     momentum=.9, lr=.1
)

_ = [[print(hyp, "\t", val) for hyp, val in pg.items() if hyp!="params"] for pg in optimizer_diff.param_groups]

weight_decay 	 0.0005
lr 	 0.1
momentum 	 0.9
dampening 	 0
nesterov 	 False
weight_decay 	 0.001
lr 	 0.1
momentum 	 0.9
dampening 	 0
nesterov 	 False


### The Learning Rate dilemma in Deep Learning

static nature of the learning rate (LR):
* if the LR is too high, we'll notice a sharp increase in accuracy with a relatively quick plateu corresponding to non-optimal solutions.
  * this is because we'll likely miss local optima because our step in the parameter space is too large
* if the LR is too low, training will be excruciatingly low and we'll likely get stuck in very bad local optima, being unable to get out of them because the step in the parameter space is too low to get out of these _valleys_

An _ideal_ solution would be to keep a _high enough_ LR until we find a _good enough_ portion of the parameter space, then decrease progressively the LR in order to carefully explore these areas for good optima.

Mid-training learning rate toggling is called in a variety of terms: **learning rate decay**, **learning rate annealing**, **learning rate scheduling**...

The simplest idea to implement this is a **stepwise** learning rate annealing:

![](https://miro.medium.com/max/864/1*VQkTnjr2VJOz0R2m4hDucQ.jpeg)

*picture from [towardsdatascience.com](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1?gi=aea0860a2f14).*

In our MLP trained with SGD + momentum, we wish to train for 15 epochs and decrease the lr by a factor of 1/10 **before** epoch 7 and 12.

Let us recover our training loop and update it accordingly:

In [None]:
def train_model(model, dataloader, loss_fn, optimizer, num_epochs, checkpoint_loc=None, checkpoint_name="checkpoint.pt", performance=accuracy):

    # create the folder for the checkpoints (if it's not None)
    if checkpoint_loc is not None:
        os.makedirs(checkpoint_loc, exist_ok=True)
    
    model.train()

    # epoch loop
    for epoch in range(num_epochs):
        ### UPDATE HERE THE LOOP ###

        ############################

        loss_meter = AverageMeter()
        performance_meter = AverageMeter()

        train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance)

        print(f"Epoch {epoch+1} completed. Loss - total: {loss_meter.sum} - average: {loss_meter.avg}; Performance: {performance_meter.avg}")

        # produce checkpoint dictionary -- but only if the name and folder of the checkpoint are not None
        if checkpoint_name is not None and checkpoint_loc is not None:
            checkpoint_dict = {
                "parameters": model.state_dict(),
                "optimizer": optimizer.state_dict(),
                "epoch": epoch
            }
            torch.save(checkpoint_dict, os.path.join(checkpoint_loc, checkpoint_name))

    return loss_meter.sum, performance_meter.avg

PyTorch has a tool additional to the optimizer, the **`lr_scheduler`**.

The closest thing to the one above is the **StepLR**, which decays the lr by `gamma` each `step_size` epochs.

In [78]:
model = MLP()

In [79]:
optimizer = torch.optim.SGD(model.parameters(), lr=.1, weight_decay=5e-4, momentum=.9)

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=.1)

In [80]:
def train_model(model, dataloader, loss_fn, optimizer, num_epochs, checkpoint_loc=None, checkpoint_name="checkpoint.pt", performance=accuracy, lr_scheduler=None, epoch_start_scheduler=1):
    # added lr_scheduler

    # create the folder for the checkpoints (if it's not None)
    if checkpoint_loc is not None:
        os.makedirs(checkpoint_loc, exist_ok=True)
    
    model.train()

    # epoch loop
    for epoch in range(num_epochs):

        loss_meter = AverageMeter()
        performance_meter = AverageMeter()

        # added print for LR
        print(f"Epoch {epoch+1} --- learning rate {optimizer.param_groups[0]['lr']:.5f}")

        train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance)

        print(f"Epoch {epoch+1} completed. Loss - total: {loss_meter.sum} - average: {loss_meter.avg}; Performance: {performance_meter.avg}")

        # produce checkpoint dictionary -- but only if the name and folder of the checkpoint are not None
        if checkpoint_name is not None and checkpoint_loc is not None:
            checkpoint_dict = {
                "parameters": model.state_dict(),
                "optimizer": optimizer.state_dict(),
                "epoch": epoch
            }
            torch.save(checkpoint_dict, os.path.join(checkpoint_loc, checkpoint_name))
        
        if lr_scheduler is not None:
            if epoch >= epoch_start_scheduler:
                lr_scheduler.step()

    return loss_meter.sum, performance_meter.avg

In [81]:
train_model(model, trainloader, loss_fn, optimizer, 15, lr_scheduler=scheduler)

Epoch 1 --- learning rate 0.10000
Epoch 1 completed. Loss - total: 26423.32963657379 - average: 0.44038882727622986; Performance: 0.8645166666984558
Epoch 2 --- learning rate 0.10000
Epoch 2 completed. Loss - total: 14970.883371829987 - average: 0.2495147228638331; Performance: 0.9253499999682109
Epoch 3 --- learning rate 0.10000
Epoch 3 completed. Loss - total: 13198.945890903473 - average: 0.2199824315150579; Performance: 0.93385
Epoch 4 --- learning rate 0.10000
Epoch 4 completed. Loss - total: 12233.711651325226 - average: 0.20389519418875376; Performance: 0.9381333333651225
Epoch 5 --- learning rate 0.10000
Epoch 5 completed. Loss - total: 11651.040762901306 - average: 0.19418401271502178; Performance: 0.9421666666666667
Epoch 6 --- learning rate 0.10000
Epoch 6 completed. Loss - total: 11336.593544483185 - average: 0.18894322574138642; Performance: 0.9433333333333334
Epoch 7 --- learning rate 0.01000
Epoch 7 completed. Loss - total: 9255.401719331741 - average: 0.1542566953221956

(7536.799172639847, 0.9616666666348775)

In [82]:
test_model(model, testloader)

TESTING - loss -- - performance 0.9732333333015442


(None, 0.9732333333015442)

#### Other techniques

1. Exponential Annealing

![](https://miro.medium.com/max/432/1*iSZv0xuVCsCCK7Z4UiXf2g.jpeg)

*picture from [towardsdatascience.com](towardsdatascience.com).*

2. Cosine Annealing

![](https://miro.medium.com/max/1266/1*2NAuh6DbcrrMv4Voq5yG9A.png)

*picture from [towardsdatascience.com](towardsdatascience.com).*

3. Triangular Annealing

![](img/lr_tri.jpg)

*picture from [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/pdf/1801.06146.pdf).*


#### Warm-up

Warm-up is a techinque which is centered on the idea that, before we start the actual training, the network has to be _warmed-up_ with some iterations of training at an ever-increasing LR, till we hit the target LR $\eta$.

A simple implementation of this (which resembles the ascending phase of the triangular schedule above) could be to:
* warm up for $U$ iterations
* increase the LR by a fraction $\frac{\eta}{U}$.

So, at iteration $u\in\{1,\dots,U\}$, the LR is $u\frac{\eta}{U}$.

Hence, the triangular annealing above could be thought of as a composition of

1. Linear warm-up, and
2. Linear annealing

#### LR schedule cycling

The aforementioned schedules can be cycled multiple times during the same training, giving rise to shapes like the following:

![](https://miro.medium.com/max/890/1*xaQVSxG_13E7ZhwPPvPNhw.png)

*picture from [towardsdatascience.com](towardsdatascience.com)*

![](https://pyimagesearch.com/wp-content/uploads/2019/07/keras_clr_triangular.png)

*picture from [pyimagesearch.com](https://www.pyimagesearch.com/2019/07/29/cyclical-learning-rates-with-keras-and-deep-learning/)*

This, coupled with $E_{opt}$ early stopping, might actually give the optimizer multiple end points from which to choose our best model. Each time the LR gets "bumped up", we get a "fresh restart" from a possibly more favorable initialization, in the hope of getting closer and closer to a good local optimum.

On to something a bit more complex:

![](https://www.pyimagesearch.com/wp-content/uploads/2019/07/keras_clr_exp_range.png)

*picture from [pyimagesearch.com](https://www.pyimagesearch.com/2019/07/29/cyclical-learning-rates-with-keras-and-deep-learning/)*

the maximum LR gets decayed as well in a "logarithmic" way. We can have a similar figure with the cosine annealing as well.

Further watch (a bit older, from 2018): [2](https://www.youtube.com/watch?v=kbe_tNGoBHI)

Further read, an argument proposing an alternative to LR annealing: [Don't Decay the Learning Rate, Increase the Batch Size](https://arxiv.org/abs/1711.00489)


#### Homework

1. Now that you have all the tools to train an MLP with high performance on MNIST, try reaching 0-loss on the training data (with a small epsilon -- don't worry if you overfit!).
The implementation is completely up to you. You just need to keep it an MLP without using fancy layers (e.g., keep the `Linear` layers, don't go into `Conv1d` or somthing like this). You are free to use any LR scheduler or optimizer, any one of batchnorm/groupnorm, regularization methods... If you use something we haven't seen during lectures, please motivate your choice and explain (as briefly as possible) how it works.
2. Try reaching 0-loss on the training data with **permuted labels**. Assess the model on the test data (without permuted labels) and comment. Help yourself with [3](https://arxiv.org/abs/1611.03530).
*Tip*: To permute the labels, act on the `trainset.targets` with an appropriate torch function.
Then, you can pass this "permuted" `Dataset` to a `DataLoader` like so: `trainloader_permuted = torch.utils.data.DataLoader(trainset_permuted, batch_size=batch_size_train, shuffle=True)`. You can now use this `DataLoader` inside the training function.
Additional view: ["The statistical significance perfect linear separation", by Jared Tanner (Oxford U.)](https://www.youtube.com/watch?v=vl2QsVWEqdA).

### References

[1](https://www.deeplearningbook.org/) LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444.

[2](https://www.youtube.com/watch?v=kbe_tNGoBHI) State-of-the-art Learning Rate Schedules. Apache MXNet. YouTube.

[3](https://arxiv.org/abs/1611.03530) Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization.
