# Deep Learning course - LAB 4

## A tour of the optimizers in PyTorch



### Recap from previous Lab

* We experimented with building a Multilayer Perceptron (MLP) trained on the MNIST dataset using _vanilla_ Stochastic Gradient Descent (SGD) and constructing a training loop that lets us track loss and accuracy as training goes on
* We saw how we can analyze parameters and gradients of this MLP as training is operated
* We explored how to add regularization to our network and loss function to increase generalization or speed up the training

### Agenda for today

* Today we will be taking a quick tour of the `torch.optim` library, having a look at some optimizers which are more advanced than vanilla SGD
* in addition to that, we will be exploring how to toggle the hyperparameters (chiefly, the learning rate) of the optimizer as the training is operated

# TODO

talk about the effect of the minibatch size as a regularizer

talk about gradient clipping

In [6]:
import torch
from scripts.architectures import MLP # I have pasted the code for the MLP with regularization in this script, no need to redefine it
from scripts.train_utils import AverageMeter, accuracy
from scripts import mnist

### Exploring optimizers in PyTorch

PT optimizers can be found in the `torch.optim` library.

We'll take a look at some of those, namely:

* SGD with momentum
* RMSProp
* Adam

If you're a fan of optimizers, you can yourself have a look at the plethora of optimizers in the `optim` library on the [official docs](https://pytorch.org/docs/stable/optim.html).

#### SGD w/ momentum

Actually, SGD with momentum is part of vanilla SGD in PT. Indeed, one of its arguments is `momentum`. Good values for momentum are usually high (close to 1).

![](img/sgd_momentum.jpg)

*Image from Deep Learning book (Goodfellow et al.) - chapter 8.3.2*

In [13]:
lr = .1
wd = 5e-4
momentum = .9

num_epochs = 5

loss_fn = torch.nn.CrossEntropyLoss()

model = MLP()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=wd, momentum=momentum)

Let us also recover the training and testing routines we defined last lab (without the trajectory):

In [15]:
def train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance): # note: I've added a generic performance to replace accuracy
    for X, y in dataloader:
        # 1. reset the gradients previously accumulated by the optimizer
        #    this will avoid re-using gradients from previous loops
        optimizer.zero_grad() 
        # 2. get the predictions from the current state of the model
        #    this is the forward pass
        y_hat = model(X)
        # 3. calculate the loss on the current mini-batch
        loss = loss_fn(y_hat, y)
        # 4. execute the backward pass given the current loss
        loss.backward()
        # 5. update the value of the params
        optimizer.step()
        # 6. calculate the accuracy for this mini-batch
        acc = performance(y_hat, y)
        # 7. update the loss and accuracy AverageMeter
        loss_meter.update(val=loss.item(), n=X.shape[0])
        performance_meter.update(val=acc, n=X.shape[0])


def train_model(model, dataloader, loss_fn, optimizer, num_epochs, checkpoint_loc=None, checkpoint_name="checkpoint.pt", performance=accuracy): # note: I've added a generic performance to replace accuracy and an object where to store the trajectory

    # create the folder for the checkpoints (if it's not None)
    if checkpoint_loc is not None:
        os.makedirs(checkpoint_loc, exist_ok=True)
    
    model.train()

    # epoch loop
    for epoch in range(num_epochs):
        loss_meter = AverageMeter()
        performance_meter = AverageMeter()

        train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance)

        print(f"Epoch {epoch+1} completed. Loss - total: {loss_meter.sum} - average: {loss_meter.avg}; Performance: {performance_meter.avg}")

        # produce checkpoint dictionary -- but only if the name and folder of the checkpoint are not None
        if checkpoint_name is not None and checkpoint_loc is not None:
            checkpoint_dict = {
                "parameters": model.state_dict(),
                "optimizer": optimizer.state_dict(),
                "epoch": epoch
            }
            torch.save(checkpoint_dict, os.path.join(checkpoint_loc, checkpoint_name))

    return loss_meter.sum, performance_meter.avg

def test_model(model, dataloader, performance=accuracy, loss_fn=None):
    # create an AverageMeter for the loss if passed
    if loss_fn is not None:
        loss_meter = AverageMeter()
    
    performance_meter = AverageMeter()

    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            y_hat = model(X)
            loss = loss_fn(y_hat, y) if loss_fn is not None else None
            acc = performance(y_hat, y)
            if loss_fn is not None:
                loss_meter.update(loss.item(), X.shape[0])
            performance_meter.update(acc, X.shape[0])
    # get final performances
    fin_loss = loss_meter.sum if loss_fn is not None else None
    fin_perf = performance_meter.avg
    print(f"TESTING - loss {fin_loss if fin_loss is not None else '--'} - performance {fin_perf}")
    return fin_loss, fin_perf

and recover the data

In [11]:
trainloader, testloader, _, _ = mnist.get_data()

Let's train the network

In [14]:
train_model(model, trainloader, loss_fn, optimizer, num_epochs)

Epoch 1 completed. Loss - total: 26499.092242240906 - average: 0.44165153737068175; Performance: 0.8647500000317891
Epoch 2 completed. Loss - total: 14802.4463763237 - average: 0.246707439605395; Performance: 0.9258000000317892
Epoch 3 completed. Loss - total: 13043.869870185852 - average: 0.2173978311697642; Performance: 0.9346833333333333
Epoch 4 completed. Loss - total: 11759.672103881836 - average: 0.19599453506469727; Performance: 0.9414
Epoch 5 completed. Loss - total: 11193.425158977509 - average: 0.18655708598295848; Performance: 0.9434833333015442


NameError: name 'trajectory' is not defined

by adding the momentum term, we already saw a small increase in training accuracy. Let's test

In [17]:
test_model(model, testloader)

TESTING - loss -- - performance 0.9569333333333333


(None, 0.9569333333333333)

#### RMSProp and the LR sensitivity "dilemma"

With RMSProp we want to tackle a problem with SGD/SGD+momentum, which is related to the fact that, with SGD, there seems to be a deal of _sensitivity_ towards some specific _directions_ (read, parameters, since each parameter of the model represent a dimension in the optimization space).

RMSProp tries to tackle this issue by introducing an _adaptive rule_ for updating the learning rate parameter-wise in each step. 
In particular:
* it keeps track of the _history_ of the squared gradient via an exponentially decaying running average 
  * (the _decay_ is controlled by a hyperparameter $\rho \in (0,1)$)
* The parameter update is 
  * directly proportional to the learning rate
  * directly proportional to the gradient for this step
  * inversely proportional to the gradient average
    * i.e., the direct effect of the gradient is _mitigated_ by dividing it with the accumulated average gradient

The formula for the update is:

$ \theta_{t+1} = \theta_t + \frac{\text{lr}}{\sqrt{\epsilon + \mathbf{R}(\rho)}} \odot \mathbf{G}$

where:
* $\epsilon$ is a small constant for numerical stability
* $\mathbf{R}(\rho)$ is the squared gradient running averate (which depends upon $\rho$)
* $\mathbf{G}$ is the gradient for the current step

In [24]:
model = MLP() # always remember to reinstantiate the net between tries
rmsprop = torch.optim.RMSprop(model.parameters()) # let's use the default hyperparams (lr=.01, eps=1e-8)

In [25]:
train_model(model, trainloader, loss_fn, rmsprop, num_epochs)
test_model(model, testloader)

Epoch 1 completed. Loss - total: 25307.13468837738 - average: 0.42178557813962303; Performance: 0.8705166666984558
Epoch 2 completed. Loss - total: 16272.398222923279 - average: 0.27120663704872133; Performance: 0.9179166666984558
Epoch 3 completed. Loss - total: 14396.38200044632 - average: 0.23993970000743867; Performance: 0.9285000000317891
Epoch 4 completed. Loss - total: 13521.132495880127 - average: 0.22535220826466879; Performance: 0.9339333333651225
Epoch 5 completed. Loss - total: 12783.330872535706 - average: 0.21305551454226176; Performance: 0.9366666666666666
TESTING - loss -- - performance 0.9390000000317892


(None, 0.9390000000317892)

#### ADAM

In [32]:
model = MLP()
adam = torch.optim.Adam(model.parameters())

In [34]:
train_model(model, trainloader, loss_fn, adam, num_epochs)
test_model(model, testloader)

Epoch 1 completed. Loss - total: 48566.915053367615 - average: 0.8094485842227935; Performance: 0.8016000000317891
Epoch 2 completed. Loss - total: 18961.944045066833 - average: 0.3160324007511139; Performance: 0.9128166666666667
Epoch 3 completed. Loss - total: 15120.141127586365 - average: 0.2520023521264394; Performance: 0.9276666666348775
Epoch 4 completed. Loss - total: 13185.002179145813 - average: 0.21975003631909687; Performance: 0.9350666666348775
Epoch 5 completed. Loss - total: 12003.2468252182 - average: 0.2000541137536367; Performance: 0.9411
TESTING - loss -- - performance 0.9559833333651224


(None, 0.9559833333651224)

static nature of the learning rate (LR):
* if the LR is too high, we'll notice a sharp increase in accuracy with a relatively quick plateu corresponding to non-optimal solutions.
  * this is because we'll likely miss local optima because our step in the parameter space is too large
* if the LR is too low, training will be excruciatingly low and we'll likely get stuck in very bad local optima, being unable to get out of them because the step in the parameter space is too low to get out of these _valleys_

An _ideal_ solution would be to keep a _high enough_ LR until we find a _good enough_ portion of the parameter space, then decrease progressively the LR in order to carefully explore these areas for good optima.

### References

[1](https://www.deeplearningbook.org/) LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444.