### Recap from previous Lab

* We "closed the loop" on our first MultiLayer Perceptron (MLP), exploring how the training routine is implemented in PyTorch (PT):

    * we saw how to use built-in loss functions in PT and we learned how to construct custom losses based upon tensor methods
    * moreover, we also saw how to use vanilla Stochastic Gradient Descent (SGD) in conjunction with backpropagation to enable the parameters updating in our MLP

### Agenda for today

* The main topic of our lecture is **regularization**
* First of all, though, we will implement a framework for monitoring the parameters during training (the so called *trajectory*), a simple research exercize
* On to regularization, we will see how to utilize various way to infuse regularization into our MLP training, still with an eye on the trajectory:

  * L2 regularization (aka *weight decay*)
  * dropout
  * regularization layers
  

### Examining parameters mid-training – the trajectory

We already covered how to save the "snapshot" of the parameters via the `state_dict` during the previous Lab.
We can use the same method to recover the trajectory during our training, although a more useful alternative is `model.named_parameters()`

In [None]:
import torch
import os
from torch import nn
from matplotlib import pyplot as plt


from scripts import mnist
from scripts.train_utils import accuracy, AverageMeter

Let us quickly recover the stuff we implemented during Lab 2

In [None]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.flat = nn.Flatten()
        self.h1 = nn.Linear(28*28, 16)
        self.h2 = nn.Linear(16, 32)
        self.h3 = nn.Linear(32, 24)
        self.out = nn.Linear(24, 10)
    
    def forward(self, X, activ_hidden=nn.functional.relu):
        out = self.flat(X)
        out = activ_hidden(self.h1(out))
        out = activ_hidden(self.h2(out))
        out = activ_hidden(self.h3(out))
        out = self.out(out)
        return out

Let's see what `model.named_parameters()` is and how to use it:

In [None]:
model = MLP()
for name, param in model.named_parameters():
    print(name, "\t", param.shape)

We can now add a small piece of code to our `train_epoch` from last Lab to implement the tracking of the trajectory: more specifically, we're interested in the L2-norm of the parameters during each training iteration.

NB:
* Each **epoch** is composed of **training iterations**:
  * a training iteration coincides with the forward/backward pass on a single batch
  * an epoch is completed when the whole of the dataset has been seen from the network during training. After an epoch has ended, we reshuffle the batches and begin a new epoch

As we can see above, parameters are stored in a (lazy) list of tuples. If we want to calculate the norm, we can't do it on such structure.
What we need to do is:
* "flatten" all the list in a single tensor (a vector)
  * first, we must create this tensor, and to do so we have to know the number of parameters that we will need to fit into it
* calculate the norm of this vector

In addition to the norm of the parameters, we're interested in the **norm of the gradients**.

Gradients of a layer may be accessed via `tensor.grad` where `tensor` is the tensor associated to the parameters of a given layer.

Let's try to call it now on our `named_parameters`:

In [None]:
for name, param in model.named_parameters():
    print(name, "\t", param.grad)

We see that all the gradients are `None`

**Q**: who knows why?

In [None]:
def get_params_and_gradients_norm(named_parameters):
    square_norms_params = []
    square_norms_grads = []

    for _, param in named_parameters:

        # Q: what is this and why did I write it here?
        if param.requires_grad:
            square_norms_params.append((param ** 2).sum())
            square_norms_grads.append((param.grad ** 2).sum())
    
    norm_params = sum(square_norms_params).sqrt().item()
    norm_grads = sum(square_norms_grads).sqrt().item()

    return norm_params, norm_grads

In [None]:
def train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance, trajectory, device): # note: I've added a generic performance to replace accuracy and the device
    for X, y in dataloader:
        # TRANSFER X AND y TO GPU IF SPECIFIED
        X = X.to(device)
        y = y.to(device)
        # ... like last time
        optimizer.zero_grad() 
        y_hat = model(X)
        loss = loss_fn(y_hat, y)
        loss.backward()
        optimizer.step()
        acc = performance(y_hat, y)
        loss_meter.update(val=loss.item(), n=X.shape[0])
        performance_meter.update(val=acc, n=X.shape[0])

        if trajectory is not None:

            params_norm, gradients_norm = get_params_and_gradients_norm(model.named_parameters())
            trajectory["parameters"].append(params_norm)
            trajectory["gradients"].append(gradients_norm)

def train_model(model, dataloader, loss_fn, optimizer, num_epochs, checkpoint_loc=None, checkpoint_name="checkpoint.pt", performance=accuracy, trajectory=None, device=None): # note: I've added a generic performance to replace accuracy and an object where to store the trajectory and the device on which to run our training

    # establish device
    if device is None:
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
    print(f"Training on {device}")

    # create the folder for the checkpoints (if it's not None)
    if checkpoint_loc is not None:
        os.makedirs(checkpoint_loc, exist_ok=True)
    
    model.to(device)
    model.train()

    # epoch loop
    for epoch in range(num_epochs):
        loss_meter = AverageMeter()
        performance_meter = AverageMeter()

        train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance, trajectory, device)

        print(f"Epoch {epoch+1} completed. Loss - total: {loss_meter.sum} - average: {loss_meter.avg}; Performance: {performance_meter.avg}")

        # produce checkpoint dictionary -- but only if the name and folder of the checkpoint are not None
        if checkpoint_name is not None and checkpoint_loc is not None:
            checkpoint_dict = {
                "parameters": model.state_dict(),
                "optimizer": optimizer.state_dict(),
                "epoch": epoch
            }
            torch.save(checkpoint_dict, os.path.join(checkpoint_loc, checkpoint_name))

    return loss_meter.sum, performance_meter.avg, trajectory

def test_model(model, dataloader, performance=accuracy, loss_fn=None, device=None):
    # establish device
    if device is None:
        device = "cuda:0" if torch.cuda.is_available() else "cpu"

    # create an AverageMeter for the loss if passed
    if loss_fn is not None:
        loss_meter = AverageMeter()
    
    performance_meter = AverageMeter()

    model.to(device)
    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            X = X.to(device)
            y = y.to(device)

            y_hat = model(X)
            loss = loss_fn(y_hat, y) if loss_fn is not None else None
            acc = performance(y_hat, y)
            if loss_fn is not None:
                loss_meter.update(loss.item(), X.shape[0])
            performance_meter.update(acc, X.shape[0])
    # get final performances
    fin_loss = loss_meter.sum if loss_fn is not None else None
    fin_perf = performance_meter.avg
    print(f"TESTING - loss {fin_loss if fin_loss is not None else '--'} - performance {fin_perf}")
    return fin_loss, fin_perf

In [None]:
minibatch_size_train = 256
minibatch_size_test = 512

trainloader, testloader, trainset, testset = mnist.get_data(batch_size_train=minibatch_size_test, batch_size_test=minibatch_size_test)

In [None]:
learn_rate = 0.1
num_epochs = 5

model = MLP()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learn_rate)

trajectory = {"parameters": [], "gradients": []}

Train the MLP:

In [None]:
train_loss, train_acc, trajectory = train_model(model, trainloader, loss_fn, optimizer, num_epochs, trajectory=trajectory)

Now we can plot the norms

In [None]:
def plot_trajectory(trajectory, ylim=(0,9)):
    fig, (ax1, ax2) = plt.subplots(1,2, sharey=True)
    ax1.set_ylim(*ylim)
    ax1.plot(trajectory["parameters"])
    ax1.set_title("Norm of parameters")
    ax2.plot(trajectory["gradients"])
    ax2.set_title("Norm of gradients")
    plt.show()

In [None]:
plot_trajectory(trajectory)

We can see a similar trend to what [3](https://arxiv.org/abs/2002.10365) observed:

1. in the very first iterations, gradients are very large and parameters quickly grow
2. gradients quickly diminishes, while parameters growth tends to slow down
3. gradients reach a "stationary state" (with some noise added), while parameters grow even slower

5 epochs of training is very little. If you want (_not a homework_), you may try increasing the number of epochs and see what happens.

of course, the analysis in [3](https://arxiv.org/abs/2002.10365) is much finer than ours and considers much more quantities and experiments than ours.

### Regularization

Regularization in DL comes in the form of different tools. We can have:

1. Penalty terms in loss functions (e.g. L1 and L2 norm regularization) which introduce bias in our parameters by actively reducing the magnitude of some weights:
    * L1 norm regularization is also called LASSO regularization
    * L2 norm regularization is also called Ridge regularization or **weight decay**
    * they were originally implemented in linear regression models as a way to infuse *inductive bias* in models originally thought to rely on the complete unbiasedness on training data
2. Normalization layers which normalize the incoming information s.t. their mean is zero and standard deviation one. It comes in different flavors:
    * batch normalization or batchnorm (the most common technique)
    * group normalization or groupnorm
    * there are more possibilities, for additional info on these, please check [this lecture by Aaron Defazio, NYU](https://atcold.github.io/pytorch-Deep-Learning/en/week05/05-2/)
3. Dropout, a technique [patented by Google](https://patents.google.com/patent/US9406017B2/en) which consists in randomly *dropping* some neurons from a given layer to prevent overfitting.
4. Early stopping, which we'll see later on during this Lab.

#### Weight decay or L2 norm/Ridge regularization

Weight Decay (WD) is a simple technique which *appends* a penalty term to the loss function equation. The term is based upon the L2 norm of the weights.

Given our original loss function $\mathcal{L}_0 (\hat{y}, y)$ and our parameter vector $\Theta$, our new loss will be:

$\mathcal{L}_0 (\hat{y}, y) + \lambda \cdot \vert\vert \Theta \vert\vert_2$

the parameter $\lambda$ (also called weight decay) controls the strenght of the regularization. $\lambda$ too high means that the model will not concentrate well enough on the original objective ($\mathcal{L}_0$), hence it will not perform well. Usually, good values form $\lambda$ fall into the interval $[5\cdot 10^{-4}, 1\cdot 10^{-4}]$.

In PT, instead of inserting our penalty term in the loss function, we specify the weight decay parameter in our optimizer:

In [None]:
weight_decay = 5e-2
optimizer = torch.optim.SGD(model.parameters(), lr=learn_rate, weight_decay=weight_decay)

#### L1 norm regularization

L1 norm regularization is analogous to weight decay. The equation is:

$\mathcal{L}_0 (\hat{y}, y) + \lambda \cdot \vert\vert \Theta \vert\vert_1$

where $\vert\vert x \vert\vert_1 = \sum_{j=1}^d \vert x_j \vert$

unlike weight decay, to my knowledge PT does not provide a built-in for L1 reg. You need to define a custom loss function for this task (**homework**).

#### batchnorm

Batch Normalization is not really a regularization technique. It operates in such a way that the mean and standard deviation of the incoming batches of data are approximately 0 and 1 respectively.

The ultimate goal of batchnorm is not to normalize each batch, but estimate one vector for mean (a running mean) and one for std (a running std) for the whole dataset and to normalize w.r.t. these. So, these become new parameters of the network. They are not adjusted via backprop but they get adjusted each time the layer *sees* another batch of data.

![](https://miro.medium.com/max/474/1*QQ2Q5rVBtLv7b3yGhO0flg.png)

*picture from [towardsdatascience.com](https://towardsdatascience.com/batch-normalisation-explained-5f4bd9de5feb)*

When the network is evaluated on test data, the running mean and std must not be adjusted, hence PT has implemented a "switch", which we saw during the previous Lab, to tell the network when to adjust and not adjust these two parameters. The switch is triggered via `model.train()` and `model.eval()` (or equivalently `model.train(False)`).

In PT, the batch normalization is found as a regular layer under within the `torch.nn` library

In [None]:
class MLP_BN(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 16),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=16), # we specify the dimensionality of the incoming data
            nn.Linear(16, 32),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=32),
            nn.Linear(32, 24),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=24),
            nn.Linear(24, 10)
        )


    def forward(self, X):
        return self.layers(X)

**Q** (for the most skilled students): why didn't we apply batchnorm for the first layer?

By peeking at the PT docs, we can see that actually the batchnorm layers have much more hyperparameters which we can play with if we wanted to:

![](img/bn_docs.jpg)

*from [PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html)*

In addition to what we say till now, there is still some debate in the DL community on whether batchnorm or other normalization techniques help optimization. The claims in the original paper [1](https://arxiv.org/abs/1502.03167) of "reducing internal covariate shift" was confuted in successive works such as [2](https://arxiv.org/abs/1805.11604.pdf), which claims that it "makes the optimization landscape significantly smoother". Another things to consider is that, since the data is distributed in a small intervall around 0, there's also a better numerical stability added.

#### Dropout

Dropout acts by removing (i.e. *zeroing-out*) a random subset of the neurons in a given layer for each forward pass.

It has one hyperparameter ($p$), which is the fraction of neurons to be dropped out.

During training, each time a layer with backprop produces an output, a fraction $p$ of that output gets discarded. This helps in such a way that co-dependence between neurons gets *forgotten* by the network. To say it in simple terms, it forces each neuron to be independent from the output of other neurons within the same layer.

For the same reason as in batchnorm, since dropout has to apply only during training, we must be careful in activating the switch `model.eval()` when testing our network.

In PT, we find Dropout as a module of `torch.nn`. Instead of placing if *before* the layer (as in batchnorm), we place it *after* the layer (the reason being, the layer produces an output, a portion $p$ of that output gets discarded).

In [None]:
class MLP_BN_Drop(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 16),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=16),
            nn.Linear(16, 32),
            nn.ReLU(),
            nn.Dropout(p=.2), # we add a dropout here. it's referred to the previous layer (with 32 neurons)

            nn.BatchNorm1d(num_features=32),
            nn.Linear(32, 24),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=24),
            nn.Linear(24, 10)
        )


    def forward(self, X):
        return self.layers(X)

In [None]:
model = MLP()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learn_rate, weight_decay=weight_decay)

trajectory_reg = {"parameters": [], "gradients": []}

In [None]:
train_loss, train_acc, trajectory_reg = train_model(model, trainloader, loss_fn, optimizer, num_epochs, trajectory=trajectory_reg)

In [None]:
plot_trajectory(trajectory_reg)

#### HOMEWORK - Early stopping

Early stopping is yet another example of regularization technique which relies a lot on practical and experimental observations rather than any supporting theory.

It is based upon the concept of **validation**, which is an assessment mode additional to *testing*. Actually, what insofar whe have described as testing is technically a validation.
* a validation dataset may be obtained as result of a random splitting of the original training dataset
* a testing dataset should be obtained instead from a model deployed "in the wild" and should consist of data unseen (from both the model and its architect) during the training and designing phase.

In a normal academic setting it's very hard to obtain a proper training dataset, so usually the meaning of testing and validation get mixed up a little bit.

Anyway, early stopping requires us to assess the model at each epoch to get a proxy for the testing performance(s) (**validation step**). That should gives us an idea of how the model **learns to generalize** (if it ever does...) during training.

The *theoretical trend* ('90 s), which is pretty much absent in modern Deep Learning due to a lot of modern factors, is the following (figures from [4](https://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf)):

![](img/generalization.jpg)

Already in that period, different stuff was observed:

![](img/generalization_ugly.jpg)

In some of my experiments, this happens (blue=training, orange=validation):

![](img/generalization_fmnist1.jpg)

![](img/generalization_fmnist2.jpg)

(red=training, blue=validation)

![](img/plot_acc_09.png)

As we observe the curves for training and validation performance, we may notice some trends:
* there usually is an intersection between the two curves which marks the moment when the network starts to **overfit** the training data.
    * it might happen that, after that moment, the validation performance stays roughly the same (_white noise_), or that it drops and never recovers
    * it might happen, instead, that the validation performance stays a few points below the training performance but keeps on growing
    * it might happen, eventually, that the validation performance peaks a few epochs after and then it decreases
    * other situations may apply depending on the dataset, the optimizer, the presence of regularization, and a lot of other factors.
    
A trick which is very often applied is to track the validation performance during training and retain the model with the highest validation performance.
**Note**: it may not be the best strategy as the validation dataset may not be representative of the data manifold (!).
In the main reference for early stopping ([4](https://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf)), it is indicated as $E_{\text{opt}}$.

**Homework**: implement "early stopping" in the $E_{\text{opt}}$ using the test data as validation (since we don't know yet how to create additional `DataLoaders` and operate random splitting).

**Homework for the bravest ones**: read [4](https://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf) and try implementing at least one of the techniques there specified (besides $E_{\text{opt}}$, of course). 



**References**

[1](https://arxiv.org/abs/1502.03167) Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456). PMLR.

[2](https://arxiv.org/abs/1805.11604) Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization?. arXiv preprint arXiv:1805.11604.

[3](https://arxiv.org/abs/2002.10365) Frankle, J., Schwab, D. J., & Morcos, A. S. (2020). The early phase of neural network training. arXiv preprint arXiv:2002.10365.

[4](https://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf) Prechelt, L. (1998). Early stopping-but when?. In Neural Networks: Tricks of the trade (pp. 55-69). Springer, Berlin, Heidelberg.

#### Homework recap

1. Implement L1 norm regularization as a custom loss function
2. Implement early stopping in the $E_{\text{opt}}$ specification
3. Implement early stopping in one of the additional specifications as of [4](https://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf) 