### Homework

1. Now that you have all the tools to train an MLP with high performance on MNIST, try reaching 0-loss on the training data (with a small epsilon, e.g. 99.99% training performance -- don't worry if you overfit!).
The implementation is completely up to you. You just need to keep it an MLP without using fancy layers (e.g., keep the `Linear` layers, don't use `Conv1d` or something like this, don't use attention). You are free to use any LR scheduler or optimizer, any one of batchnorm/groupnorm, regularization methods... If you use something we haven't seen during lectures, please motivate your choice and explain (as briefly as possible) how it works.

In [1]:
%reset -f

In [2]:
import torch
import os
import sys
from torch import nn
from matplotlib import pyplot as plt

from scripts import mnist
from scripts.train_utils import accuracy, AverageMeter
from scripts import architectures

- **Training time until convergence:** There seems to be a sweet spot. If the batch size is very small (e.g. 8), this time goes up. If the batch size is huge, it is also higher than the minimum.

- **Training time per epoch:** Bigger computes faster (is efficient)

- **Resulting model quality:** The lower the better due to better generalization (?)

In [3]:
# load the data
minibatch_size_train = 256
minibatch_size_test = 512

trainloader, testloader, trainset, testset = mnist.get_data(batch_size_train=minibatch_size_test, 
                                                            batch_size_test=minibatch_size_test)

In [4]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 256),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=256),
            nn.Linear(256, 512),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=512),
            nn.Linear(512, 256),
            nn.ReLU(),
            
            nn.BatchNorm1d(num_features=256),
            nn.Linear(256, 128),
            nn.ReLU(),
            
            nn.BatchNorm1d(num_features=128),
            nn.Linear(128, 64),
            nn.ReLU(),
            
            nn.BatchNorm1d(num_features=64),
            nn.Linear(64, 32),
            nn.ReLU(),       

            nn.BatchNorm1d(num_features=32),
            nn.Linear(32, 10)
        )

    def forward(self, X):
        return self.layers(X)

In [5]:
def train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, accuracy_meter):
    for X, y in dataloader:
        optimizer.zero_grad()
        y_hat = model(X)
        loss = loss_fn(y_hat, y)
        loss.backward()
        optimizer.step()
        acc = accuracy(y_hat, y)
        loss_meter.update(val=loss.item(), n=X.shape[0])
        accuracy_meter.update(val=acc, n=X.shape[0])
        
def train_model(model, dataloader, loss_fn, optimizer, num_epochs,lr_scheduler=None, epoch_start_scheduler=1):
    model.train()
    for epoch in range(num_epochs):
        loss_meter = AverageMeter()
        accuracy_meter = AverageMeter()
        train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, accuracy_meter)
        # now with loss meter we can print both the cumulative value and the average value
        print(f"Epoch {epoch+1} completed. Loss - total: {loss_meter.sum} - average: {loss_meter.avg}; Accuracy: {accuracy_meter.avg}")
    # we also return the stats for the final epoch of training
    
        if lr_scheduler is not None:
            if epoch >= epoch_start_scheduler:
                lr_scheduler.step()
    return loss_meter.sum, accuracy_meter.avg

In [6]:
%time
learn_rate = 0.1 # for SGD
num_epochs = 20
loss_fn = nn.CrossEntropyLoss()
model = MLP()
adam = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.StepLR(adam, step_size=5, gamma=.1)

train_model(model, trainloader, loss_fn, adam, num_epochs,scheduler)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.25 µs
Epoch 1 completed. Loss - total: 25049.58820915222 - average: 0.4174931368192037; Accuracy: 0.9212333333333333
Epoch 2 completed. Loss - total: 7171.282043457031 - average: 0.11952136739095053; Accuracy: 0.9711166666666666
Epoch 3 completed. Loss - total: 4859.559731960297 - average: 0.08099266219933828; Accuracy: 0.9783166666666666
Epoch 4 completed. Loss - total: 3597.719836473465 - average: 0.05996199727455775; Accuracy: 0.9833
Epoch 5 completed. Loss - total: 2773.6652530431747 - average: 0.046227754217386244; Accuracy: 0.9862
Epoch 6 completed. Loss - total: 2214.2340394556522 - average: 0.036903900657594205; Accuracy: 0.9892333333333333
Epoch 7 completed. Loss - total: 1082.5240494012833 - average: 0.018042067490021386; Accuracy: 0.9953
Epoch 8 completed. Loss - total: 686.4664028435946 - average: 0.01144110671405991; Accuracy: 0.9979333333333333
Epoch 9 completed. Loss - total: 572.324460029602 - average: 0.00953874

(331.5814508795738, 0.9994833333333333)

2. Try reaching 0-loss on the training data with **permuted labels**. Assess the model on the test data (without permuted labels) and comment. Help yourself with [3](https://arxiv.org/abs/1611.03530).
*Tip*: To permute the labels, act on the `trainset.targets` with an appropriate torch function.
Then, you can pass this "permuted" `Dataset` to a `DataLoader` like so: `trainloader_permuted = torch.utils.data.DataLoader(trainset_permuted, batch_size=batch_size_train, shuffle=True)`. You can now use this `DataLoader` inside the training function.
Additional view for motivating this exercise: ["The statistical significance perfect linear separation", by Jared Tanner (Oxford U.)](https://www.youtube.com/watch?v=vl2QsVWEqdA).

In [24]:
trainset.targets

tensor([5, 0, 4,  ..., 5, 6, 8])