### Homework

1. Now that you have all the tools to train an MLP with high performance on MNIST, try reaching 0-loss on the training data (with a small epsilon, e.g. 99.99% training performance -- don't worry if you overfit!).
The implementation is completely up to you. You just need to keep it an MLP without using fancy layers (e.g., keep the `Linear` layers, don't use `Conv1d` or something like this, don't use attention). You are free to use any LR scheduler or optimizer, any one of batchnorm/groupnorm, regularization methods... If you use something we haven't seen during lectures, please motivate your choice and explain (as briefly as possible) how it works.

In [130]:
%reset -f

In [131]:
import torch
import os
import sys
from torch import nn
from matplotlib import pyplot as plt

from scripts import mnist
from scripts.train_utils import accuracy, AverageMeter
from scripts import architectures

torch.manual_seed(42)

<torch._C.Generator at 0x7ff9a8175b50>

- **Training time until convergence:** There seems to be a sweet spot. If the batch size is very small (e.g. 8), this time goes up. If the batch size is huge, it is also higher than the minimum.

- **Training time per epoch:** Bigger computes faster (is efficient)

- **Resulting model quality:** The lower the better due to better generalization (?)

In [132]:
# load the data
minibatch_size_train = 256
minibatch_size_test = 512

trainloader, testloader, trainset, testset = mnist.get_data(batch_size_train=minibatch_size_test, 
                                                            batch_size_test=minibatch_size_test)

In [133]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 256),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=256),
            nn.Linear(256, 512),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=512),
            nn.Linear(512, 256),
            nn.ReLU(),
            
            nn.BatchNorm1d(num_features=256),
            nn.Linear(256, 128),
            nn.ReLU(),
            
            nn.BatchNorm1d(num_features=128),
            nn.Linear(128, 64),
            nn.ReLU(),
            
            nn.BatchNorm1d(num_features=64),
            nn.Linear(64, 32),
            nn.ReLU(),       

            nn.BatchNorm1d(num_features=32),
            nn.Linear(32, 10)
        )

    def forward(self, X):
        return self.layers(X)

In [137]:
def train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, accuracy_meter):
    for X, y in dataloader:
        optimizer.zero_grad()
        y_hat = model(X)
        loss = loss_fn(y_hat, y)
        loss.backward()
        optimizer.step()
        acc = accuracy(y_hat, y)
        loss_meter.update(val=loss.item(), n=X.shape[0])
        accuracy_meter.update(val=acc, n=X.shape[0])
        
def train_model(model, dataloader, loss_fn, optimizer, num_epochs,lr_scheduler=None, epoch_start_scheduler=1):
    model.train()
    for epoch in range(num_epochs):
        loss_meter = AverageMeter()
        accuracy_meter = AverageMeter()
        train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, accuracy_meter)
        # now with loss meter we can print both the cumulative value and the average value
        print(f"Epoch {epoch+1} completed. Loss - total: {loss_meter.sum} - average: {loss_meter.avg}; Accuracy: {accuracy_meter.avg}")
    # we also return the stats for the final epoch of training
    
        if lr_scheduler is not None:
            if epoch >= epoch_start_scheduler:
                lr_scheduler.step()
    return loss_meter.sum, accuracy_meter.avg


def test_model(model, dataloader, performance=accuracy, loss_fn=None):
    # create an AverageMeter for the loss if passed
    if loss_fn is not None:
        loss_meter = AverageMeter()
    
    performance_meter = AverageMeter()

    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            y_hat = model(X)
            loss = loss_fn(y_hat, y) if loss_fn is not None else None
            acc = performance(y_hat, y)
            if loss_fn is not None:
                loss_meter.update(loss.item(), X.shape[0])
            performance_meter.update(acc, X.shape[0])
    # get final performances
    fin_loss = loss_meter.sum if loss_fn is not None else None
    fin_perf = performance_meter.avg
    print(f"TESTING - loss {fin_loss if fin_loss is not None else '--'} - performance {fin_perf}")
    return fin_loss, fin_perf

In [6]:
%time
learn_rate = 0.1 # for SGD
num_epochs = 20
loss_fn = nn.CrossEntropyLoss()
model = MLP()
adam = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.StepLR(adam, step_size=5, gamma=.1)

train_model(model, trainloader, loss_fn, adam, num_epochs,scheduler)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 9.06 µs
Epoch 1 completed. Loss - total: 25245.082615852356 - average: 0.4207513769308726; Accuracy: 0.91925
Epoch 2 completed. Loss - total: 6942.592653989792 - average: 0.11570987756649653; Accuracy: 0.9726
Epoch 3 completed. Loss - total: 4816.2511620521545 - average: 0.08027085270086924; Accuracy: 0.9780166666666666
Epoch 4 completed. Loss - total: 3532.179852962494 - average: 0.05886966421604156; Accuracy: 0.9840166666666667
Epoch 5 completed. Loss - total: 2648.7057873010635 - average: 0.04414509645501773; Accuracy: 0.9872833333333333
Epoch 6 completed. Loss - total: 2216.1498260498047 - average: 0.03693583043416341; Accuracy: 0.9890333333333333
Epoch 7 completed. Loss - total: 1064.156699001789 - average: 0.01773594498336315; Accuracy: 0.9952666666666666
Epoch 8 completed. Loss - total: 618.9858611226082 - average: 0.010316431018710137; Accuracy: 0.9980833333333333
Epoch 9 completed. Loss - total: 496.1282648444176 - averag

(315.8218548297882, 0.9995)

In [157]:
test_model(model, testloader)

TESTING - loss -- - performance 0.9987166666666667


(None, 0.9987166666666667)

**Note: Since it is said that we can neglect the overfitting, none of the regularization techniques weren't used.**

2. Try reaching 0-loss on the training data with **permuted labels**. Assess the model on the test data (without permuted labels) and comment. Help yourself with [3](https://arxiv.org/abs/1611.03530).
*Tip*: To permute the labels, act on the `trainset.targets` with an appropriate torch function.
Then, you can pass this "permuted" `Dataset` to a `DataLoader` like so: `trainloader_permuted = torch.utils.data.DataLoader(trainset_permuted, batch_size=batch_size_train, shuffle=True)`. You can now use this `DataLoader` inside the training function.
Additional view for motivating this exercise: ["The statistical significance perfect linear separation", by Jared Tanner (Oxford U.)](https://www.youtube.com/watch?v=vl2QsVWEqdA).

In [141]:
trainloaderToPerm, testloader, trainsetToPerm, testset = mnist.get_data(batch_size_train=1024, 
                                                            batch_size_test=1024)

indexes = torch.randperm(trainsetToPerm.targets.shape[0])
trainset_target_permuted = trainsetToPerm.targets[indexes]
trainsetToPerm.targets = trainset_target_permuted

trainloader_permuted = torch.utils.data.DataLoader(trainsetToPerm, batch_size=1024, shuffle=True)

In [144]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 1500),
            nn.Tanh(),

            nn.BatchNorm1d(num_features=1500),
            nn.Linear(1500, 750),
            nn.Tanh(),

            nn.BatchNorm1d(num_features=750),
            nn.Linear(750, 400),
            nn.Tanh(),
            
            nn.BatchNorm1d(num_features=400),
            nn.Linear(400, 10),

        )

    def forward(self, X):
        return self.layers(X)

In [146]:
%time
learn_rate = 0.1 # for SGD
num_epochs = 80
loss_fn = nn.CrossEntropyLoss()
model = MLP()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(adam, step_size=5, gamma=.2)

train_model(model, trainloader_permuted, loss_fn, optimizer, num_epochs, scheduler)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.01 µs
Epoch 1 completed. Loss - total: 141442.35330963135 - average: 2.3573725551605222; Accuracy: 0.1004
Epoch 2 completed. Loss - total: 138155.65364837646 - average: 2.302594227472941; Accuracy: 0.1223
Epoch 3 completed. Loss - total: 137127.17332458496 - average: 2.2854528887430825; Accuracy: 0.13505
Epoch 4 completed. Loss - total: 136427.59621429443 - average: 2.2737932702382406; Accuracy: 0.14665
Epoch 5 completed. Loss - total: 135899.9328765869 - average: 2.2649988812764486; Accuracy: 0.15066666666666667
Epoch 6 completed. Loss - total: 135113.18854522705 - average: 2.251886475753784; Accuracy: 0.16115
Epoch 7 completed. Loss - total: 134509.1104888916 - average: 2.241818508148193; Accuracy: 0.16918333333333332
Epoch 8 completed. Loss - total: 133746.01634979248 - average: 2.229100272496541; Accuracy: 0.17728333333333332
Epoch 9 completed. Loss - total: 132826.2497024536 - average: 2.213770828374227; Accuracy: 0.1859166

Epoch 76 completed. Loss - total: 1803.0588864088058 - average: 0.030050981440146764; Accuracy: 0.9948833333333333
Epoch 77 completed. Loss - total: 1641.8966326713562 - average: 0.027364943877855936; Accuracy: 0.9952333333333333
Epoch 78 completed. Loss - total: 1364.9683384895325 - average: 0.022749472308158873; Accuracy: 0.9968333333333333
Epoch 79 completed. Loss - total: 1276.4597100615501 - average: 0.021274328501025834; Accuracy: 0.9972833333333333
Epoch 80 completed. Loss - total: 1224.0443016290665 - average: 0.020400738360484443; Accuracy: 0.9972666666666666


(1224.0443016290665, 0.9972666666666666)