# Deep Learning Homework \#04 (*normal MNIST*)
### Deep Learning Course $\in$ DSSC @ UniTS (Spring 2021)  

#### Submitted by [Emanuele Ballarin](mailto:emanuele@ballarin.cc)  

### Preliminaries:

#### Imports:

We start off by importing all the libraries, modules, classes and functions we are going to use *today*...

In [1]:
# Type hints
from torch import Tensor

# Just to force-load MKL (if available)
import numpy as np

# Mathematical functions
from math import sqrt as msqrt

# Neural networks and friends
import torch as th
from torch.nn import Sequential, BatchNorm1d, Linear, LogSoftmax, Dropout
import torch.nn.functional as F

# Optimization and scheduling
from torch.optim.lr_scheduler import StepLR, MultiStepLR

# Bespoke Modules / Functions / Optimizers
from ebtorch.logging import AverageMeter
from ebtorch.nn import Mish, mishlayer_init
from ebtorch.optim import Lookahead
from madgrad.madgrad import MADGRAD as MadGrad

# Model summarization
from torchinfo import summary

# Dataset handling for PyTorch
import os
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor, Normalize, Compose, Lambda

### Request 1:

Now that you have all the tools to train an MLP with high performance on MNIST, try reaching 0-loss (or 100% accuracy) on the training data (with a small epsilon, e.g. 99.99% training performance -- don't worry if you overfit!). The implementation is completely up to you. You just need to keep it an MLP without using fancy layers (e.g., keep the Linear layers, don't use `Conv1d` or something like this, don't use attention). You are free to use any LR scheduler or optimizer, any one of batchnorm/groupnorm, regularization methods... If you use something we haven't seen during lectures, please motivate your choice and explain (as briefly as possible) how it works.

#### Comment:

Nothing fancy here... just some DataSets/Loaders.

In [2]:
def spawn_mnist_loaders(
    data_root="datasets/",
    batch_size_train=256,
    batch_size_test=512,
    cuda_accel=False,
    **kwargs
):

    os.makedirs(data_root, exist_ok=True)

    transforms = Compose(
        [
            ToTensor(),
            Normalize((0.1307,), (0.3081,)),  # usual normalization constants for MNIST
            Lambda(lambda x: th.flatten(x)),
        ]
    )

    trainset = MNIST(data_root, train=True, transform=transforms, download=True)
    testset = MNIST(data_root, train=False, transform=transforms, download=True)

    cuda_args = {}
    if cuda_accel:
        cuda_args = {"num_workers": 1, "pin_memory": True}

    trainloader = DataLoader(
        trainset, batch_size=batch_size_train, shuffle=True, **cuda_args
    )
    testloader = DataLoader(
        testset, batch_size=batch_size_test, shuffle=False, **cuda_args
    )
    tontrloader = DataLoader(   # tontr == test on train
        trainset, batch_size=batch_size_test, shuffle=False, **cuda_args
    )

    return trainloader, testloader, tontrloader

#### Comment:

Nothing fancy here, too. A standard training/testing loop.

In [3]:
train_acc_avgmeter = AverageMeter("Training Loss")

def train_epoch(
    model, device, train_loader, loss_fn, optimizer, epoch, print_every_nep, inner_scheduler=None, quiet=False,
):
    train_acc_avgmeter.reset()
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
        if inner_scheduler is not None:
            inner_scheduler.step()
        
        train_acc_avgmeter.update(loss.item())

        if not quiet and batch_idx % print_every_nep == 0:
            print(
                "Train Epoch: {} [{}/{} ({:.0f}%)]\tAvg. loss: {:.6f}".format(
                    epoch,
                    batch_idx * len(data),
                    len(train_loader.dataset),
                    100.0 * batch_idx / len(train_loader),
                    train_acc_avgmeter.avg
                )
            )


def test(model, device, test_loader, loss_fn, quiet=False):
    model.eval()
    test_loss = 0
    correct = 0
    with th.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += loss_fn(
                output, target, reduction="sum"
            ).item()  # sum up batch loss
            pred = output.argmax(
                dim=1, keepdim=True
            )  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    ltlds = len(test_loader.dataset)

    test_loss /= ltlds
    
    if not quiet:
        print(
            "Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)".format(
                test_loss,
                correct,
                ltlds,
                100.0 * correct / ltlds,
            )
        )
    
    return test_loss, correct / ltlds

#### Comment:

The *usual* specification of the device. Please, notece that the `MADGRAD` optimizer (used later on) requires a working *CUDA-capable* GPU and may not work otherwise. You may be interested in using `RAdam` instead.

In [4]:
device = th.device("cuda" if th.cuda.is_available() else "cpu")

#### Comment:

The choice of a $512$-element *batch size* is unusual. Here, the rationale may be summarized by noticing the peculiar relationship among *large batch size* (both as a regularizer and as a specific training device) and the deeper nature of learned features.  

Small batch sizes ($<128$, for the dataset of interest), on the one hand:

- Carry a considerable amount of *data sampling noise*, that smoothes the loss landcape and ease optimization with non-accelerated methods (e.g. SGD);

- Carry a higher *signal-to-data* ratio, due to limited feature- and gradient- competition inside the batch;

- Usually produce better generalization on both *testing* and *unseen* data. This may also be true for *synthetic whole-manifold* and *noise-corrupted* data;

- Produce more unstable *training-set* loss and accuracy results;

- Require longer or more-otherwise-regularized training for the same level of *training set accuuracy*;

- Put the main training focus on learning *useful data-point features*, as opposed to pure decision boundaries.


On the other hand, larger batch sizes ($>128$, for the dataset of interest):

- Carry a reduced amount of *data sampling noise*, making easier for an optimizer to find a single direction that minimizes (even though of a small amount) the whole-dataset loss. This regime may be particularly appealing for *accelerated* descent methods;

- May carry a reduced *signal-to-data* ratio, also due to gradient competition inside the batch, thus requiring more finely-tuned *learning rate* and *loss function* choices. Or specifically-crafted robust-to-initialization optimizers;

- May easily overfit the *training set* without significant generalization, especially without additional regularization devices;

- Usually produce an approximation of *constantly-decreasing test-set loss* seen in *full-dataset GD*;

- May speed-up training, with no guarantee - however - on how low will be the loss at convergence;

- Put the main training focus on differences among data-points and decision boundaries, sometimes regardless of the most reasonable similarity-based representation.


As we can see, for our specific goal, the latter may be a favourable regime, provided that optimization is carried out thoughtfully.

In [5]:
# Hyperparameters & co.

minibatch_size_train: int = 512 # I know it's high; I just want a "little" more stability
minibatch_size_test: int = 512

nrepochs = 12   # This number os tuned to the minimum necessary for stable convergence to the result

lossfn = F.nll_loss

In [6]:
train_loader, test_loader, test_on_train_loader = spawn_mnist_loaders(
    batch_size_train=minibatch_size_train,
    batch_size_test=minibatch_size_test,
    cuda_accel=bool(device == "cuda"),
)

#### Comment:

The network is a good compromise between number of parameters, training performance, and a still *easily-steerable* parameter space.  

In particular, though only $3$-layered, the MLP exploits *hyperfeaturization* (the choice to include hidden layers larger than input the size; still not close to $(\# \text{inputs} \times \# \text{outputs})$) and presents in any case a relatively large (for the problem) number of parameters.  

**About the optimizer**:

- `Lookahead` optimizer. A fresh approach to *inner-loop optimization* applied to *neural network optimizers*. The optimizer copies the weights of the network (called *slow weights*), updates $k$ times such copy (called *fast weights*), according to the loss minimization criterion, with an auxiliary optimizer, and finally updates *once* the *slow weights* with a step in the direction of the sum of *fast weights* updates. Such optimization schedule is very robust to noise and to *ruggedness-induced jitter*.  

- The `MADGRAD` optimizer, used (with maybe overkill attitude) as the *inner-loop optimizer* of the above, is a new, *doubly-averaged* adaptive (as in *AdaGrad*), *explicitly-momentumized* (as in *SGD with momentum*) optimizer by Facebook AI Research. It may be seen (with much approximation) as an attempt to combine *Adam-like* convergence speed and adaptiveness (successfully!), *SGD-like* generalizability (requires care!), and a tunable *momentum* (it's part of te design).

In [7]:
model = Sequential(
    # -> Input is here <-

    # POST-INPUT BLOCK:
    Linear(in_features=28*28, out_features=1500, bias=True),    # Hyperfeaturize ~2*input
    Mish(),

    # HIDDEN BLOCK:
    BatchNorm1d(num_features=1500, affine=True),
    Linear(in_features=1500, out_features=500, bias=True),      # Compress ~0.75*input
    Mish(),

    # PRE-OUTPUT BLOCK:
    BatchNorm1d(num_features=500, affine=True),
    Linear(in_features=500, out_features=10, bias=True),        # To output
    LogSoftmax(dim=1)

    # -> Output is here <-
        ).to(device)

if device == "cpu":
    raise RuntimeError("The MADGRAD optimizer won't work without a GPU. You may want to use RAdam instead! ;)")

base_optimizer = MadGrad(model.parameters(), lr=0.00017)
optimizer      = Lookahead(base_optimizer, la_steps=4)
scheduler      = MultiStepLR(optimizer, milestones=[10, 11], gamma=0.4) # Just to dampen jitter in case perfect accuracy is reached

#### Comment:

A *weight-initialization* scheme that tries to evenly spread weights in such way to fully exploit the three different *regimes* offered by the `Mish` activation function (asymptotically zero, negative, asymptotically linear) with a *fanin/fanout*-inspired approach (as the one in *Xavier* or *Kaiming* initialization).

In [8]:
# Initialize weights and biases in the proper way ;)
for layr in model:
    mishlayer_init(layr)

In [9]:
summary(model)

Layer (type:depth-idx)                   Param #
├─Linear: 1-1                            1,177,500
├─Mish: 1-2                              --
├─BatchNorm1d: 1-3                       3,000
├─Linear: 1-4                            750,500
├─Mish: 1-5                              --
├─BatchNorm1d: 1-6                       1,000
├─Linear: 1-7                            5,010
├─LogSoftmax: 1-8                        --
Total params: 1,937,010
Trainable params: 1,937,010
Non-trainable params: 0

In [10]:
for epoch in range(1, nrepochs + 1):

    # Training
    print("TRAINING...")
    train_epoch(
        model, device, train_loader, lossfn, optimizer, epoch, print_every_nep=15, inner_scheduler=None, quiet=False,
    )

    # Tweaks for the Lookahead optimizer (before testing)
    if isinstance(optimizer, Lookahead):
        optimizer._backup_and_load_cache()  # I.e.: use slow weights for testing -->

    # Testing: on training and testing set
    print("\nON TRAINING SET:")
    _ = test(model, device, test_on_train_loader, lossfn, quiet=False)
    print("\nON TEST SET:")
    _ = test(model, device, test_loader, lossfn, quiet=False)
    print("\n\n")

    # Tweaks for the Lookahead optimizer (after testing)
    if isinstance(optimizer, Lookahead):
        optimizer._clear_and_load_backup()  # <-- I.e.: use slow weights for testing
    
    # Scheduling step (outer)
    scheduler.step()

TRAINING...

ON TRAINING SET:
Average loss: 0.0841, Accuracy: 58680/60000 (98%)

ON TEST SET:
Average loss: 0.1060, Accuracy: 9687/10000 (97%)



TRAINING...

ON TRAINING SET:
Average loss: 0.0367, Accuracy: 59567/60000 (99%)

ON TEST SET:
Average loss: 0.0756, Accuracy: 9774/10000 (98%)



TRAINING...

ON TRAINING SET:
Average loss: 0.0179, Accuracy: 59859/60000 (100%)

ON TEST SET:
Average loss: 0.0675, Accuracy: 9783/10000 (98%)



TRAINING...

ON TRAINING SET:
Average loss: 0.0091, Accuracy: 59961/60000 (100%)

ON TEST SET:
Average loss: 0.0610, Accuracy: 9807/10000 (98%)



TRAINING...

ON TRAINING SET:
Average loss: 0.0058, Accuracy: 59965/60000 (100%)

ON TEST SET:
Average loss: 0.0612, Accuracy: 9795/10000 (98%)



TRAINING...

ON TRAINING SET:
Average loss: 0.0029, Accuracy: 59989/60000 (100%)

ON TEST SET:
Average loss: 0.0588, Accuracy: 9821/10000 (98%)



TRAINING...

ON TRAINING SET:
Average loss: 0.0019, Accuracy: 59996/60000 (100%)

ON TEST SET:
Average loss: 0.0569, Acc