# Training Models on Classical Datasets

In this notebook, the classes and utils from the `loop` package are used to train simple models on various "classical" computer vision datasets. It is intended to be used as a benchmark showing the quality of the training loop.

## Imports

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
from functools import partial
import sys
from pathlib import Path
import warnings

try:
    old_path
except NameError:
    old_path = sys.path.copy()
    sys.path = [Path.cwd().parent.as_posix()] + old_path
    
warnings.simplefilter('ignore', category=Warning)

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import torch
from torch import nn
from torch import optim
from torch.nn import functional as F
from torchvision import models
from torchvision import transforms as T
from torchvision.datasets import MNIST, CIFAR10
from torchvision.models import resnet18, resnet34

from loop import find_lr, train, train_classifier, make_phases
from loop import callbacks as C
from loop import plot
from loop.config import defaults
from loop.optimizers.adamw import AdamW
from loop.schedule import OneCycleSchedule
from loop.torch_helpers.modules import tiny_classifier, FineTunedModel
from loop.torch_helpers.utils import freeze_status

In [4]:
def data_path(name, root=Path('~/data')):
    return root.expanduser()/name

In [5]:
defaults.device = torch.device('cuda:1')
mnist_stats = ([0.15]*3, [0.15]*3)
imagenet_stats = ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

## MNIST

The very first benchmark is running the training loop against ubiquitious MNIST dataset. Here is a [leaderboard](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html) with the best results on this dataset. The first entries look like this:

|Error|Method|Venue|
|-----|------|-----|
|0.21%|Regularization of Neural Networks using DropConnect|ICML 2013|
|0.23%|Multi-column Deep Neural Networks for Image Classiﬁcation|CVPR 2012|
|0.23%|APAC: Augmented PAttern Classification with Neural Networks|arXiv 2015|
|0.24%|Batch-normalized Maxout Network in Network|arXiv 2015|

In other words, the best results nowadays are around `99.7%` of accuracy. Let's see if we can get close to these values.

We're going to use `torchvision` package to simplify data loading and preparation process. Next, we create the training phases to track metrics and wrap datasets into data loaders. Finally, we create a list of callbacks and initialize the optimizer.


In [5]:
root = data_path('mnist')

train_ds = MNIST(
    root, train=True, download=True,
    transform=T.Compose([T.Resize(32),
                         T.RandomAffine(5, translate=(0.05, 0.05), scale=(0.9, 1.2)),
                         T.ToTensor(), 
                         T.Normalize(*mnist_stats)]))

valid_ds = MNIST(
    root, train=False, download=True,
    transform=T.Compose([T.Resize(32),
                         T.ToTensor(), 
                         T.Normalize(*mnist_stats)]))

phases = make_phases(train_ds, valid_ds, batch_size=1024, num_workers=(12, 4))

cb = C.CallbacksGroup([
    C.Accuracy(),
    C.RollingLoss(),
    C.History(),
    C.MemoryUsage(),
    C.StreamLogger(),
    C.ProgressBar(),
    C.Scheduler(
        OneCycleSchedule(
            t=len(phases[0].loader) * 3,
            linear_pct=0.3,
            eta_max=1.0,
            div_factor=25
        ),
        mode='batch',
        params_conf=[
            {'name': 'lr'},
            {'name': 'weight_decay', 'inverse': True}
        ]
    )
])

model = tiny_classifier(n_channels=1, n_out=10)
opt = AdamW(model.parameters(), lr=1e-2, weight_decay=1e-2)

In [6]:
train(model, opt, phases, cb, epochs=3)

HBox(children=(IntProgress(value=0, description='train', max=59), HTML(value='')))

HBox(children=(IntProgress(value=0, description='valid', max=10), HTML(value='')))

Epoch:    1 | train_loss=0.9978, train_accuracy=0.6029, valid_loss=0.1254, valid_accuracy=0.9617
Epoch:    2 | train_loss=0.5518, train_accuracy=0.8689, valid_loss=0.0949, valid_accuracy=0.9788
Epoch:    3 | train_loss=0.4207, train_accuracy=0.8946, valid_loss=0.1034, valid_accuracy=0.9765




We're getting more then `97%` of accuracy after training 3 epochs only, training the network from scratch and using very simple architecture. Therefore, we could suppose that our training loop works correctly.

## MNIST: Pretrained and Augmented

Could we improve our results with a pretrained model and with applying some data augmentation techniques? Let's make a try and see. 

But first of all, we need to make our samples ready to pass into the pretrained model. Most of the pretrained networks are trained on the ImageNet dataset which consists of colored images, i.e., the images with 3 channels at first dimention. However, the MNIST dataset contains grayscale images with single channel. To make it compatible with the models we're going to try, we need to  convert these 1-channel images into 3-channel images. The most straightforward way to do so is to duplicate the original channel three times. The `ExpandChannels` class does exactly this.

In [7]:
class ExpandChannels:
    
    def __init__(self, num_of_channels=3):
        self.nc = num_of_channels
    
    def __call__(self, x):
        return x.expand((self.nc,) + x.shape[1:])

Also, this time we don't create the callbacks explicitly but using a convenience `train_classifier` function that adds required callbacks for us. 

In [20]:
image_size = 128

train_ds = MNIST(
    root, train=True, download=True,
    transform=T.Compose([T.Resize(image_size),
                         T.Pad(4, padding_mode='reflect'),
                         T.RandomAffine(5, translate=(0.05, 0.05), scale=(0.9, 1.2)),
                         T.RandomResizedCrop(image_size, scale=(0.8, 1.2)),
                         T.ToTensor(), 
                         ExpandChannels(3),
                         T.Normalize(*imagenet_stats)]))

valid_ds = MNIST(
    root, train=False, download=True,
    transform=T.Compose([T.Resize(image_size),
                         T.ToTensor(), 
                         ExpandChannels(3),
                         T.Normalize(*imagenet_stats)]))

model = FineTunedModel(n_out=10, arch=resnet18)
opt = AdamW(model.parameters(), lr=1e-2, weight_decay=1e-2)

Notice that by default, the fine-tuned model "freezes" backbone layers, and trains custom top layers only:

In [21]:
freeze_status(model)

Layers status
--------------------------------------------------------------------------------
Conv2d                                                                  [frozen]
BatchNorm2d                                                             [frozen]
ReLU                                                                    [-     ]
MaxPool2d                                                               [-     ]
Conv2d                                                                  [frozen]
BatchNorm2d                                                             [frozen]
ReLU                                                                    [-     ]
Conv2d                                                                  [frozen]
BatchNorm2d                                                             [frozen]
Conv2d                                                                  [frozen]
BatchNorm2d                                                             [frozen]
ReLU          

This time we also change a scheduling function from cosine annealing to one-cycle policy to see if we could also get a better result with a different scheduling function.

In [22]:
results = train_classifier(model, opt, (train_ds, valid_ds), 
                           epochs=3, batch_size=512,
                           schedule_params={
                               'schedule': 'cos_anneal', 
                               'schedule_config': {'eta_min': 0.1}},
                           num_workers=(12, 4))

HBox(children=(IntProgress(value=0, description='train', max=118), HTML(value='')))

HBox(children=(IntProgress(value=0, description='valid', max=20), HTML(value='')))

Epoch:    1 | train_loss=0.4205, train_accuracy=0.8464, valid_loss=0.3566, valid_accuracy=0.9334
Epoch:    2 | train_loss=0.3206, train_accuracy=0.8944, valid_loss=0.2490, valid_accuracy=0.9440
Epoch:    3 | train_loss=0.2961, train_accuracy=0.9215, valid_loss=0.2726, valid_accuracy=0.9480




The results are promising but not as good as our simplistic model. Let's see what can we get if unfreeze the backbone layers and train the model with differential learning rates:

In [None]:
pd.DataFrame(results['callbacks']['scheduler'].parameter_history('lr', 'weight_decay')).plot()

In [24]:
torch.save(model, 'frozen.model')

In [25]:
model = torch.load('frozen.model')

In [28]:
model.freeze_backbone(False)
opt = AdamW([
    {'params': model.backbone.parameters(), 'lr': 1e-4, 'weigth_decay': 1e-5},
    {'params': model.top.parameters(), 'lr': 1e-3, 'weigth_decay': 1e-3}
])

In [29]:
results = train_classifier(model, opt, (train_ds, valid_ds), 
                           epochs=5, batch_size=512,
                           schedule_params={'schedule': 'one_cycle'},
                           num_workers=(12, 4))

HBox(children=(IntProgress(value=0, description='train', max=118), HTML(value='')))

HBox(children=(IntProgress(value=0, description='valid', max=20), HTML(value='')))

Epoch:    1 | train_loss=0.0541, train_accuracy=0.9795, valid_loss=0.0212, valid_accuracy=0.9931
Epoch:    2 | train_loss=0.0303, train_accuracy=0.9906, valid_loss=0.0262, valid_accuracy=0.9901
Epoch:    3 | train_loss=0.0253, train_accuracy=0.9931, valid_loss=0.0226, valid_accuracy=0.9932
Epoch:    4 | train_loss=0.0186, train_accuracy=0.9949, valid_loss=0.0196, valid_accuracy=0.9952
Epoch:    5 | train_loss=0.0155, train_accuracy=0.9956, valid_loss=0.0184, valid_accuracy=0.9942




We're getting even closer the top entries of the scoreboard! It seems that our training loop and schedulers are going in a right direction.

## CIFAR10

The CIFAR10 dataset is a more challenging task. Though modern deep learning architectures are capable to show quite good results even in this case. Therefore, it is also a good candidate to check how well performs the written code.

The process is the same as for MNIST:

1. Create datasets and data loaders
2. Initialize optimizer and schedulers
3. Start the training process 

In [9]:
root = data_path('cifar10')

image_size = 224

train_ds = CIFAR10(
    root, train=True, download=True,
    transform=T.Compose([T.Resize(image_size),
                         T.Pad(8, padding_mode='reflect'),
                         T.RandomAffine(5, translate=(0.05, 0.05), scale=(0.8, 1.2)),
                         T.RandomResizedCrop(image_size, scale=(0.8, 1.1)),
                         T.RandomHorizontalFlip(),
                         T.ToTensor(), 
                         T.Normalize(*imagenet_stats)]))

valid_ds = CIFAR10(
    root, train=False, download=True,
    transform=T.Compose([T.Resize(image_size),
                         T.ToTensor(), 
                         T.Normalize(*imagenet_stats)]))

model = FineTunedModel(n_out=10, arch=resnet34)
opt = AdamW(model.parameters(), lr=1e-2, weight_decay=1e-2)

Files already downloaded and verified
Files already downloaded and verified


In [10]:
lrs, losses = find_lr(model, opt, train_ds, batch_size=1024, loss_fn=F.cross_entropy)

HBox(children=(IntProgress(value=0, description='lr_finder', max=49), HTML(value='')))

In [None]:
plot.lr_loss_curve(lrs, losses, zoom=(1e-7, 1e-2))

> /home/ck/code/loop/loop/plot/learning_rate.py(14)lr_loss_curve()
-> min_lr, max_lr = zoom
(Pdb) zoom
(1e-07, 0.01)
(Pdb) n
> /home/ck/code/loop/loop/plot/learning_rate.py(15)lr_loss_curve()
-> lrs, losses = zip(*[(x, y) for x, y in zip(lrs, losses) if min_lr <= x <= max_lr])
(Pdb) lrs
[1e-07, 0.02083343125, 0.041666762499999996, 0.06250009375, 0.083333425, 0.10416675625, 0.1250000875, 0.14583341875, 0.16666675, 0.18750008125, 0.2083334125, 0.22916674375, 0.250000075, 0.27083340624999996, 0.29166673749999994, 0.3125000687499999, 0.3333333999999999, 0.35416673124999987, 0.37500006249999984, 0.3958333937499998, 0.4166667249999998, 0.43750005624999977, 0.45833338749999974, 0.4791667187499997, 0.5000000499999997, 0.5208333812499997, 0.5416667124999996, 0.5625000437499996, 0.5833333749999996, 0.6041667062499996, 0.6250000374999996, 0.6458333687499995, 0.6666666999999995, 0.6875000312499995, 0.7083333624999995, 0.7291666937499994, 0.7500000249999994, 0.7708333562499994, 0.7916666874999994, 

In [18]:
results = train_classifier(model, opt, (train_ds, valid_ds), 
                           epochs=10, batch_size=512,
                           schedule_params={
                               'schedule': 'one_cycle', 
                               'schedule_config': {'eta_min': 0.1}},
                           num_workers=(12, 4))

HBox(children=(IntProgress(value=0, description='train', max=98), HTML(value='')))

HBox(children=(IntProgress(value=0, description='valid', max=20), HTML(value='')))

Epoch:    1 | train_loss=0.9926, train_accuracy=0.6384, valid_loss=0.8086, valid_accuracy=0.7701
Epoch:    2 | train_loss=0.8926, train_accuracy=0.7034, valid_loss=0.7647, valid_accuracy=0.7501
Epoch:    3 | train_loss=0.8665, train_accuracy=0.7110, valid_loss=0.7338, valid_accuracy=0.7580
Epoch:    4 | train_loss=0.8440, train_accuracy=0.7149, valid_loss=0.7053, valid_accuracy=0.7795
Epoch:    5 | train_loss=0.8387, train_accuracy=0.7183, valid_loss=0.7036, valid_accuracy=0.7830
Epoch:    6 | train_loss=0.8407, train_accuracy=0.7171, valid_loss=0.7718, valid_accuracy=0.7303


Exception ignored in: <function Image.__del__ at 0x7f25ed365378>
Traceback (most recent call last):
  File "/home/ck/anaconda3/envs/fastai/lib/python3.7/site-packages/PIL/Image.py", line 601, in __del__
    def __del__(self):
KeyboardInterrupt


KeyboardInterrupt: 