<a href="https://colab.research.google.com/github/davidcpage/Imagenette-experiments/blob/master/Reducing_variance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the [previous notebook](https://github.com/davidcpage/Imagenette-experiments/blob/master/Imagenette_v2.ipynb), we demonstrated Imagenette training of an xresnet model based on  https://github.com/lessw2020/Ranger-Mish-ImageWoof-5 using fastai and fastai2 codebases. We implemented a faster dataloader using Nvidia DALI and made some changes to the fastai2 model so that it agrees with the v1 version from the repo.

In today's notebook, we are going to focus on reducing the variance of the validation accuracy to make it easier to compare training setups. The baseline 5 epoch xresnet18 training from last time achieves a mean Imagenette validation accuracy of around 88.3% with a std dev of about 0.7%. Our plan is to move some examples from the training set to make a larger validation set and to experiment with a smoothed version of the 0-1 accuracy metric.

### Setup

Install fastai2 and DALI. You may need to restart afterwards.

In [0]:
!python -m pip install typeguard
!python -m pip install --upgrade pillow fastprogress
!python -m pip install git+https://github.com/fastai/fastai2

!python -m pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/10.0 nvidia-dali

RANGER = 'https://raw.githubusercontent.com/lessw2020/Ranger-Mish-ImageWoof-5/master/ranger.py'
MXRESNET = 'https://raw.githubusercontent.com/lessw2020/Ranger-Mish-ImageWoof-5/master/mxresnet.py'
UTILS = 'https://raw.githubusercontent.com/davidcpage/Imagenette-experiments/master/utils.py'

!wget $RANGER -O ranger.py
!wget $MXRESNET -O mxresnet.py
!wget $UTILS -O utils.py

### Main

Imports, device setup and dataset download:

In [0]:
from functools import partial
import numpy as np
import torch
import torch.nn as nn
import fastai, fastai.vision
import fastai2, fastai2.callback.all

import ranger

data_dir = fastai.datasets.untar_data(fastai.datasets.URLs.IMAGENETTE_320)
device = torch.device(torch.cuda.current_device())

torch.backends.cudnn.benchmark = True

 We have moved the main functionality implemented last time to a script to reuse here:

In [0]:
from utils import DALIDataLoader, Chain, MockV1DataBunch, imagenet_train_graph, imagenet_valid_graph, fit_flat_cos

Let's use this to build fastai and fastai2 compatible (DALI) dataloaders:

In [0]:
size = 128
bs = 64
random_aspect_ratio = (3/4, 4/3)
random_area = (0.35, 1.)
val_xtra_size = 32

train_dl = lambda folder, bs, seed=-1: (
        DALIDataLoader(imagenet_train_graph(folder, size, random_aspect_ratio, random_area), bs, drop_last=True, device=device, seed=seed))
valid_dl = lambda folder, bs, : Chain(
        DALIDataLoader(imagenet_valid_graph(folder, size, val_xtra_size), bs, drop_last=False, device=device),
        DALIDataLoader(imagenet_valid_graph(folder, size, val_xtra_size, mirror=1), bs, drop_last=False, device=device),
    )

data_v1 = lambda data_dir=data_dir, bs=bs: MockV1DataBunch(train_dl(data_dir/'train', bs), valid_dl(data_dir/'val', bs))
data_v2 = lambda data_dir=data_dir, bs=bs: fastai2.basics.DataBunch(train_dl(data_dir/'train', bs), valid_dl(data_dir/'val', bs))

Let's recap by comparing a simple training run using the v1 and v2 codebases. The v2 version is basically a repeat of what we had at the end of last time and we expect that there are still differences from v1. Since we tied out the model between codebases, we are going to use the more flexible v2 model from now on even for v1 training. For what it's worth, we've also implemented a slightly faster version of the Mish activation function (compared to the v1 version and the jitted version in v2.)

In [0]:
from utils import xresnet18, MishJit
model = partial(xresnet18, c_out=10, sa=1, sym=0, act_cls=MishJit)

v1 training:

In [5]:
learner_v1 = partial(
    fastai.basic_train.Learner, wd=1e-2, bn_wd=False, true_wd=True,
    opt_func=partial(ranger.Ranger, betas=(0.95, 0.99), eps=1e-6),
    metrics=(fastai.metrics.accuracy,),
    loss_func=fastai.layers.LabelSmoothingCrossEntropy())

learn = fit_flat_cos(learner_v1(data_v1(), model()).to_fp16(), n_epoch=5, lr=4e-3, pct_start=0.72)

epoch,train_loss,valid_loss,accuracy,time
0,1.412312,1.486379,0.605,00:18
1,1.196931,1.027858,0.81,00:17
2,1.106371,0.979379,0.807,00:17
3,1.016692,0.896012,0.856,00:17
4,0.906713,0.838668,0.884,00:17


v2 training:

In [6]:
RangerWrapper = lambda *args, **kwargs: fastai2.basics.OptimWrapper(ranger.Ranger(*args, **kwargs))
learner_v2 = partial(
    fastai2.basics.Learner, lr=4e-3,
    opt_func=partial(RangerWrapper, betas=(0.95, 0.99), eps=1e-6),
    metrics=(fastai2.metrics.accuracy,),
    loss_func=fastai2.basics.LabelSmoothingCrossEntropy())

learn = learner_v2(data_v2(), model()).to_fp16().fit_flat_cos(n_epoch=5, wd=1e-2, pct_start=0.72)

epoch,train_loss,valid_loss,accuracy,time
0,1.422901,1.150089,0.758,00:18
1,1.191645,1.050608,0.781,00:19
2,1.095837,1.194106,0.72,00:19
3,1.024172,1.012769,0.819,00:19
4,0.912284,0.84172,0.881,00:19


Note that training takes a couple of seconds more per epoch on fastai2, which needs investigating at some point.

### Dataset

Here is a utility to make a new data set split with 250 examples per class instead of the original 50. These are moved from the training set so the new training set has 10894 examples from the original 12894.

In [0]:
from pathlib import Path
import shutil
import os

listdir = lambda dir: sorted(os.listdir(dir)) #deterministic ordering...

def make_new_split(data_dir, new_data_dir, val_examples_per_class=250, seed=1234):
    #keep the original 50 validation examples in each class
    #and move over 'val_examples_per_class'-50 more from the train set
    new_data_dir = Path(new_data_dir)
    num_move = val_examples_per_class - 50  
    assert num_move > 0 
    rng = np.random.RandomState(seed)

    shutil.copytree(data_dir, new_data_dir)
    train_dir, val_dir = Path(new_data_dir)/'train', Path(new_data_dir)/'val'
    for k in listdir(train_dir):
        files = listdir(train_dir/k)
        for f in rng.choice(files, num_move, replace=False):
            shutil.move(str(train_dir/k/f), val_dir/k)

In [0]:
new_data_dir = Path(str(data_dir)+'-new')
if not new_data_dir.exists(): 
    make_new_split(data_dir, new_data_dir)

In [9]:
learn = fit_flat_cos(learner_v1(data_v1(new_data_dir), model()).to_fp16(), n_epoch=5, lr=4e-3, pct_start=0.72)

epoch,train_loss,valid_loss,accuracy,time
0,1.517955,1.452,0.6298,00:17
1,1.269426,1.308483,0.6724,00:17
2,1.153134,1.090687,0.7592,00:17
3,1.087146,1.02977,0.7932,00:17
4,0.958357,0.953012,0.8252,00:17


Validation accuracy is reduced by 4-5% which is not surprising since the new training set is smaller and the new validation set may have different difficulty just by chance.

### Smoothed accuracy

Before we launch a set of training runs to test the variance of validation accuracy for the new dataset, let's try one more thing. Validation noise comes largely from examples on which the model is not sure which class to predict. Small changes in output class probabilities can lead to a change of predicted (argmax) class and thus model accuracy. The situation is improved with a larger validaton set to average over the noise, but we can potentially improve things further by smoothing the decision boundary using a soft(arg)max. 

This is a typical bias/variance tradeoff where we can reduce the variance of the accuracy metric at the expense of introducing a controlled amount of bias. It is likely that the bias more-or-less cancels out when we compare similar training settings in which case the reduction in variance would be a net win. In any case, it is cheap to add such a smoothed accuracy as an additional validation metric and we can decide if we want to use it later:

In [0]:
def smoothed_acc(logits, targets, beta=3.): #replace argmax with soft(arg)max
    return torch.mean(nn.functional.softmax(logits*beta, dim=-1)[torch.arange(0, targets.size(0), device=device), targets])

metrics = [fastai.metrics.accuracy, smoothed_acc]

In [11]:
learn = fit_flat_cos(learner_v1(data_v1(new_data_dir), model(), metrics=metrics).to_fp16(), n_epoch=5, lr=4e-3, pct_start=0.72)

epoch,train_loss,valid_loss,accuracy,smoothed_acc,time
0,1.489646,1.520813,0.5962,0.578618,00:17
1,1.255452,1.317987,0.6654,0.644284,00:16
2,1.162168,1.093264,0.7648,0.742068,00:17
3,1.081428,1.028166,0.7916,0.777946,00:17
4,0.954942,0.931859,0.8358,0.822356,00:17


The smoothed accuracy metric appears to be about 1.5% lower than the true 0-1 accuracy. Let's launch a set of training runs to measure the variance of the validation accuracy for our new dataset and to see if the smoothed accuracy metric improves things further:


In [0]:
results = (fit_flat_cos(learner_v1(data_v1(new_data_dir), model(), metrics=metrics).to_fp16(), n_epoch=5, lr=4e-3, pct_start=0.72) for _ in range(10))
acc, acc_smoothed = zip(*[[x.item() for x in learn.recorder.metrics[-1]] for learn in results])
np.mean(acc), np.std(acc), np.mean(acc_smoothed), np.std(acc_smoothed)

The new dataset has a validation accuracy of ~83.5% with a std dev of ~0.35% whilst the smoothed accuracy is ~82.1% with a std dev of ~0.25%. Note that the smoothed accuracy is just an alternative metric and doesn't affect training. It remains to be seen whether the bias that this introduces is consistent across different training settings and/or whether a different value of beta would be more appropriate.