# Bag of Tricks
> Refinements and Improvements to CNN's for image classification

[Bag of Tricks for Image Classification with Convolutional Neural Networks](https://arxiv.org/abs/1812.01187)

Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, Mu Li

    Much of the recent progress made in image classification research can be credited to training procedure refinements, such as changes in data augmentations and optimization methods. In the literature, however, most refinements are either briefly mentioned as implementation details or only visible in source code. In this paper, we will examine a collection of such refinements and empirically evaluate their impact on the final model accuracy through ablation study. We will show that, by combining these refinements together, we are able to improve various CNN models significantly. For example, we raise ResNet-50's top-1 validation accuracy from 75.3% to 79.29% on ImageNet. We will also demonstrate that improvement on image classification accuracy leads to better transfer learning performance in other application domains such as object detection and semantic segmentation.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
#export
from exp.nb_09 import *

## Setup

### Data to DataBunch

In [3]:
bs = 64

In [4]:
path = datasets.untar_data(datasets.URLs.IMAGENETTE_160) # downloads and returns a path to folder
tfms = [make_rgb, ResizeFixed(128), to_byte_tensor, to_float_tensor] # transforms to be applied to images

il = ImageList.from_files(path, tfms=tfms) # Imagelist from files
sd = SplitData.split_by_func(il, partial(grandparent_splitter, valid_name="val")) # Splitdata by function
ll = label_by_func(sd, parent_labeler, proc_y=CategoryProcesser()) # label the data by parent folder
data = ll.to_databunch(bs, c_in=3, c_out=10)

### Callbacks

In [5]:
callbacks = [partial(AvgStatsCallback, accuracy), 
             CudaCallback,
             partial(BatchTransformXCallback, norm_imagenette)]

## Model Arch

We don't use a big conv 7x7 at first but three 3x3 convs, and don't go directly from 3 channels to 64 but progressively add those:

In [6]:
nfs = [64,64,128,256]

In [7]:
#export
def prev_pow_2(x): return 2**math.floor(math.log2(x))

In [8]:
2**math.floor(math.log2(3))

2

In [9]:
#export
def get_cnn_layers(data, nfs, layer, **kwargs):
    
    def f(ni, nf, stride=2): 
        return layer(ni, nf, ks=3, stride=stride, **kwargs)
    
    l1 = data.c_in
    l2 = prev_pow_2(l1*3*3)
    
    layers = [f(l1,   l2,   stride=1),
              f(l2,   l2*2, stride=2),
              f(l2*2, l2*4, stride=2)]
    nfs = [l2*4] + nfs
    
    layers += [f(nfs[i], nfs[i+1]) for i in range(len(nfs)-1)]
    layers += [nn.AdaptiveAvgPool2d(1), Lambda(flatten), nn.Linear(nfs[-1], data.c_out)]
    
    return layers

In [10]:
#export
def get_cnn_model(data, nfs, layer, **kwargs):
    return nn.Sequential(*get_cnn_layers(data, nfs, layer, **kwargs))

def get_learn_run(data, nfs, layer, lr, cbs=None, opt_func=None, uniform=False, **kwargs):
    model = get_cnn_model(data, nfs, layer, **kwargs)
    init_cnn(model, uniform=uniform)
    return get_runner(model, data, lr=lr, cbs=cbs, opt_func=opt_func)

In [11]:
sched = combine_scheds([0.3, 0.7], cos_1cycle_anneal(0.1, 0.3, 0.05))

In [12]:
learn, run = get_learn_run(data, nfs, conv_layer, lr=0.2, cbs=callbacks+[partial(ParamScheduler, 'lr', sched)])

In [13]:
run.fit(1, learn)

train: [1.730062820123561, tensor(0.4031, device='cuda:0')]
valid: [1.3776530155254778, tensor(0.5465, device='cuda:0')]


A function that would print out a summary of the layers and their activation shapes of our model would be very helpful. 

We can do this by using Hooks and sending batch through the model to print out what happens at every stage:

In [14]:
#export
def model_summary(run, learn, data, find_all=False):
    xb, yb = get_batch(data.valid_dl, run)
    device = next(learn.model.parameters()).device
    xb, yb = xb.to(device), yb.to(device)
    hf = lambda hook,mod,inp,outp: print(f'{mod}\nOutput:{outp.shape}\n')
    mods = find_mods(learn.model, is_lin_layer) if find_all else learn.model.children()
    with Hooks(mods, hf) as hook: learn.model(xb)

In [15]:
model_summary(run, learn, data)

Sequential(
  (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (1): GeneralRelu()
  (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
Output:torch.Size([128, 16, 128, 128])

Sequential(
  (0): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (1): GeneralRelu()
  (2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
Output:torch.Size([128, 32, 64, 64])

Sequential(
  (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (1): GeneralRelu()
  (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
Output:torch.Size([128, 64, 32, 32])

Sequential(
  (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (1): GeneralRelu()
  (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
Output:torch.Size([128, 64, 16, 16])

Sequential(
  (0): 

### Training

In [16]:
%time run.fit(5, learn)

train: [1.3123442697882564, tensor(0.5594, device='cuda:0')]
valid: [1.5668453921178345, tensor(0.5029, device='cuda:0')]
train: [1.2176354460938326, tensor(0.5961, device='cuda:0')]
valid: [1.275376816281847, tensor(0.5735, device='cuda:0')]
train: [0.8694145451605239, tensor(0.7151, device='cuda:0')]
valid: [1.2902619924363057, tensor(0.5967, device='cuda:0')]
train: [0.48882599634333085, tensor(0.8495, device='cuda:0')]
valid: [1.1328984623805733, tensor(0.6566, device='cuda:0')]
train: [0.2372632797648907, tensor(0.9434, device='cuda:0')]
valid: [1.1539058767914012, tensor(0.6596, device='cuda:0')]
Wall time: 1min 16s


### LSUV Init

Let's try our bag of tricks model with the LSUV initialization technique.

In [17]:
def lsuv_module(mod, xb, mdl):
    h = Hook(mod, lsuv_append_stat)
    
    with torch.no_grad():
        if mod.bias is not None:
            while mdl(xb) is not None and abs(h.mean ) > 1e-3: mod.bias -= h.mean
        while mdl(xb) is not None and abs(h.std-1) > 1e-3: mod.weight.data /= h.std

    h.remove()
    return h.mean, h.std

In [18]:
xb, yb = get_batch(data.valid_dl, run)

mods = find_mods(learn.model, is_lin_layer)
for m in mods: lsuv_module(m, xb, learn.model)

In [19]:
%time run.fit(5, learn)

train: [0.4784115642161263, tensor(0.8919, device='cuda:0')]
valid: [1.4449001044984076, tensor(0.5753, device='cuda:0')]
train: [0.6862207949130056, tensor(0.7698, device='cuda:0')]
valid: [1.4064058767914012, tensor(0.5992, device='cuda:0')]
train: [0.3569653328047444, tensor(0.8857, device='cuda:0')]
valid: [1.3306855841958598, tensor(0.6334, device='cuda:0')]
train: [0.10578015885719189, tensor(0.9776, device='cuda:0')]
valid: [1.2797096437101911, tensor(0.6548, device='cuda:0')]
train: [0.036571728690950406, tensor(0.9977, device='cuda:0')]
valid: [1.2824068222531848, tensor(0.6555, device='cuda:0')]
Wall time: 1min 16s


In [20]:
nb_auto_export()

<IPython.core.display.Javascript object>