In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [None]:
#export
from exp.nb_07 import *

## ConvNet

Getting the MNIST data and a CNN

In [None]:
x_train,y_train,x_valid,y_valid = get_data()

x_train,x_valid = normalize_to(x_train,x_valid)
train_ds,valid_ds = Dataset(x_train, y_train),Dataset(x_valid, y_valid)

nh,bs = 50,512
c = y_train.max().item()+1
loss_func = F.cross_entropy

data = DataBunch(*get_dls(train_ds, valid_ds, bs), c)

KeyboardInterrupt: 

In [None]:
mnist_view = view_tfm(1,28,28)
cbfs = [Recorder,
        partial(AvgStatsCallback,accuracy),
        CudaCallback,
        partial(BatchTransformXCallback, mnist_view)]

In [None]:
nfs = [8,16,32,64,64]

In [None]:
def conv_layer(ni, nf, ks=3, stride=2, **kwargs):
    return nn.Sequential(
        nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=True),
        GeneralRelu(**kwargs))

In [None]:
learn,run = get_learn_run(nfs, data, 0.6, conv_layer, cbs=cbfs)

Now we're going to look at [All You Need is a Good Init](https://arxiv.org/pdf/1511.06422.pdf), which introduces *Layer-wise Sequential Unit-Variance* (*LSUV*). We initialize our neural net with the usual technique, then we pass a batch through the model and check the outputs of the linear and convolutional layers. We can then rescale the weights according to the variance we observe on the activations, and subtract the mean we observe from the initial bias. That way we will have activations that stay normalize.

We repeat this process until we are satisfied with the mean/variance we observe.

Let's start by looking at a baseline:

In [None]:
run.fit(2, learn)

Now we recreate our model and we'll try again with LSUV:

In [None]:
learn,run = get_learn_run(nfs, data, 0.6, conv_layer, cbs=cbfs)

In [None]:
def get_batch(dl, run):
    run.xb,run.yb = next(iter(dl))
    for cb in run.cbs: cb.set_runner(run)
    run('begin_batch')
    return run.xb,run.yb

In [None]:
xb,yb = get_batch(data.train_dl, run)

We only want the outputs of convolutional or linear layers.

In [None]:
def find_modules(m, cond):
    if cond(m): return [m]
    return sum([find_modules(o,cond) for o in m.children()], [])

In [None]:
def is_lin_layer(l):
    lin_layers = (nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear)
    return isinstance(l, lin_layers)

In [None]:
mods = find_modules(learn.model, is_lin_layer)

In [None]:
mods

In [None]:
def append_mean(hook, mod, inp, outp):
    d = outp.data
    hook.mean,hook.std = d.mean().item(),d.std().item()

In [None]:
mdl = learn.model.cuda()

In [None]:
with Hooks(mods,append_mean) as hooks:
    mdl(xb)
    for hook in hooks: print(hook.mean,hook.std)

The idea is then to change the bias and weights accordingly to make the activations have a mean closer to 0 and a std closer to 1.

In [None]:
def lsuv_module(m, xb):
    h = Hook(m, append_mean)

    if getattr(m, 'bias', None) is not None:
        while mdl(xb) is not None and abs(h.mean) > 1e-3:
            m.bias.data -= h.mean

    while mdl(xb) is not None and abs(h.std-1) > 1e-3:
        m.weight.data /= h.std

    h.remove()
    return h.mean,h.std

In [None]:
for m in mods: print(lsuv_module(m, xb))

Then training is beginning on better grounds.

In [None]:
%time run.fit(2, learn)

In this case, it's not clear that LSUV has helped much. That may be because it's not a very deep network, and we were also fairly careful to initialize it properly in previous notebooks. However, LSUV might be particularly useful for more complex and deeper architectures that are hard to initialize to get unit variance at the last layer.