In [303]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [304]:
#export
from exp.nb_07 import *

## Layerwise Sequential Unit Variance (LSUV)

In [305]:
x_train, y_train, x_valid, y_valid = get_data()

x_train, x_valid = normalize_to(x_train, x_valid)

In [306]:
train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)

In [307]:
nh, bs = 50, 512
c = y_train.max().item() + 1
loss_func = F.cross_entropy

In [308]:
data = DataBunch(*get_dls(train_ds, valid_ds, bs), c)

In [309]:
mnist_view = view_tfm(1, 28, 28)

In [310]:
cbfs = [
    Recorder,
    partial(AvgStatsCallback, accuracy),
    CudaCallback,
    partial(BatchTransformXCallback, mnist_view)
    
]

In [311]:
nfs = [8, 16, 32, 64, 64]

In [312]:
class ConvLayer(nn.Module):
    def __init__(self, ni, nf, ks=3, stride=2, sub=0., **kwargs):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=True)
        self.relu = GeneralRelu(sub=sub, **kwargs)
        
    def forward(self, x):
        return self.relu(self.conv(x))
    
    @property
    def bias(self): return -self.relu.sub
    
    @bias.setter
    def bias(self, v): 
        self.relu.sub = -v
    
    @property
    def weight(self):
        return self.conv.weight

In [313]:
learn, run = get_learn_run(nfs, data, 0.5, ConvLayer, cbs=cbfs)

Paper **All You Need is a Good Init:**

* Introduces Layer-wise Sequential Unit-Variance (LSUV).
* Initialize neural net with the usual technique, then pass a batch through the model and check the outputs of the linear and convolutional layers. 
* Rescale the weights according to the actual variance we observe on the activations, and subtract the mean we observe from the initial bias. That way we will have activations that stay normalized.
* Repeat this process until satisfied with the mean/variance

**Let's get a baseline:**

In [314]:
run.fit(2, learn)

train: [1.42551203125, tensor(0.5098, device='cuda:0')]
valid: [0.2484116455078125, tensor(0.9214, device='cuda:0')]
train: [0.2205297265625, tensor(0.9318, device='cuda:0')]
valid: [0.117218310546875, tensor(0.9649, device='cuda:0')]


Helper function to get one batch of a given dataloader with the callbacks called to preprocess it:

In [329]:
learn,run = get_learn_run(nfs, data, 0.5, ConvLayer, cbs=cbfs)

In [330]:
#export
def get_batch(dl, run):
    run.xb, run.yb = next(iter(dl))
    for cb in run.cbs: cb.set_runner(run)
    run('begin_batch')
    return run.xb, run.yb

In [331]:
xb, yb = get_batch(data.train_dl, run)

Find all convolutional and linear layers. Modules in PyTorch can form a tree structure, so we have to recurse.

In [332]:
#export
def find_modules(m, cond):
    if cond(m): return [m]
    return sum([find_modules(o, cond) for o in m.children()], [])
        
def is_lin_layer(l):
    lin_layers = (nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear, nn.ReLU)
    return isinstance(l, lin_layers)

In [333]:
mods = find_modules(learn.model, lambda o: isinstance(o, ConvLayer))

In [334]:
mods

[ConvLayer(
   (conv): Conv2d(1, 8, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
   (relu): GeneralRelu()
 ), ConvLayer(
   (conv): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ), ConvLayer(
   (conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ), ConvLayer(
   (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ), ConvLayer(
   (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 )]

Helper function that grabs the mean and std of the output of a hooked layer:

In [335]:
def append_stat(hook, mod, inp, outp):
    d = outp.data
    hook.mean, hook.std = d.mean().item(), d.std().item()

`def register_forward_hook(self, hook)` with `hook(module, input, output)` and the `Hook` class doing `self.hook = m.register_forward_hook(partial(f, self))` (where `f` might be `append_stat`)

In [336]:
??nn.Module.register_forward_hook

In [337]:
??Hook

In [338]:
model = learn.model.cuda()

Let's look at the means and the stds of the conv layers of our model:

In [339]:
with Hooks(mods, append_stat) as hooks:
    model(xb)
    for hook in hooks:
        print(hook.mean, hook.std)

0.3720621168613434 0.8526896238327026
0.28457432985305786 0.6383782029151917
0.3149712085723877 0.5545002818107605
0.3181527853012085 0.48824334144592285
0.22887423634529114 0.3585847020149231


Means are not 0 and the stds are not one. We first adjust the bias to make the means 0. Then we adjust the weights to make the stds 1 (threshold 1e-3).

Here, `model(xb) is not None` does nothing but making sure that the batch is passed through the network repeatedly computing the activations and updating the hooks untill we fall below the threshold.

In [340]:
#export
def lsuv_module(m, xb):
    h = Hook(m, append_stat)

    while model(xb) is not None and abs(h.std - 1)  > 1e-3: m.weight.data /= h.std
    while model(xb) is not None and abs(h.mean) > 1e-3: m.bias -= h.mean

    h.remove()
    return h.mean, h.std

Execute that initialization on all the conv layers in order:

In [341]:
for m in mods:
    print(lsuv_module(m, xb))

(2.2503794383510467e-08, 0.9999998807907104)
(3.740495557735812e-08, 0.9999999403953552)
(6.577465683221817e-09, 1.0000001192092896)
(-3.003515303134918e-08, 1.0)
(-1.1175870895385742e-08, 0.9999999403953552)


The fastai course first fixes the biases and then the weights. But in that way fixing the weights ruins the bias more than in this order!

**Let's train again and compare to the baseline:**

In [342]:
run.fit(2, learn)

train: [0.4337325390625, tensor(0.8606, device='cuda:0')]
valid: [0.1249064697265625, tensor(0.9631, device='cuda:0')]
train: [0.12725626953125, tensor(0.9613, device='cuda:0')]
valid: [0.092294921875, tensor(0.9713, device='cuda:0')]


We start with a better loss and accuracy and end with a better loss and accuracy compared to the training without lsuv.

Lsuv is especially useful for more complex and deeper architectures that are hard to initialize to get unit variance at the last layer.

## Export

In [344]:
!python notebook2script.py 07a_lsuv.ipynb

Converted 07a_lsuv.ipynb to exp/nb_07a.py
