In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Does nn.Conv2d init work well?

[Jump_to lesson 9 video](https://course.fast.ai/videos/?lesson=9&t=21)

In [2]:
#export
from exp.nb_02 import *

def get_data():
    path = datasets.download_data(MNIST_URL, ext='.gz')
    with gzip.open(path, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (x_train,y_train,x_valid,y_valid))

def normalize(x, m, s): return (x-m)/s

We are testing out the `math.sqrt(5)` part of the following function, because it doesn't make intuitive sense.

In [3]:
torch.nn.modules.conv._ConvNd.reset_parameters??

[0;31mSignature:[0m [0mtorch[0m[0;34m.[0m[0mnn[0m[0;34m.[0m[0mmodules[0m[0;34m.[0m[0mconv[0m[0;34m.[0m[0m_ConvNd[0m[0;34m.[0m[0mreset_parameters[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
    [0;32mdef[0m [0mreset_parameters[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0minit[0m[0;34m.[0m[0mkaiming_uniform_[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mweight[0m[0;34m,[0m [0ma[0m[0;34m=[0m[0mmath[0m[0;34m.[0m[0msqrt[0m[0;34m([0m[0;36m5[0m[0;34m)[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0;32mif[0m [0mself[0m[0;34m.[0m[0mbias[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0mfan_in[0m[0;34m,[0m [0m_[0m [0;34m=[0m [0minit[0m[0;34m.[0m[0m_calculate_fan_in_and_fan_out[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mweight[0m[0;34m)[0m[0;34m[0m
[0;34m[0m

In [4]:
x_train,y_train,x_valid,y_valid = get_data()
train_mean,train_std = x_train.mean(),x_train.std()
x_train = normalize(x_train, train_mean, train_std)
x_valid = normalize(x_valid, train_mean, train_std)

In [5]:
x_train.shape

torch.Size([50000, 784])

Resize the training data so that it can be fed into a ConvNet. We need a rectangular input, and the current data is just a single 784 vector. They are now 28 x 28, 1-channel images.

In [6]:
x_train = x_train.view(-1,1,28,28)
x_valid = x_valid.view(-1,1,28,28)
x_train.shape,x_valid.shape

(torch.Size([50000, 1, 28, 28]), torch.Size([10000, 1, 28, 28]))

In [7]:
n,*_ = x_train.shape
c = y_train.max()+1
nh = 32
n,c

(50000, tensor(10))

Create a Conv2d layer: 1 input (1 channel), number of hidden units, and a 5x5 kernel

In [8]:
l1 = nn.Conv2d(1, nh, 5)

In [9]:
x = x_valid[:100]

In [10]:
x.shape

torch.Size([100, 1, 28, 28])

Create a function, since we'll be using this a lot:

This happens once you notice you've done the same thing at least twice, then go back and put it in a function.

In [11]:
def stats(x): return x.mean(),x.std()

A Conv2d layer contains a weight parameter, and bias. They are both tensors.

In [12]:
l1.weight.shape

torch.Size([32, 1, 5, 5])

weight has 32 output filters (bc that's the number of hidden), 1 input filter, and a 5x5 kernel.

In [13]:
stats(l1.weight),stats(l1.bias)

((tensor(0.0008, grad_fn=<MeanBackward0>),
  tensor(0.1161, grad_fn=<StdBackward0>)),
 (tensor(-0.0033, grad_fn=<MeanBackward0>),
  tensor(0.1083, grad_fn=<StdBackward0>)))

So we know that this function `torch.nn.modules.conv._ConvNd.reset_parameters` was called to initialize.

```python
    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)
```

The bias is initialized with a uniform random number bt pos and neg 1 (bound).<br>
The weights are initialized with kaiming uniform with the `math.sqrt(5)`.

So let's put our input thru the layer, and get the stats of the output (t). We want a mean of 0 and std 1.

In [14]:
t = l1(x)

In [15]:
stats(t)

(tensor(-0.0007, grad_fn=<MeanBackward0>),
 tensor(0.6611, grad_fn=<StdBackward0>))

Mean is close to 0, but std is not where we want.

Let's use the normal kaiming init, which is used after a leaky Relu layer. `a` is the "leak" amount.

In [16]:
init.kaiming_normal_(l1.weight, a=1.)
stats(l1(x))

(tensor(-0.0009, grad_fn=<MeanBackward0>),
 tensor(1.0673, grad_fn=<StdBackward0>))

Which gives us good results

In [17]:
import torch.nn.functional as F

Let's make a function that passes our input thru the conv layer and a leaky relu with the "leak" amount.

In [18]:
def f1(x,a=0): return F.leaky_relu(l1(x),a)   # a=0 means a regular relu

In [19]:
init.kaiming_normal_(l1.weight, a=0)  
stats(f1(x))

(tensor(0.5271, grad_fn=<MeanBackward0>),
 tensor(1.0030, grad_fn=<StdBackward0>))

Mean is not 0, std is 1

When we reinitialize with standard pytorch (below) the mean and std are not 0 and 1

In [20]:
l1 = nn.Conv2d(1, nh, 5)
stats(f1(x))

(tensor(0.2198, grad_fn=<MeanBackward0>),
 tensor(0.3903, grad_fn=<StdBackward0>))

In [21]:
l1.weight.shape

torch.Size([32, 1, 5, 5])

To explore this, we're going to write our own kaiming init function.

Explanation: https://youtu.be/AcA8HAYh7IE?t=543

Since convolution and matrix multiplication are kind of one in the same, with some weight tying and with some with some zeros, in order to calculate the total number of multiplications and additions going on for a a convolutional layer, we need to basically take the kernel size (5x5) and multiply it by the number of filters (32).

To do this, we grab any piece of the weight tensor `weight[0,0]`, which returns a 5x5 kernel, and get the number of elements in that part of the weight tensor `numel()`, and that's going to be the receptive field size. Simply the number of elements in that kernel.

In [22]:
# receptive field size

rec_fs = l1.weight[0,0].numel()
rec_fs

25

Grab the shape (num filters out, num in)

In [23]:
nf,ni,*_ = l1.weight.shape
nf,ni

(32, 1)

There's no explanation of why we calculate the fan in and out this way.

In [24]:
fan_in  = ni*rec_fs
fan_out = nf*rec_fs
fan_in,fan_out

(25, 800)

In [25]:
# formula for kaiming init, a is the leak (0 = standard relu)

def gain(a): return math.sqrt(2.0 / (1 + a**2))

Try different gains, especially the math.sqrt(5)...

In [26]:
gain(1),gain(0),gain(0.01),gain(0.1),gain(math.sqrt(5.))

(1.0,
 1.4142135623730951,
 1.4141428569978354,
 1.4071950894605838,
 0.5773502691896257)

Keep in mind that they're using kaiming_uniform, not kaiming_normal. 

This is what those distributions look like...
![](images/dist.png)

Let's grab a bunch of random numbers from a uniform distribution and get their std, which ends up being 1/sqrt 3.

In [27]:
torch.zeros(10000).uniform_(-1,1).std()

tensor(0.5781)

In [28]:
1/math.sqrt(3.)

0.5773502691896258

Here's the previous code in a function

In [29]:
def kaiming2(x,a, use_fan_out=False):
    nf,ni,*_ = x.shape
    rec_fs = x[0,0].shape.numel()
    fan = nf*rec_fs if use_fan_out else ni*rec_fs
    std = gain(a) / math.sqrt(fan)
    bound = math.sqrt(3.) * std
    x.data.uniform_(-bound,bound)

In [30]:
kaiming2(l1.weight, a=0);
stats(f1(x))

(tensor(0.5072, grad_fn=<MeanBackward0>),
 tensor(0.9606, grad_fn=<StdBackward0>))

You get a variance of about 1

In [31]:
kaiming2(l1.weight, a=math.sqrt(5.))
stats(f1(x))

(tensor(0.2074, grad_fn=<MeanBackward0>),
 tensor(0.3624, grad_fn=<StdBackward0>))

Now we're getting about the same as the pytorch default ~0.4

Let's throw some inputs thru a quick convnet and see the results..

In [32]:
class Flatten(nn.Module):
    def forward(self,x): return x.view(-1)

In [33]:
m = nn.Sequential(
    nn.Conv2d(1,8, 5,stride=2,padding=2), nn.ReLU(),
    nn.Conv2d(8,16,3,stride=2,padding=1), nn.ReLU(),
    nn.Conv2d(16,32,3,stride=2,padding=1), nn.ReLU(),
    nn.Conv2d(32,1,3,stride=2,padding=1),
    nn.AdaptiveAvgPool2d(1),
    Flatten(),
)

In [34]:
y = y_valid[:100].float()

In [35]:
t = m(x)
stats(t)

(tensor(0.0481, grad_fn=<MeanBackward0>),
 tensor(0.0116, grad_fn=<StdBackward0>))

So we know the input layer has std 1, the first hidden layer has 0.4, and now the last layer has 0.01.

In [36]:
l = mse(t,y)
l.backward()

In [37]:
stats(m[0].weight.grad)

(tensor(-0.0004), tensor(0.0346))

Stats for the gradients have std 0.03, nowhere near 1

In [38]:
init.kaiming_uniform_??

[0;31mSignature:[0m
[0minit[0m[0;34m.[0m[0mkaiming_uniform_[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtensor[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0ma[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmode[0m[0;34m=[0m[0;34m'fan_in'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnonlinearity[0m[0;34m=[0m[0;34m'leaky_relu'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mkaiming_uniform_[0m[0;34m([0m[0mtensor[0m[0;34m,[0m [0ma[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m [0mmode[0m[0;34m=[0m[0;34m'fan_in'[0m[0;34m,[0m [0mnonlinearity[0m[0;34m=[0m[0;34m'leaky_relu'[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34mr"""Fills the input `Tensor` with values according to the method[0m
[0;34m    described in `Delving deep into rectifiers: Surpassing human-level[0m
[0;34m    performance on ImageNet classification` - He, K. et al. (2015), using a[0m
[

`kaiming_uniform` uses `math.sqrt(3)`

In [39]:
for l in m:
    if isinstance(l,nn.Conv2d):
        init.kaiming_uniform_(l.weight)
        l.bias.data.zero_()

In [40]:
t = m(x)
stats(t)

(tensor(0.6480, grad_fn=<MeanBackward0>),
 tensor(0.2855, grad_fn=<StdBackward0>))

It's not 1, but closer

In [41]:
l = mse(t,y)
l.backward()
stats(m[0].weight.grad)

(tensor(-0.2117), tensor(0.4082))

Not 1, but closer

This is the starting point for the research

## Export

In [43]:
!./notebook2script.py 02a_Lesson09_why_sqrt5.ipynb

Converted 02a_Lesson09_why_sqrt5.ipynb to exp/nb_02a.py
