## A - Note on the network initialization

    Thomas Moreau <thomas.moreau@inria.fr>
    Alexandre Gramfort <alexandre.gramfort@inria.fr>

In this notebook, we consider the properties of different initialization schemes for the parameter of the network. We will some random data `x, y` generated bellow:

In [None]:
import torch

n_samples, n_features = 1000, 100

x = torch.randn(n_samples, n_features)
y = torch.randn(n_samples)

We consider one linear layer that conserves the input dimensionality.  
We initialize the weights and the bias randomly with normal distributions:

In [None]:
n_hidden = 100

def lin(x, w, b):
    return x @ w + b

w1 = torch.randn(n_features, n_hidden)
b1 = torch.randn(n_hidden)
l1 = lin(x, w1, b1)
l1.shape

There is a problem with the way our model was initialized, however.  
To understand it, we need to look at the standard deviation (std) of `l1` compared to the one of the input `x`:

In [None]:
x.std(), l1.std()

The standard deviation, which represents how far away our activations go from the mean, went from 1 to 10.  

This is a really big problem because that's with just one layer. Modern neural nets can have hundreds of layers, so if each of them multiplies the scale of our activations by 10, by the end of the last layer we won't have numbers representable by a computer.

Indeed, if we apply just 50 layers with such initialization `x`, we'll have:

In [None]:
x = torch.randn(n_samples, n_features)
for i in range(50):
    w1 = torch.randn(n_features, n_hidden)
    b1 = torch.randn(n_hidden)
    x = lin(x, w1, b1)
    
print(f"std(X) = {x.std().item():.2e}")
x[0:5,0:5]

The result is `nan`s everywhere. So maybe the scale of our matrix was too big, and we need to have smaller weights? But if we use too small weights, we will have the opposite problem—the scale of our activations will go from 1 to 0.1, and after 50 layers we'll be left with zeros everywhere:

In [None]:
x = torch.randn(n_samples, n_features)
for i in range(50):
    w1 = 1e-2 * torch.randn(n_features, n_hidden)
    b1 = torch.zeros(n_hidden)
    x = lin(x, w1, b1)
    
print(f"std(X) = {x.std().item():.2e}")
x[:5, :5]

We put `b1` to `0` here as otherwise the output is not 0 everywhere, but the same for each input.

The quantity that drives this phenomena is the operator norm of the weight matrix $\|W_1\|_2$, which corresponds to its largest eigenvector. So if we want to control this, we have to scale our weights to have a norm close to $1$ so it does not explode to $\infty$ or $0$. Using random matrices theory, this can easily be done, as illustrated by Xavier Glorot and Yoshua Bengio in ["Understanding the Difficulty of Training Deep Feedforward Neural Networks"](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). The right scale for a given layer is $1/\sqrt{n_{in}}$, where $n_{in}$ represents the number of inputs.

In [None]:
import numpy as np

x = torch.randn(n_samples, n_features)
for i in range(50):
    w1 = torch.randn(n_features, n_hidden) / np.sqrt(n_features)
    b1 = torch.zeros(n_hidden)
    x = lin(x, w1, b1)

print(f"std(X) = {x.std().item()}")
x[:5, :5]

Finally some numbers that are neither zeros nor `nan`s and a reasonable std!

If you play a little bit with the value for scale by moving `eps` to $\{-1, 1\}$, you'll notice that even a slight variation from $\frac1{n_{in}}$ will get you either to very small or very large numbers, so initializing the weights properly is extremely important. 

In [None]:
eps = -1
x = torch.randn(n_samples, n_features)
for i in range(50):
    w1 = torch.randn(n_features, n_hidden) / (np.sqrt(n_features) + eps)
    b1 = torch.zeros(n_hidden)
    x = lin(x, w1, b1)

print(f"std(X) = {x.std().item():.2f}")
x[:5, :5]

Very good. Now we need to go through a ReLU:

In [None]:
def relu(x):
    return x.clamp_min(0.)


# Redefine fresh `x, y`
x = torch.randn(n_samples, n_features)
y = torch.randn(n_samples)


w1 = torch.randn(n_features, n_hidden) / np.sqrt(n_features)
b1 = torch.randn(n_hidden)
l1 = lin(x, w1, b1)
l2 = relu(l1)
l2.mean(), l2.std()

And we're back to square one: the mean of our activations has gone to up (which is understandable since we removed the negatives) and the std went down. So like before, after a few layers we will probably end up with zeros:

In [None]:
x = torch.randn(n_samples, n_features)

for i in range(50):
    w1 = torch.randn(n_features, n_hidden) / np.sqrt(n_features, dtype=np.float32)
    b1 = torch.zeros(n_hidden)
    x = relu(lin(x, w1, b1))

print(f"std(X) = {x.std().item():.2e}")
x[0:5,0:5]

This means our initialization wasn't right. Why? At the time Glorot and Bengio wrote their article, the popular activation in a neural net was the hyperbolic tangent (tanh, which is the one they used), and that initialization doesn't account for our ReLU. Fortunately, someone else has done the math for us and computed the right scale for us to use. In ["Delving Deep into Rectifiers: Surpassing Human-Level Performance"](https://arxiv.org/abs/1502.01852) (it's the article that introduced the ResNet), Kaiming He et al. show that we should use the following scale instead: $\sqrt{2 / n_{in}}$, where $n_{in}$ is the number of inputs of our model. Let's see what this gives us:

In [None]:
x = torch.randn(n_samples, n_features)

for i in range(50):
    w1 = torch.randn(n_features, n_hidden) * np.sqrt(2 / n_features, dtype=np.float32)
    b1 = torch.zeros(n_hidden)
    x = relu(lin(x, w1, b1))

print(f"std(X) = {x.std().item():.2f}")
x[0:5,0:5]

That's better: our numbers aren't all zeroed this time.  
This initialization is named *Kaiming initialization* or *He initialization*.

Note that these consideration are linked to the Lipschitz constants of the layers and of the network, that have received significant attention in the recent literature.

## Conclusion

Here are a few things to remember:

- A neural net is basically a bunch of matrix multiplications with nonlinearities in between.
- When subclassing `nn.Module`, we have to call the superclass `__init__` method in our `__init__` method and we have to define a `forward` function that takes an input and returns the desired result.
- The backward pass is the chain rule applied multiple times, computing the gradients from the output of our model and going back, one layer at a time.
- Properly initializing a neural net is crucial to get training started. Kaiming initialization should be used when we have ReLU nonlinearities.