# Neural Network Weight Initialization

Consider a Multi Layer Perceptron (MLP) Neural Network where one layer consists of the function $y_i(x) = \sigma(W_i x + b_i)$, where $x$ is the input $n$-vector, $W_i$ is an $m \times n$ weight matrix, $b$ is a bias $m$-vector and $\sigma : \mathbb{R}^m \to \mathbb{R}^m$ is some elementwise activation function.
We can then represent the entire network as repeated application of the layer function;

$$
o(x) = (y_N \circ y_{N-1} \circ \dots \circ y_2 \circ y_1)(x)
$$

Assume the input values are drawn from a standard normal distribution $x \sim \mathcal{N}(0, 1)$ and bias vectors are initialized as zero vectors.

## Identity activation function

With an identity activation function $\sigma(x) = x$, the appropriate initialization for weight matrices is $W_i \sim \mathcal{N}(0, \frac{1}{n_i})$, where $n_i$ is the size of the $i$th-layer input vector.
This can be achieved by adjusting the weight initialization distribution, or by using a standard normal, then dividing the weights by $\sqrt{n_i}$.

This is sometimes called 'Xavier' or 'Glorot' intialization, named after Xavier Glorot from the 2010 paper <a href="http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi">Understanding the difficulty of training deep feedforward neural networks</a>.

In [1]:

# Get imports
import math
import random
import torch


In [2]:

# Use identity activation function
sigma = lambda x: x

# Initialize a random deep network
num_layers = 200
random_layer_size = lambda: random.randint(10, 1024)
x = torch.randn(random_layer_size())
weights = []
biases = []
layer_input_size = len(x)
for layer in range(num_layers):
    layer_output_size = random_layer_size()
    weights.append(
        torch.randn(
            layer_output_size,
            layer_input_size
        ) / math.sqrt(layer_input_size)
    )
    biases.append(torch.zeros(layer_output_size))
    layer_input_size = layer_output_size

# Forward pass through the network
y = x
for w, b in zip(weights, biases):
    y = sigma(w @ y + b)

# Check the distribution of the output vector
print(x.mean(), x.std())
print(y.mean(), y.std())


tensor(0.0088) tensor(0.9887)
tensor(-0.0213) tensor(1.6131)



Without the normalization on line 18 above, the values in the network will tend toward 0 or a very large value, slowing training.
With Xavier initialization, the layer values maintain a normal distribution, regardless of the depth of the network.



## ReLU activation function

With the Rectified Linear Unit (ReLU) activation function $\sigma(x) = \max(0, x)$, the appropriate initialization for weight matrices is $W_i \sim \mathcal{N}(0, \frac{2}{n_i})$, where $n_i$ is the size of the $i$th-layer input vector.

This is sometimes called 'Kaiming' or 'He' intialization, named after Kaiming He from the 2015 paper <a href="https://arxiv.org/pdf/1502.01852.pdf">Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification</a>.


In [3]:

# Use ReLU activation function
sigma = lambda x: torch.max(torch.zeros_like(x), x)

# Initialize a random deep network
num_layers = 200
random_layer_size = lambda: random.randint(10, 1024)
x = torch.randn(random_layer_size())
weights = []
biases = []
layer_input_size = len(x)
for layer in range(num_layers):
    layer_output_size = random_layer_size()
    weights.append(
        torch.randn(
            layer_output_size,
            layer_input_size
        ) * math.sqrt(2) / math.sqrt(layer_input_size)
    )
    biases.append(torch.zeros(layer_output_size))
    layer_input_size = layer_output_size

# Forward pass through the network
y = x
for w, b in zip(weights, biases):
    y = sigma(w @ y + b)

# Check the distribution of the output vector
print(x.mean(), x.std())
print(y.mean(), y.std())


tensor(0.0279) tensor(0.9784)
tensor(0.0316) tensor(0.0482)
