### Layer Normalization

Hinton 2016

- Batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a
mean and variance which are then used to normalize the summed input to that
neuron on each training case. 

- This significantly reduces the training time in feedforward neural networks. However, the effect of batch normalization is dependent
on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. 

- Transpose batch normalization into layer normalization by
computing the mean and variance used for normalization from all of the summed
inputs to the neurons in a layer on a single training case. Like batch normalization,
we also give each neuron its own adaptive bias and gain which are applied after
the normalization but before the non-linearity. Unlike batch normalization, layer
normalization performs exactly the same computation at training and test times.

- It is also straightforward to apply to recurrent neural networks by computing the
normalization statistics separately at each time step. 

- Layer normalization is very
effective at stabilizing the hidden state dynamics in recurrent networks. 

- Layer normalization can substantially reduce the training time
compared with previously published techniques.

In [2]:
import torch.nn as nn
import torch

In [3]:

batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)
layer_norm = nn.LayerNorm(embedding_dim)
# Activate module
layer_norm(embedding)


tensor([[[ 0.8631, -0.5524, -0.6833,  1.6680, -0.7220, -1.5516, -0.3392,
           0.9271,  1.1228, -0.7325],
         [-0.5624, -0.5215, -0.8575,  0.6416, -1.2002,  2.0684,  0.6174,
          -1.2592,  0.5748,  0.4987],
         [ 0.1391, -0.3788, -0.4666, -0.8340,  0.7700, -0.5245, -0.7082,
           1.3214, -1.3289,  2.0105],
         [-1.8712,  0.1779,  0.0526,  1.2808, -0.1356, -0.9665, -1.0203,
           0.7354,  1.4914,  0.2555],
         [-0.0586,  0.0164,  1.7841,  0.4567,  0.2305, -1.7034,  1.3561,
          -0.8539, -1.0187, -0.2093]],

        [[ 0.8672, -1.1752, -0.6380,  0.8092, -0.5298, -1.3975, -0.0243,
           0.1947,  2.1173, -0.2236],
         [-0.6731,  2.1555, -1.0356,  0.1076, -0.0157, -0.2555, -0.8735,
           0.5578,  1.1731, -1.1406],
         [-0.3722,  0.9445,  1.7047, -0.5864,  1.5843, -1.2029, -0.0111,
          -0.6455, -1.1222, -0.2933],
         [-1.0012,  0.2532,  0.5297,  0.5100, -0.4123, -1.7262, -0.4312,
          -0.2290,  0.2884,  2.2184],

Dropout

Embeddings