### Layer Normalization

- To solve the problem of covariate shift and train NNs faster. Also applicable for sequential networks unlike batch norm.

Hinton 2016

- Batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a
mean and variance which are then used to normalize the summed input to that
neuron on each training case. 

- This significantly reduces the training time in feedforward neural networks. However, the effect of batch normalization is dependent
on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. 

- Transpose batch normalization into layer normalization by
computing the mean and variance used for normalization from all of the summed
inputs to the neurons in a layer on a single training case. Like batch normalization,
we also give each neuron its own adaptive bias and gain which are applied after
the normalization but before the non-linearity. Unlike batch normalization, layer
normalization performs exactly the same computation at training and test times.

- It is also straightforward to apply to recurrent neural networks by computing the
normalization statistics separately at each time step. 

- Layer normalization is very
effective at stabilizing the hidden state dynamics in recurrent networks. 

- Layer normalization can substantially reduce the training time
compared with previously published techniques.

In [2]:
import torch.nn as nn
import torch

In [None]:

batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)
layer_norm = nn.LayerNorm(embedding_dim)
# Activate module
layer_norm(embedding)


Dropout

- Applied after any non-output layer
- During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution. Each channel will be zeroed out independently on every forward call.
- Not applied during eval

In [None]:
nn.Dropout(p=0.25)

Embeddings

- An embedding is an efficient alternative to a single linear layer when one has a large number of input features.