# Layer Normalization

This notebook is a hands-on code walkthrough of how "add & layer normalization" is defined.

Layer Normalization
- Why? The activation of neurons in a neural network, are varying and often large positive and negative values. Layer Normalization helps in stabilizing these values around a mean of 0, and a standard deviation, to improve model performance. 

How yHat is usually calculated:

$$ yHat = w^T * X + B $$

How yHat is calculated with Layer Normalization. In the below formula, gamma and bias term are also "learnable" parameters of the model.

$$ yPred = \gamma \biggl({\frac{yHat - \mu}{\sigma}} \biggr) + B $$

In [2]:
import math
import numpy as np
from IPython.display import Image, display
from matplotlib import pyplot as plt
import torch 
import torch.nn as nn

In [4]:
# Layer normalization is done, on the output of the multi-headed attention step, 
# which outputs a set of embeddings. The below code, creates a random tensor of 
# "1 batch-size" to showcase how parallelization can be incorporated, number of 
# words as 2, and number of arbitrary embeddings as 3, for each of the 2 words.

inputs = torch.Tensor([[[0.2, 0.1, 0.3], [0.5, 0.1, 0.1]]])
batch_size, number_of_words, embeddings = inputs.size()
inputs = inputs.reshape(number_of_words, batch_size, embeddings)
inputs.size()

torch.Size([2, 1, 3])

In [7]:
parameter_shape = inputs.size()[-2:]

# Define the gamma and bias/beta terms which are trainable parameters of the layer normalization process
gamma = nn.Parameter(torch.ones(parameter_shape))
beta = nn.Parameter(torch.zeros(parameter_shape))

In [8]:
gamma.size(), beta.size()

(torch.Size([1, 3]), torch.Size([1, 3]))

In [6]:
dims = [-(i+1) for i in range(len(parameter_shape))]
dims

[-1, -2]

In [9]:
mean = inputs.mean(dim=dims, keepdim=True)
mean.size()

torch.Size([2, 1, 1])

In [21]:
mean

tensor([[[0.2000]],

        [[0.2333]]])

In [10]:
var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)
epsilon = 1e-5
std = (var + epsilon).sqrt()
std

tensor([[[0.0817]],

        [[0.1886]]])

In [12]:
yPred = (inputs - mean) / std
yPred

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]])

# Combining everything above to create re-usable functions

In [17]:
class LayerNormalization:
    
    def __init__(self, parameter_shape, eps=1e-5):
        self.parameters_shape = parameter_shape
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(parameter_shape))
        self.beta = nn.Parameter(torch.zeros(parameter_shape))

    def forward(self, input):
        dims = [-(i+1) for i in range(len(self.parameters_shape))]
        mean = inputs.mean(dim=dims, keepdim=True)
        print("mean is: ", mean)
        var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)
        std = (var + self.eps).sqrt()
        print("std is: ", std)
        yPred = (inputs - mean) / std
        print("yPred is: ", yPred)
        out = self.gamma * yPred + self.beta
        print("out is: ", out)
        return out

### Execution Sample 1

With 1 batch

In [18]:
inputs = torch.Tensor([[[0.2, 0.1, 0.3], [0.5, 0.1, 0.1]]])
batch_size, number_of_words, embeddings = inputs.size()
inputs = inputs.reshape(number_of_words, batch_size, embeddings)
inputs.size()

torch.Size([2, 1, 3])

In [20]:
ln = LayerNormalization(inputs.size()[-2:])
out = ln.forward(inputs)

mean is:  tensor([[[0.2000]],

        [[0.2333]]])
std is:  tensor([[[0.0817]],

        [[0.1886]]])
yPred is:  tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]])
out is:  tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>)


### Execution Sample 2

With 3 batches

In [23]:
batch_size = 3
sentence_length = 5
embeeddings_dim = 8

inputs = torch.randn(sentence_length, batch_size, embeeddings_dim)
inputs.size()

torch.Size([5, 3, 8])

In [24]:
inputs

tensor([[[-0.0384,  0.5514, -1.7740, -0.8103, -0.3624, -0.0891,  1.1995,
           0.1242],
         [-0.3968,  0.6609, -0.2462,  0.8967,  0.0881,  0.7868,  0.5733,
           1.0558],
         [-0.0509, -0.5829,  0.3879, -0.2013,  0.0669, -0.2171,  3.5800,
          -0.8869]],

        [[ 1.8947, -0.0449,  0.1689, -1.0628, -0.0268,  1.4316, -1.1890,
           0.4034],
         [-0.1751,  0.1311, -0.8976,  0.3440,  0.0097,  0.7959, -1.1326,
          -0.1315],
         [ 1.0377,  0.3952,  2.8899, -1.3676, -0.8985,  0.4896, -0.3504,
           1.3439]],

        [[ 0.6586, -0.4266, -0.9415,  0.4833,  1.3271,  0.1943,  0.5293,
          -0.5305],
         [ 0.4919,  0.8674,  1.4671, -1.2163, -0.0985,  0.9991, -0.1465,
          -0.4534],
         [ 1.1036,  0.1054,  0.2652, -2.4811,  1.2961,  0.2491, -1.5688,
           0.0838]],

        [[-1.3049,  0.4179,  0.9155, -1.6621, -0.0244, -0.5019,  0.6881,
           1.0417],
         [-0.7131,  0.7899, -1.0207, -0.2845, -0.3761, -0.1282, 

In [None]:
ln = LayerNormalization(inputs.size()[-2:])
out = ln.forward(inputs)