# Layer Normalization Step
Its importtant to note that in this step, there is also a residual connection from the word embeddong + positional encoding layer. Therefore, the layernorm will be something like this:
```
normalized = layerNorm(x_i, out)
```

Where ```x_i``` is the word embedding + positional encoding output and the ```out``` is just the output for the self-attention layer. 

## What is Normalization? And why do we use it?
Its easy to just think of layer normalization of 'normalizing' the outputs by layer as opposed to the other method which is done by 'batches'. 

The idea behind normalization in general is to keep the outputs relatively stable so they dont deviate so far into +inf and -inf which could essentially kill the model - preventing it from learning. 

In [1]:
# let's work through an example
import torch
import torch.nn as nn

In [2]:
inputs = torch.Tensor([[0.2, 0.1, 0.2], [0.5, 0.1, 0.1]])
inputs.shape

torch.Size([2, 3])

In [3]:
# let's add a batch dimension
inputs = inputs[None, :, :]
inputs.shape

torch.Size([1, 2, 3])

In [5]:
batch, seq, emb = inputs.size()

# now let's reshape it
inputs = inputs.reshape(seq, batch, emb)
inputs.shape

torch.Size([2, 1, 3])

In [7]:
# now we need to calcualte and use layernorm accross not only the layer but also
# across the batches
parameter_shape = inputs.size()[-2:]
parameter_shape

torch.Size([1, 3])

In [9]:
# this will be a learnable parameter for each layer that uses layerNorm
gamma = nn.Parameter(torch.ones(parameter_shape))
beta = nn.Parameter(torch.zeros(parameter_shape)) # this is a regular beta parameter
gamma.shape, beta.shape

(torch.Size([1, 3]), torch.Size([1, 3]))

In [10]:
# we are applying layer norm to the last two layers
dims = [-1, -2]

In [11]:
# calcualte the mean of the last two layers
mean = inputs.mean(dim=dims, keepdim=True)
mean.shape

torch.Size([2, 1, 1])

In [17]:
# calcualte the std
std = inputs.std(dim=dims, keepdim=True)
std.shape

torch.Size([2, 1, 1])

In [18]:
mean, std

(tensor([[[0.1667]],
 
         [[0.2333]]]),
 tensor([[[0.0577]],
 
         [[0.2309]]]))

In [19]:
# then to calculate the layer norm output
y = (inputs - mean) / std
y

tensor([[[ 0.5774, -1.1547,  0.5774]],

        [[ 1.1547, -0.5774, -0.5774]]])

In [20]:
out = gamma * y + beta
out

tensor([[[ 0.5774, -1.1547,  0.5774]],

        [[ 1.1547, -0.5774, -0.5774]]], grad_fn=<AddBackward0>)

In [27]:
# lets cheat and just do this using pytorch
layernorm = nn.LayerNorm(parameter_shape)

out = layernorm(inputs)

In [28]:
out

tensor([[[ 0.7055, -1.4110,  0.7055]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<NativeLayerNormBackward0>)