# Layer Normalisation

<img src="images/transformer-architecture-4.png" width="500">

## Transformer Neural Network Architecture Overview (Continued)

Similar to the Transformer architecture overview section in the [positional encoding notebook](3-Positional_Encoding_in_Transformer_Neural_Network.ipynb), we will be taking a quick walkthrough of the architecture once again before we discuss about layer normalisation.

Let's say we have an input sequence that is, *My name is John* and we want to translate this from English to French. We first pad the input sequence to the maximum seqeunce length. Each word is then represented as a one-hot encoded vector.

**Note:** Technically, it would not be the words that represent the one-hot encoded vector but word pieces called byte pair encodings but for the sake of simplicity, we will consider them as one-hot encoded word vectors.

Because these are one-hot encoded vectors, they are going to be the same length as the vocabulary size — that is all possible words that could possibly occur. We then transform these one-hot encoded vectors into $512$-dimensional word vectors to form matrix $X$. Because all words are passed into the Transformer in parallel, there is no sense of ordering. However, English sentences have words that are ordered specifically. So, we pass in some [positional encoding](3-Positional_Encoding_in_Transformer_Neural_Network.ipynb) to encode orders.

We then add the input to the encoding to get the positionally encoded vectors to form matrix $X^{1}$. From here, the multi-head attention unit is kicked off where each positionally encoded vector is now split up into three vectors of query, key and value — each of these are $512-dimensional vectors. So, we are going to end up with 3 $\times$ the maximum sequence length — it is basically 3 three $\times$ the number of words in the input sequence. 

We now split each of these query, key and value vectors into 8 parts and each part (highlighted in yellow) is going to be a vector for one attention head — there are 8 attention heads in the main paper. Each of these heads are then passed into an attention block, $\text{ATTN}$. An attention block is basically going to multiply their query and key vectors, apply scaling and masking (only for decoder), to form the attention blocks, $a_{i}$ which have the size of maximum sequence length $\times$ maximum sequence length. These attention blocks, $a_{i}$ tells us exactly how much attention each word should pay to the other words.

We then multiply every attention blocks by every head's value vector. In the end, we will get 8 individual vecors of size maximum sequence length $\times$ 64 each. We concatenate all of these 8 vectors and the final output size will be maximum sequence length $\times$ $512$ (which is $64 \times 8 \text{ heads} = $512$ \text{ output dimensions}$).

This output matrix will be just before the normalisation layer. The normalisation takes in this output layer as well as a residual/skip connection of the matrix after positional encoding. These residual connections are done to ensure that there is a stronger information signal that flows through deep networks. This is required because as we keep back propagating, the gradient updates become zero and the model stops learning. This issue is famously known as vanishing gradients. Therefore, to prevent that, we induce stronger signals from the input in different parts of the network.

We then multiply every attention blocks by every head's value vector. In the end, we will get 8 individual vecors of size maximum sequence length $\times$ 64 each. We concatenate all of these 8 vectors and the final output size will be maximum sequence length $\times$ $512$ (which is $64 \times 8 \text{ heads} = $512$ \text{ output dimensions}$).

![diagram](images/layer-normalisation.png)

## What is Layer Normalisation? Why perform Layer Normalisation?

Activations of neurons will be a wide range of positive and negative values. Normalisation encapsulates these values within a much smaller range and typically centres around zero. What this allows for is much more stable training during the back propagation phase and when we perform a gradient update step, we are taking much more even and consistent steps — so it is now easier to learn and hence, faster to train till the model gets to the optimal parameter values.

Layer normalisation is the strategy in which we apply normalisation to a neural network. In this case, we are going to ensure the activation values of every neuron in every layer is normalised such that all the activation values in a layer will be centered with unit variance (i.e. centered at $0$ and standard deviation of $1$).

![diagram](images/formula.png)

To understand layer normalisation in more detail, let's say we have X, Y, Z and O are the activation vectors for each one of the layers as shown in the image above. In a typical neural network fashion, we'd apply some activation, $f$ to the weights, $W$ $\times$ some vector, $X$ and plus a bias, $b$ (see first equation in image above). This is without any kind of normalisation.

To perform normalisation, we'd substract the mean, $\mu$ of the activation values divided by the standard deviation, $\sigma$ of the activation values to the output produced from the activation function above. We would also add learnable parameters, $\gamma$ and $\beta$ (see second equation in image above).

As we keep getting more and more inputs to the network above over time, and we keep performing the back propagation step, the learnable parameters, $\gamma$ and $\beta$ are going to change and be learned in order to optimize the objective of the loss function.

## How to perform Layer Normalisation?

Let's say we have two input vectors:

\begin{align}

\begin{bmatrix}
0.2 & 0.1 & 0.3\\
0.5 & 0.1 & 0.1 
\end{bmatrix} \rightarrow 
\text{2 words and 3 dimensions}

\nonumber
 
\end{align}

Now, we want to perform some normalisation — specifically, layer normalisation for the matrix above. To do this, we compute the mean and standard deviation across the layer:

\begin{align}

\mu_{11} = \frac{1}{3}[0.2 + 0.1 + 0.3] = 0.2 \nonumber

\newline\nonumber
\newline\nonumber
 
\mu_{21} = \frac{1}{3}[0.5 + 0.1 + 0.1] = 0.233 \nonumber

\end{align}

We can now use these $\mu$ values to compute the standard deviations:

\begin{align}

\sigma_{11} = \sqrt{\frac{1}{3}[(0.2 - 0.2)^{2} +(0.1 - 0.2)^{2} + (0.3 - 0.2)^{2}]} = \sqrt{\frac{1}{3}[0.0 + 0.01 + 0.01]} = 0.08164 \nonumber

\newline\nonumber
\newline\nonumber
 
\sigma_{21} = \sqrt{\frac{1}{3}[(0.5 - 0.233)^{2} +(0.1 - 0.233)^{2} + (0.1 - 0.233)^{2}]} = \sqrt{\frac{1}{3}[0.071289 + 0.017689 + 0.017689]} = 0.1885 \nonumber \nonumber

\end{align}

Now, we have the matrices for the means and standard deviations:

\begin{align}

\mu = \begin{bmatrix}
\mu_{11}\\
\mu_{21}
\end{bmatrix} = \begin{bmatrix}
0.2\\
0.233 
\end{bmatrix} \nonumber

\newline\nonumber
\newline\nonumber
 
\sigma = \begin{bmatrix}
\sigma_{11}\\
\sigma_{21}
\end{bmatrix} = \begin{bmatrix}
0.08164\\
0.1885
\end{bmatrix} \nonumber

\end{align}

Now, we can substract the mean and divide by the standard deviation:

\begin{align}

Y = \frac{X - \mu}{\sigma} = \begin{bmatrix}
\frac{0.2 - 0.2}{0.08164} & \frac{0.1 - 0.2}{0.08164} & \frac{0.3 - 0.2}{0.08164}\\
\frac{0.5 - 0.233}{0.1885} & \frac{0.1 - 0.233}{0.1885} & \frac{0.1 - 0.233}{0.1885}
\end{bmatrix} = \begin{bmatrix}
0.0 & -1.2248 & 1.2248\\
1.414 & -0.707 & -0.707
\end{bmatrix} \nonumber

\end{align}

\begin{align}

\text{out} = \gamma · Y + \beta \nonumber

\end{align}

\begin{align}

\gamma, \beta \isin \mathbb{R}^{2} \nonumber

\end{align}

What you'll notice is if $\gamma$ is set to $1$ and $\beta$ is set to $0$, the $\text{out}$ will be same as $Y$. You will also notice that for every single one of those normalised layers, the mean, $\mu$ is $0$ and standard deviation, $\sigma$ is close to $1$. Therefore, these values are much more tractable and it becomes much more stable during training.

In [1]:
import torch
import torch.nn as nn

You'll notice here that a batch dimension, `B` have been added to the same exact input as our example above. In practice, during training, we would typically have a batch dimension so that it helps parallelise training and training just becomes faster.

`B` is the batch size, `S` is the number of words and `E` is the embedding size.

In [2]:
inputs = torch.Tensor([[[0.2, 0.1, 0.3], [0.5, 0.1, 0.1]]])
B, S, E = inputs.size()  # [1, 2, 3]
inputs = inputs.reshape(S, B, E)

inputs.size()

torch.Size([2, 1, 3])

Because now we have this batch dimension, layer normalisation is going to be applied to not just the layer but also across the batches. In this case, batch size is $1$ so it's not going to make much difference but layer normalisation is essentially going to be computed the layers and also the batches.

In [3]:
parameters_shape = inputs.size()[-2:]  # [1, 3]
gamma = nn.Parameter(torch.ones(parameters_shape))
beta = nn.Parameter(torch.zeros(parameters_shape))

gamma.size(), beta.size()

(torch.Size([1, 3]), torch.Size([1, 3]))

We initialise `gamma` to be just ones whereas `beta` will be just zeros. Then, we compute the dimensions for which we want to compute layer normalisation — the batch dimension as well as the embedding dimension.

In [4]:
dims = [-(i + 1) for i in range(len(parameters_shape))]

dims

[-1, -2]

`[-1, -2]` simply means the last two layers which in this case are the batch dimension as well as the embedding dimension. Now, we take the mean across these layers.

In [5]:
mean = inputs.mean(
    dim=dims, keepdim=True
)  # Taking the mean across the batch and embedding dimensions

print(mean.size())
mean

torch.Size([2, 1, 1])


tensor([[[0.2000]],

        [[0.2333]]])

In [6]:
var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)
epsilon = 1e-5  # For numerical stability
std = (var + epsilon).sqrt()

print(std.size())
std

torch.Size([2, 1, 1])


tensor([[[0.0817]],

        [[0.1886]]])

In [7]:
y = (inputs - mean) / std

print(y.size())
y

torch.Size([2, 1, 3])


tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]])

In [8]:
out = gamma * y + beta

print(out.size())
out

torch.Size([2, 1, 3])


tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>)

## Class

In [9]:
import torch
import torch.nn as nn


class LayerNormalisation(nn.Module):
    def __init__(self, parameters_shape, eps=1e-5):
        super().__init__()
        self.parameters_shape = parameters_shape
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(self.parameters_shape))
        self.beta = nn.Parameter(torch.zeros(self.parameters_shape))

    def forward(self, inputs):
        dims = [-(i + 1) for i in range(len(self.parameters_shape))]
        mean = inputs.mean(dim=dims, keepdims=True)
        print(f"Mean \n({mean.shape}): \n{mean}")

        var = ((inputs - mean) ** 2).mean(dim=dims, keepdims=True)
        std = (var + self.eps).sqrt()
        print(f"Standard Deviation \n({std.shape}): \n{std}")

        y = (inputs - mean) / std
        print(f"y \n({y.shape}): \n{y}")

        out = self.gamma * y + self.beta
        print(f"out \n({out.shape}): \n{out}")

        return out

In [10]:
batch_size = 3
sentence_length = 5
embedding_dim = 8
inputs = torch.randn(sentence_length, batch_size, embedding_dim)

print(f"inputs \n({inputs.size()}) \n{inputs}")

inputs 
(torch.Size([5, 3, 8])) 
tensor([[[ 0.8309, -1.4533,  1.6044, -0.5960,  0.7909,  1.0906, -0.0633,
          -1.0680],
         [ 1.7214,  0.0474,  0.0555, -2.1137, -0.8513, -0.4475,  0.7492,
          -0.9332],
         [-0.5913, -1.0556,  0.4406, -0.4345,  0.9959, -1.5633,  0.4680,
           0.3614]],

        [[-0.8853,  0.3268,  1.1512, -0.0673, -1.1856,  1.2611, -0.9168,
           1.5467],
         [-0.8143, -0.8634,  1.2919,  0.7863,  1.8642,  0.6072,  0.7174,
           0.1674],
         [ 0.0502, -1.3253, -0.9979,  0.3378, -1.2231,  0.2320, -1.0998,
           0.4230]],

        [[ 0.7130,  1.3595,  0.0569,  2.2373, -1.5809, -0.7621,  0.4687,
          -0.6500],
         [-0.2873, -0.4107, -1.8128,  2.2755,  1.0251,  0.3668, -0.5424,
          -1.5654],
         [ 0.4478, -1.3341,  0.5928, -0.5933, -1.9637, -0.3187,  0.6185,
           0.0860]],

        [[-0.6113, -1.2800,  1.4437,  0.2136, -0.3700, -2.3754, -1.2255,
          -1.3245],
         [-0.2706,  0.0489,  1.

In [11]:
parameters_shape = inputs.size()[-2:]
layer_norm = LayerNormalisation(parameters_shape)

In [12]:
out = layer_norm(inputs)

Mean 
(torch.Size([5, 1, 1])): 
tensor([[[-0.0839]],

        [[ 0.0577]],

        [[-0.0656]],

        [[ 0.0037]],

        [[-0.1786]]])
Standard Deviation 
(torch.Size([5, 1, 1])): 
tensor([[[0.9989]],

        [[0.9631]],

        [[1.1303]],

        [[1.1641]],

        [[0.8420]]])
y 
(torch.Size([5, 3, 8])): 
tensor([[[ 0.9158, -1.3709,  1.6902, -0.5126,  0.8758,  1.1758,  0.0207,
          -0.9852],
         [ 1.8073,  0.1315,  0.1396, -2.0319, -0.7682, -0.3639,  0.8341,
          -0.8501],
         [-0.5079, -0.9727,  0.5251, -0.3509,  1.0811, -1.4810,  0.5525,
           0.4458]],

        [[-0.9792,  0.2795,  1.1355, -0.1297, -1.2910,  1.2496, -1.0119,
           1.5461],
         [-0.9054, -0.9565,  1.2816,  0.7565,  1.8758,  0.5706,  0.6850,
           0.1139],
         [-0.0077, -1.4360, -1.0961,  0.2909, -1.3300,  0.1810, -1.2019,
           0.3794]],

        [[ 0.6888,  1.2608,  0.1083,  2.0374, -1.3407, -0.6163,  0.4727,
          -0.5170],
         [-0.1962, -0.3