In [1]:
import numpy as np
import torch as t

rng = np.random.default_rng()

## Batch Norm
Batch norm takes a batch of $m$ instances with $n$ features and calculates the mean and standard deviation of each feature in the batch, i.e., the rows in the input matrix. It then "normalizes" the input by subtracting the mean and dividing by the standard deviation. It further scales this normalized matrix by a learned scalar $\gamma$ and shifts by another learned scalar $\beta$.

Lets have our input matrix -
$$
X = \begin{bmatrix}
\leftarrow \mathbf x_1 \rightarrow \\
\leftarrow \mathbf x_2 \rightarrow \\
\leftarrow \mathbf x_3 \rightarrow \\
\leftarrow \mathbf x_4 \rightarrow \\
\end{bmatrix}
$$

Now calcualte the rowwise mean and standard deviation -
$$
\mathbf \mu = mean(\mathbf x_1, \mathbf x_2, \mathbf x_3, \mathbf x_4) \\
\mathbf \sigma = std(\mathbf x_1, \mathbf x_2, \mathbf x_3, \mathbf x_4)
$$

Here both mean and standard deviation are also vectors with the same number of elements as the number of columns in the input matrix.

Now normalize the input matrix -
$$
\overline X = \begin{bmatrix}
\leftarrow (\mathbf x_1 - \mathbf \mu) / \mathbf \sigma 
\rightarrow \\
\leftarrow (\mathbf x_2 - \mathbf \mu) / \mathbf \sigma 
\rightarrow \\
\leftarrow (\mathbf x_3 - \mathbf \mu) / \mathbf \sigma 
\rightarrow \\
\leftarrow (\mathbf x_4 - \mathbf \mu) / \mathbf \sigma 
\rightarrow \\

\end{bmatrix}
$$

Now scale and shfit with the learned params - 
$$
X' = \overline X \gamma + \beta
$$

Initially $\gamma = 1$ and $\beta = 0$. Gradually the network learns how much to scale and shift the normalized matrix.

![batchnorm](./batchnorm.png)

Lets see this with a concrete example -
$$
X = \begin{bmatrix}
0.87717015 & 0.7769747 \\
0.12235527 & 0.6907834 \\
0.6839817 & 0.23128869 \\
0.56366396 & 0.3721697 \\
\end{bmatrix}
$$

Here is the mean of each column (i.e., rowwise mean) -
$$
\mu_0 = (0.87717015 + 0.12235527 + 0.6839817 + 0.56366396) / 4 = 0.5617928 \\
\mu_1 = (0.7769747 + 0.6907834 + 0.23128869 + 0.3721697) / 4 = 0.5178041
$$

We can get this specifying `axis` as $0$ in the `np.mean` function.

The standard deviation of each column can also be similarly calculated.
$$
\sigma_0 = 0.27726424 \\
\sigma_1 = 0.22382565
$$

$$
X = \begin{bmatrix}
0.87717015 & 0.7769747 \\
0.12235527 & 0.6907834 \\
0.6839817 & 0.23128869 \\
0.56366396 & 0.3721697 \\
\end{bmatrix} \\
\mu = \begin{bmatrix}
0.5617928 & 0.5178041
\end{bmatrix} \\
\sigma = \begin{bmatrix}
0.27726424 & 0.22382565
\end{bmatrix}
$$

Now the rowwise normalization -
$$
\overline X = \begin{bmatrix}
(0.87717015 - 0.5617928) / 0.27726424 & (0.7769747 - 0.5178041) / 0.22382565 \\
(0.12235527 - 0.5617928) / 0.27726424 & (0.6907834 - 0.5178041) / 0.22382565 \\
(0.6839817 - 0.5617928) / 0.27726424 & (0.23128869 - 0.5178041) / 0.22382565 \\
(0.56366396 - 0.5617928) / 0.27726424 & (0.3721697 - 0.5178041) / 0.22382565 \\
\end{bmatrix}
$$

In [2]:
x = np.array([
    [0.87717015, 0.7769747 ],
    [0.12235527, 0.6907834 ],
    [0.6839817 , 0.23128869],
    [0.56366396, 0.3721697 ]], dtype=np.float32)

In [3]:
print((0.87717015 - 0.5617928) / 0.27726424)
print((0.7769747 - 0.5178041) / 0.22382565)

print((0.12235527 - 0.5617928) / 0.27726424)
print((0.6907834 - 0.5178041) / 0.22382565)

print((0.6839817 - 0.5617928) / 0.27726424)
print((0.23128869 - 0.5178041) / 0.22382565)

print((0.56366396 - 0.5617928) / 0.27726424)
print((0.3721697 - 0.5178041) / 0.22382565)

1.1374613257014319
1.1579128665548388
-1.584905179261487
0.7728305491350078
0.44069476828313686
-1.28008300210454
0.0067486524767852605
-0.6506600114866192


In [4]:
x_mean = np.mean(x, axis=0)
x_std = np.std(x, axis=0)
print(x_mean, x_std)

[0.5617928 0.5178041] [0.27726424 0.22382565]


In [5]:
x_norm = (x - x_mean) / x_std
x_norm

array([[ 1.1374613 ,  1.1579129 ],
       [-1.5849051 ,  0.77283055],
       [ 0.44069487, -1.2800831 ],
       [ 0.00674868, -0.6506599 ]], dtype=float32)

In [6]:
t.nn.BatchNorm1d(x.shape[-1])(t.from_numpy(x))

tensor([[ 1.1374,  1.1578],
        [-1.5848,  0.7728],
        [ 0.4407, -1.2800],
        [ 0.0067, -0.6506]], grad_fn=<NativeBatchNormBackward0>)

## Layer Norm
Instead of normalizing instance vectors across the batch, layer norm normalizes each element of the instance across that instance only. It does not care about the batch at all. The learned scale and shift is as before.

Lets have our input matrix -
$$
X = \begin{bmatrix}
x_{11} & x_{12} & x_{13} & x_{14} \\
x_{21} & x_{22} & x_{33} & x_{44} \\
\end{bmatrix}
$$

Now the mean and standard deviation are calculated for each instance independently -
$$
\mu_1 = mean(x_{11}, x_{12}, x_{13}, x_{14}) \\
\sigma_1 = std(x_{11}, x_{12}, x_{13}, x_{14}) \\
$$

$$
\mu_2 = mean(x_{21}, x_{22}, x_{33}, x_{44}) \\
\sigma_2 = std(x_{21}, x_{22}, x_{33}, x_{44})
$$

Here the mean and standard deviation for each instance are scalars.

Now normalize the input matrix -
$$
\overline X = \begin{bmatrix}
(x_{11} - \mu_1)/\sigma_1 & (x_{12} - \mu_1)/\sigma_1 & (x_{13} - \mu_1)/\sigma_1 & (x_{14} - \mu_1)/\sigma_1 \\
(x_{21} - \mu_2)/\sigma_2 & (x_{22} - \mu_2)/\sigma_2 & (x_{23} - \mu_2)/\sigma_2 & (x_{24} - \mu_2)/\sigma_2 \\
\end{bmatrix}
$$

And then scale and shift as before -
$$
X' = \overline X \gamma + \beta
$$

![layernorm](./layernorm.png)

In [2]:
x = rng.random((2, 4)).astype(np.float32)
x

array([[0.76992553, 0.00166408, 0.5785207 , 0.7359749 ],
       [0.55730516, 0.5911572 , 0.5388567 , 0.5622644 ]], dtype=float32)

In [3]:
x_mean = np.mean(x, axis=1)
x_std = np.std(x, axis=1)
print(x_mean, x_std)

[0.52152133 0.5623959 ] [0.30870515 0.0187566 ]


In [4]:
x_norm = (x - x_mean.reshape(-1, 1)) / x_std.reshape(-1, 1)
x_norm

array([[ 0.8046649 , -1.6839927 ,  0.18464021,  0.6946874 ],
       [-0.2714092 ,  1.5333978 , -1.2549816 , -0.00701022]],
      dtype=float32)

In [6]:
t.nn.LayerNorm(4)(t.from_numpy(x))

tensor([[ 0.8046, -1.6839,  0.1846,  0.6947],
        [-0.2676,  1.5121, -1.2375, -0.0069]],
       grad_fn=<NativeLayerNormBackward0>)