# LayerNormalization

## Resources
1. [ML Explained](https://mlexplained.com/2018/01/13/weight-normalization-and-layer-normalization-explained-normalization-in-deep-learning-part-2/)
2. 

## BatchNorm Limitations
BatchNorm computes the mean and variance of each mini-batch and normalizes each feature according to the mini-batch statistics. 

Ideally, we want to calculate global mean and variance to normalize the inputs to a layer. However, computing mean of entire dataset for a single update is computationally expensive. Hence, we estimate them using mini-batch statistics, which might contain some error compared to global mean. Also, it varies from batch to batch. Hence, we have to be extra careful while choosing the batch size.

BatchNorm is difficult to apply to RNNs. 

## Explanation

### General LayerNormalization
Given inputs $x$ over a minibatch of size $m, B={x_{1},x_{2},…,x_{m}}$, each sample $x_{i}$ contains $K$ elements, i.e. the length of flatten $x_{i}$ is $K$. Calculate the mean and the variance of of each sample from the minibatch. For sample $x_{i}$ whose flatten format is ${x_{i,1},x_{i,2},…, x_{i,K}}$, we have its mean $μ_{i}$ and variance $σ^2_{i}$.

$$\mu_i = \frac{1}{K} \sum_{k=1}^{K} x_{i,k}$$

$$\sigma_i^2 = \frac{1}{K} \sum_{k=1}^{K} (x_{i,k} - \mu_i)^2$$

$$\hat{x}_{i,k} = \frac{x_{i,k}-\mu_k}{\sqrt{\sigma_k^2 + \epsilon}}$$

$$y_i = \gamma \hat{x}_{i} + \beta \equiv {\text{LN}}_{\gamma, \beta} (x_i)$$

***Note:-*** Working on a per sample basis

### LayerNormalization in Convolution
Normalization is done per channel within a single sample.
Assume the input tensor has shape $[m,H,W,C]$, for each channel $c∈{1,2,⋯,C}$

$$\mu_{i,c} = \frac{1}{HW} \sum_{j=1}^{H} \sum_{k=1}^{W} x_{i,j,k,c}$$

$$\sigma_{i,c}^2 = \frac{1}{HW} \sum_{j=1}^{H} \sum_{k=1}^{W} (x_{i,j,k,c} - \mu_{i,c})^2$$

$$\hat{x}_{i,j,k,c} = \frac{x_{i,j,k,c}-\mu_{i,c}}{\sqrt{\sigma_{i,c}^2 + \epsilon}}$$

$$y_{i,:,:,c} = \gamma_c \hat{x}_{i,:,:,c} + \beta_c \equiv {\text{LN}}_{\gamma_c, \beta_c} (x_{i,:,:,c})$$

![difference](https://i1.wp.com/mlexplained.com/wp-content/uploads/2018/01/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-01-11-11.48.12.png?resize=768%2C448&ssl=1)

In batch normalization, the statistics are computed across the batch and are the same for each example in the batch. In contrast, in layer normalization, the statistics are computed across each feature and are independent of other examples.