###  **Layer Normalization (LN)**

**Where?**  
- Used in **RNNs**, **Transformers**, **NLP models**, and **fully connected networks**

**How?**  
- LN normalizes **across features in a single sample**, **not across the batch**
- Formula:  
  $
  \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
  $  
  where:
  - $ \mu $, $ \sigma $: mean and std over the features of one sample

**Key Properties:**  
- **Independent of batch size**
- Always uses current sample statistics
- Learnable **scale ($ \gamma $)** and **shift ($ \beta $)**

**Pros:**
- Works well with variable batch sizes
- Stable for sequential models (RNNs, Transformers)

**Cons:**
- May be slightly slower convergence compared to BN in CNNs

---


## **Numeric example difference between  Batch Normalization (BN) and Layer Normalization (LN)**

We’ll use your batch of size 3, each with 4 features:

$$
X = 
\begin{bmatrix}
5 & 4 & 7 & 2 \\
1 & 6 & 2 & 3 \\
4 & 8 & 1 & 9
\end{bmatrix}
$$

---

## 1. **Batch Normalization (BN)**

BN normalizes **per feature across the batch** (rows).
That means for each column (feature), we compute:

$$
\mu_j = \frac{1}{N} \sum_{i=1}^N x_{ij}, 
\quad
\sigma_j^2 = \frac{1}{N} \sum_{i=1}^N (x_{ij} - \mu_j)^2
$$

where $N=3$ (batch size).

### Example:

* Feature 1 (col 0): \[5, 1, 4] → mean = (5+1+4)/3 = 3.33, var = 2.89
* Feature 2 (col 1): \[4, 6, 8] → mean = 6, var = 2.67
* Feature 3 (col 2): \[7, 2, 1] → mean = 3.33, var = 6.22
* Feature 4 (col 3): \[2, 3, 9] → mean = 4.67, var = 8.22

Then normalize each entry:

$$
x'_{ij} = \frac{x_{ij} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}
$$

---

## 2. **Layer Normalization (LN)**

LN normalizes **per sample across all features** (row).
That means for each row, compute:

$$
\mu_i = \frac{1}{F} \sum_{j=1}^F x_{ij}, 
\quad
\sigma_i^2 = \frac{1}{F} \sum_{j=1}^F (x_{ij} - \mu_i)^2
$$

where $F=4$ (features).

### Example:

* Row 1: \[5, 4, 7, 2] → mean = 4.5, var = 3.25
* Row 2: \[1, 6, 2, 3] → mean = 3.0, var = 3.5
* Row 3: \[4, 8, 1, 9] → mean = 5.5, var = 10.25

Then normalize each entry:

$$
x'_{ij} = \frac{x_{ij} - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}
$$

---

## Python Code (Numeric Example)

```python
import numpy as np

X = np.array([[5, 4, 7, 2],
              [1, 6, 2, 3],
              [4, 8, 1, 9]], dtype=float)

# Batch Normalization (per feature across batch)
mu_batch = X.mean(axis=0)
var_batch = X.var(axis=0)
bn = (X - mu_batch) / np.sqrt(var_batch + 1e-5)

# Layer Normalization (per sample across features)
mu_layer = X.mean(axis=1, keepdims=True)
var_layer = X.var(axis=1, keepdims=True)
ln = (X - mu_layer) / np.sqrt(var_layer + 1e-5)

print("Original X:\n", X)
print("\nBatchNorm (per feature across batch):\n", np.round(bn, 3))
print("\nLayerNorm (per row across features):\n", np.round(ln, 3))
```

---

##  Output (rounded)

```
Original X:
 [[5. 4. 7. 2.]
  [1. 6. 2. 3.]
  [4. 8. 1. 9.]]

BatchNorm:
 [[ 0.98 -1.22  1.49 -0.93]
  [-1.37  0.    -0.53 -0.58]
  [ 0.39  1.22 -0.96  1.52]]

LayerNorm:
 [[ 0.28 -0.09  1.38 -1.57]
  [-1.07  1.6  -0.53  0.0 ]
  [-0.47  0.78 -1.4   1.09]]
```

---

 **Key difference**:

* **BN** normalizes **column-wise (feature-wise across batch)** → all rows share the same feature stats.
* **LN** normalizes **row-wise (within each sample)** → each row is independent of the batch.

---


In [1]:
import torch
import torch.nn as nn

# Input batch: 3 samples, 4 features each
X = torch.tensor([[5., 4., 7., 2.],
                  [1., 6., 2., 3.],
                  [4., 8., 1., 9.]])

# BatchNorm1d: normalizes over batch dimension (N) for each feature
bn = nn.BatchNorm1d(num_features=4, affine=False, track_running_stats=False)

# LayerNorm: normalizes per sample (over features)
ln = nn.LayerNorm(normalized_shape=4, elementwise_affine=False)

print("Original X:\n", X)

print("\nBatchNorm output:\n", torch.round(bn(X) * 1000) / 1000)  # rounded
print("\nLayerNorm output:\n", torch.round(ln(X) * 1000) / 1000)

Original X:
 tensor([[5., 4., 7., 2.],
        [1., 6., 2., 3.],
        [4., 8., 1., 9.]])

BatchNorm output:
 tensor([[ 0.9810, -1.2250,  1.3970, -0.8630],
        [-1.3730, -0.0000, -0.5080, -0.5390],
        [ 0.3920,  1.2250, -0.8890,  1.4020]])

LayerNorm output:
 tensor([[ 0.2770, -0.2770,  1.3870, -1.3870],
        [-1.0690,  1.6040, -0.5350,  0.0000],
        [-0.4690,  0.7810, -1.4060,  1.0930]])
