### üöÄ Normalization in Neural Networks

Normalization helps keep the values inside a neural network at a **stable scale**, ensuring that training is:

- fast  
- stable  
- resistant to exploding/vanishing gradients  
- smooth and predictable  

Without normalization, networks can:

- **explode** (values grow too large)  
- **vanish** (values shrink too small)  
- **oscillate** (jump around during training)  
- **learn very slowly**  

Normalization ensures that each layer receives inputs with a **consistent distribution**, which dramatically improves learning.

---

## ‚úÖ Why Normalization Is Needed

### 1. **Internal Covariate Shift**

Internal Covariate Shift refers to the phenomenon where the **distribution of activations inside the network keeps changing** as the model learns.

**Why it happens:**

- Layer 1 outputs activations `Œ±‚ÇÅ`
- Layer 2 uses `Œ±‚ÇÅ` as input
- When Layer 1 updates during training, the distribution of `Œ±‚ÇÅ` shifts
- Layer 2 must constantly adapt to these new inputs

This slows learning because **every layer keeps chasing a moving target**.

Normalization (BatchNorm, LayerNorm, etc.) keeps the activations consistent, reducing this shift.

**Effect:**  
‚úÖ Faster convergence  
‚úÖ More stable training  
‚úÖ Higher accuracy  

---

### 2. **Activation Drift**

Activation drift means activations or gradients gradually start drifting toward:

- **very large values** ‚Üí exploding gradients  
- **very small values** ‚Üí vanishing gradients  

Both issues break learning:

- exploding gradients ‚Üí unstable updates  
- vanishing gradients ‚Üí slow or zero learning  

Normalization helps by keeping activations centered and scaled (e.g., mean ‚âà 0, variance ‚âà 1).

**Effect:**  
‚úÖ Gradients become stable  
‚úÖ Training becomes smoother  

---

### 3. **Distribution Drift (General Case)**

Distribution Drift happens when the **statistical distribution of data** changes over time.

Examples:

- Input data distribution changes  
- Training set and deployment data differ  
- Internal activations drift as weights update  

When distributions drift:

- model predictions degrade  
- training becomes inconsistent  
- the model becomes harder to optimize  

Normalization reduces internal drift and improves model robustness.

---

## ‚úÖ What Normalization Actually Does (Intuition)

Normalization layers typically:

1. Compute statistics (mean, variance, norms, etc.)  
2. Use them to **scale and center** the activations  
3. Optionally apply **learnable parameters**  
   - gamma (Œ≥) ‚Üí scale  
   - beta (Œ≤) ‚Üí shift  

This ensures activations are:

- not too large  
- not too small  
- well-conditioned for optimization  

---

## ‚úÖ Types of Normalization (Short Overview)

### **1. Batch Normalization (BatchNorm)**
- Normalizes across the batch dimension  
- Works very well in CNNs  
- Not ideal for small batch sizes or LLMs

### **2. Layer Normalization (LayerNorm)**
- Normalizes across features within a single token  
- Used in Transformers and LLMs  
- Works with batch size = 1  
- No dependency on batch statistics

### **3. RMSNorm**
- Variant of LayerNorm without mean subtraction  
- Common in newer LLMs (e.g., Falcon, LLaMA variants)

### **4. GroupNorm / InstanceNorm**
- Used mostly in computer vision architectures

---

## ‚úÖ Why LLMs Prefer LayerNorm (instead of BatchNorm)

- Batch sizes during inference = 1 ‚Üí BatchNorm fails  
- Sequence lengths vary ‚Üí BatchNorm becomes inconsistent  
- Transformers work token-wise ‚Üí LayerNorm fits naturally  

LayerNorm keeps token representations stable, regardless of batch arrangement.

---

## ‚úÖ Summary

Normalization solves several major problems:

- ‚úÖ internal covariate shift  
- ‚úÖ activation drift  
- ‚úÖ distribution drift  
- ‚úÖ exploding/vanishing gradients  
- ‚úÖ slow/unstable training  

By keeping activations stable, normalization enables deep networks ‚Äî especially Transformers and LLMs ‚Äî to train efficiently and reliably.



### ‚úÖ How Batch Normalization Works (Step-by-Step)

For a given activation \( x \) during training:

---

### **Step 1 ‚Äî Compute Batch Mean**

$$
\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i
$$

---

### **Step 2 ‚Äî Compute Batch Variance**

$$
\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
$$

---

### **Step 3 ‚Äî Normalize**

$$
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$$

(Œµ avoids division by zero)



#### **Batch Normalization for Neural Nets**

In [3]:
import torch
hidden_layer1 = torch.nn.Linear(128, 30)
hidden_layer1

Linear(in_features=128, out_features=30, bias=True)

In [5]:
# batch_size, hidden_size
input_size = torch.rand(2, 128)

outs = hidden_layer1(input_size)
print(outs.shape)
outs

torch.Size([2, 30])


tensor([[ 0.3119, -0.5689, -0.2118, -0.3213, -0.3049, -0.2963,  0.1679,  0.1045,
         -0.4935, -0.3426,  0.1636,  0.3444,  0.4053, -0.3914, -0.0169, -0.4242,
          0.2364, -0.0529,  0.1380,  0.2431, -0.1235, -0.1448,  0.4815, -0.0974,
          0.2168, -0.1358,  0.1592,  0.0519,  0.2798,  0.0597],
        [ 0.6261, -0.3463,  0.0785, -0.2130, -0.3671, -0.1707,  0.2661,  0.5566,
         -0.5229, -0.1492, -0.0131,  0.4890,  0.2283, -0.0953, -0.2422, -0.0629,
          0.3819, -0.3854, -0.0651,  0.0377, -0.0174,  0.2092,  0.6284, -0.0103,
          0.4670,  0.3811,  0.0269,  0.1123,  0.3464,  0.2016]],
       grad_fn=<AddmmBackward0>)

In [None]:
# Now we'll apply batch_normalization here 
# so for each batch we'll calculate mean and variance 
# after that all elements of that batch should be (x - u)/var^0.2 
import torch 

def normalize(input):
    # dim=-1 (as we want to calculate it for all elements of last dimension)
    mean_without_keep_dim = torch.mean(outs, dim=-1)
    mean_with_keep_dim = torch.mean(outs, dim=-1, keepdim=True)
    print(f"Mean without Keepdim : {mean_without_keep_dim.shape}")
    print(f"Mean with Keepdim : {mean_with_keep_dim.shape}")

In [14]:
normalize(outs)

Mean without Keepdim : torch.Size([2])
Mean with Keepdim : torch.Size([2, 1])


In [20]:
# making proper function for normalization
def normalize(input):
    print(f"Input shape : {input.shape}")
    mean = torch.mean(outs, dim=-1)
    var = torch.var(outs, dim=-1)
    print(f"Mean shape : {mean.shape}")
    inputs = (input - mean) / ((var)**(0.5) + 0.000001)
    return inputs

In [22]:
# this will throw an error as dimension of mean is not same as dimension of input
# that's where we need keepdim=True parameter
normalize(outs).shape

Input shape : torch.Size([2, 30])
Mean shape : torch.Size([2])


RuntimeError: The size of tensor a (30) must match the size of tensor b (2) at non-singleton dimension 1

In [None]:
# normalization with keepdim=True
def normalize(input):
    print(f"Input shape : {input.shape}")
    mean = torch.mean(outs, dim=-1, keepdim=True)
    var = torch.var(outs, dim=-1, keepdim=True)
    print(f"Mean shape : {mean.shape}")
    inputs = (input - mean) / ((var)**(0.5) + 0.000001)
    return inputs

In [25]:
# Now this would do the normalization across batches
normalize(outs).shape

Input shape : torch.Size([2, 30])
Mean shape : torch.Size([2, 1])


torch.Size([2, 30])

#### **Batch Normalization for CNNs**

In [48]:
# CNNs comes with (batch_size, channels, height, width)
cnn_output = torch.rand(2, 3, 12, 12)
print(f"Batch size : {cnn_output.shape[0]}")
print(f"Channel size : {cnn_output.shape[1]}")
print(f"Height of channel : {cnn_output.shape[2]}")
print(f"Width of channel : {cnn_output.shape[3]}")

Batch size : 2
Channel size : 3
Height of channel : 12
Width of channel : 12


In [72]:
# Normalization in this case calculates mean and variance  for each channel of each batch

def normalization_cnn(input):
    # now we have a matrix (height x width) for which we have calculate mean and variance
    mean = input.mean(dim=(0, 2, 3), keepdim=True)
    # BatchNorm uses unbiased=False because it needs the true batch variance (divide by N), not the unbiased estimator (divide by N‚àí1).
    var  = input.var(dim=(0, 2, 3), keepdim=True, unbiased=False)
    print(mean.shape)

    # 2) Normalize
    input_hat = (input - mean) / torch.sqrt(var + 0.00001) 
    return input_hat

In [73]:
normalization_cnn(cnn_output).shape

torch.Size([1, 3, 1, 1])


torch.Size([2, 3, 12, 12])

In [74]:
# for CNN this whole functionality is provided in BatchNorm2d(channels)

batch_norm_layer = torch.nn.BatchNorm2d(3)
output = batch_norm_layer(cnn_output)
output.shape

torch.Size([2, 3, 12, 12])

In [78]:
## checking if my function and pytorch default BatchNorm Function works same or not
if torch.allclose(output, normalization_cnn(cnn_output), atol=1e-5):
    print("Yes")
else:
    print("No")

torch.Size([1, 3, 1, 1])
Yes


#### **Layer Normalization**

Let say we have a batch of 2 sentences with 5 words each and embed_dim = 7

In [80]:
# batch_size, context_len, embed_dim
inputs = torch.rand(2, 5, 7)

In [81]:
# But here we don't use BatchNormalization, here we go with normalizing each token independently
def layer_norm(input):
    mean = torch.mean(input, dim=-1, keepdim=True)
    var = torch.var(input, dim=-1, keepdim=True)
    inputs = (input - mean) / torch.sqrt(var + 0.0001)
    return inputs

In [82]:
out = layer_norm(inputs)
out.shape

torch.Size([2, 5, 7])

#### **Layer Normalization Class**

In [85]:
import torch 
import torch.nn as nn

class LayerNormalization(nn.Module):
    def __init__(self):
        super().__init__()
        self.eps = 0.0001

    def forward(self, x):
        mean = torch.mean(x, dim=-1, keepdim=True)
        var = torch.var(x, dim=-1, keepdim=True)
        inputs = (x - mean) / torch.sqrt(var + self.eps)
        return inputs

In [86]:
# batch_size, context_len, embed_dim
inputs = torch.rand(2, 5, 7)
layer_norm1 = LayerNormalization()
outs = layer_norm1(inputs)
outs.shape

torch.Size([2, 5, 7])