# Day 11 — "Normalization Layers: BatchNorm, LayerNorm & Gradient Stability"

Normalization resets activation distributions so each layer learns on stable, predictable ranges. That keeps gradients flowing and training fast.

## 1. Core Intuition

- Deep nets see their activation scales drift; gradients become unstable.
- Normalization recenters/rescales before each layer, like leveling the ground before building.

## 2. What Normalization Does

- Mean → 0, variance → 1.
- Stabilizes gradient flow, smooths the loss surface, allows larger learning rates, reduces sensitivity to initialization, and makes deeper models trainable.

## 3. Batch Normalization

`mu = mean(x, batch)`, `var = var(x, batch)`, `xhat = (x-mu)/sqrt(var+eps)`, `y = gamma*xhat + beta`. Works best for CNNs; suffers when batch sizes vary or are tiny (RNNs/Transformers).

## 4. Layer Normalization

Normalize across features per sample: batch-size agnostic and perfect for Transformers, LSTMs, and attention blocks.

## 5. Python Implementation — BatchNorm vs LayerNorm

`days/day11/code/normalization.py` implements forward passes.

In [None]:
from __future__ import annotations

import sys
from pathlib import Path
import numpy as np


def find_repo_root(marker: str = "days") -> Path:
    path = Path.cwd()
    while path != path.parent:
        if (path / marker).exists():
            return path
        path = path.parent
    raise RuntimeError("Run this notebook from inside the repository tree.")

REPO_ROOT = find_repo_root()
if str(REPO_ROOT) not in sys.path:
    sys.path.append(str(REPO_ROOT))

from days.day11.code.normalization import batchnorm_forward, layernorm_forward

x = np.random.randn(4, 5)
print('BatchNorm:\n', batchnorm_forward(x))
print('LayerNorm:\n', layernorm_forward(x))


: 

## 6. Visualization — Distribution Stabilization

`days/day11/code/visualizations.py` animates drifting activations vs BatchNorm.

In [None]:
from days.day11.code.visualizations import anim_batchnorm_distribution

RUN_ANIMATIONS = False

if RUN_ANIMATIONS:
    gif = anim_batchnorm_distribution()
    print('Saved animation →', gif)
else:
    print('Set RUN_ANIMATIONS = True to regenerate Day 11 GIFs in days/day11/outputs/.')


## 7. How Normalization Helps Gradients

- Prevents exploding activations (variance ≈1).
- Keeps gradients in sensitive regions (less vanishing).
- Smooths curvature, enables larger learning rates, reduces internal covariate shift.

## 9. Which Architecture Uses Which Normalizer

| Architecture | Normalization |
| --- | --- |
| ResNet / MobileNet | BatchNorm |
| Transformers / GPT / ViT | LayerNorm / RMSNorm |
| Style transfer | InstanceNorm |
| 3D CNNs / tiny batches | GroupNorm |

## 10. Mini Exercises

1. Train a small CNN with/without BatchNorm; compare gradient norms.
2. Shrink batch size (32→2→1) to see BatchNorm instability.
3. Swap BatchNorm for LayerNorm in a CNN and measure accuracy.
4. Try GroupNorm/InstanceNorm when batches are tiny.
5. Compare gradient norms in a deep MLP with vs without normalization.

## 11. Key Takeaways

| Point | Meaning |
| --- | --- |
| Normalization stabilizes optimization | Keeps activations/gradients in safe ranges. |
| BatchNorm | Batch-wise stats, great for CNNs. |
| LayerNorm/RMSNorm | Feature-wise stats, great for Transformers/RNNs. |
| Internal covariate shift | Reduced when distributions stay fixed. |
| Enables depth/speed | Higher learning rates & deeper nets. |

> Normalization reshapes the terrain of learning—turning jagged mountains into smooth hills.