#  Normalization Methods in Deep Learning

| **Norm Type**                                           | **Stats Computed Over**                                       | **Learned Params**   | **Best Use Cases**                                  | **Pros**                                                                                                         | **Cons / Limitations**                                                                                                               |
| ------------------------------------------------------- | ------------------------------------------------------------- | -------------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| **BatchNorm (BN)**                                      | Across **batch + spatial dims** (for each feature channel)    | γ (scale), β (shift) | CNNs with **moderate-to-large batch sizes**         | - Stabilizes training<br>- Allows higher LR<br>- Regularizes (adds noise)<br>- Default for ResNets, EfficientNet | - Needs large batch size<br>- Bad for variable-length sequences<br>- Train/test mismatch<br>- Extra overhead in distributed training |
| **LayerNorm (LN)**                                      | Across **features of a single sample** (independent of batch) | γ, β                 | NLP (Transformers), RNNs, Vision Transformers (ViT) | - Works with batch size = 1<br>- No train/test mismatch<br>- Standard in Transformers                            | - Less effective in CNNs<br>- Sometimes slower (per-sample calc)                                                                     |
| **InstanceNorm (IN)**                                   | Across **spatial dims per channel per sample**                | γ, β                 | Style transfer, GANs (image generation)             | - Invariant to global contrast/illumination<br>- Works for per-instance normalization                            | - Removes global contrast info (may hurt classification)<br>- Less effective for recognition tasks                                   |
| **GroupNorm (GN)**                                      | Across **groups of channels within a sample**                 | γ, β                 | CNNs with **small batch sizes**                     | - Works with any batch size<br>- Stable for small-batch/medical imaging<br>- No train/test mismatch              | - Requires tuning of group count<br>- Not as fast as BN with big batches                                                             |
| **LayerScale / WeightNorm / RMSNorm (modern variants)** | Varies (weights or RMS of activations)                        | Usually scale-only   | Transformers, lightweight models                    | - Even simpler than LN<br>- Lower overhead                                                                       | - Less studied, newer, not as universal                                                                                              |

---

##  Which is more common today?

* **Vision (CNNs):**

  * BatchNorm still very common (ResNet, EfficientNet).
  * GroupNorm is popular when batch size is small (detection, segmentation).
* **NLP / Transformers:**

  * LayerNorm is the **default choice** (BERT, GPT, ViT, etc.).
* **Generative / Style Transfer:**

  * InstanceNorm is often used (CycleGAN, style transfer networks).
* **New research directions:**

  * Some models avoid normalization entirely (e.g., *Normalizer-Free ResNets*, some ViT variants use RMSNorm).

---

##  Rule of Thumb

* **If CNN + large batch** → BatchNorm.
* **If CNN + small batch (like segmentation)** → GroupNorm.
* **If Transformer / NLP / sequence** → LayerNorm.
* **If Style transfer / image generation** → InstanceNorm.