###  **Batch Normalization (BN)**

**Where?**  
- Commonly used in **CNNs** (Convolutional Neural Networks)  
- Typically applied **between linear/convolution and activation layers**

**How?**  
- BN normalizes each feature **across the batch dimension**
- Formula:  
  $
  \hat{x}^{(k)} = \frac{x^{(k)} - \mu^{(k)}}{\sqrt{(\sigma^{(k)})^2 + \epsilon}}
  $  
  where:
  - $ x^{(k)} $ is the k-th feature
  - $ \mu^{(k)} $, $ \sigma^{(k)} $: mean and std over the batch
  - $ \epsilon $: small constant for numerical stability

**Key Properties:**  
- Uses **batch statistics** (mean and variance per feature across samples)
- Has learnable **scale ($ \gamma $)** and **shift ($ \beta $)**
- During inference, uses **running estimates** of mean/variance

**Pros:**
- Helps with **internal covariate shift**
- Speeds up convergence
- Often improves generalization

**Cons:**
- Depends on batch size (small batches may cause instability)
- Not ideal for **RNNs** or **online/streaming data**

---

### 🔹 **Layer Normalization (LN)**

**Where?**  
- Used in **RNNs**, **Transformers**, **NLP models**, and **fully connected networks**

**How?**  
- LN normalizes **across features in a single sample**, **not across the batch**
- Formula:  
  $
  \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
  $  
  where:
  - $ \mu $, $ \sigma $: mean and std over the features of one sample

**Key Properties:**  
- **Independent of batch size**
- Always uses current sample statistics
- Learnable **scale ($ \gamma $)** and **shift ($ \beta $)**

**Pros:**
- Works well with variable batch sizes
- Stable for sequential models (RNNs, Transformers)

**Cons:**
- May be slightly slower convergence compared to BN in CNNs

---

###  **Quick Comparison**

| Feature                  | BatchNorm                        | LayerNorm                        |
|--------------------------|----------------------------------|----------------------------------|
| Normalizes over          | Batch + feature axis             | Feature axis only (per sample)   |
| Use case                 | CNNs                             | RNNs, Transformers, NLP          |
| Batch size sensitive     | ✅ Yes                           | ❌ No                            |
| Inference mode           | Uses running stats               | Uses current sample              |
| Sequential data          | ❌ Problematic                   | ✅ Well suited                   |

---

###  Summary

- Use **BatchNorm** in CNNs with reasonably large batch sizes.
- Use **LayerNorm** in NLP, RNNs, or attention-based models like Transformers.
- For small batches or online learning, **LayerNorm** is more stable.

---

Let me know if you want code examples in PyTorch for either!