##  Batch Normalization (BatchNorm)

Batch Normalization is a layer that:

1. **Normalizes** activations (zero mean, unit variance) **within each batch**
2. Helps **speed up training**, reduce **internal covariate shift**, and **stabilize learning**

It’s commonly used **after a linear or conv layer**, before activation (like ReLU).

---

### **What BatchNorm Does**

For each mini-batch during training, BatchNorm:

1. **Normalizes activations** so that they have zero mean and unit variance:

   $$
   \hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma_{\text{batch}}^2 + \epsilon}}
   $$

   where $\mu_{\text{batch}}$ and $\sigma_{\text{batch}}^2$ are mean and variance computed over the current batch.

2. **Learns two trainable parameters** per feature:

   * **Scale ($\gamma$)**: allows the network to stretch/compress normalized activations.
   * **Shift ($\beta$)**: allows the network to shift normalized activations.

   So the final output is:

   $$
   y = \gamma \hat{x} + \beta
   $$

---

### **What is Learned**

* **Not the mean and variance** (those are computed dynamically during training).
* Instead, the learnable parameters are:

  * $\gamma$ (scaling factor)
  * $\beta$ (shifting factor)

These parameters allow BatchNorm to preserve the representational power of the network (otherwise normalization would restrict activations too much).

During inference, instead of using the batch mean/variance, **running averages** (moving averages of mean and variance accumulated during training) are used.

---



### **How γ (scale) and β (shift) are learned**

* They are **regular trainable parameters** of the network, just like weights and biases in Linear/Conv layers.
* They start with some initialization (typically $\gamma=1, \beta=0$).
* During backpropagation:

  * Gradients of the loss with respect to $\gamma$ and $\beta$ are computed.
  * The optimizer (SGD, Adam, etc.) updates them, exactly the same way it updates weights.

So the network "decides" the optimal scaling and shifting for each feature/channel by minimizing the training loss.

---

###  **How they are used**

After normalizing each activation $x$ with batch statistics:

$$
\hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma_{\text{batch}}^2 + \epsilon}}
$$

BatchNorm outputs:

$$
y = \gamma \hat{x} + \beta
$$

* $\gamma$ stretches or compresses the normalized activation distribution.
* $\beta$ shifts it up or down.

 If the network wants the normalized activations to be centered at 5 instead of 0, it will learn a large $\beta$.
 If it wants the variance bigger/smaller than 1, it adjusts $\gamma$.

---

### **Are they used in evaluation (inference)?**

Yes 

But there’s a key difference between training and evaluation:

* **Training mode**:

  * Normalization uses the **current batch’s mean & variance**.
  * Then applies $\gamma$ and $\beta$.

* **Evaluation (inference) mode**:

  * Normalization uses the **running (moving average) mean & variance** accumulated during training.
  * Then applies the same learned $\gamma$ and $\beta$.

So in inference you do **not depend on the current batch**, but you still apply the learned scale and shift:

$$
y = \gamma \cdot \frac{x - \mu_{\text{running}}}{\sqrt{\sigma_{\text{running}}^2 + \epsilon}} + \beta
$$

---

###  **Why keep γ and β at inference?**

Without $\gamma,\beta$, all features would always be locked to mean=0 and variance=1.
This would restrict the representational power of the network.

$\gamma$ and $\beta$ restore flexibility while keeping training stable.
Think of them as a **"learned reparametrization"** step.

---

Excellent — this is the *core confusion* about BatchNorm, and it’s worth being crystal clear.

You’re absolutely right:

* Step 1: BatchNorm **forces each batch of activations to mean 0 and std 1**.
* Step 2: It immediately **undoes this** with $\gamma$ (scale) and $\beta$ (shift).

So why bother? 

---

###  The Key Idea

We don’t *want* every layer’s output to always be exactly mean=0, std=1.
We just want the optimization to be **stable** and not blow up due to wildly different scales across layers.

So the logic is:

1. **Normalize first** → remove crazy fluctuations in mean/variance between batches.

   * This makes gradients more stable.
   * Prevents exploding/vanishing activations.
   * Allows higher learning rates.

2. **Then reintroduce flexibility with $\gamma$ and $\beta$** → let the network **choose** the best scale and offset for each feature if pure normalization is too restrictive.

   * Example: maybe ReLU works best if the distribution is shifted positive.
   * Or maybe a feature is only useful if it’s amplified.

So:

* The normalization step ensures **training stability**.
* The $\gamma,\beta$ step ensures **representational power is not lost**.

---

###  What Do We Actually Learn?

We learn **the "preferred" mean and variance of activations** for each feature.

* Without $\gamma,\beta$, every feature is always constrained to mean=0, var=1 → too rigid.
* With $\gamma,\beta$, the network can move features to **where they’re most useful**.

Formally:

$$
y = \gamma \hat{x} + \beta
$$

* $\gamma$ tells us how “wide” the distribution should be.
* $\beta$ tells us where the center of the distribution should be.

The network learns these values automatically to minimize loss.

---

###  Analogy

Think of normalization like **pressing reset**: every layer gets a clean, well-behaved distribution (mean 0, var 1).
But sometimes the model actually wants “not centered at 0” — e.g., a ReLU neuron benefits if most values are slightly positive.

So $\beta$ moves the “reset” center.
And $\gamma$ lets the feature spread out more or less.

---

###  Example

Imagine after normalization we always get values roughly in $[-1, 1]$.

* If the next layer’s ReLU kills all negatives, half the information is wasted.
* But if $\beta=+1$, the distribution shifts to $[0, 2]$, and now *everything survives ReLU*.
* If a feature is weak, $\gamma>1$ amplifies it.
* If a feature is noisy, $\gamma<1$ compresses it.

---




### Different Behavior in Train/Test is BatchNorm

```python
model.train()  # training mode (BN uses batch stats, Dropout active)
...
model.eval()   # inference mode (BN uses running averages, Dropout disabled)
```

## tiny BatchNorm example

### Step 1: Raw activations (from previous layer)

$$
x = [1, 2, 3, 4]
$$

---

### Step 2: Normalize (mean=0, std=1)

* Mean:

$$
\mu = \frac{1+2+3+4}{4} = 2.5
$$

* Variance:

$$
\sigma^2 = \frac{(1-2.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (4-2.5)^2}{4} = 1.25
$$

* Std:

$$
\sigma = \sqrt{1.25} \approx 1.118
$$

Now normalize each:

$$
\hat{x}_i = \frac{x_i - \mu}{\sigma}
$$

$$
\hat{x} = \left[ \frac{1-2.5}{1.118}, \frac{2-2.5}{1.118}, \frac{3-2.5}{1.118}, \frac{4-2.5}{1.118} \right]
$$

$$
\hat{x} \approx [-1.34, -0.45, 0.45, 1.34]
$$

Now mean = 0, std = 1.

---

### Step 3: Apply learned γ (scale) and β (shift)

Suppose the network has learned:

$$
\gamma = 2, \quad \beta = 3
$$

Then:

$$
y = \gamma \hat{x} + \beta
$$

$$
y = 2 \cdot [-1.34, -0.45, 0.45, 1.34] + 3
$$

$$
y \approx [0.32, 2.10, 3.90, 5.68]
$$

---

### Step 4: Interpretation

* Normalization gave us a **stable baseline** distribution (mean=0, std=1).
* But the network “decided” that this feature works best if it is **shifted upward** (β=3) and **stretched wider** (γ=2).
* Now the values are centered around \~3 with a bigger spread.

This is what the network *learns*: the “best” mean and variance for each feature, while enjoying the stability of normalization.

---
 **Summary of the toy example**:

* Start: $[1,2,3,4]$
* Normalize: $[-1.34, -0.45, 0.45, 1.34]$
* Apply γ=2, β=3 → $[0.32, 2.10, 3.90, 5.68]$

---



## When to Use BatchNorm

BatchNorm is useful when:

* You have **deep feedforward networks** (MLPs, CNNs) that are hard to train.
* You want **faster convergence** with higher learning rates.
* You want **regularization** (BN adds some noise due to batch statistics, reducing overfitting).
* You’re working with **image models (CNNs)** — historically it was a game-changer (ResNet, Inception, etc.).

It’s less critical in:

* Very **shallow networks** (where activations don’t blow up/vanish).
* Very **small batch sizes** (BN becomes unstable because mean/variance estimates are noisy).

---

##  Drawbacks of BatchNorm

1. **Dependency on batch size**

   * If the batch is too small (say 1–4), the estimated mean/variance is noisy → training unstable.
   * In some domains (NLP, RL), we often use small batches, so BN struggles.

2. **Training vs inference mismatch**

   * During training → batch mean/variance.
   * During inference → running averages.
   * If not estimated well, performance can drop.

3. **Extra complexity**

   * Slight computational overhead.
   * Harder to use in some architectures (RNNs, transformers with variable-length sequences).

4. **Not always portable**

   * In distributed training, synchronizing batch stats across GPUs can be tricky.

---

## Is it still common and popular?

* **Yes, but… with competition.**
* BatchNorm is still extremely common in **CNNs for vision** (e.g., ResNet, EfficientNet still use BN).
* But in **NLP and Transformers**, **LayerNorm** became the standard (BN doesn’t work well with variable-length sequences and small batches).
* In **GANs**, people often prefer **InstanceNorm** or **GroupNorm**.
* In very **small batch regimes**, **GroupNorm** or **LayerNorm** is more stable.

### Trends:

* **2015–2018**: BatchNorm was revolutionary → default in almost every model.
* **2019–2023**: Alternatives (LayerNorm, GroupNorm) became popular depending on the architecture.
* **Now (2025)**:

  * **CNNs → BatchNorm is still default**.
  * **Transformers (vision & NLP) → LayerNorm dominates**.
  * **Special cases (GANs, style transfer, medical imaging with small batches) → InstanceNorm / GroupNorm**.

---

## Summary

* **When to use**: CNNs, medium/large batch training, deep models that need stable training.
* **Drawbacks**: Needs sufficiently large batches, mismatch train/test, not great for sequential/transformer models.
* **Popularity**: Still very popular in CNNs, but LayerNorm has taken over in Transformers, and GroupNorm/InstanceNorm are preferred in small-batch or special cases.

---

