##  What is Batch Normalization (BatchNorm)?

Batch Normalization is a layer that:

1. **Normalizes** activations (zero mean, unit variance) **within each batch**
2. Helps **speed up training**, reduce **internal covariate shift**, and **stabilize learning**

It’s commonly used **after a linear or conv layer**, before activation (like ReLU).

---

##  When Is BatchNorm Used?

| Phase       | Is BatchNorm used? | How it behaves                                              |
|-------------|---------------------|--------------------------------------------------------------|
| **Training**    | ✅ Yes                | Uses **mean and variance of current batch** for normalization. Also updates running statistics (running mean & variance). |
| **Validation**  | ✅ Yes (with `.eval()`) | Uses the **running (moving average) statistics** collected during training. |
| **Test**        | ✅ Yes (with `.eval()`) | Same as validation – uses running stats from training.     |

---

###  `model.train()` ➜ Training Mode

In this mode, **BatchNorm**:

- Normalizes using **batch statistics** (mean & variance of the current mini-batch).
- Updates its **running averages** (for later use during evaluation/test).

```python
model.train()
# x -> BatchNorm -> Normalize with batch mean/var
#                  Update running_mean and running_var
```

---

###  `model.eval()` ➜ Evaluation (Validation/Test) Mode

In this mode, **BatchNorm**:

- Normalizes using the **stored running statistics**.
- **Does NOT** update running stats.
- Gives consistent output for the same input (no randomness).

```python
model.eval()
# x -> BatchNorm -> Normalize with running_mean/var (from training)
```

---

###  Why This Matters

Let’s say your batch size is small. Then batch statistics can vary a lot from batch to batch during training — that’s fine. But during validation or test, we want stable and **consistent behavior** — hence we use running averages.

If you forget to switch to `.eval()`, BatchNorm may behave inconsistently — using tiny validation/test batches' mean/variance — and results will be wrong.

---

## ✅ Summary

| Mode         | Use BatchNorm? | Uses What Stats?            | Updates Running Stats? |
|--------------|----------------|------------------------------|-------------------------|
| Training     | Yes            | Current batch mean/variance | ✅ Yes                  |
| Validation   | Yes            | Running mean/variance       | ❌ No                   |
| Testing      | Yes            | Running mean/variance       | ❌ No                   |

---


##  **BatchNorm – Equation (Training Phase)**

Given an input batch $ x = [x_1, x_2, ..., x_m] $, BatchNorm does the following:

1. **Compute batch mean and variance**:
   $
   \mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i,\quad
   \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
   $

2. **Normalize each input**:
   $
   \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
   $

3. **Scale and shift** (learnable parameters $ \gamma $, $ \beta $):
   $   y_i = \gamma \hat{x}_i + \beta    $

  During training, $ \mu_B $ and $ \sigma_B^2 $ come **from the current mini-batch**, and we **update the running estimates**.

---

##  Running Statistics (for Evaluation)

To make the model deterministic during validation/testing, we maintain a **running average** of the batch statistics:

- Running mean:
  $
  \mu_{running} \leftarrow (1 - \alpha) \cdot \mu_{running} + \alpha \cdot \mu_B
  $

- Running variance:
  $
  \sigma^2_{running} \leftarrow (1 - \alpha) \cdot \sigma^2_{running} + \alpha \cdot \sigma_B^2
  $

Where:
- $ \alpha $ is a momentum term (e.g., 0.1)
- These are **not learned**, but updated during training

---

##  **During Validation / Testing**

We do not use the current batch statistics. Instead:

$ \hat{x}_i = \frac{x_i - \mu_{running}}{\sqrt{\sigma^2_{running} + \epsilon}} $

$ y_i = \gamma \hat{x}_i + \beta $

> ❌ No updates to mean/variance  
> ✅ Use consistent stats collected during training

---

## 🔍 Side-by-Side Comparison

| Step               | Training Mode                              | Eval Mode                        |
|--------------------|---------------------------------------------|-----------------------------------|
| Normalization mean | Batch mean \( \mu_B \)                      | Running mean \( \mu_{running} \) |
| Normalization var  | Batch variance \( \sigma_B^2 \)             | Running var \( \sigma^2_{running} \) |
| Updates stats?     | ✅ Yes (for running mean/var)               | ❌ No                             |
| Behavior           | Noisy (batch-dependent)                     | Stable (global stats)            |

---

## 🧪 Example (Python-style, simplified)

Here’s a small NumPy example to simulate this behavior:

```python
import numpy as np

np.random.seed(0)

# Simulated inputs: batch of size 4
x_batch = np.array([1.0, 2.0, 3.0, 4.0])

# --- Training phase ---
mu_B = x_batch.mean()
var_B = x_batch.var()
epsilon = 1e-5
x_hat = (x_batch - mu_B) / np.sqrt(var_B + epsilon)

gamma = 1.0
beta = 0.0
y_batch_train = gamma * x_hat + beta

# Running statistics update
mu_running = 0.0
var_running = 1.0
momentum = 0.1
mu_running = (1 - momentum) * mu_running + momentum * mu_B
var_running = (1 - momentum) * var_running + momentum * var_B

# --- Evaluation phase ---
x_hat_eval = (x_batch - mu_running) / np.sqrt(var_running + epsilon)
y_batch_eval = gamma * x_hat_eval + beta

print("Training output:", y_batch_train)
print("Evaluation output:", y_batch_eval)
```

### Sample Output:
```
Training output: [-1.3416 -0.4472  0.4472  1.3416]
Evaluation output: [-0.4629  0.2314  0.9258  1.6202]
```

You can see the **difference in output** due to different normalization stats.

---

## Batch Normalization

We want to keep the output of activation function unit gaussian, so we make them gaussian. In the process we are not normalizing weight but the output of activations.


Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.

You want to be able how much saturation you have:
bth of them are learnable 
$\gamma$ and $\beta$

[1](https://youtu.be/wEoyxE0GP2M?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&t=2933),
[2](https://arxiv.org/abs/1502.03167),
[3](https://www.youtube.com/watch?v=DtEq44FTPM4), [4](https://www.youtube.com/watch?v=dXB-KQYkzNU), [5](https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c)

## Batch Norm with PyTorch

https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html
