##  **Group Normalization (GroupNorm)**

* **How it works**: Divides channels into groups, then normalizes within each group.
  Example: 32 channels split into 8 groups of 4 → normalize per group.
* **Formula**: Similar to BN, but mean/var are computed per group instead of batch.
* **Best for**:

  * CNNs when batch size is **small** (e.g., segmentation tasks with high-res images).
  * Often used in detection/segmentation networks (Mask R-CNN, etc.).
* **Strengths**:

  * Doesn’t depend on batch size.
  * Bridges the gap between BN and LN.
* **Weaknesses**:

  * Slightly slower than BN.

**Usage in PyTorch**:
`nn.GroupNorm(num_groups, num_channels)`
Example: `nn.GroupNorm(32, 128)` → 128 channels divided into 32 groups.

---



$$
X=
\begin{bmatrix}
5 & 4 & 7 & 2\\
1 & 6 & 2 & 3\\
4 & 8 & 1 & 9
\end{bmatrix}
$$

We’ll use **num\_groups = 2** (so each sample’s 4 features are split into 2 groups of size 2):

* Group 1: features $[0,1]$
* Group 2: features $[2,3]$

GroupNorm normalizes **per sample, per group**:

$$
\mu_{i,g}=\frac{1}{|G|}\sum_{c\in G}x_{i,c},\quad
\sigma^2_{i,g}=\frac{1}{|G|}\sum_{c\in G}(x_{i,c}-\mu_{i,g})^2,\quad
\hat x_{i,c}=\frac{x_{i,c}-\mu_{i,g}}{\sqrt{\sigma^2_{i,g}+\epsilon}}
$$

### Hand calculation (ε≈0)

**Row 1:** $[5,4,7,2]$

* G1 $[5,4]$: mean $4.5$, var $0.25$ → normalized $[+1,-1]$
* G2 $[7,2]$: mean $4.5$, var $6.25$ → normalized $[+1,-1]$
  → Row1: $[1,-1,1,-1]$

**Row 2:** $[1,6,2,3]$

* G1 $[1,6]$: mean $3.5$, var $6.25$ → $[-1,+1]$
* G2 $[2,3]$: mean $2.5$, var $0.25$ → $[-1,+1]$
  → Row2: $[-1,1,-1,1]$

**Row 3:** $[4,8,1,9]$

* G1 $[4,8]$: mean $6$, var $4$ → $[-1,+1]$
* G2 $[1,9]$: mean $5$, var $16$ → $[-1,+1]$
  → Row3: $[-1,1,-1,1]$

**Result (no affine $\gamma,\beta$):**

$$
\begin{bmatrix}
 1 & -1 &  1 & -1\\
-1 &  1 & -1 &  1\\
-1 &  1 & -1 &  1
\end{bmatrix}
$$

> With groups of size 2, each pair always normalizes to $\{-1,+1\}$ (up to tiny ε).

---

## PyTorch code (matches the math)

```python
import torch
import torch.nn as nn

# (N, C) batch with 3 samples and 4 features/channels
X = torch.tensor([[5., 4., 7., 2.],
                  [1., 6., 2., 3.],
                  [4., 8., 1., 9.]])

# GroupNorm expects (N, C, *). We'll add a dummy spatial dim and remove it after.
gn = nn.GroupNorm(num_groups=2, num_channels=4, affine=False)  # no gamma/beta to see pure norm

# Add dummy length dimension, apply GN, then squeeze back
X_gn = gn(X.unsqueeze(-1)).squeeze(-1)

print("Original X:\n", X)
print("\nGroupNorm (G=2, no affine):\n", torch.round(X_gn * 1000) / 1000)
```

### (Optional) With learnable γ, β

```python
gn_affine = nn.GroupNorm(num_groups=2, num_channels=4, affine=True)
with torch.no_grad():
    gn_affine.weight[:] = torch.tensor([1.0, 1.0, 2.0, 0.5])  # γ per channel
    gn_affine.bias[:]   = torch.tensor([0.0, 0.0, 1.0, -1.0]) # β per channel

X_gn_affine = gn_affine(X.unsqueeze(-1)).squeeze(-1)
print("\nGroupNorm (G=2, with affine γ,β):\n", torch.round(X_gn_affine * 1000) / 1000)
```

---

### Notes

* **num\_groups=1** ⇒ LayerNorm over features (per sample).
* **num\_groups=C (here 4)** is **not** InstanceNorm; IN normalizes per channel over *spatial* dims. With vectors (no spatial extent), IN is ill-defined (zero variance). Stick to GroupNorm for vector features or add spatial dims for conv features.