# Day 12 — "Weight Initialization: Keeping Signal & Gradient Balanced"

Initialization keeps information flowing—without it, deep networks drown in vanishing or exploding signals.

## 1. Core Intuition

- Forward/backward passes behave like long pipelines.
- Poor initialization gradually kills or explodes signals → gradients vanish or blow up.

## 2. Mathematical Goal

Maintain `Var(x_l+1) ≈ Var(x_l)` and `Var(δ_l) ≈ Var(δ_{l+1})` so neither activations nor gradients collapse/explode.

## 3. Xavier (Glorot) Initialization

- For tanh/sigmoid nets.
- `W ∼ N(0, 2/(n_in+n_out))` or uniform variant.
- Keeps both forward/backward variance stable for saturating activations.

## 4. He (Kaiming) Initialization

- For ReLU (half activations zero).
- `W ∼ N(0, 2/n_in)` or uniform variant.
- Compensates for ReLU sparsity so signal stays alive.

## 5. Python Simulation — Variance Propagation

`days/day12/code/initialization.py` provides utilities to inspect variance stability.

In [3]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np


def find_repo_root(marker: str = "days") -> Path:
    path = Path.cwd()
    while path != path.parent:
        if (path / marker).exists():
            return path
        path = path.parent
    raise RuntimeError("Run this notebook from inside the repository tree.")

REPO_ROOT = find_repo_root()
if str(REPO_ROOT) not in sys.path:
    sys.path.append(str(REPO_ROOT))

from days.day12.code.initialization import VarianceSimulator

sim = VarianceSimulator(layers=20, width=128, seed=42)
for init in ("xavier", "he", "bad"):
    variances = sim.run(init)
    print(f"{init} final variance = {variances[-1]:.4f}")


xavier final variance = 0.0231
he final variance = 0.2639
bad final variance = 1100063550125093781839691019976704.0000


## 6. Visualization — Variance Across Layers

`days/day12/code/visualizations.py` animates how Xavier/He compare to bad init.

In [4]:
from days.day12.code.visualizations import anim_variance_evolution

RUN_ANIMATIONS = False

if RUN_ANIMATIONS:
    gif = anim_variance_evolution()
    print('Saved variance animation →', gif)
else:
    print('Set RUN_ANIMATIONS = True to regenerate Day 12 GIFs in days/day12/outputs/.')


Set RUN_ANIMATIONS = True to regenerate Day 12 GIFs in days/day12/outputs/.


## 8. Why Initialization Matters

- Prevents vanishing/exploding gradients.
- Lets deep nets converge quickly; interacts with normalization and learning rate.
- Reduces sensitivity to bad hyperparameters.

## 9. Initialization by Architecture

| Architecture | Initialization |
| --- | --- |
| ReLU CNNs / ResNets | He (Kaiming) |
| tanh/sigmoid MLPs | Xavier |
| Transformers | Xavier uniform + RMSNorm |
| RNNs | Orthogonal init |
| ResNets final BN | Zero-gamma trick |

## 10. Mini Exercises

1. Replace Xavier with small constants; plot gradient norms across 20 layers.
2. Train MNIST with constant init to observe collapse.
3. Use orthogonal init for an RNN and compare to random gaussian.
4. Compare training curves: He vs Xavier for ReLU nets.
5. Visualize eigenvalues of weight matrices for different initializations.

## 11. Key Takeaways

| Point | Meaning |
| --- | --- |
| Initialization controls signal flow | Keeps activations and gradients balanced. |
| Xavier | Stable tanh/sigmoid nets. |
| He | Lets deep ReLU nets learn. |
| Poor init | Causes immediate collapse; no training hack can save it. |
| Init + normalization | Foundation of modern deep learning stability. |

> Initialization is the first decision that decides whether a network can learn at all.