# Day 9 — "Optimization Pathologies: Vanishing & Exploding Gradients"

Gradients flowing through many layers multiply Jacobians. Slightly shrinking (<1) or amplifying (>1) factors compound exponentially, causing gradients to vanish or explode.

## 1. Core Intuition

- Each layer behaves like a scaling tunnel for gradients.
- Multiplying many Jacobians amplifies small deviations.
- Deep stacks without architectural tricks cause gradient death or explosion.

## 2. Mathematical Reason

`g_N = a^N g_0` shows scalar behavior; |a|<1 → vanishing, |a|>1 → exploding. Matrix version depends on largest singular value of Jacobian products.

## 3. Python Simulation — Scalar Multiplication & Jacobian Norms

`days/day09/code/gradient_pathologies.py` simulates gradient evolution and random Jacobian norms.

In [None]:
from __future__ import annotations

import sys
from pathlib import Path
import numpy as np


def find_repo_root(marker: str = "days") -> Path:
    path = Path.cwd()
    while path != path.parent:
        if (path / marker).exists():
            return path
        path = path.parent
    raise RuntimeError("Run this notebook from inside the repository tree.")

REPO_ROOT = find_repo_root()
if str(REPO_ROOT) not in sys.path:
    sys.path.append(str(REPO_ROOT))

from days.day09.code.gradient_pathologies import GradientEvolution, jacobian_singular_values

values = GradientEvolution(factors=[0.7, 1.0, 1.3], steps=30).simulate()
for a, traj in values.items():
    print(f"a={a}: final gradient = {traj[-1]:.4e}")

norms = jacobian_singular_values(seed=42)
print('Average singular value (He init):', np.mean(norms))


## 4. Visualization — Gradient Evolution Animation

`days/day09/code/visualizations.py` animates vanishing/stable/exploding cases.

In [None]:
from days.day09.code.visualizations import anim_gradient_evolution

RUN_ANIMATIONS = False

if RUN_ANIMATIONS:
    gif = anim_gradient_evolution()
    print('Saved animation →', gif)
else:
    print('Set RUN_ANIMATIONS = True to regenerate Day 9 GIFs in days/day09/outputs/.')


## 5. When & Why These Pathologies Occur

- **Vanishing**: sigmoid/tanh, deep plain nets, small weights, RNNs across long sequences.
- **Exploding**: large weights, big learning rates, long RNN sequences, Jacobian singular values > 1.
- **Effects**: early layers stop learning (vanishing) or training becomes unstable (exploding).

## 6. Modern Solutions

1. ReLU/variants to prevent saturation.
2. BatchNorm/LayerNorm to stabilize activations.
3. Residual connections: `J = I + J_f` preserves gradients.
4. Proper initialization (Xavier/He) to keep singular values ≈ 1.
5. Gradient clipping for explosions.
6. LSTMs/GRUs and attention scaling in sequence models.


## 7. Mini Exercises

1. Simulate different factors (0.5, 0.9, 1.05, 1.5).
2. Replace ReLU with sigmoid in a toy MLP and inspect gradient norms.
3. Visualize singular value distributions for random weight matrices.
4. Implement gradient clipping in the exploding simulation.
5. Build a multi-layer linear network and measure gradient norms vs depth.


## 8. Key Takeaways

| Concept | Meaning |
| --- | --- |
| Vanishing gradients | Products of <1 singular values kill gradient flow. |
| Exploding gradients | Products >1 cause instability. |
| Jacobian products | Root cause; gradient = chain of local linear maps. |
| Solutions | ReLU, normalization, residuals, proper init, clipping, gated RNNs. |

> Gradient flow is the bloodstream of learning—keep it within healthy ranges.