# Day 4 — "Momentum and Nesterov": The Ball with Inertia

Momentum gives our optimization marble memory: instead of zigzagging with every gradient, it carries speed through valleys, smoothing the ride toward lower loss.

## 1. Core Intuition

- Vanilla gradient descent = cautious step-by-step walking.
- Momentum = sliding downhill with inertia — it remembers consistent downhill directions.
- Nesterov momentum peeks ahead before committing, allowing smoother corrections.
- Result: faster, stabler training on twisty or noisy loss surfaces.

## 2. Mathematical Story — Adding Memory

| Concept | Formula | Meaning |
| --- | --- | --- |
| Vanilla update | `θ_{t+1} = θ_t - η ∇L(θ_t)` | Move opposite the gradient. |
| Momentum velocity | `v_{t+1} = β v_t + (1-β)(-η ∇L(θ_t))` | Weighted average of past gradients. |
| Momentum update | `θ_{t+1} = θ_t + v_{t+1}` | Apply the velocity to parameters. |
| Nesterov lookahead | `v_{t+1} = β v_t - η ∇L(θ_t + β v_t)` | Measure slope after the momentum step. |
| β parameter | `β ∈ [0,1)` | Memory strength (0 = vanilla GD, 0.9 common). |

## 3. Python Implementation — Momentum vs Vanilla

Helper functions live in `days/day04/code/momentum_methods.py`.

In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np


def find_repo_root(marker: str = "days") -> Path:
    path = Path.cwd()
    while path != path.parent:
        if (path / marker).exists():
            return path
        path = path.parent
    raise RuntimeError("Run this notebook from inside the repository tree.")

REPO_ROOT = find_repo_root()
if str(REPO_ROOT) not in sys.path:
    sys.path.append(str(REPO_ROOT))

from days.day04.code.momentum_methods import Bowl, OptimizerConfig, gradient_descent, momentum, nesterov

bowl = Bowl()
config = OptimizerConfig(lr=0.15, beta=0.9, steps=12)
init = [2.5, -2.0]

gd_path = gradient_descent(init, bowl, lr=config.lr, steps=config.steps)
mom_path = momentum(init, bowl, config)
nag_path = nesterov(init, bowl, config)

print("GD tail:", gd_path[-3:])
print("Momentum tail:", mom_path[-3:])
print("Nesterov tail:", nag_path[-3:])


Momentum (and especially Nesterov) reaches near the origin faster and with less zigzagging — greater distances covered per step because velocity carries past gradients forward.

## 4. Visualization — The Marble with Inertia

`days/day04/code/visualizations.py` animates: (1) GD vs momentum, (2) β sweeps, (3) momentum vs Nesterov.

In [None]:
from days.day04.code.visualizations import (
    anim_gd_vs_momentum,
    anim_momentum_beta,
    anim_momentum_vs_nesterov,
)

RUN_ANIMATIONS = False

if RUN_ANIMATIONS:
    assets = [
        anim_gd_vs_momentum(),
        anim_momentum_beta(),
        anim_momentum_vs_nesterov(),
    ]
    for asset in assets:
        print(f"Saved asset → {asset}")
else:
    print('Set RUN_ANIMATIONS = True to regenerate GIFs in days/day04/outputs/.')


## 5. Deep Learning & Computer Vision Connections

| Concept | Real-Life Usage |
| --- | --- |
| Momentum | Default in SGD for CNNs; stabilizes noisy gradients. |
| Nesterov | Used in many CV pipelines (ResNet, segmentation) for smoother convergence. |
        | β parameter | Typically 0.9/0.99 — controls memory length. |
        | Effect on loss | Faster convergence, less zigzag, better conditioning. |
        | Analogy | Heavy ball rolling through valleys, ignoring small bumps. |

## 6. Mini Exercises

1. Try β = 0, 0.5, 0.9, 0.99; note overshoot vs sluggishness.
2. Inject random noise into gradients to mimic SGD — watch momentum filter it.
3. Modify the surface to have multiple minima and study whether momentum jumps shallow pits.
4. Extend the code to animate Nesterov vs vanilla momentum with different β.

## 7. Key Takeaways

| Concept | Meaning |
| --- | --- |
| Momentum | Adds memory — move in the averaged gradient direction. |
| β (beta) | Memory strength; high β smooths updates (common 0.9). |
| Nesterov | Look ahead before pushing, leading to refined corrections. |
| Benefit | Faster and more stable convergence. |
| Analogy | A ball with inertia rolling through curved valleys. |

> Momentum helps optimization remember its direction — it stops hesitating and starts flowing.