# Concepts: Optimizers & Learning Dynamics

Understanding how neural networks find optimal weights.

This notebook covers **optimization algorithms** — the engines that power training.

---

## Why Optimizers Matter

Training a neural network means finding weights that minimize the loss function. **Optimizers** determine *how* we navigate the loss landscape.

The wrong optimizer (or wrong settings) can lead to:
- Training that never converges
- Getting stuck in poor local minima
- Painfully slow training
- Unstable, oscillating loss

---

## Gradient Descent Recap

All optimizers are variations of **gradient descent**:

$$w_{t+1} = w_t - \eta \cdot \nabla \mathcal{L}(w_t)$$

Where:
- $w_t$ = current weights
- $\eta$ = learning rate
- $\nabla \mathcal{L}$ = gradient of the loss

The challenge: vanilla gradient descent has problems.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Visualize the problem: a "ravine" loss surface
def loss_surface(x, y):
    return 0.1 * x**2 + y**2  # Elongated bowl

x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = loss_surface(X, Y)

fig, ax = plt.subplots(figsize=(8, 6))
contour = ax.contour(X, Y, Z, levels=20, cmap='viridis')
ax.set_xlabel('Weight 1')
ax.set_ylabel('Weight 2')
ax.set_title('Loss Surface: The "Ravine" Problem\n(Vanilla GD oscillates in the steep direction)', fontsize=12)
ax.set_aspect('equal')
plt.colorbar(contour, label='Loss')
plt.show()

---

## SGD: Stochastic Gradient Descent

Instead of computing gradients on the entire dataset, use **mini-batches**.

**Pros:**
- Faster iterations
- Noise helps escape local minima
- Works with large datasets

**Cons:**
- Noisy updates
- Sensitive to learning rate
- Can oscillate in ravines

### SGD with Momentum

Add "velocity" to smooth out updates:

$$v_t = \beta v_{t-1} + \nabla \mathcal{L}(w_t)$$
$$w_{t+1} = w_t - \eta \cdot v_t$$

Where $\beta$ (typically 0.9) controls how much history to remember.

**Intuition:** Like a ball rolling downhill — it builds up speed in consistent directions and dampens oscillations.

In [None]:
# Compare SGD vs SGD with Momentum
def gradient(x, y):
    return np.array([0.2 * x, 2 * y])  # Gradient of 0.1*x^2 + y^2

def simulate_sgd(start, lr, steps, momentum=0.0):
    pos = np.array(start, dtype=float)
    velocity = np.zeros(2)
    history = [pos.copy()]
    for _ in range(steps):
        grad = gradient(pos[0], pos[1])
        velocity = momentum * velocity + grad
        pos = pos - lr * velocity
        history.append(pos.copy())
    return np.array(history)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, (momentum, title) in zip(axes, [(0.0, 'SGD (no momentum)'), (0.9, 'SGD with Momentum (β=0.9)')]):
    ax.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.7)
    path = simulate_sgd([4, 4], lr=0.3, steps=30, momentum=momentum)
    ax.plot(path[:, 0], path[:, 1], 'ro-', markersize=4, linewidth=1.5, label='Optimization path')
    ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start')
    ax.plot(0, 0, 'r*', markersize=15, label='Optimum')
    ax.set_xlabel('Weight 1')
    ax.set_ylabel('Weight 2')
    ax.set_title(title, fontsize=12)
    ax.legend()
    ax.set_xlim(-5, 5)
    ax.set_ylim(-5, 5)

plt.tight_layout()
plt.show()

---

## RMSprop: Adaptive Learning Rates

**Key idea:** Adapt the learning rate for each parameter based on recent gradient magnitudes.

$$s_t = \beta s_{t-1} + (1-\beta)(\nabla \mathcal{L})^2$$
$$w_{t+1} = w_t - \frac{\eta}{\sqrt{s_t + \epsilon}} \nabla \mathcal{L}$$

**Effect:** 
- Large gradients → smaller effective learning rate
- Small gradients → larger effective learning rate

This naturally handles the ravine problem!

---

## Adam: The Best of Both Worlds

**Adam** (Adaptive Moment Estimation) combines:
- Momentum (first moment)
- RMSprop (second moment)

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla \mathcal{L}$$ (momentum)
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla \mathcal{L})^2$$ (RMSprop)

With bias correction:
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$

Update:
$$w_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

**Default hyperparameters:** $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-7}$

### Why Adam is the Default

| Property | SGD | SGD+Momentum | RMSprop | Adam |
|----------|-----|--------------|---------|------|
| Adaptive LR per param | ✗ | ✗ | ✓ | ✓ |
| Momentum | ✗ | ✓ | ✗ | ✓ |
| Bias correction | N/A | N/A | ✗ | ✓ |
| Works out-of-the-box | ✗ | ~ | ~ | ✓ |

**Adam is robust:** It usually works well with default settings, making it the go-to choice for most problems.

In [None]:
# Visual comparison of optimizers
fig, ax = plt.subplots(figsize=(10, 8))

# Contour plot
ax.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.5)

# Simulate different optimizers (simplified)
optimizers = {
    'SGD': {'momentum': 0.0, 'lr': 0.15, 'color': 'red'},
    'SGD+Momentum': {'momentum': 0.9, 'lr': 0.15, 'color': 'blue'},
    'Adam (simulated)': {'momentum': 0.9, 'lr': 0.5, 'color': 'green'},  # Simplified
}

for name, params in optimizers.items():
    path = simulate_sgd([4, 4], lr=params['lr'], steps=25, momentum=params['momentum'])
    ax.plot(path[:, 0], path[:, 1], 'o-', color=params['color'], 
            markersize=3, linewidth=1.5, label=name, alpha=0.8)

ax.plot(4, 4, 'ko', markersize=12, label='Start')
ax.plot(0, 0, 'k*', markersize=15, label='Optimum')
ax.set_xlabel('Weight 1', fontsize=12)
ax.set_ylabel('Weight 2', fontsize=12)
ax.set_title('Optimizer Comparison on a Ravine Loss Surface', fontsize=14)
ax.legend(loc='upper right')
ax.set_xlim(-2, 5)
ax.set_ylim(-2, 5)
plt.show()

---

## Learning Rate Schedules

A fixed learning rate isn't always optimal. **Schedules** adjust the learning rate during training.

### Common Schedules

| Schedule | Formula | Use Case |
|----------|---------|----------|
| **Step decay** | Reduce by factor every N epochs | Simple, predictable |
| **Exponential** | $\eta_t = \eta_0 \cdot e^{-kt}$ | Smooth decay |
| **Cosine annealing** | Follows cosine curve | State-of-the-art results |
| **Reduce on plateau** | Reduce when val_loss stalls | Adaptive, practical |

In [None]:
# Visualize learning rate schedules
epochs = np.arange(100)
initial_lr = 0.01

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Step decay
step_lr = initial_lr * (0.5 ** (epochs // 30))
axes[0, 0].plot(epochs, step_lr, 'b-', linewidth=2)
axes[0, 0].set_title('Step Decay\n(Halve every 30 epochs)')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Learning Rate')
axes[0, 0].grid(True, alpha=0.3)

# Exponential decay
exp_lr = initial_lr * np.exp(-0.03 * epochs)
axes[0, 1].plot(epochs, exp_lr, 'g-', linewidth=2)
axes[0, 1].set_title('Exponential Decay')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Learning Rate')
axes[0, 1].grid(True, alpha=0.3)

# Cosine annealing
cosine_lr = initial_lr * 0.5 * (1 + np.cos(np.pi * epochs / 100))
axes[1, 0].plot(epochs, cosine_lr, 'r-', linewidth=2)
axes[1, 0].set_title('Cosine Annealing')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Learning Rate')
axes[1, 0].grid(True, alpha=0.3)

# Warmup + decay
warmup_epochs = 10
warmup_lr = np.where(epochs < warmup_epochs, 
                     initial_lr * epochs / warmup_epochs,
                     initial_lr * np.exp(-0.03 * (epochs - warmup_epochs)))
axes[1, 1].plot(epochs, warmup_lr, 'm-', linewidth=2)
axes[1, 1].set_title('Warmup + Exponential Decay')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Learning Rate')
axes[1, 1].axvline(x=warmup_epochs, color='gray', linestyle='--', alpha=0.5)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Choosing an Optimizer: Practical Guide

```
┌─────────────────────────────────────────────────────────────┐
│                   OPTIMIZER DECISION TREE                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Start with Adam (lr=0.001)                                │
│         │                                                   │
│         ▼                                                   │
│  Does it converge?                                         │
│    │         │                                              │
│   YES       NO                                              │
│    │         │                                              │
│    ▼         ▼                                              │
│  Done!    Try lower lr (0.0001)                            │
│              │                                              │
│              ▼                                              │
│         Still no? Try SGD + Momentum                       │
│         (better generalization sometimes)                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Quick Reference

| Situation | Recommendation |
|-----------|----------------|
| **Starting out** | Adam, lr=0.001 |
| **Computer vision** | SGD+Momentum often better for final % |
| **NLP / Transformers** | Adam or AdamW |
| **Training is unstable** | Lower learning rate, add warmup |
| **Stuck in plateau** | Try ReduceLROnPlateau callback |

---

## In Keras

```python
# Default Adam
model.compile(optimizer='adam', loss='categorical_crossentropy')

# Adam with custom learning rate
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.0001),
    loss='categorical_crossentropy'
)

# SGD with momentum
model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
    loss='categorical_crossentropy'
)

# Learning rate schedule with callback
reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', 
    factor=0.5, 
    patience=5
)
model.fit(X, y, callbacks=[reduce_lr])
```

---

## Key Takeaways

1. **Start with Adam** — it works well out-of-the-box for most problems
2. **Learning rate is the most important hyperparameter** — tune it first
3. **Momentum helps** with noisy gradients and ravine-shaped loss surfaces
4. **Adaptive optimizers** (RMSprop, Adam) adjust learning rates per-parameter
5. **Learning rate schedules** can improve final performance
6. **SGD+Momentum** can generalize better for computer vision (but harder to tune)

**Next:** Apply these concepts in the labs. When training isn't working, come back here to diagnose optimizer issues.