[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/03-calculus/notebooks/03-gradient-descent.ipynb)

# Lesson 3: Gradient Descent — Walking Downhill

*"For twenty years I have walked this invisible landscape. Each step guided not by sight but by feel—the subtle shift of ground beneath my feet telling me: uphill, downhill, plateau. I cannot see the valley where the Tower falls, but I can feel my way toward it, one step at a time. This is optimization. This is patience. This is war."*  
— The Colonel, Year 20 of the Siege

---

## The Core Algorithm

**Gradient descent** is the workhorse of machine learning optimization. It's beautifully simple:

1. Start somewhere
2. Compute the gradient (which way is uphill?)
3. Take a step in the opposite direction (downhill)
4. Repeat until you reach a minimum

The update rule:

$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla L(\theta)$$

where:
- $\theta$ = parameters (the knobs we're tuning)
- $\alpha$ = learning rate (step size)
- $\nabla L$ = gradient of loss (points uphill)

---

## Learning Objectives

By the end of this lesson, you will:
1. Implement gradient descent from scratch
2. Understand the role of the learning rate
3. Visualize the optimization process
4. Diagnose common problems: too large/small learning rate, local minima
5. Connect gradient descent to the Colonel's siege strategy

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.animation as animation
from IPython.display import HTML

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load the siege data
siege = pd.read_csv(BASE_URL + "siege_progress.csv")
stratagem = pd.read_csv(BASE_URL + "stratagem_details.csv")
expedition = pd.read_csv(BASE_URL + "expedition_outcomes.csv")

print(f"Loaded {len(siege)} months of siege records")
print(f"Loaded {len(stratagem)} stratagem attempts")
print(f"Loaded {len(expedition)} expedition records")

## Part 1: The Simplest Gradient Descent

Let's start with a simple 1D example. We want to minimize:

$$L(x) = (x - 5)^2$$

The minimum is at $x = 5$ (where the loss is zero). The gradient is:

$$\frac{dL}{dx} = 2(x - 5)$$

Let's watch gradient descent find the minimum:

In [None]:
def loss_1d(x):
    """Simple quadratic loss function."""
    return (x - 5)**2

def gradient_1d(x):
    """Gradient of the loss function."""
    return 2 * (x - 5)

def gradient_descent_1d(start, learning_rate, num_steps):
    """
    Perform gradient descent in 1D.
    
    Returns:
        path: list of x values visited
        losses: list of loss values at each step
    """
    x = start
    path = [x]
    losses = [loss_1d(x)]
    
    for step in range(num_steps):
        grad = gradient_1d(x)
        x = x - learning_rate * grad  # The key update!
        path.append(x)
        losses.append(loss_1d(x))
    
    return path, losses

# Run gradient descent
start = 0
learning_rate = 0.1
num_steps = 30

path, losses = gradient_descent_1d(start, learning_rate, num_steps)

# Display first few steps
print("Gradient Descent Trace:")
print("=" * 60)
print(f"{'Step':>6} | {'x':>10} | {'Loss':>12} | {'Gradient':>12}")
print("-" * 60)
for i in range(min(10, len(path))):
    grad = gradient_1d(path[i])
    print(f"{i:>6} | {path[i]:>10.4f} | {losses[i]:>12.4f} | {grad:>12.4f}")
print(f"{'...':>6}")
print(f"{len(path)-1:>6} | {path[-1]:>10.4f} | {losses[-1]:>12.4f} | {gradient_1d(path[-1]):>12.4f}")

In [None]:
# Visualize the gradient descent process
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Show path on loss function
x_range = np.linspace(-2, 10, 200)
y_range = [loss_1d(x) for x in x_range]

axes[0].plot(x_range, y_range, 'b-', linewidth=2, label='Loss function')
axes[0].plot(path, losses, 'ro-', markersize=6, linewidth=1.5, label='Gradient descent path')
axes[0].plot(path[0], losses[0], 'gs', markersize=12, label='Start')
axes[0].plot(path[-1], losses[-1], 'g*', markersize=15, label='End')
axes[0].axvline(5, color='green', linestyle='--', alpha=0.5, label='True minimum')
axes[0].set_xlabel('x', fontsize=11)
axes[0].set_ylabel('Loss', fontsize=11)
axes[0].set_title('Gradient Descent on a Quadratic Loss', fontsize=12)
axes[0].legend()

# Right: Show loss over steps
axes[1].plot(losses, 'b-', linewidth=2)
axes[1].set_xlabel('Step', fontsize=11)
axes[1].set_ylabel('Loss', fontsize=11)
axes[1].set_title('Loss Over Time (Convergence Curve)', fontsize=12)
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

print(f"Started at x = {start}, ended at x = {path[-1]:.6f}")
print(f"True minimum is at x = 5")
print(f"Final loss: {losses[-1]:.10f}")

## Part 2: The Learning Rate — The Colonel's Courage

*"How bold should each step be? Too timid, and I grow old before reaching my goal. Too bold, and I overshoot, oscillating wildly, perhaps never converging. The learning rate is the measure of my courage—or my recklessness."*  
— The Colonel

The **learning rate** ($\alpha$) controls how big each step is:
- **Too small**: Slow convergence, might take forever
- **Too large**: Overshooting, oscillation, might diverge
- **Just right**: Fast and stable convergence

Let's see these three regimes:

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.1, 0.5, 0.95, 1.05]
labels = ['Too small (0.01)', 'Good (0.1)', 'Faster (0.5)', 'Edge of stability (0.95)', 'Diverges (1.05)']
colors = ['blue', 'green', 'orange', 'red', 'purple']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Paths on loss function
x_range = np.linspace(-5, 15, 200)
y_range = [loss_1d(x) for x in x_range]
axes[0].plot(x_range, y_range, 'k-', linewidth=2, alpha=0.3, label='Loss')

# Right: Convergence curves
for lr, label, color in zip(learning_rates, labels, colors):
    path, losses = gradient_descent_1d(start=0, learning_rate=lr, num_steps=40)
    
    # Clip for visualization if diverging
    losses_clipped = np.clip(losses, 0, 200)
    path_clipped = np.clip(path, -5, 15)
    
    axes[0].plot(path_clipped[:20], losses_clipped[:20], 'o-', color=color, 
                markersize=4, linewidth=1.5, alpha=0.7, label=label)
    axes[1].plot(losses_clipped, color=color, linewidth=2, label=label)

axes[0].axvline(5, color='green', linestyle='--', alpha=0.5)
axes[0].set_xlabel('x', fontsize=11)
axes[0].set_ylabel('Loss', fontsize=11)
axes[0].set_title('Effect of Learning Rate on Optimization Path', fontsize=12)
axes[0].legend(fontsize=9)
axes[0].set_ylim(-5, 100)

axes[1].set_xlabel('Step', fontsize=11)
axes[1].set_ylabel('Loss', fontsize=11)
axes[1].set_title('Convergence Curves for Different Learning Rates', fontsize=12)
axes[1].legend(fontsize=9)
axes[1].set_ylim(0, 100)

plt.tight_layout()
plt.show()

print("Observations:")
print("- Learning rate 0.01: Converges, but very slowly")
print("- Learning rate 0.1: Good balance of speed and stability")
print("- Learning rate 0.5: Faster, still stable")
print("- Learning rate 0.95: Oscillates but eventually converges")
print("- Learning rate 1.05: DIVERGES! Loss explodes instead of decreasing")

## Part 3: Gradient Descent in 2D — The Colonel's Multi-Front War

Now let's apply gradient descent to a 2D problem, where we optimize two parameters simultaneously.

*"I do not fight on a single front. Personnel and supplies, timing and tactics—all must be optimized together. The gradient tells me how to adjust each, and I step forward on all fronts at once."*  
— The Colonel

In [None]:
def loss_2d(x, y):
    """2D loss function: bowl-shaped with minimum at (3, 2)."""
    return (x - 3)**2 + 2*(y - 2)**2

def gradient_2d(x, y):
    """Gradient of the 2D loss function."""
    dL_dx = 2 * (x - 3)
    dL_dy = 4 * (y - 2)
    return np.array([dL_dx, dL_dy])

def gradient_descent_2d(start, learning_rate, num_steps):
    """
    Perform gradient descent in 2D.
    """
    pos = np.array(start, dtype=float)
    path = [pos.copy()]
    losses = [loss_2d(pos[0], pos[1])]
    
    for step in range(num_steps):
        grad = gradient_2d(pos[0], pos[1])
        pos = pos - learning_rate * grad
        path.append(pos.copy())
        losses.append(loss_2d(pos[0], pos[1]))
    
    return np.array(path), losses

# Run gradient descent from different starting points
starts = [(-2, -1), (8, 5), (0, 6), (6, -1)]
colors = ['blue', 'red', 'green', 'purple']

# Create meshgrid for visualization
x_range = np.linspace(-3, 9, 100)
y_range = np.linspace(-2, 7, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = loss_2d(X, Y)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Contour plot with paths
contour = axes[0].contour(X, Y, Z, levels=20, cmap='viridis')
axes[0].clabel(contour, inline=True, fontsize=8)

for start, color in zip(starts, colors):
    path, losses = gradient_descent_2d(start, learning_rate=0.15, num_steps=30)
    axes[0].plot(path[:, 0], path[:, 1], 'o-', color=color, 
                markersize=4, linewidth=1.5, label=f'Start: {start}')
    axes[0].plot(path[0, 0], path[0, 1], 's', color=color, markersize=10)
    axes[0].plot(path[-1, 0], path[-1, 1], '*', color=color, markersize=15)

axes[0].plot(3, 2, 'k*', markersize=20, label='Minimum (3, 2)')
axes[0].set_xlabel('x (Personnel)', fontsize=11)
axes[0].set_ylabel('y (Supplies)', fontsize=11)
axes[0].set_title('Gradient Descent Paths in 2D', fontsize=12)
axes[0].legend(fontsize=9)

# Right: Convergence curves
for start, color in zip(starts, colors):
    path, losses = gradient_descent_2d(start, learning_rate=0.15, num_steps=30)
    axes[1].plot(losses, color=color, linewidth=2, label=f'Start: {start}')

axes[1].set_xlabel('Step', fontsize=11)
axes[1].set_ylabel('Loss', fontsize=11)
axes[1].set_title('Convergence Curves', fontsize=12)
axes[1].set_yscale('log')
axes[1].legend(fontsize=9)

plt.tight_layout()
plt.show()

## Part 4: The Colonel's Real Data — Fitting a Model

Now let's use gradient descent to fit a real model to the Colonel's stratagem data. We'll predict `progress_delta` from the stratagem features.

For linear regression, we're minimizing the **mean squared error**:

$$L(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w} \cdot \mathbf{x}_i)^2$$

The gradient with respect to weights $\mathbf{w}$ is:

$$\nabla L = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \mathbf{w} \cdot \mathbf{x}_i) \mathbf{x}_i$$

In [None]:
# Prepare data for linear regression
features = ['personnel_committed', 'supply_cost', 'risk_level', 'colonel_confidence']

# Normalize features for stable gradient descent
X = stratagem[features].values
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X_norm = (X - X_mean) / X_std

# Add bias term
X_norm = np.column_stack([np.ones(len(X_norm)), X_norm])

y = stratagem['progress_delta'].values

print(f"Data shape: {X_norm.shape}")
print(f"Features: ['bias'] + {features}")
print(f"Target: progress_delta (mean={y.mean():.4f}, std={y.std():.4f})")

In [None]:
def mse_loss(w, X, y):
    """Mean squared error loss."""
    predictions = X @ w
    errors = y - predictions
    return np.mean(errors**2)

def mse_gradient(w, X, y):
    """Gradient of MSE with respect to weights."""
    predictions = X @ w
    errors = y - predictions
    gradient = -2 * X.T @ errors / len(y)
    return gradient

def gradient_descent_linear(X, y, learning_rate, num_steps):
    """
    Perform gradient descent for linear regression.
    """
    # Initialize weights randomly
    w = np.random.randn(X.shape[1]) * 0.01
    
    losses = [mse_loss(w, X, y)]
    weights_history = [w.copy()]
    
    for step in range(num_steps):
        grad = mse_gradient(w, X, y)
        w = w - learning_rate * grad
        losses.append(mse_loss(w, X, y))
        weights_history.append(w.copy())
    
    return w, losses, weights_history

# Run gradient descent
learning_rate = 0.1
num_steps = 200

w_final, losses, weights_history = gradient_descent_linear(X_norm, y, learning_rate, num_steps)

print("Training Complete!")
print("=" * 50)
print(f"Initial loss: {losses[0]:.6f}")
print(f"Final loss: {losses[-1]:.6f}")
print(f"Loss reduction: {(1 - losses[-1]/losses[0])*100:.2f}%")

In [None]:
# Visualize training
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Loss curve
axes[0].plot(losses, 'b-', linewidth=2)
axes[0].set_xlabel('Step', fontsize=11)
axes[0].set_ylabel('MSE Loss', fontsize=11)
axes[0].set_title('Training Loss Over Time', fontsize=12)
axes[0].grid(True, alpha=0.3)

# Right: Weight evolution
weights_history = np.array(weights_history)
labels = ['bias'] + features
for i, label in enumerate(labels):
    axes[1].plot(weights_history[:, i], linewidth=2, label=label)

axes[1].set_xlabel('Step', fontsize=11)
axes[1].set_ylabel('Weight Value', fontsize=11)
axes[1].set_title('Weight Evolution During Training', fontsize=12)
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Display final weights
print("\nFinal Weights (The Colonel's Learned Strategy):")
print("=" * 50)
for label, weight in zip(labels, w_final):
    direction = "↑ helps" if weight > 0 else "↓ hurts"
    print(f"{label:25}: {weight:>10.4f}  ({direction})")

## Part 5: Batch vs Stochastic Gradient Descent

So far, we've used **batch gradient descent**—computing the gradient using *all* data points at each step.

In practice, especially with large datasets, we often use:

1. **Stochastic Gradient Descent (SGD)**: Use one random sample per step
2. **Mini-batch Gradient Descent**: Use a small random batch per step

*"I cannot test every stratagem against the entire history of the siege. Instead, I sample—I try one approach, observe the result, and adjust. This is noisier, but faster. The gradient I estimate is imperfect, but good enough to make progress."*  
— The Colonel

In [None]:
def sgd_linear(X, y, learning_rate, num_epochs, batch_size=1):
    """
    Stochastic/Mini-batch gradient descent for linear regression.
    """
    n_samples = len(y)
    w = np.random.randn(X.shape[1]) * 0.01
    
    losses = [mse_loss(w, X, y)]
    
    for epoch in range(num_epochs):
        # Shuffle data
        indices = np.random.permutation(n_samples)
        
        for start_idx in range(0, n_samples, batch_size):
            batch_indices = indices[start_idx:start_idx + batch_size]
            X_batch = X[batch_indices]
            y_batch = y[batch_indices]
            
            # Compute gradient on batch
            grad = mse_gradient(w, X_batch, y_batch)
            w = w - learning_rate * grad
        
        losses.append(mse_loss(w, X, y))
    
    return w, losses

# Compare batch sizes
batch_sizes = [1, 8, 32, len(y)]  # SGD, mini-batch, mini-batch, full batch
labels = ['SGD (batch=1)', 'Mini-batch (8)', 'Mini-batch (32)', 'Full Batch']
colors = ['red', 'orange', 'blue', 'green']

fig, ax = plt.subplots(figsize=(10, 6))

for batch_size, label, color in zip(batch_sizes, labels, colors):
    w, losses = sgd_linear(X_norm, y, learning_rate=0.05, 
                           num_epochs=50, batch_size=batch_size)
    ax.plot(losses, color=color, linewidth=2, label=label, alpha=0.8)

ax.set_xlabel('Epoch', fontsize=11)
ax.set_ylabel('MSE Loss', fontsize=11)
ax.set_title('Comparison of Batch Sizes\n(Noise vs Speed Trade-off)', fontsize=12)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Observations:")
print("- SGD (batch=1): Noisy but can escape local minima")
print("- Mini-batch: Balance between noise and stability")
print("- Full batch: Smooth but may be slower on large data")

## Part 6: Common Pitfalls

### Pitfall 1: Learning Rate Too Large — Divergence

In [None]:
# Demonstrate divergence with large learning rate
fig, ax = plt.subplots(figsize=(10, 5))

for lr in [0.01, 0.1, 0.5, 1.0]:
    try:
        w, losses, _ = gradient_descent_linear(X_norm, y, learning_rate=lr, num_steps=100)
        losses_clipped = np.clip(losses, 0, 1)  # Clip for visualization
        ax.plot(losses_clipped, linewidth=2, label=f'LR = {lr}')
    except:
        print(f"LR = {lr} diverged!")

ax.set_xlabel('Step', fontsize=11)
ax.set_ylabel('MSE Loss (clipped)', fontsize=11)
ax.set_title('Learning Rate Effect: Too Large = Divergence', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Pitfall 2: Local Minima and Saddle Points

In non-convex landscapes, gradient descent can get stuck.

In [None]:
# Non-convex loss function with local minima
def non_convex_loss(x):
    return x**4 - 4*x**2 + x + 3

def non_convex_gradient(x):
    return 4*x**3 - 8*x + 1

# Visualize
x_range = np.linspace(-2.5, 2.5, 200)
y_range = [non_convex_loss(x) for x in x_range]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Loss landscape
axes[0].plot(x_range, y_range, 'b-', linewidth=2)

# Run GD from different starting points
starts = [-2.0, -0.5, 0.5, 2.0]
colors = ['red', 'green', 'orange', 'purple']

for start, color in zip(starts, colors):
    x = start
    path = [x]
    for _ in range(50):
        grad = non_convex_gradient(x)
        x = x - 0.01 * grad
        path.append(x)
    
    y_path = [non_convex_loss(p) for p in path]
    axes[0].plot(path, y_path, 'o-', color=color, markersize=3, 
                linewidth=1.5, alpha=0.7, label=f'Start: {start}')

axes[0].set_xlabel('x', fontsize=11)
axes[0].set_ylabel('Loss', fontsize=11)
axes[0].set_title('Non-Convex Landscape: Local Minima Trap', fontsize=12)
axes[0].legend()

# Right: Gradient showing multiple zeros
grad_range = [non_convex_gradient(x) for x in x_range]
axes[1].plot(x_range, grad_range, 'r-', linewidth=2)
axes[1].axhline(0, color='black', linestyle='--')
axes[1].set_xlabel('x', fontsize=11)
axes[1].set_ylabel('Gradient', fontsize=11)
axes[1].set_title('Gradient: Multiple Zeros = Multiple Critical Points', fontsize=12)

plt.tight_layout()
plt.show()

print("Starting from x = -2 leads to a local minimum.")
print("Starting from x = 2 leads to a different (global) minimum.")
print("\nIn neural networks, this is why initialization and momentum matter!")

## Part 7: The Colonel's Step Size Record

The stratagem data includes the Colonel's `step_size`—how aggressive each move was. Let's analyze how step size affected outcomes.

In [None]:
# Analyze relationship between step_size and outcomes
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Step size vs progress
colors_map = {'success': 'green', 'partial': 'orange', 'failure': 'gray', 'disaster': 'red'}
for outcome in colors_map:
    mask = stratagem['outcome_category'] == outcome
    axes[0].scatter(stratagem.loc[mask, 'step_size'], 
                   stratagem.loc[mask, 'progress_delta'],
                   c=colors_map[outcome], label=outcome, alpha=0.6, s=40)

axes[0].axhline(0, color='black', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Step Size (Colonel\'s Learning Rate)', fontsize=11)
axes[0].set_ylabel('Progress Delta', fontsize=11)
axes[0].set_title('Step Size vs Outcome', fontsize=12)
axes[0].legend()

# Right: Distribution of step sizes by outcome
step_by_outcome = stratagem.groupby('outcome_category')['step_size'].agg(['mean', 'std', 'count'])
step_by_outcome = step_by_outcome.loc[['success', 'partial', 'failure', 'disaster']]

axes[1].bar(step_by_outcome.index, step_by_outcome['mean'], 
           yerr=step_by_outcome['std'], capsize=5,
           color=[colors_map[o] for o in step_by_outcome.index], alpha=0.7)
axes[1].set_xlabel('Outcome Category', fontsize=11)
axes[1].set_ylabel('Average Step Size', fontsize=11)
axes[1].set_title('Average Step Size by Outcome', fontsize=12)

plt.tight_layout()
plt.show()

print("\nStep Size Statistics by Outcome:")
print(step_by_outcome.round(3))

---

## Exercises

### Exercise 1: Implement Momentum

Momentum helps gradient descent move faster and escape local minima. The update becomes:

$$v_t = \beta v_{t-1} + \nabla L$$
$$\theta_t = \theta_{t-1} - \alpha v_t$$

where $\beta$ is the momentum coefficient (typically 0.9).

In [None]:
# Exercise 1: Implement gradient descent with momentum

def gradient_descent_momentum(X, y, learning_rate, momentum, num_steps):
    """
    Gradient descent with momentum for linear regression.
    """
    w = np.random.randn(X.shape[1]) * 0.01
    v = np.zeros_like(w)  # Velocity
    
    losses = [mse_loss(w, X, y)]
    
    for step in range(num_steps):
        grad = mse_gradient(w, X, y)
        v = momentum * v + grad  # Update velocity
        w = w - learning_rate * v  # Update weights using velocity
        losses.append(mse_loss(w, X, y))
    
    return w, losses

# Compare with and without momentum
fig, ax = plt.subplots(figsize=(10, 5))

# Without momentum
w1, losses1, _ = gradient_descent_linear(X_norm, y, learning_rate=0.05, num_steps=100)
ax.plot(losses1, 'b-', linewidth=2, label='No momentum')

# With momentum
w2, losses2 = gradient_descent_momentum(X_norm, y, learning_rate=0.05, momentum=0.9, num_steps=100)
ax.plot(losses2, 'r-', linewidth=2, label='With momentum (0.9)')

ax.set_xlabel('Step', fontsize=11)
ax.set_ylabel('MSE Loss', fontsize=11)
ax.set_title('Effect of Momentum on Convergence', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Without momentum - Final loss: {losses1[-1]:.6f}")
print(f"With momentum - Final loss: {losses2[-1]:.6f}")

### Exercise 2: Learning Rate Schedule

Sometimes it helps to decrease the learning rate over time. Implement a learning rate schedule that starts large and decays.

In [None]:
# Exercise 2: Learning rate scheduling

def gradient_descent_lr_decay(X, y, initial_lr, decay_rate, num_steps):
    """
    Gradient descent with exponential learning rate decay.
    lr(t) = initial_lr * exp(-decay_rate * t)
    """
    w = np.random.randn(X.shape[1]) * 0.01
    
    losses = [mse_loss(w, X, y)]
    learning_rates = [initial_lr]
    
    for step in range(num_steps):
        # Decay learning rate
        lr = initial_lr * np.exp(-decay_rate * step)
        learning_rates.append(lr)
        
        grad = mse_gradient(w, X, y)
        w = w - lr * grad
        losses.append(mse_loss(w, X, y))
    
    return w, losses, learning_rates

# Compare constant vs decaying learning rate
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Constant LR
w1, losses1, _ = gradient_descent_linear(X_norm, y, learning_rate=0.1, num_steps=100)

# Decaying LR
w2, losses2, lrs = gradient_descent_lr_decay(X_norm, y, initial_lr=0.3, decay_rate=0.03, num_steps=100)

axes[0].plot(losses1, 'b-', linewidth=2, label='Constant LR = 0.1')
axes[0].plot(losses2, 'r-', linewidth=2, label='Decaying LR (0.3 → 0.015)')
axes[0].set_xlabel('Step', fontsize=11)
axes[0].set_ylabel('MSE Loss', fontsize=11)
axes[0].set_title('Learning Rate Schedule Effect', fontsize=12)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(lrs, 'r-', linewidth=2)
axes[1].set_xlabel('Step', fontsize=11)
axes[1].set_ylabel('Learning Rate', fontsize=11)
axes[1].set_title('Decaying Learning Rate Schedule', fontsize=12)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Exercise 3: Apply GD to Expedition Data

Fit a linear model to predict expedition `profit_loss` from the features in the expedition data.

In [None]:
# Exercise 3: Expedition data

# Prepare expedition features
exp_features = ['party_size', 'initial_supply_score', 'risk_tolerance', 'weather_stability']

X_exp = expedition[exp_features].values
X_exp_mean = X_exp.mean(axis=0)
X_exp_std = X_exp.std(axis=0)
X_exp_norm = (X_exp - X_exp_mean) / X_exp_std
X_exp_norm = np.column_stack([np.ones(len(X_exp_norm)), X_exp_norm])

y_exp = expedition['profit_loss'].values

# Train model
w_exp, losses_exp, _ = gradient_descent_linear(X_exp_norm, y_exp, 
                                                learning_rate=0.1, num_steps=200)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(losses_exp, 'b-', linewidth=2)
axes[0].set_xlabel('Step', fontsize=11)
axes[0].set_ylabel('MSE Loss', fontsize=11)
axes[0].set_title('Training Loss for Expedition Profit Prediction', fontsize=12)
axes[0].grid(True, alpha=0.3)

# Predictions vs actual
predictions = X_exp_norm @ w_exp
axes[1].scatter(y_exp, predictions, alpha=0.5, s=30)
axes[1].plot([y_exp.min(), y_exp.max()], [y_exp.min(), y_exp.max()], 'r--', linewidth=2)
axes[1].set_xlabel('Actual Profit/Loss', fontsize=11)
axes[1].set_ylabel('Predicted Profit/Loss', fontsize=11)
axes[1].set_title('Predictions vs Actual', fontsize=12)

plt.tight_layout()
plt.show()

# Show coefficients
print("\nLearned Coefficients:")
for label, weight in zip(['bias'] + exp_features, w_exp):
    print(f"{label:25}: {weight:>10.4f}")

### Exercise 4: Gradient Descent Visualization Animation

Create an animated visualization of gradient descent on a 2D surface.

In [None]:
# Exercise 4: Animated gradient descent (static frames for notebook)

# Generate path
path, losses = gradient_descent_2d(start=(-2, 5), learning_rate=0.1, num_steps=40)

# Create multi-frame visualization
fig, axes = plt.subplots(2, 3, figsize=(14, 9))
axes = axes.flatten()

steps_to_show = [0, 5, 10, 20, 30, 40]

for ax, step in zip(axes, steps_to_show):
    # Plot contours
    contour = ax.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.5)
    
    # Plot path up to this step
    ax.plot(path[:step+1, 0], path[:step+1, 1], 'ro-', markersize=4, linewidth=1.5)
    
    # Mark current position
    ax.plot(path[step, 0], path[step, 1], 'r*', markersize=15)
    
    # Mark minimum
    ax.plot(3, 2, 'g*', markersize=15)
    
    ax.set_title(f'Step {step}\nLoss: {losses[step]:.4f}', fontsize=11)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_xlim(-3, 9)
    ax.set_ylim(-2, 7)

plt.tight_layout()
plt.show()

---

## Summary

| Concept | Key Insight | Colonel's Siege Example |
|---------|-------------|------------------------|
| **Gradient Descent** | Iteratively move opposite to gradient | Take steps toward the Tower based on what reduces loss |
| **Learning Rate** | Step size—too small = slow, too large = unstable | How bold is each strategic adjustment? |
| **Convergence** | Loss decreases over iterations | Progress accumulates over months and years |
| **Local Minima** | Getting stuck in suboptimal solutions | Settling for partial success when better exists |
| **SGD** | Use random samples for faster, noisier updates | Test one stratagem, adjust, repeat |
| **Momentum** | Accumulate velocity to escape local minima | Build on past successes; don't overcorrect |

---

## Key Takeaways

1. **Gradient descent is simple**: Move opposite to gradient, repeat.

2. **Learning rate is crucial**: Too small = slow; too large = unstable.

3. **SGD trades accuracy for speed**: Use small batches for faster updates.

4. **Local minima are real**: Initialization and momentum help escape them.

5. **The algorithm is the same everywhere**: From linear regression to deep learning, it's all gradient descent.

---

## Next Lesson

In **Lesson 4: The Chain Rule and Backpropagation**, we'll see how gradients are computed efficiently for complex, nested functions—like neural networks. This is the magic that makes deep learning possible.

*"I understand now how to walk downhill. But the landscape is not simple—it is layers upon layers, each stratagem affecting the next in ways I cannot directly observe. To trace the sensitivity through this chain of causation, I need the chain rule."*  
— The Colonel