# Part 1.2: Calculus for Deep Learning

Calculus is essential for understanding how neural networks learn. The key insight: **learning = optimization**, and optimization requires derivatives.

## Learning Objectives
- [ ] Compute partial derivatives of multivariate functions
- [ ] Apply the chain rule to composite functions
- [ ] Understand gradients as directions of steepest ascent
- [ ] Implement gradient descent from scratch

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. Derivatives: The Core Concept

The **derivative** measures the rate of change of a function:

$$f'(x) = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

**Intuition**: The derivative tells you how much the output changes when you slightly change the input.

### Why This Matters for Deep Learning

If `loss = f(weights)`, then the derivative tells us:
- **Which direction** to change weights to reduce loss
- **How much** each weight affects the loss

### Deep Dive: What Does a Derivative Really Mean?

The derivative answers a fundamental question: **"If I wiggle the input a tiny bit, how much does the output wiggle?"**

Think of it like this:
- You're turning a dial (input x)
- A meter responds (output f(x))
- The derivative tells you: "For each unit I turn the dial, how many units does the meter move?"

#### The Formal Definition Unpacked

$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

| Component | Meaning |
|-----------|---------|
| $f(x + h)$ | Output after nudging input by tiny amount h |
| $f(x)$ | Original output |
| $f(x + h) - f(x)$ | Change in output (how much the meter moved) |
| $h$ | Change in input (how much we turned the dial) |
| $\frac{f(x+h) - f(x)}{h}$ | **Rate of change** = output change per unit input change |
| $\lim_{h \to 0}$ | Take h infinitesimally small (instantaneous rate) |

#### Why Do We Care About Rate of Change?

| Context | What the Derivative Tells Us |
|---------|------------------------------|
| **Physics** | Velocity = derivative of position. "How fast am I moving right now?" |
| **Economics** | Marginal cost = derivative of total cost. "Cost of making one more unit?" |
| **Machine Learning** | Gradient of loss = derivative of loss w.r.t. weights. "How does changing this weight affect the error?" |

In ML specifically: **Derivatives tell us which direction to adjust weights to reduce error.**

In [None]:
# Numerical derivative approximation
def numerical_derivative(f, x, h=1e-5):
    """Approximate derivative using finite differences."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Example: f(x) = x^2
f = lambda x: x**2
f_prime_analytical = lambda x: 2*x  # We know this

x = 3.0
print(f"f(x) = x² at x = {x}")
print(f"Numerical derivative: {numerical_derivative(f, x):.6f}")
print(f"Analytical derivative: {f_prime_analytical(x):.6f}")

In [None]:
# Interactive visualization: Tangent lines at multiple points
# Shows how the derivative (slope) changes across the function

def plot_multiple_tangents(f, f_prime, x_range, points, title):
    """
    Plot a function with tangent lines at multiple points.
    This visualizes how the derivative changes across the function.
    """
    x = np.linspace(x_range[0], x_range[1], 200)
    y = f(x)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left plot: Function with tangent lines
    axes[0].plot(x, y, 'b-', linewidth=2, label='f(x)')
    
    colors = plt.cm.Reds(np.linspace(0.3, 0.9, len(points)))
    
    for x0, color in zip(points, colors):
        y0 = f(x0)
        slope = f_prime(x0)
        
        # Tangent line: y = f(x0) + f'(x0)(x - x0)
        x_tangent = np.linspace(x0 - 1.5, x0 + 1.5, 50)
        y_tangent = y0 + slope * (x_tangent - x0)
        
        axes[0].plot(x_tangent, y_tangent, '--', color=color, linewidth=1.5, alpha=0.8)
        axes[0].scatter([x0], [y0], color=color, s=80, zorder=5)
        axes[0].annotate(f'slope={slope:.2f}', xy=(x0, y0), xytext=(x0+0.3, y0+0.5),
                        fontsize=9, color=color)
    
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('f(x)')
    axes[0].set_title(f'{title}\nTangent lines show instantaneous slope at each point')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    axes[0].set_ylim([min(y) - 1, max(y) + 2])
    
    # Right plot: The derivative function itself
    y_prime = f_prime(x)
    axes[1].plot(x, y_prime, 'r-', linewidth=2, label="f'(x) (derivative)")
    axes[1].axhline(y=0, color='k', linewidth=0.5)
    
    for x0, color in zip(points, colors):
        slope = f_prime(x0)
        axes[1].scatter([x0], [slope], color=color, s=80, zorder=5)
    
    axes[1].set_xlabel('x')
    axes[1].set_ylabel("f'(x)")
    axes[1].set_title("The Derivative Function\nShows the slope at every point")
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Example 1: f(x) = x^2 (parabola)
f = lambda x: x**2
f_prime = lambda x: 2*x
plot_multiple_tangents(f, f_prime, x_range=(-3, 3), 
                       points=[-2, -1, 0, 1, 2], 
                       title='f(x) = x²')

print("Key observations for f(x) = x²:")
print("- At x=0: slope is 0 (bottom of the parabola - minimum!)")
print("- Negative x: slope is negative (function decreasing)")
print("- Positive x: slope is positive (function increasing)")
print("- Slope magnitude increases as we move away from 0")

In [None]:
# Visualize derivative as slope of tangent line
def plot_tangent(f, f_prime, x0, title):
    """Plot function with tangent line at x0."""
    x = np.linspace(x0 - 2, x0 + 2, 100)
    y = f(x)
    
    # Tangent line: y = f(x0) + f'(x0)(x - x0)
    slope = f_prime(x0)
    tangent = f(x0) + slope * (x - x0)
    
    plt.figure(figsize=(10, 6))
    plt.plot(x, y, 'b-', linewidth=2, label='f(x)')
    plt.plot(x, tangent, 'r--', linewidth=2, label=f'Tangent (slope = {slope:.2f})')
    plt.scatter([x0], [f(x0)], color='red', s=100, zorder=5)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# f(x) = x² at different points
f = lambda x: x**2
f_prime = lambda x: 2*x

plot_tangent(f, f_prime, 1.5, "f(x) = x² with tangent at x = 1.5")

In [None]:
# Example 2: A more complex function - sine wave
f = lambda x: np.sin(x)
f_prime = lambda x: np.cos(x)
plot_multiple_tangents(f, f_prime, x_range=(-2*np.pi, 2*np.pi), 
                       points=[-np.pi, -np.pi/2, 0, np.pi/2, np.pi], 
                       title='f(x) = sin(x)')

print("\nKey observations for f(x) = sin(x):")
print("- At peaks/troughs (x = ±π/2): slope is 0 (maxima/minima)")
print("- At zero crossings (x = 0, ±π): slope is ±1 (steepest)")
print("- The derivative of sin(x) is cos(x) - shifted by π/2!")

### Common Derivatives

| Function | Derivative |
|----------|------------|
| $x^n$ | $nx^{n-1}$ |
| $e^x$ | $e^x$ |
| $\ln(x)$ | $1/x$ |
| $\sin(x)$ | $\cos(x)$ |
| $\cos(x)$ | $-\sin(x)$ |

### Activation Functions and Their Derivatives

These are critical for backpropagation!

In [None]:
# Sigmoid and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)  # Nice property!

# ReLU and its derivative
def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

# Tanh and its derivative
def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

# Plot them all
x = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

activations = [
    (sigmoid, sigmoid_derivative, 'Sigmoid'),
    (relu, relu_derivative, 'ReLU'),
    (tanh, tanh_derivative, 'Tanh')
]

for i, (func, deriv, name) in enumerate(activations):
    # Function
    axes[0, i].plot(x, func(x), 'b-', linewidth=2)
    axes[0, i].set_title(f'{name}')
    axes[0, i].set_xlabel('x')
    axes[0, i].axhline(y=0, color='k', linewidth=0.5)
    axes[0, i].axvline(x=0, color='k', linewidth=0.5)
    axes[0, i].grid(True, alpha=0.3)
    
    # Derivative
    axes[1, i].plot(x, deriv(x), 'r-', linewidth=2)
    axes[1, i].set_title(f'{name} Derivative')
    axes[1, i].set_xlabel('x')
    axes[1, i].axhline(y=0, color='k', linewidth=0.5)
    axes[1, i].axvline(x=0, color='k', linewidth=0.5)
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key observations:")
print("- Sigmoid derivative max is 0.25 (causes vanishing gradients)")
print("- ReLU derivative is 0 or 1 (no vanishing gradient for positive x)")
print("- Tanh derivative max is 1 (better than sigmoid)")

---

## 2. Partial Derivatives

For functions of multiple variables, **partial derivatives** measure the rate of change with respect to one variable while holding others constant.

$$\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, y) - f(x, y)}{h}$$

### Example

For $f(x, y) = x^2 + 3xy + y^2$:

$$\frac{\partial f}{\partial x} = 2x + 3y$$
$$\frac{\partial f}{\partial y} = 3x + 2y$$

### Deep Dive: Why Does the Gradient Point "Uphill"?

This is one of the most important insights in optimization. Let's build intuition for WHY the gradient points in the direction of steepest increase.

#### The Gradient as a "Which Way is Up?" Detector

Imagine you're standing on a hilly surface (the function) and want to find the steepest uphill direction. The gradient is like a compass that always points uphill.

**Mathematical Intuition:**

The gradient $\nabla f$ at a point gives you the direction where the function increases **most rapidly**.

Think about it:
- $\frac{\partial f}{\partial x}$ tells you: "If I move in the x-direction, how fast does f increase?"
- $\frac{\partial f}{\partial y}$ tells you: "If I move in the y-direction, how fast does f increase?"
- The gradient combines these: "The optimal uphill direction is a blend of these, weighted by how steep each direction is"

#### Directional Derivatives: Movement in Any Direction

If you move in direction $\mathbf{u}$ (a unit vector), the rate of change is:

$$\frac{\partial f}{\partial \mathbf{u}} = \nabla f \cdot \mathbf{u} = |\nabla f| \cos(\theta)$$

Where $\theta$ is the angle between gradient and movement direction.

| Direction relative to gradient | $\cos(\theta)$ | Rate of change |
|-------------------------------|----------------|----------------|
| Same as gradient ($\theta = 0$) | 1 | Maximum increase |
| Perpendicular ($\theta = 90$) | 0 | No change (contour line) |
| Opposite ($\theta = 180$) | -1 | Maximum decrease |

**This is why gradient descent works!** Moving opposite to the gradient gives maximum decrease.

In [None]:
# Numerical partial derivatives
def partial_derivative(f, point, var_index, h=1e-5):
    """
    Compute partial derivative of f at point with respect to variable var_index.
    
    Args:
        f: Function taking array of variables
        point: Array of variable values
        var_index: Which variable to differentiate
        h: Step size
    """
    point = np.array(point, dtype=float)
    point_plus = point.copy()
    point_minus = point.copy()
    point_plus[var_index] += h
    point_minus[var_index] -= h
    return (f(point_plus) - f(point_minus)) / (2 * h)

# f(x, y) = x² + 3xy + y²
def f(p):
    x, y = p
    return x**2 + 3*x*y + y**2

# Analytical partial derivatives
def df_dx(x, y):
    return 2*x + 3*y

def df_dy(x, y):
    return 3*x + 2*y

# Test at point (2, 3)
point = [2, 3]
print(f"At point {point}:")
print(f"∂f/∂x numerical: {partial_derivative(f, point, 0):.6f}")
print(f"∂f/∂x analytical: {df_dx(*point):.6f}")
print(f"∂f/∂y numerical: {partial_derivative(f, point, 1):.6f}")
print(f"∂f/∂y analytical: {df_dy(*point):.6f}")

In [None]:
# Enhanced gradient field visualization with interactive exploration
# Shows gradient as arrows pointing uphill, with different movement directions

def visualize_gradient_directions():
    """
    Interactive visualization showing:
    1. Gradient field (arrows pointing uphill)
    2. How rate of change varies with direction
    3. Why opposite-to-gradient is the best descent direction
    """
    
    fig = plt.figure(figsize=(16, 5))
    
    # Define function: f(x,y) = x² + 0.5*y² (elliptical paraboloid)
    def f(x, y):
        return x**2 + 0.5*y**2
    
    def grad_f(x, y):
        return np.array([2*x, y])
    
    # Create grid
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    Z = f(X, Y)
    
    # Plot 1: 3D surface
    ax1 = fig.add_subplot(131, projection='3d')
    ax1.plot_surface(X, Y, Z, cmap=cm.viridis, alpha=0.7)
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.set_zlabel('f(x,y)')
    ax1.set_title('f(x,y) = x² + 0.5y²\n(Bowl-shaped surface)')
    
    # Plot 2: Contour with gradient field
    ax2 = fig.add_subplot(132)
    contour = ax2.contour(X, Y, Z, levels=15, cmap=cm.viridis)
    ax2.clabel(contour, inline=True, fontsize=8)
    
    # Sparse grid for gradient arrows
    x_sparse = np.linspace(-2.5, 2.5, 8)
    y_sparse = np.linspace(-2.5, 2.5, 8)
    X_s, Y_s = np.meshgrid(x_sparse, y_sparse)
    
    U = 2 * X_s  # ∂f/∂x
    V = Y_s      # ∂f/∂y
    
    # Normalize for visualization
    mag = np.sqrt(U**2 + V**2) + 1e-10
    U_norm = U / mag * 0.4
    V_norm = V / mag * 0.4
    
    ax2.quiver(X_s, Y_s, U_norm, V_norm, mag, cmap=cm.Reds, alpha=0.8)
    ax2.set_xlabel('x')
    ax2.set_ylabel('y')
    ax2.set_title('Gradient Field\nArrows point UPHILL')
    ax2.set_aspect('equal')
    ax2.plot([0], [0], 'k*', markersize=15, label='Minimum')
    ax2.legend()
    
    # Plot 3: Directional derivative at a specific point
    ax3 = fig.add_subplot(133)
    
    # Pick a point
    px, py = 2.0, 1.0
    grad = grad_f(px, py)
    grad_mag = np.linalg.norm(grad)
    
    # Compute directional derivative for all directions
    angles = np.linspace(0, 2*np.pi, 100)
    dir_derivs = []
    for theta in angles:
        direction = np.array([np.cos(theta), np.sin(theta)])
        dir_deriv = np.dot(grad, direction)
        dir_derivs.append(dir_deriv)
    
    # Plot directional derivative vs angle
    ax3.plot(np.degrees(angles), dir_derivs, 'b-', linewidth=2)
    ax3.axhline(y=0, color='k', linewidth=0.5)
    ax3.axhline(y=grad_mag, color='g', linestyle='--', label=f'Max = |grad| = {grad_mag:.2f}')
    ax3.axhline(y=-grad_mag, color='r', linestyle='--', label=f'Min = -|grad| = {-grad_mag:.2f}')
    
    # Mark special directions
    grad_angle = np.degrees(np.arctan2(grad[1], grad[0]))
    ax3.axvline(x=grad_angle, color='g', alpha=0.5)
    ax3.axvline(x=grad_angle + 180, color='r', alpha=0.5)
    
    ax3.set_xlabel('Direction (degrees)')
    ax3.set_ylabel('Rate of change')
    ax3.set_title(f'Directional Derivative at ({px}, {py})\nRate of change vs movement direction')
    ax3.legend(loc='lower right')
    ax3.set_xticks([0, 90, 180, 270, 360])
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"At point ({px}, {py}):")
    print(f"  Gradient = {grad}")
    print(f"  Gradient magnitude = {grad_mag:.2f}")
    print(f"  Gradient direction = {grad_angle:.1f}°")
    print(f"\n  To INCREASE f fastest: move at {grad_angle:.1f}° (with the gradient)")
    print(f"  To DECREASE f fastest: move at {grad_angle + 180:.1f}° (against the gradient)")
    print(f"  To stay at same level: move at {grad_angle + 90:.1f}° or {grad_angle - 90:.1f}° (perpendicular)")

visualize_gradient_directions()

In [None]:
# Visualize a 2D function and its partial derivatives
def f_2d(x, y):
    return x**2 + y**2

# Create meshgrid
x = np.linspace(-3, 3, 50)
y = np.linspace(-3, 3, 50)
X, Y = np.meshgrid(x, y)
Z = f_2d(X, Y)

fig = plt.figure(figsize=(15, 5))

# 3D surface
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z, cmap=cm.viridis, alpha=0.8)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')
ax1.set_title('f(x,y) = x² + y²')

# Contour plot with gradient vectors
ax2 = fig.add_subplot(132)
contour = ax2.contour(X, Y, Z, levels=15, cmap=cm.viridis)
ax2.clabel(contour, inline=True, fontsize=8)

# Add gradient vectors at some points
points = [(-2, -2), (-2, 0), (0, 2), (1, 1), (2, -1)]
for px, py in points:
    grad_x = 2 * px  # ∂f/∂x = 2x
    grad_y = 2 * py  # ∂f/∂y = 2y
    ax2.arrow(px, py, grad_x*0.3, grad_y*0.3, head_width=0.15, head_length=0.1, fc='red', ec='red')

ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Contour plot with gradient vectors')
ax2.set_aspect('equal')

# Slice at y=1
ax3 = fig.add_subplot(133)
y_fixed = 1
z_slice = f_2d(x, y_fixed)
ax3.plot(x, z_slice, 'b-', linewidth=2)
ax3.set_xlabel('x')
ax3.set_ylabel('f(x, 1)')
ax3.set_title(f'Slice at y = {y_fixed}\n∂f/∂x = 2x')
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 3. The Gradient

The **gradient** is the vector of all partial derivatives:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

### Key Properties

1. **Direction**: Points in the direction of steepest **increase**
2. **Magnitude**: Tells how steep that increase is
3. **To minimize**: Move in the **opposite** direction (negative gradient)

### Deep Dive: Why the Chain Rule is CRITICAL for Deep Learning

The chain rule is not just another calculus rule - it's the **mathematical heart of backpropagation**. Without it, we couldn't train neural networks.

#### The Core Insight

When you have nested functions (f composed with g), changes **propagate** through the chain:

$$\text{small change in } x \rightarrow \text{change in } g(x) \rightarrow \text{change in } f(g(x))$$

The chain rule says: **multiply the rates of change at each step**.

#### Breaking Down the Formula

For $y = f(g(x))$, let's call $u = g(x)$ (the intermediate value):

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

| Component | Meaning | In neural network terms |
|-----------|---------|------------------------|
| $\frac{du}{dx}$ | How much does u change when x changes? | "Local gradient" of layer |
| $\frac{dy}{du}$ | How much does y change when u changes? | "Upstream gradient" from later layers |
| $\frac{dy}{dx}$ | How much does y change when x changes? | "Full gradient" through the network |

#### Visual: How Changes Propagate

```
Input x                        Output y
   |                              |
   v                              v
   x ----[g]----> u = g(x) ----[f]----> y = f(u)
   
   Δx    ->     Δu = (dg/dx)·Δx   ->   Δy = (df/du)·Δu
                                            = (df/du)·(dg/dx)·Δx
```

The change in x gets **amplified (or diminished)** at each step, and the total effect is the product!

In [None]:
# Step-by-step example with ACTUAL NUMBERS
# Let's trace through f(g(x)) = (2x + 1)³ at x = 2

print("=" * 60)
print("CHAIN RULE: Step-by-Step with Actual Numbers")
print("=" * 60)
print("\nFunction: y = (2x + 1)³")
print("This is f(g(x)) where g(x) = 2x + 1 and f(u) = u³")
print("\n" + "-" * 60)

x = 2
print(f"Evaluating at x = {x}")

# Step 1: Forward pass - compute intermediate and final values
u = 2*x + 1  # g(x)
y = u**3      # f(u)

print(f"\n1. FORWARD PASS:")
print(f"   u = g(x) = 2({x}) + 1 = {u}")
print(f"   y = f(u) = {u}³ = {y}")

# Step 2: Compute local derivatives
dg_dx = 2          # derivative of g(x) = 2x + 1 is 2
df_du = 3 * u**2   # derivative of f(u) = u³ is 3u²

print(f"\n2. LOCAL DERIVATIVES (at this point):")
print(f"   dg/dx = d(2x+1)/dx = 2")
print(f"   df/du = d(u³)/du = 3u² = 3({u})² = {df_du}")

# Step 3: Apply chain rule
dy_dx = df_du * dg_dx

print(f"\n3. CHAIN RULE:")
print(f"   dy/dx = (df/du) × (dg/dx)")
print(f"         = {df_du} × {dg_dx}")
print(f"         = {dy_dx}")

# Verify with the analytical derivative
# y = (2x+1)³, so dy/dx = 3(2x+1)² × 2 = 6(2x+1)²
dy_dx_analytical = 6 * (2*x + 1)**2
print(f"\n4. VERIFICATION:")
print(f"   Analytical formula: dy/dx = 6(2x+1)²")
print(f"   At x = {x}: dy/dx = 6({2*x+1})² = {dy_dx_analytical}")
print(f"   Match: {dy_dx == dy_dx_analytical} ✓")

# What does this mean?
print(f"\n5. INTERPRETATION:")
print(f"   If we increase x by a tiny amount Δx = 0.001:")
print(f"   y will increase by approximately {dy_dx} × 0.001 = {dy_dx * 0.001}")

# Verify numerically
h = 0.001
y_original = (2*x + 1)**3
y_nudged = (2*(x + h) + 1)**3
actual_change = y_nudged - y_original
print(f"   Actual change: {y_nudged} - {y_original} = {actual_change:.6f}")
print(f"   Predicted change: {dy_dx * h:.6f}")

In [None]:
# Visualization: Chain Rule as Signal Propagation
# Shows how a small change propagates through composed functions

def visualize_chain_rule_propagation():
    """
    Visualize how changes propagate through composed functions.
    f(g(x)) = sin(x²) 
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    x = np.linspace(-2, 2, 200)
    
    # g(x) = x²
    g = x**2
    # f(u) = sin(u) where u = g(x)
    f = np.sin(g)
    
    # Plot g(x)
    axes[0].plot(x, g, 'b-', linewidth=2)
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('u = g(x) = x²')
    axes[0].set_title('Step 1: Inner function\ng(x) = x²')
    axes[0].grid(True, alpha=0.3)
    axes[0].axhline(y=0, color='k', linewidth=0.5)
    axes[0].axvline(x=0, color='k', linewidth=0.5)
    
    # Highlight a point
    x0 = 1.5
    u0 = x0**2
    axes[0].scatter([x0], [u0], color='red', s=100, zorder=5)
    axes[0].annotate(f'x={x0}\nu={u0:.2f}', xy=(x0, u0), xytext=(x0+0.3, u0-0.5), fontsize=10)
    
    # Plot f(u) = sin(u)
    u = np.linspace(0, 4, 200)
    axes[1].plot(u, np.sin(u), 'g-', linewidth=2)
    axes[1].set_xlabel('u')
    axes[1].set_ylabel('y = f(u) = sin(u)')
    axes[1].set_title('Step 2: Outer function\nf(u) = sin(u)')
    axes[1].grid(True, alpha=0.3)
    axes[1].axhline(y=0, color='k', linewidth=0.5)
    
    y0 = np.sin(u0)
    axes[1].scatter([u0], [y0], color='red', s=100, zorder=5)
    axes[1].annotate(f'u={u0:.2f}\ny={y0:.2f}', xy=(u0, y0), xytext=(u0+0.3, y0+0.2), fontsize=10)
    
    # Plot the composition f(g(x))
    axes[2].plot(x, f, 'm-', linewidth=2)
    axes[2].set_xlabel('x')
    axes[2].set_ylabel('y = f(g(x)) = sin(x²)')
    axes[2].set_title('Result: Composition\ny = sin(x²)')
    axes[2].grid(True, alpha=0.3)
    axes[2].axhline(y=0, color='k', linewidth=0.5)
    axes[2].axvline(x=0, color='k', linewidth=0.5)
    
    axes[2].scatter([x0], [y0], color='red', s=100, zorder=5)
    axes[2].annotate(f'x={x0}\ny={y0:.2f}', xy=(x0, y0), xytext=(x0+0.2, y0+0.3), fontsize=10)
    
    plt.tight_layout()
    plt.show()
    
    # Now show the gradient computation
    print("\n" + "=" * 60)
    print("CHAIN RULE COMPUTATION for y = sin(x²) at x = 1.5")
    print("=" * 60)
    
    # At x = 1.5
    x_val = 1.5
    u_val = x_val**2
    y_val = np.sin(u_val)
    
    # Local gradients
    dg_dx = 2 * x_val          # d(x²)/dx = 2x
    df_du = np.cos(u_val)       # d(sin(u))/du = cos(u)
    
    # Chain rule
    dy_dx = df_du * dg_dx
    
    print(f"\n1. Forward pass:")
    print(f"   x = {x_val}")
    print(f"   u = x² = {u_val}")
    print(f"   y = sin(u) = {y_val:.4f}")
    
    print(f"\n2. Backward pass (computing gradients):")
    print(f"   dy/du = cos(u) = cos({u_val}) = {df_du:.4f}")
    print(f"   du/dx = 2x = 2({x_val}) = {dg_dx}")
    
    print(f"\n3. Chain rule:")
    print(f"   dy/dx = (dy/du) × (du/dx)")
    print(f"         = {df_du:.4f} × {dg_dx}")
    print(f"         = {dy_dx:.4f}")
    
    # Verify numerically
    h = 1e-5
    numerical_grad = (np.sin((x_val + h)**2) - np.sin((x_val - h)**2)) / (2*h)
    print(f"\n4. Numerical verification: {numerical_grad:.4f}")

visualize_chain_rule_propagation()

### Computational Graphs: The Key to Backpropagation

A **computational graph** is a visual representation of how a function computes its output. Each node is an operation, and edges show data flow.

**Why does this matter?** Neural networks are just big computational graphs, and backpropagation is just the chain rule applied systematically through the graph!

#### Example: Computing gradients through a simple graph

Consider: $L = (wx + b - y)^2$ (squared error loss)

```
     w                              
     |                              
     v                              
x -->[ * ]--> z1 -->[ + ]--> z2 -->[ - ]--> z3 -->[ ² ]--> L
                      ^              ^
                      |              |
                      b              y (target)
```

**Forward pass (left to right):** Compute values at each node
**Backward pass (right to left):** Compute gradients using chain rule

In [None]:
# Detailed walkthrough of forward and backward pass through computational graph
# L = (w*x + b - y)²

def computational_graph_example():
    """
    Step-by-step forward and backward pass through a computational graph.
    This is EXACTLY how neural network libraries compute gradients!
    """
    print("=" * 70)
    print("COMPUTATIONAL GRAPH: L = (w*x + b - y)²")
    print("=" * 70)
    
    # Input values
    x = 2.0   # input
    y = 7.0   # target
    w = 3.0   # weight
    b = 1.0   # bias
    
    print(f"\nInputs: x={x}, y={y}, w={w}, b={b}")
    print("\n" + "-" * 70)
    print("FORWARD PASS (compute values left to right)")
    print("-" * 70)
    
    # Forward pass - compute each node
    z1 = w * x          # multiplication
    print(f"z1 = w * x = {w} * {x} = {z1}")
    
    z2 = z1 + b         # addition
    print(f"z2 = z1 + b = {z1} + {b} = {z2}")
    
    z3 = z2 - y         # subtraction (error)
    print(f"z3 = z2 - y = {z2} - {y} = {z3}")
    
    L = z3 ** 2         # square (loss)
    print(f"L = z3² = {z3}² = {L}")
    
    print("\n" + "-" * 70)
    print("BACKWARD PASS (compute gradients right to left)")
    print("-" * 70)
    print("Starting from dL/dL = 1 (gradient of L with respect to itself)\n")
    
    # Backward pass - apply chain rule at each node
    
    # Start with gradient of loss w.r.t. itself
    dL_dL = 1
    print(f"dL/dL = {dL_dL}")
    
    # Node: L = z3² 
    # dL/dz3 = d(z3²)/dz3 = 2*z3
    dL_dz3 = dL_dL * (2 * z3)
    print(f"\nNode L = z3²:")
    print(f"  Local gradient: d(z3²)/dz3 = 2*z3 = 2*{z3} = {2*z3}")
    print(f"  dL/dz3 = dL/dL × d(z3²)/dz3 = {dL_dL} × {2*z3} = {dL_dz3}")
    
    # Node: z3 = z2 - y
    # dz3/dz2 = 1, dz3/dy = -1
    dL_dz2 = dL_dz3 * 1
    dL_dy = dL_dz3 * (-1)
    print(f"\nNode z3 = z2 - y:")
    print(f"  Local gradients: dz3/dz2 = 1, dz3/dy = -1")
    print(f"  dL/dz2 = dL/dz3 × 1 = {dL_dz3} × 1 = {dL_dz2}")
    print(f"  dL/dy = dL/dz3 × (-1) = {dL_dz3} × (-1) = {dL_dy}")
    
    # Node: z2 = z1 + b
    # dz2/dz1 = 1, dz2/db = 1
    dL_dz1 = dL_dz2 * 1
    dL_db = dL_dz2 * 1
    print(f"\nNode z2 = z1 + b:")
    print(f"  Local gradients: dz2/dz1 = 1, dz2/db = 1")
    print(f"  dL/dz1 = dL/dz2 × 1 = {dL_dz2} × 1 = {dL_dz1}")
    print(f"  dL/db = dL/dz2 × 1 = {dL_dz2} × 1 = {dL_db}")
    
    # Node: z1 = w * x
    # dz1/dw = x, dz1/dx = w
    dL_dw = dL_dz1 * x
    dL_dx = dL_dz1 * w
    print(f"\nNode z1 = w * x:")
    print(f"  Local gradients: dz1/dw = x = {x}, dz1/dx = w = {w}")
    print(f"  dL/dw = dL/dz1 × x = {dL_dz1} × {x} = {dL_dw}")
    print(f"  dL/dx = dL/dz1 × w = {dL_dz1} × {w} = {dL_dx}")
    
    print("\n" + "-" * 70)
    print("SUMMARY OF GRADIENTS")
    print("-" * 70)
    print(f"dL/dw = {dL_dw}  <- How much changing w affects L")
    print(f"dL/db = {dL_db}  <- How much changing b affects L")
    print(f"dL/dx = {dL_dx}  <- How much changing x affects L")
    
    print("\n" + "-" * 70)
    print("VERIFICATION with numerical gradients")
    print("-" * 70)
    h = 1e-5
    
    loss = lambda w, b, x, y: (w*x + b - y)**2
    
    dL_dw_num = (loss(w+h, b, x, y) - loss(w-h, b, x, y)) / (2*h)
    dL_db_num = (loss(w, b+h, x, y) - loss(w, b-h, x, y)) / (2*h)
    
    print(f"dL/dw: analytical = {dL_dw}, numerical = {dL_dw_num:.4f}")
    print(f"dL/db: analytical = {dL_db}, numerical = {dL_db_num:.4f}")
    
    return dL_dw, dL_db

dL_dw, dL_db = computational_graph_example()

### Deep Dive: Understanding Gradient Descent

Gradient descent is the **optimization engine** of deep learning. Let's build deep intuition.

#### The Core Idea in Plain English

You're lost on a foggy mountainside and want to reach the lowest valley. What do you do?
1. **Feel the slope** under your feet (compute gradient)
2. **Take a step downhill** (move opposite to gradient)
3. **Repeat** until you reach flat ground (gradient is zero)

That's gradient descent!

#### The Update Rule Decoded

$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla L(\theta)$$

| Component | Meaning | Analogy |
|-----------|---------|---------|
| $\theta$ | Parameters (weights) | Your position on the mountain |
| $\nabla L(\theta)$ | Gradient of loss | Which way is uphill |
| $-\nabla L(\theta)$ | Negative gradient | Which way is downhill |
| $\alpha$ | Learning rate | Size of your steps |
| $\alpha \nabla L(\theta)$ | The actual step | How far you move |

#### The Learning Rate $\alpha$ is Critical

| Learning rate | What happens | Problem |
|---------------|--------------|---------|
| **Too small** | Tiny steps, very slow progress | Takes forever to converge |
| **Too large** | Big steps, overshoots minimum | Oscillates or diverges |
| **Just right** | Steady progress, converges | Sweet spot (hard to find!) |

This is why learning rate scheduling and adaptive optimizers (Adam) are important in practice.

In [None]:
def compute_gradient(f, point, h=1e-5):
    """Compute gradient of f at point using numerical differentiation."""
    point = np.array(point, dtype=float)
    grad = np.zeros_like(point)
    for i in range(len(point)):
        grad[i] = partial_derivative(f, point, i, h)
    return grad

# Example: f(x, y) = x² + y²
def f(p):
    return p[0]**2 + p[1]**2

point = np.array([3.0, 4.0])
grad = compute_gradient(f, point)

print(f"At point {point}:")
print(f"f(point) = {f(point)}")
print(f"Gradient = {grad}")
print(f"Gradient magnitude = {np.linalg.norm(grad):.4f}")
print(f"\nTo decrease f, move in direction: {-grad}")

In [None]:
# Visualize gradient field
def f_2d(x, y):
    return x**2 + y**2

# Create grid
x = np.linspace(-3, 3, 15)
y = np.linspace(-3, 3, 15)
X, Y = np.meshgrid(x, y)

# Compute gradient at each point
U = 2 * X  # ∂f/∂x
V = 2 * Y  # ∂f/∂y

# Normalize for better visualization
magnitude = np.sqrt(U**2 + V**2)
U_norm = U / (magnitude + 1e-10)
V_norm = V / (magnitude + 1e-10)

plt.figure(figsize=(10, 8))

# Contour plot
x_fine = np.linspace(-3, 3, 100)
y_fine = np.linspace(-3, 3, 100)
X_fine, Y_fine = np.meshgrid(x_fine, y_fine)
Z_fine = f_2d(X_fine, Y_fine)
plt.contour(X_fine, Y_fine, Z_fine, levels=15, cmap=cm.viridis, alpha=0.5)

# Gradient vectors (pointing uphill)
plt.quiver(X, Y, U_norm, V_norm, magnitude, cmap=cm.Reds, alpha=0.8)

plt.xlabel('x')
plt.ylabel('y')
plt.title('Gradient Field of f(x,y) = x² + y²\nArrows point UPHILL (direction of steepest increase)')
plt.colorbar(label='Gradient magnitude')
plt.axis('equal')
plt.show()

print("Notice: Gradients point away from the minimum (origin)")
print("To minimize, we follow the NEGATIVE gradient (downhill)")

---

## 4. The Chain Rule

The **chain rule** is the foundation of backpropagation. It tells us how to differentiate composite functions.

### Single Variable

If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

### Intuition

If $x$ changes by a small amount $\Delta x$:
- $g$ changes by $\frac{dg}{dx} \cdot \Delta x$
- This causes $f$ to change by $\frac{df}{dg} \cdot (\frac{dg}{dx} \cdot \Delta x)$

The changes **multiply** through the chain!

In [None]:
# Example: y = (3x + 2)²
# Let g(x) = 3x + 2, f(g) = g²
# dy/dx = df/dg * dg/dx = 2g * 3 = 6(3x + 2)

def y(x):
    return (3*x + 2)**2

def dy_dx_analytical(x):
    return 6 * (3*x + 2)

x = 1.0
print(f"y = (3x + 2)² at x = {x}")
print(f"y({x}) = {y(x)}")
print(f"dy/dx numerical: {numerical_derivative(y, x):.6f}")
print(f"dy/dx analytical (chain rule): {dy_dx_analytical(x):.6f}")

In [None]:
# Comprehensive visualization of gradient descent behavior
# Shows the path, learning rate effects, and convergence

def visualize_gradient_descent_comprehensive():
    """
    Create a comprehensive visualization showing:
    1. 3D view of the loss surface with descent path
    2. Top-down view (contour) with path
    3. Loss over iterations
    4. Effect of different learning rates
    """
    
    # Define a nice loss landscape: f(x,y) = x² + 10*y² (elongated bowl)
    def f(p):
        return p[0]**2 + 10*p[1]**2
    
    def grad_f(p):
        return np.array([2*p[0], 20*p[1]])
    
    def gradient_descent(start, lr, n_steps):
        point = np.array(start, dtype=float)
        history = [point.copy()]
        for _ in range(n_steps):
            point = point - lr * grad_f(point)
            history.append(point.copy())
        return np.array(history)
    
    # Run GD with different learning rates
    start = [3.0, 1.0]
    n_steps = 30
    
    lr_small = 0.01
    lr_good = 0.05
    lr_large = 0.09
    lr_too_large = 0.11
    
    hist_small = gradient_descent(start, lr_small, n_steps)
    hist_good = gradient_descent(start, lr_good, n_steps)
    hist_large = gradient_descent(start, lr_large, n_steps)
    hist_too_large = gradient_descent(start, lr_too_large, n_steps)
    
    # Create figure
    fig = plt.figure(figsize=(16, 10))
    
    # Create grid for surface plots
    x = np.linspace(-4, 4, 100)
    y = np.linspace(-2, 2, 100)
    X, Y = np.meshgrid(x, y)
    Z = X**2 + 10*Y**2
    
    # Plot 1: 3D surface with path (good learning rate)
    ax1 = fig.add_subplot(221, projection='3d')
    ax1.plot_surface(X, Y, Z, cmap=cm.viridis, alpha=0.6)
    
    # Add descent path on surface
    path_z = [f(p) for p in hist_good]
    ax1.plot(hist_good[:, 0], hist_good[:, 1], path_z, 'r.-', 
             markersize=8, linewidth=2, label='GD path')
    ax1.scatter([start[0]], [start[1]], [f(start)], color='green', s=100, marker='o')
    ax1.scatter([0], [0], [0], color='red', s=100, marker='*')
    
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.set_zlabel('f(x,y)')
    ax1.set_title('3D View: Gradient Descent Path\n(Learning rate = 0.05)')
    
    # Plot 2: Contour with all paths
    ax2 = fig.add_subplot(222)
    contour = ax2.contour(X, Y, Z, levels=20, cmap=cm.viridis)
    ax2.clabel(contour, inline=True, fontsize=8)
    
    ax2.plot(hist_small[:, 0], hist_small[:, 1], 'b.-', markersize=5, 
             linewidth=1.5, label=f'lr={lr_small} (too small)')
    ax2.plot(hist_good[:, 0], hist_good[:, 1], 'g.-', markersize=5, 
             linewidth=1.5, label=f'lr={lr_good} (good)')
    ax2.plot(hist_large[:, 0], hist_large[:, 1], 'orange', marker='.', markersize=5, 
             linewidth=1.5, label=f'lr={lr_large} (large)')
    ax2.plot(hist_too_large[:, 0], hist_too_large[:, 1], 'r.-', markersize=5, 
             linewidth=1.5, label=f'lr={lr_too_large} (too large)')
    
    ax2.scatter([start[0]], [start[1]], color='green', s=150, marker='o', zorder=5, label='Start')
    ax2.scatter([0], [0], color='red', s=150, marker='*', zorder=5, label='Minimum')
    
    ax2.set_xlabel('x')
    ax2.set_ylabel('y')
    ax2.set_title('Top View: Different Learning Rates')
    ax2.legend(loc='upper right', fontsize=9)
    ax2.set_aspect('equal')
    
    # Plot 3: Loss curves
    ax3 = fig.add_subplot(223)
    
    losses_small = [f(p) for p in hist_small]
    losses_good = [f(p) for p in hist_good]
    losses_large = [f(p) for p in hist_large]
    losses_too_large = [f(p) for p in hist_too_large]
    
    ax3.plot(losses_small, 'b-', linewidth=2, label=f'lr={lr_small}')
    ax3.plot(losses_good, 'g-', linewidth=2, label=f'lr={lr_good}')
    ax3.plot(losses_large, color='orange', linewidth=2, label=f'lr={lr_large}')
    ax3.plot(losses_too_large, 'r-', linewidth=2, label=f'lr={lr_too_large}')
    
    ax3.set_xlabel('Iteration')
    ax3.set_ylabel('Loss')
    ax3.set_title('Loss Over Time')
    ax3.legend()
    ax3.set_yscale('log')
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Zoomed in first few steps
    ax4 = fig.add_subplot(224)
    
    # Show step-by-step for good learning rate
    for i in range(min(8, len(hist_good)-1)):
        p1 = hist_good[i]
        p2 = hist_good[i+1]
        grad = grad_f(p1)
        
        # Point
        ax4.scatter([p1[0]], [p1[1]], color='blue', s=60, zorder=5)
        ax4.annotate(f'{i}', xy=(p1[0], p1[1]), xytext=(p1[0]+0.1, p1[1]+0.1), fontsize=9)
        
        # Gradient (scaled for visualization)
        ax4.arrow(p1[0], p1[1], -grad[0]*0.02, -grad[1]*0.02, 
                 head_width=0.05, head_length=0.02, fc='red', ec='red', alpha=0.5)
        
        # Actual step
        ax4.arrow(p1[0], p1[1], (p2[0]-p1[0])*0.95, (p2[1]-p1[1])*0.95,
                 head_width=0.05, head_length=0.02, fc='green', ec='green')
    
    contour2 = ax4.contour(X, Y, Z, levels=20, cmap=cm.viridis, alpha=0.5)
    ax4.set_xlabel('x')
    ax4.set_ylabel('y')
    ax4.set_title('Step-by-Step View (lr=0.05)\nGreen: actual steps, Red: gradient direction')
    ax4.set_xlim([-1, 4])
    ax4.set_ylim([-0.5, 1.5])
    
    plt.tight_layout()
    plt.show()
    
    print("Key observations:")
    print(f"- lr={lr_small}: Very slow convergence, many steps needed")
    print(f"- lr={lr_good}: Good convergence, reaches minimum efficiently")
    print(f"- lr={lr_large}: Oscillates but eventually converges")
    print(f"- lr={lr_too_large}: Oscillates wildly, may diverge")

visualize_gradient_descent_comprehensive()

### The Local Minima Problem

Real loss surfaces are rarely simple bowls. They often have:
- **Local minima**: Points that look like minima locally but aren't the global best
- **Saddle points**: Points where gradient is zero but it's neither min nor max
- **Plateaus**: Flat regions where gradient is tiny

Neural networks have extremely complex loss landscapes. Fortunately:
1. In high dimensions, true local minima are rare (saddle points are more common)
2. Many local minima have similar loss values
3. Modern optimizers (Adam, etc.) can escape shallow local minima

In [None]:
# Visualize local minima, saddle points, and the challenges they pose

def visualize_local_minima_problem():
    """
    Visualize a loss landscape with multiple local minima
    and show how gradient descent can get stuck.
    """
    
    # Create a function with multiple local minima
    # f(x) = sin(x) + 0.1*x² (creates multiple valleys)
    def f_1d(x):
        return np.sin(3*x) + 0.1*x**2
    
    def df_1d(x):
        return 3*np.cos(3*x) + 0.2*x
    
    # 1D visualization
    fig, axes = plt.subplots(1, 3, figsize=(16, 4))
    
    x = np.linspace(-4, 4, 200)
    y = f_1d(x)
    
    axes[0].plot(x, y, 'b-', linewidth=2)
    axes[0].set_xlabel('x (parameter)')
    axes[0].set_ylabel('Loss')
    axes[0].set_title('Loss Landscape with Multiple Minima')
    axes[0].grid(True, alpha=0.3)
    
    # Mark local minima (where derivative crosses zero from - to +)
    for xi in np.linspace(-4, 4, 1000):
        if abs(df_1d(xi)) < 0.05 and f_1d(xi-0.01) > f_1d(xi) < f_1d(xi+0.01):
            axes[0].scatter([xi], [f_1d(xi)], color='red', s=100, marker='v', zorder=5)
    
    axes[0].annotate('Global\nminimum', xy=(-2.1, f_1d(-2.1)), xytext=(-3, 1),
                    arrowprops=dict(arrowstyle='->', color='green'), fontsize=10, color='green')
    axes[0].annotate('Local\nminimum', xy=(0.0, f_1d(0.0)), xytext=(1, 1.5),
                    arrowprops=dict(arrowstyle='->', color='red'), fontsize=10, color='red')
    
    # Run GD from different starting points
    def gd_1d(x0, lr=0.1, n_steps=50):
        x = x0
        history = [x]
        for _ in range(n_steps):
            x = x - lr * df_1d(x)
            history.append(x)
        return np.array(history)
    
    starts = [-3.5, -1.0, 1.5, 3.0]
    colors = ['green', 'red', 'orange', 'purple']
    
    axes[1].plot(x, y, 'b-', linewidth=2, alpha=0.5)
    for start, color in zip(starts, colors):
        hist = gd_1d(start, lr=0.05, n_steps=100)
        y_hist = f_1d(hist)
        axes[1].plot(hist, y_hist, '.-', color=color, markersize=4, 
                    linewidth=1, label=f'Start at x={start}')
        axes[1].scatter([start], [f_1d(start)], color=color, s=100, marker='o', zorder=5)
    
    axes[1].set_xlabel('x')
    axes[1].set_ylabel('Loss')
    axes[1].set_title('GD from Different Starting Points')
    axes[1].legend(fontsize=9)
    axes[1].grid(True, alpha=0.3)
    
    # Show final loss values
    final_losses = []
    for start in starts:
        hist = gd_1d(start, lr=0.05, n_steps=100)
        final_losses.append(f_1d(hist[-1]))
    
    axes[2].bar(range(len(starts)), final_losses, color=colors)
    axes[2].set_xticks(range(len(starts)))
    axes[2].set_xticklabels([f'x={s}' for s in starts])
    axes[2].set_xlabel('Starting point')
    axes[2].set_ylabel('Final loss')
    axes[2].set_title('Final Loss Depends on Start!\n(Different starting points = different results)')
    axes[2].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("Key insight: Gradient descent finds LOCAL minima, not necessarily GLOBAL minima.")
    print("The solution depends on where you start!")
    print("\nStrategies to address this:")
    print("1. Run from multiple random starting points")
    print("2. Use momentum to escape shallow local minima")
    print("3. Add noise (stochastic gradient descent)")
    print("4. Use learning rate schedules")

visualize_local_minima_problem()

In [None]:
# Visualize saddle points in 2D
# A saddle point is a critical point that is a minimum in one direction but maximum in another

def visualize_saddle_point():
    """
    Visualize a saddle point and why it's problematic for gradient descent.
    """
    # Classic saddle: f(x,y) = x² - y²
    def f_saddle(x, y):
        return x**2 - y**2
    
    def grad_saddle(p):
        return np.array([2*p[0], -2*p[1]])
    
    fig = plt.figure(figsize=(16, 5))
    
    # Create grid
    x = np.linspace(-2, 2, 100)
    y = np.linspace(-2, 2, 100)
    X, Y = np.meshgrid(x, y)
    Z = f_saddle(X, Y)
    
    # 3D surface
    ax1 = fig.add_subplot(131, projection='3d')
    ax1.plot_surface(X, Y, Z, cmap=cm.coolwarm, alpha=0.8)
    ax1.scatter([0], [0], [0], color='black', s=200, marker='o')
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.set_zlabel('f(x,y)')
    ax1.set_title('Saddle Point: f(x,y) = x² - y²\nMinimum in x, Maximum in y')
    
    # Contour plot
    ax2 = fig.add_subplot(132)
    contour = ax2.contour(X, Y, Z, levels=20, cmap=cm.coolwarm)
    ax2.clabel(contour, inline=True, fontsize=8)
    ax2.scatter([0], [0], color='black', s=200, marker='o', label='Saddle point')
    
    # Draw gradient arrows around saddle point
    for px, py in [(0.5, 0), (-0.5, 0), (0, 0.5), (0, -0.5)]:
        grad = grad_saddle([px, py])
        ax2.arrow(px, py, -grad[0]*0.15, -grad[1]*0.15, head_width=0.05, 
                 head_length=0.02, fc='green', ec='green')
    
    ax2.set_xlabel('x')
    ax2.set_ylabel('y')
    ax2.set_title('Contour Plot\nGreen arrows: negative gradient direction')
    ax2.legend()
    ax2.set_aspect('equal')
    
    # 1D slices through saddle point
    ax3 = fig.add_subplot(133)
    x_slice = np.linspace(-2, 2, 100)
    ax3.plot(x_slice, x_slice**2, 'b-', linewidth=2, label='Slice at y=0: f = x² (minimum)')
    ax3.plot(x_slice, -x_slice**2, 'r-', linewidth=2, label='Slice at x=0: f = -y² (maximum)')
    ax3.axhline(y=0, color='k', linewidth=0.5)
    ax3.axvline(x=0, color='k', linewidth=0.5)
    ax3.scatter([0], [0], color='black', s=100, zorder=5)
    ax3.set_xlabel('x or y')
    ax3.set_ylabel('f')
    ax3.set_title('1D Slices Through Saddle\nSame point is min AND max!')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("At a saddle point:")
    print("- Gradient is zero (looks like a minimum/maximum)")
    print("- But it's a minimum in some directions, maximum in others")
    print("- GD can get stuck here if approaching from certain directions")
    print("\nIn high-dimensional neural network loss landscapes:")
    print("- Saddle points are MUCH more common than local minima")
    print("- This is because you need ALL directions to curve upward for a minimum")
    print("- Momentum helps escape saddle points by building up velocity")

visualize_saddle_point()

### Chain Rule in Neural Networks

Consider a simple network:

$$\text{Input } x \rightarrow z = wx + b \rightarrow a = \sigma(z) \rightarrow L = (a - y)^2$$

To find $\frac{\partial L}{\partial w}$, we apply the chain rule:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$$

In [None]:
# Computational graph example
# Forward pass: x -> z = wx + b -> a = sigmoid(z) -> L = (a - y)²

def forward_and_backward(x, y_true, w, b):
    """Compute forward pass and gradients using chain rule."""
    
    # Forward pass
    z = w * x + b
    a = sigmoid(z)
    L = (a - y_true)**2
    
    print("=== Forward Pass ===")
    print(f"x = {x}")
    print(f"z = w*x + b = {w}*{x} + {b} = {z}")
    print(f"a = sigmoid(z) = {a:.6f}")
    print(f"L = (a - y)² = ({a:.6f} - {y_true})² = {L:.6f}")
    
    # Backward pass (chain rule)
    print("\n=== Backward Pass (Chain Rule) ===")
    
    # dL/da
    dL_da = 2 * (a - y_true)
    print(f"∂L/∂a = 2(a - y) = {dL_da:.6f}")
    
    # da/dz (sigmoid derivative)
    da_dz = a * (1 - a)
    print(f"∂a/∂z = σ(z)(1 - σ(z)) = {da_dz:.6f}")
    
    # dz/dw
    dz_dw = x
    print(f"∂z/∂w = x = {dz_dw}")
    
    # dz/db
    dz_db = 1
    print(f"∂z/∂b = 1")
    
    # Chain rule
    dL_dz = dL_da * da_dz
    dL_dw = dL_dz * dz_dw
    dL_db = dL_dz * dz_db
    
    print(f"\n∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w = {dL_dw:.6f}")
    print(f"∂L/∂b = ∂L/∂a · ∂a/∂z · ∂z/∂b = {dL_db:.6f}")
    
    return L, dL_dw, dL_db

# Example
x = 2.0
y_true = 1.0
w = 0.5
b = 0.1

L, dL_dw, dL_db = forward_and_backward(x, y_true, w, b)

In [None]:
# Verify with numerical gradient
h = 1e-5

def loss(w, b, x=2.0, y=1.0):
    z = w * x + b
    a = sigmoid(z)
    return (a - y)**2

# Numerical gradients
dL_dw_numerical = (loss(w + h, b) - loss(w - h, b)) / (2 * h)
dL_db_numerical = (loss(w, b + h) - loss(w, b - h)) / (2 * h)

print("Verification with numerical gradients:")
print(f"∂L/∂w: analytical = {dL_dw:.6f}, numerical = {dL_dw_numerical:.6f}")
print(f"∂L/∂b: analytical = {dL_db:.6f}, numerical = {dL_db_numerical:.6f}")

### Multivariate Chain Rule

When a variable affects the output through multiple paths:

$$\frac{\partial L}{\partial x} = \sum_{i} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}$$

This is why we **sum** gradients when a variable is used multiple times.

In [None]:
# Example: f = x*y + x*z where y and z both depend on x
# Actually, let's do: f(x) = x² + x (x is used twice)

# Computational graph:
# x --> a = x  --\
#                 +--> c = a * b --> f = c + d
# x --> b = x  --/                      |
#                                       |
# x --> d = x  -------------------------/

# This is: f = x*x + x = x² + x
# df/dx = 2x + 1 (by calculus)

# But through the graph:
# df/dx = df/dc * dc/da * da/dx + df/dc * dc/db * db/dx + df/dd * dd/dx
#       = 1 * b * 1 + 1 * a * 1 + 1 * 1
#       = x + x + 1 = 2x + 1 ✓

x = 3.0
print(f"f(x) = x² + x at x = {x}")
print(f"f({x}) = {x**2 + x}")
print(f"df/dx (analytical) = 2x + 1 = {2*x + 1}")

# Through computational graph
a = x
b = x  
c = a * b  # = x²
d = x
f = c + d  # = x² + x

# Backward
df_dc = 1
df_dd = 1
dc_da = b  # = x
dc_db = a  # = x
da_dx = 1
db_dx = 1
dd_dx = 1

# Sum all paths from f to x
df_dx = df_dc * dc_da * da_dx + df_dc * dc_db * db_dx + df_dd * dd_dx
print(f"df/dx (computational graph) = {df_dx}")

---

## 5. Gradient Descent

**Gradient descent** is the optimization algorithm that powers deep learning:

$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla L(\theta)$$

Where:
- $\theta$: Parameters (weights)
- $\alpha$: Learning rate (step size)
- $\nabla L$: Gradient of loss with respect to parameters

In [None]:
def gradient_descent_1d(f, df, x0, learning_rate=0.1, n_steps=50):
    """Gradient descent for 1D function."""
    x = x0
    history = [(x, f(x))]
    
    for i in range(n_steps):
        grad = df(x)
        x = x - learning_rate * grad
        history.append((x, f(x)))
        
    return x, history

# Minimize f(x) = (x - 3)²
f = lambda x: (x - 3)**2
df = lambda x: 2 * (x - 3)

x_final, history = gradient_descent_1d(f, df, x0=10.0, learning_rate=0.1, n_steps=30)

print(f"Minimum found at x = {x_final:.6f}")
print(f"f(x) = {f(x_final):.6f}")
print(f"True minimum at x = 3")

# Visualize
x_range = np.linspace(-2, 12, 100)
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(x_range, f(x_range), 'b-', linewidth=2, label='f(x) = (x-3)²')
xs, ys = zip(*history)
plt.scatter(xs, ys, c=range(len(xs)), cmap='Reds', s=50, zorder=5)
plt.plot(xs, ys, 'r--', alpha=0.5)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gradient Descent Path')
plt.legend()
plt.colorbar(label='Iteration')

plt.subplot(1, 2, 2)
plt.plot([h[1] for h in history], 'b-o')
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.title('Loss over Time')

plt.tight_layout()
plt.show()

### 2D Gradient Descent

In [None]:
def gradient_descent_2d(f, grad_f, start, learning_rate=0.1, n_steps=50):
    """Gradient descent for 2D function."""
    point = np.array(start, dtype=float)
    history = [point.copy()]
    
    for i in range(n_steps):
        grad = grad_f(point)
        point = point - learning_rate * grad
        history.append(point.copy())
        
    return point, np.array(history)

# Minimize f(x, y) = x² + y²
def f(p):
    return p[0]**2 + p[1]**2

def grad_f(p):
    return np.array([2*p[0], 2*p[1]])

start = [4.0, 3.0]
final, history = gradient_descent_2d(f, grad_f, start, learning_rate=0.1, n_steps=30)

print(f"Start: {start}")
print(f"Final: {final}")
print(f"f(final) = {f(final):.10f}")

# Visualize
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2

plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap=cm.viridis)
plt.plot(history[:, 0], history[:, 1], 'r.-', markersize=10, linewidth=2)
plt.scatter([start[0]], [start[1]], color='green', s=200, marker='o', label='Start', zorder=5)
plt.scatter([final[0]], [final[1]], color='red', s=200, marker='*', label='End', zorder=5)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Gradient Descent on f(x,y) = x² + y²')
plt.legend()
plt.colorbar(label='f(x,y)')
plt.axis('equal')
plt.show()

### Effect of Learning Rate

### Calculus Concepts and Their ML Applications

| Calculus Concept | What it Means | ML Application |
|------------------|---------------|----------------|
| **Derivative** | Rate of change of output w.r.t. input | How loss changes when we change one weight |
| **Partial Derivative** | Rate of change w.r.t. one variable (others fixed) | Gradient component for one parameter |
| **Gradient** | Vector of all partial derivatives | Direction to update ALL weights at once |
| **Chain Rule** | Derivative of composed functions = product of derivatives | Backpropagation through network layers |
| **Gradient Descent** | Iteratively move opposite to gradient | Core training algorithm for neural networks |
| **Learning Rate** | Step size in gradient descent | Hyperparameter controlling training speed |
| **Local Minimum** | Point where gradient = 0 and function curves up | Where training might get stuck |
| **Saddle Point** | Point where gradient = 0 but not min or max | Common in high-dim; momentum helps escape |

### The Full Picture: How a Neural Network Learns

1. **Forward pass**: Input flows through network, computing activations layer by layer
2. **Loss computation**: Compare output to target, get a single number (the loss)
3. **Backward pass**: Use chain rule to compute gradient of loss w.r.t. every weight
4. **Parameter update**: Use gradient descent to update all weights
5. **Repeat**: Until loss is small enough

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.1, 0.5, 0.95]
colors = ['blue', 'green', 'orange', 'red']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Contour plot with paths
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2

axes[0].contour(X, Y, Z, levels=20, cmap=cm.viridis, alpha=0.5)

for lr, color in zip(learning_rates, colors):
    final, history = gradient_descent_2d(f, grad_f, [4.0, 3.0], learning_rate=lr, n_steps=20)
    axes[0].plot(history[:, 0], history[:, 1], '.-', color=color, markersize=8, 
                 linewidth=2, label=f'lr={lr}')
    
    # Loss curve
    losses = [f(p) for p in history]
    axes[1].plot(losses, color=color, linewidth=2, label=f'lr={lr}')

axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Gradient Descent Paths')
axes[0].legend()
axes[0].axis('equal')

axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss')
axes[1].set_title('Loss over Time')
axes[1].legend()
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

print("Observations:")
print("- Too small (0.01): Slow convergence")
print("- Good (0.1): Steady progress")
print("- Larger (0.5): Faster but oscillates")
print("- Too large (0.95): Oscillates wildly, may not converge")

### A More Challenging Function: Rosenbrock

The Rosenbrock function is a classic optimization test:

$$f(x, y) = (1 - x)^2 + 100(y - x^2)^2$$

Minimum at $(1, 1)$. Famous for its narrow, curved valley.

In [None]:
def rosenbrock(p):
    x, y = p
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(p):
    x, y = p
    dx = -2*(1 - x) - 400*x*(y - x**2)
    dy = 200*(y - x**2)
    return np.array([dx, dy])

# Visualize the function
x = np.linspace(-2, 2, 200)
y = np.linspace(-1, 3, 200)
X, Y = np.meshgrid(x, y)
Z = (1 - X)**2 + 100 * (Y - X**2)**2

plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=np.logspace(0, 3, 30), cmap=cm.viridis)
plt.scatter([1], [1], color='red', s=200, marker='*', label='Minimum (1,1)', zorder=5)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Rosenbrock Function\nNotice the narrow curved valley')
plt.colorbar(label='f(x,y)')
plt.legend()
plt.show()

In [None]:
# Gradient descent on Rosenbrock (challenging!)
start = [-1.0, 1.0]
final, history = gradient_descent_2d(rosenbrock, rosenbrock_grad, start, 
                                      learning_rate=0.001, n_steps=5000)

print(f"Start: {start}")
print(f"Final: {final}")
print(f"f(final) = {rosenbrock(final):.6f}")
print(f"True minimum: (1, 1), f = 0")

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.contour(X, Y, Z, levels=np.logspace(0, 3, 30), cmap=cm.viridis, alpha=0.5)
plt.plot(history[::50, 0], history[::50, 1], 'r.-', markersize=5, linewidth=1)  # Every 50th point
plt.scatter([start[0]], [start[1]], color='green', s=100, marker='o', label='Start', zorder=5)
plt.scatter([final[0]], [final[1]], color='red', s=100, marker='*', label='End', zorder=5)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Gradient Descent on Rosenbrock')
plt.legend()

plt.subplot(1, 2, 2)
losses = [rosenbrock(p) for p in history[::10]]
plt.plot(losses)
plt.xlabel('Iteration (x10)')
plt.ylabel('Loss')
plt.title('Loss over Time')
plt.yscale('log')

plt.tight_layout()
plt.show()

print("\nNote: Simple gradient descent struggles with this function!")
print("More advanced optimizers (Adam, etc.) handle this better.")

---

## 6. Putting It Together: Training a Linear Model

Let's train a simple linear regression model using gradient descent.

In [None]:
# Generate synthetic data
np.random.seed(42)
n_samples = 100

# True parameters
w_true = 2.5
b_true = 1.0

# Generate data: y = w*x + b + noise
X = np.random.uniform(-3, 3, n_samples)
y = w_true * X + b_true + np.random.normal(0, 0.5, n_samples)

plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Data')
plt.plot(X, w_true * X + b_true, 'r-', linewidth=2, label=f'True: y = {w_true}x + {b_true}')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression Dataset')
plt.legend()
plt.show()

In [None]:
def train_linear_regression(X, y, learning_rate=0.01, n_epochs=100):
    """
    Train linear regression using gradient descent.
    
    Model: y_pred = w * x + b
    Loss: MSE = (1/n) * sum((y_pred - y)^2)
    """
    n = len(X)
    
    # Initialize parameters
    w = 0.0
    b = 0.0
    
    history = {'loss': [], 'w': [], 'b': []}
    
    for epoch in range(n_epochs):
        # Forward pass
        y_pred = w * X + b
        
        # Compute loss (MSE)
        loss = np.mean((y_pred - y)**2)
        
        # Compute gradients
        # d(loss)/dw = (2/n) * sum((y_pred - y) * x)
        # d(loss)/db = (2/n) * sum(y_pred - y)
        dw = (2/n) * np.sum((y_pred - y) * X)
        db = (2/n) * np.sum(y_pred - y)
        
        # Update parameters
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record history
        history['loss'].append(loss)
        history['w'].append(w)
        history['b'].append(b)
        
        if epoch % 20 == 0:
            print(f"Epoch {epoch:3d}: loss = {loss:.4f}, w = {w:.4f}, b = {b:.4f}")
    
    return w, b, history

# Train
w_learned, b_learned, history = train_linear_regression(X, y, learning_rate=0.1, n_epochs=100)

print(f"\nLearned: w = {w_learned:.4f}, b = {b_learned:.4f}")
print(f"True:    w = {w_true:.4f}, b = {b_true:.4f}")

In [None]:
# Visualize training
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss curve
axes[0].plot(history['loss'])
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Training Loss')

# Parameter trajectory
axes[1].plot(history['w'], label='w')
axes[1].plot(history['b'], label='b')
axes[1].axhline(y=w_true, color='blue', linestyle='--', alpha=0.5, label=f'w_true={w_true}')
axes[1].axhline(y=b_true, color='orange', linestyle='--', alpha=0.5, label=f'b_true={b_true}')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Parameter Value')
axes[1].set_title('Parameter Convergence')
axes[1].legend()

# Final fit
axes[2].scatter(X, y, alpha=0.6, label='Data')
x_line = np.linspace(-3, 3, 100)
axes[2].plot(x_line, w_true * x_line + b_true, 'g-', linewidth=2, label=f'True: y = {w_true}x + {b_true}')
axes[2].plot(x_line, w_learned * x_line + b_learned, 'r--', linewidth=2, 
             label=f'Learned: y = {w_learned:.2f}x + {b_learned:.2f}')
axes[2].set_xlabel('x')
axes[2].set_ylabel('y')
axes[2].set_title('Final Fit')
axes[2].legend()

plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: Implement Gradient Checking

Gradient checking is crucial for debugging backpropagation. Compare analytical gradients with numerical gradients.

In [None]:
def gradient_check(f, grad_f, point, h=1e-5, threshold=1e-5):
    """
    Compare analytical gradient with numerical gradient.
    Returns True if they match within threshold.
    """
    point = np.array(point, dtype=float)
    analytical_grad = grad_f(point)
    numerical_grad = compute_gradient(f, point, h)
    
    # Compute relative error
    diff = np.abs(analytical_grad - numerical_grad)
    denom = np.maximum(np.abs(analytical_grad) + np.abs(numerical_grad), 1e-10)
    relative_error = diff / denom
    
    print(f"Point: {point}")
    print(f"Analytical gradient: {analytical_grad}")
    print(f"Numerical gradient:  {numerical_grad}")
    print(f"Relative error: {relative_error}")
    print(f"Max relative error: {np.max(relative_error):.2e}")
    
    return np.all(relative_error < threshold)

# Test on f(x,y) = x³ + 2xy + y²
def f(p):
    x, y = p
    return x**3 + 2*x*y + y**2

def grad_f(p):
    x, y = p
    return np.array([3*x**2 + 2*y, 2*x + 2*y])

passed = gradient_check(f, grad_f, [2.0, 3.0])
print(f"\nGradient check passed: {passed}")

### Exercise 2: Implement Softmax and Its Gradient

Softmax is critical for classification. Implement it and its gradient.

In [None]:
def softmax(x):
    """
    Compute softmax: softmax(x)_i = exp(x_i) / sum(exp(x_j))
    Subtract max for numerical stability.
    """
    # TODO: Implement softmax
    x_shifted = x - np.max(x)  # For numerical stability
    exp_x = np.exp(x_shifted)
    return exp_x / np.sum(exp_x)

def softmax_jacobian(x):
    """
    Compute Jacobian of softmax.
    J[i,j] = d(softmax_i)/d(x_j)
    
    Formula: J[i,j] = softmax_i * (delta_ij - softmax_j)
    where delta_ij = 1 if i==j, else 0
    """
    s = softmax(x)
    n = len(s)
    jacobian = np.zeros((n, n))
    
    # TODO: Implement Jacobian
    for i in range(n):
        for j in range(n):
            if i == j:
                jacobian[i, j] = s[i] * (1 - s[j])
            else:
                jacobian[i, j] = -s[i] * s[j]
    
    return jacobian

# Test
x = np.array([2.0, 1.0, 0.1])
print(f"Input: {x}")
print(f"Softmax: {softmax(x)}")
print(f"Sum (should be 1): {softmax(x).sum():.6f}")
print(f"\nJacobian:\n{softmax_jacobian(x)}")

### Exercise 3: Gradient Descent with Momentum

Momentum helps accelerate gradient descent. Implement it!

In [None]:
def gradient_descent_momentum(f, grad_f, start, learning_rate=0.01, momentum=0.9, n_steps=100):
    """
    Gradient descent with momentum.
    
    v = momentum * v - learning_rate * gradient
    x = x + v
    """
    point = np.array(start, dtype=float)
    velocity = np.zeros_like(point)
    history = [point.copy()]
    
    for i in range(n_steps):
        grad = grad_f(point)
        velocity = momentum * velocity - learning_rate * grad
        point = point + velocity
        history.append(point.copy())
        
    return point, np.array(history)

# Compare regular GD vs GD with momentum on Rosenbrock
start = [-1.0, 1.0]
n_steps = 1000

final_gd, history_gd = gradient_descent_2d(rosenbrock, rosenbrock_grad, start, 
                                            learning_rate=0.001, n_steps=n_steps)
final_mom, history_mom = gradient_descent_momentum(rosenbrock, rosenbrock_grad, start,
                                                    learning_rate=0.001, momentum=0.9, n_steps=n_steps)

print(f"Regular GD final loss: {rosenbrock(final_gd):.6f}")
print(f"Momentum GD final loss: {rosenbrock(final_mom):.6f}")

# Visualize
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.contour(X, Y, Z, levels=np.logspace(0, 3, 30), cmap=cm.viridis, alpha=0.5)
plt.plot(history_gd[::20, 0], history_gd[::20, 1], 'b.-', markersize=3, label='Regular GD')
plt.plot(history_mom[::20, 0], history_mom[::20, 1], 'r.-', markersize=3, label='With Momentum')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Optimization Paths')
plt.legend()

plt.subplot(1, 2, 2)
losses_gd = [rosenbrock(p) for p in history_gd]
losses_mom = [rosenbrock(p) for p in history_mom]
plt.plot(losses_gd, 'b-', label='Regular GD')
plt.plot(losses_mom, 'r-', label='With Momentum')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss Comparison')
plt.yscale('log')
plt.legend()

plt.tight_layout()
plt.show()

---

## Summary

### Key Concepts

1. **Derivatives** measure rate of change - essential for optimization
2. **Partial derivatives** handle functions of multiple variables
3. **The gradient** points in the direction of steepest ascent
4. **Chain rule** lets us compute gradients through composed functions (backprop!)
5. **Gradient descent** minimizes loss by following the negative gradient

### Connection to Deep Learning

- **Forward pass**: Compute function values through the network
- **Loss**: Scalar measuring prediction quality
- **Backward pass**: Apply chain rule to compute gradients
- **Update**: Move parameters in negative gradient direction

### Checklist
- [ ] I can compute derivatives of common functions
- [ ] I understand partial derivatives and gradients
- [ ] I can apply the chain rule to composite functions
- [ ] I can implement gradient descent from scratch
- [ ] I understand the effect of learning rate

---

## Next Steps

Continue to **Part 1.3: Probability & Statistics** where we'll cover:
- Probability distributions
- Bayes' theorem
- Maximum likelihood estimation
- Information theory (entropy, KL divergence)