# Part 1.2: Calculus for Deep Learning — The Cooking Edition

Calculus is essential for understanding how neural networks learn. The key insight: **learning = optimization**, and optimization requires derivatives.

But here's the thing: **cooks and bakers live and breathe calculus every time they step into the kitchen**. When a baker asks "how much faster will the bread rise if I increase the oven temperature by ten degrees?" — that's a derivative. When a chef asks "which ingredient should I adjust to improve the flavor the most?" — that's a gradient. And when a home cook iteratively tweaks a recipe between attempts to find the perfect version — that's gradient descent.

Every perfectly caramelized crust, every balanced seasoning, every ideal baking time is found through the same mathematical machinery that trains neural networks.

## Learning Objectives
- [ ] Compute partial derivatives of multivariate functions
- [ ] Apply the chain rule to composite functions
- [ ] Understand gradients as directions of steepest ascent
- [ ] Implement gradient descent from scratch

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. Derivatives: The Core Concept

The **derivative** measures the rate of change of a function:

$$f'(x) = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

**Intuition**: The derivative tells you how much the output changes when you slightly change the input.

**Cooking analogy:** The derivative is the thermometer of calculus. The rate at which your oven heats up is the derivative of temperature over time — it tells you how fast the temperature is changing at any given instant. The speed at which bread dough rises is the derivative of dough volume over time. The rate at which sugar caramelizes is the derivative of browning over temperature. Every "rate of change" question in the kitchen is a derivative question.

### Why This Matters for Deep Learning

If `loss = f(weights)`, then the derivative tells us:
- **Which direction** to change weights to reduce loss
- **How much** each weight affects the loss

### Deep Dive: What Does a Derivative Really Mean?

The derivative answers a fundamental question: **"If I wiggle the input a tiny bit, how much does the output wiggle?"**

Think of it like this:
- You're turning a dial (input x)
- A meter responds (output f(x))
- The derivative tells you: "For each unit I turn the dial, how many units does the meter move?"

**Cooking analogy:** Imagine you're adjusting the oven temperature while baking a soufflé. You nudge the temperature up by five degrees (the "wiggle"). The derivative tells you: "For those five degrees of temperature change, how many minutes faster will the soufflé rise?" or "How much more browning will you get on top?" The derivative is the sensitivity of your dish to your adjustment.

#### The Formal Definition Unpacked

$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

| Component | Meaning | Cooking Analogy |
|-----------|---------|-----------------|
| $f(x + h)$ | Output after nudging input by tiny amount h | Browning level after a small temperature increase |
| $f(x)$ | Original output | Browning level at current temperature |
| $f(x + h) - f(x)$ | Change in output (how much the meter moved) | Extra browning gained or lost from the change |
| $h$ | Change in input (how much we turned the dial) | Size of the temperature adjustment |
| $\frac{f(x+h) - f(x)}{h}$ | **Rate of change** = output change per unit input change | Browning gained per degree of oven temperature |
| $\lim_{h \to 0}$ | Take h infinitesimally small (instantaneous rate) | The exact sensitivity at this precise temperature |

#### Why Do We Care About Rate of Change?

| Context | What the Derivative Tells Us | Cooking Parallel |
|---------|------------------------------|------------------|
| **Physics** | Velocity = derivative of position. "How fast am I moving right now?" | How fast the oven temperature is climbing right now |
| **Economics** | Marginal cost = derivative of total cost. "Cost of making one more unit?" | Cost of adding one more gram of saffron to the dish |
| **Machine Learning** | Gradient of loss = derivative of loss w.r.t. weights. "How does changing this weight affect the error?" | "How does changing oven temperature affect the bread's crust quality?" |

In ML specifically: **Derivatives tell us which direction to adjust weights to reduce error.**

In cooking specifically: **Derivatives tell the chef which direction to adjust a recipe variable to improve the dish.**

In [None]:
# Numerical derivative approximation
def numerical_derivative(f, x, h=1e-5):
    """Approximate derivative using finite differences."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Example: f(x) = x^2 — Think of x as oven temp, f(x) as browning intensity
oven_temp_curve = lambda x: x**2
oven_temp_derivative = lambda x: 2*x  # We know this analytically

temp = 3.0
print(f"f(x) = x² at x = {temp}")
print(f"  (Think: browning intensity as a function of oven temperature)")
print(f"Numerical derivative: {numerical_derivative(oven_temp_curve, temp):.6f}")
print(f"Analytical derivative: {oven_temp_derivative(temp):.6f}")

In [None]:
# Interactive visualization: Tangent lines at multiple points
# Shows how the derivative (slope) changes across the function
# Cooking context: Think of this as how the "sensitivity" of browning to 
# oven temperature changes depending on where you are in the temperature range

def plot_multiple_tangents(f, f_prime, x_range, points, title):
    """
    Plot a function with tangent lines at multiple points.
    This visualizes how the derivative changes across the function.
    """
    x = np.linspace(x_range[0], x_range[1], 200)
    y = f(x)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left plot: Function with tangent lines
    axes[0].plot(x, y, 'b-', linewidth=2, label='f(x)')
    
    colors = plt.cm.Reds(np.linspace(0.3, 0.9, len(points)))
    
    for x0, color in zip(points, colors):
        y0 = f(x0)
        slope = f_prime(x0)
        
        # Tangent line: y = f(x0) + f'(x0)(x - x0)
        x_tangent = np.linspace(x0 - 1.5, x0 + 1.5, 50)
        y_tangent = y0 + slope * (x_tangent - x0)
        
        axes[0].plot(x_tangent, y_tangent, '--', color=color, linewidth=1.5, alpha=0.8)
        axes[0].scatter([x0], [y0], color=color, s=80, zorder=5)
        axes[0].annotate(f'slope={slope:.2f}', xy=(x0, y0), xytext=(x0+0.3, y0+0.5),
                        fontsize=9, color=color)
    
    axes[0].set_xlabel('Recipe Parameter')
    axes[0].set_ylabel('Dish Quality')
    axes[0].set_title(f'{title}\nTangent lines show instantaneous rate of change')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    axes[0].set_ylim([min(y) - 1, max(y) + 2])
    
    # Right plot: The derivative function itself
    y_prime = f_prime(x)
    axes[1].plot(x, y_prime, 'r-', linewidth=2, label="f'(x) (derivative)")
    axes[1].axhline(y=0, color='k', linewidth=0.5)
    
    for x0, color in zip(points, colors):
        slope = f_prime(x0)
        axes[1].scatter([x0], [slope], color=color, s=80, zorder=5)
    
    axes[1].set_xlabel('Recipe Parameter')
    axes[1].set_ylabel("f'(x) — Sensitivity")
    axes[1].set_title("The Derivative Function\nShows the sensitivity at every point")
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Example 1: f(x) = x^2 (parabola) — like browning vs oven temperature
f = lambda x: x**2
f_prime = lambda x: 2*x
plot_multiple_tangents(f, f_prime, x_range=(-3, 3), 
                       points=[-2, -1, 0, 1, 2], 
                       title='f(x) = x² — Browning vs Oven Temperature')

print("Key observations for f(x) = x²:")
print("- At x=0: slope is 0 (bottom of the parabola - minimum!)")
print("  Cooking: At the ideal temperature, tiny changes barely affect browning")
print("- Negative x: slope is negative (function decreasing)")
print("- Positive x: slope is positive (function increasing)")
print("- Slope magnitude increases as we move away from 0")
print("  Cooking: The further from ideal temp, the more sensitive the dish becomes")

In [None]:
# Visualize derivative as slope of tangent line
def plot_tangent(f, f_prime, x0, title):
    """Plot function with tangent line at x0."""
    x = np.linspace(x0 - 2, x0 + 2, 100)
    y = f(x)
    
    # Tangent line: y = f(x0) + f'(x0)(x - x0)
    slope = f_prime(x0)
    tangent = f(x0) + slope * (x - x0)
    
    plt.figure(figsize=(10, 6))
    plt.plot(x, y, 'b-', linewidth=2, label='f(x)')
    plt.plot(x, tangent, 'r--', linewidth=2, label=f'Tangent (slope = {slope:.2f})')
    plt.scatter([x0], [f(x0)], color='red', s=100, zorder=5)
    plt.xlabel('Oven Temperature (scaled)')
    plt.ylabel('Browning Intensity')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# f(x) = x² at different points
f = lambda x: x**2
f_prime = lambda x: 2*x

plot_tangent(f, f_prime, 1.5, "Browning vs Oven Temp: f(x) = x² with tangent at x = 1.5\nSlope tells us how sensitive browning is to temperature changes at this point")

In [None]:
# Example 2: A more complex function - sine wave
# Cooking context: Oscillating quality, like bread dough rising and falling
# (volume cycles as yeast ferments, peaks, then over-proofs)
f = lambda x: np.sin(x)
f_prime = lambda x: np.cos(x)
plot_multiple_tangents(f, f_prime, x_range=(-2*np.pi, 2*np.pi), 
                       points=[-np.pi, -np.pi/2, 0, np.pi/2, np.pi], 
                       title='f(x) = sin(x) — Oscillating Dough Rise')

print("\nKey observations for f(x) = sin(x):")
print("- At peaks/troughs (x = +/-pi/2): slope is 0 (maxima/minima)")
print("  Cooking: When dough rise peaks or collapses, it's momentarily stable")
print("- At zero crossings (x = 0, +/-pi): slope is +/-1 (steepest)")
print("  Cooking: Dough volume is changing fastest during transitions")
print("- The derivative of sin(x) is cos(x) - shifted by pi/2!")

### Common Derivatives

| Function | Derivative | Cooking Analogy |
|----------|------------|-----------------|
| $x^n$ | $nx^{n-1}$ | Power-law relationships (caramelization rate scales with sugar concentration squared) |
| $e^x$ | $e^x$ | Exponential yeast growth — rate of rising proportional to current yeast population |
| $\ln(x)$ | $1/x$ | Diminishing returns — each extra pinch of salt matters less as you add more |
| $\sin(x)$ | $\cos(x)$ | Oscillating dough rise through proofing cycles |
| $\cos(x)$ | $-\sin(x)$ | Phase-shifted oscillations in oven temperature regulation |

### Activation Functions and Their Derivatives

These are critical for backpropagation!

**Cooking analogy:** Activation functions are like cooking response curves. A sigmoid is like caramelization — gentle start, rapid change in the sweet spot, then saturating at burnt. ReLU is like a boiling threshold — nothing happens below 100C, then vigorous bubbling above it.

In [None]:
# Sigmoid and its derivative — like caramelization: gentle at first, rapid in sweet spot, saturates at burnt
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)  # Nice property!

# ReLU and its derivative — like a boiling threshold: nothing below 100C, vigorous above
def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

# Tanh and its derivative — like seasoning balance: symmetric, saturates at extremes
def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

# Plot them all
x = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

activations = [
    (sigmoid, sigmoid_derivative, 'Sigmoid\n(Caramelization)'),
    (relu, relu_derivative, 'ReLU\n(Boiling Threshold)'),
    (tanh, tanh_derivative, 'Tanh\n(Seasoning Balance)')
]

for i, (func, deriv, name) in enumerate(activations):
    # Function
    axes[0, i].plot(x, func(x), 'b-', linewidth=2)
    axes[0, i].set_title(f'{name}')
    axes[0, i].set_xlabel('Input Signal')
    axes[0, i].axhline(y=0, color='k', linewidth=0.5)
    axes[0, i].axvline(x=0, color='k', linewidth=0.5)
    axes[0, i].grid(True, alpha=0.3)
    
    # Derivative
    axes[1, i].plot(x, deriv(x), 'r-', linewidth=2)
    axes[1, i].set_title(f'{name.split(chr(10))[0]} Derivative (Sensitivity)')
    axes[1, i].set_xlabel('Input Signal')
    axes[1, i].axhline(y=0, color='k', linewidth=0.5)
    axes[1, i].axvline(x=0, color='k', linewidth=0.5)
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key observations:")
print("- Sigmoid derivative max is 0.25 (causes vanishing gradients)")
print("  Cooking: Like over-caramelized sugar — can't tell the difference anymore")
print("- ReLU derivative is 0 or 1 (no vanishing gradient for positive x)")
print("  Cooking: Clean on/off response — either boiling or not")
print("- Tanh derivative max is 1 (better than sigmoid)")
print("  Cooking: Better seasoning feel, but still saturates at extremes")

---

## 2. Partial Derivatives

For functions of multiple variables, **partial derivatives** measure the rate of change with respect to one variable while holding others constant.

$$\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, y) - f(x, y)}{h}$$

**Cooking analogy:** A recipe has many variables — flour amount, sugar amount, butter, eggs, oven temperature, baking time. A partial derivative answers the question: **"If I change ONLY the sugar amount (holding everything else fixed), how much does the sweetness change?"** This is exactly what experienced bakers do when perfecting a recipe — isolate one variable at a time to understand its individual effect.

### Example

For $f(x, y) = x^2 + 3xy + y^2$:

$$\frac{\partial f}{\partial x} = 2x + 3y$$
$$\frac{\partial f}{\partial y} = 3x + 2y$$

### Deep Dive: Why Does the Gradient Point "Uphill"?

This is one of the most important insights in optimization. Let's build intuition for WHY the gradient points in the direction of steepest increase.

#### The Gradient as a "Which Way is Up?" Detector

Imagine you're standing on a hilly surface (the function) and want to find the steepest uphill direction. The gradient is like a compass that always points uphill.

**Cooking analogy:** Think of the recipe quality landscape as a terrain map, where altitude represents how far the dish is from perfection (the error). The gradient at your current recipe is like your taste buds telling you which combination of ingredient changes would make the dish worse the fastest. To make the dish **better**, you go in the **opposite** direction — down the gradient, toward lower error. This is exactly what gradient descent does.

**Mathematical Intuition:**

The gradient $\nabla f$ at a point gives you the direction where the function increases **most rapidly**.

Think about it:
- $\frac{\partial f}{\partial x}$ tells you: "If I move in the x-direction, how fast does f increase?"
- $\frac{\partial f}{\partial y}$ tells you: "If I move in the y-direction, how fast does f increase?"
- The gradient combines these: "The optimal uphill direction is a blend of these, weighted by how steep each direction is"

**Cooking analogy:** If adding sugar (x) improves flavor score by 0.3 per gram, and adding salt (y) improves it by 0.5 per gram, the gradient tells you: "The fastest way to improve is a blend of both, weighted by their individual sensitivities."

#### Directional Derivatives: Movement in Any Direction

If you move in direction $\mathbf{u}$ (a unit vector), the rate of change is:

$$\frac{\partial f}{\partial \mathbf{u}} = \nabla f \cdot \mathbf{u} = |\nabla f| \cos(\theta)$$

Where $\theta$ is the angle between gradient and movement direction.

| Direction relative to gradient | $\cos(\theta)$ | Rate of change | Cooking Interpretation |
|-------------------------------|----------------|----------------|------------------------|
| Same as gradient ($\theta = 0$) | 1 | Maximum increase | Worst possible recipe change (max error increase) |
| Perpendicular ($\theta = 90$) | 0 | No change (contour line) | A trade-off change — different but equally good |
| Opposite ($\theta = 180$) | -1 | Maximum decrease | Best possible recipe change (max error decrease) |

**This is why gradient descent works!** Moving opposite to the gradient gives maximum decrease.

In [None]:
# Numerical partial derivatives
def partial_derivative(f, point, var_index, h=1e-5):
    """
    Compute partial derivative of f at point with respect to variable var_index.
    
    Cooking context: This is like measuring the sensitivity of flavor score to ONE 
    ingredient while holding all others fixed.
    
    Args:
        f: Function taking array of variables (recipe ingredients)
        point: Array of variable values (current recipe)
        var_index: Which variable to differentiate (which ingredient to tweak)
        h: Step size (how big a tweak)
    """
    point = np.array(point, dtype=float)
    point_plus = point.copy()
    point_minus = point.copy()
    point_plus[var_index] += h
    point_minus[var_index] -= h
    return (f(point_plus) - f(point_minus)) / (2 * h)

# f(sugar_g, oven_temp) = sugar_g² + 3*sugar_g*oven_temp + oven_temp²
# Think: flavor score as a function of two recipe parameters
def flavor_model(p):
    sugar_g, oven_temp = p
    return sugar_g**2 + 3*sugar_g*oven_temp + oven_temp**2

# Analytical partial derivatives
def dflavor_d_sugar(sugar_g, oven_temp):
    return 2*sugar_g + 3*oven_temp

def dflavor_d_temp(sugar_g, oven_temp):
    return 3*sugar_g + 2*oven_temp

# Test at point (2, 3) — sugar_g=2, oven_temp=3
recipe_point = [2, 3]
print(f"At recipe point (sugar_g={recipe_point[0]}, oven_temp={recipe_point[1]}):")
print(f"  d(flavor)/d(sugar_g) numerical:  {partial_derivative(flavor_model, recipe_point, 0):.6f}")
print(f"  d(flavor)/d(sugar_g) analytical: {dflavor_d_sugar(*recipe_point):.6f}")
print(f"  d(flavor)/d(oven_temp) numerical:  {partial_derivative(flavor_model, recipe_point, 1):.6f}")
print(f"  d(flavor)/d(oven_temp) analytical: {dflavor_d_temp(*recipe_point):.6f}")
print(f"\nInterpretation: oven_temp has a bigger derivative ({dflavor_d_temp(*recipe_point)}) than sugar_g ({dflavor_d_sugar(*recipe_point)})")
print("=> Changing oven temperature would affect flavor more at this recipe point")

In [None]:
# Enhanced gradient field visualization with interactive exploration
# Shows gradient as arrows pointing uphill, with different movement directions
# Cooking: Gradient field over the recipe landscape — arrows show "which way makes the dish worse"

def visualize_gradient_directions():
    """
    Interactive visualization showing:
    1. Gradient field (arrows pointing uphill / toward worse flavor)
    2. How rate of change varies with direction
    3. Why opposite-to-gradient is the best descent direction
    """
    
    fig = plt.figure(figsize=(16, 5))
    
    # Define function: f(x,y) = x² + 0.5*y² (elliptical paraboloid)
    # Cooking: Flavor error surface — sugar² + 0.5*temp²
    def f(x, y):
        return x**2 + 0.5*y**2
    
    def grad_f(x, y):
        return np.array([2*x, y])
    
    # Create grid
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    Z = f(X, Y)
    
    # Plot 1: 3D surface
    ax1 = fig.add_subplot(131, projection='3d')
    ax1.plot_surface(X, Y, Z, cmap=cm.viridis, alpha=0.7)
    ax1.set_xlabel('Sugar Amount')
    ax1.set_ylabel('Oven Temp')
    ax1.set_zlabel('Flavor Error')
    ax1.set_title('Flavor Error Surface\n(Bowl = perfect recipe at center)')
    
    # Plot 2: Contour with gradient field
    ax2 = fig.add_subplot(132)
    contour = ax2.contour(X, Y, Z, levels=15, cmap=cm.viridis)
    ax2.clabel(contour, inline=True, fontsize=8)
    
    # Sparse grid for gradient arrows
    x_sparse = np.linspace(-2.5, 2.5, 8)
    y_sparse = np.linspace(-2.5, 2.5, 8)
    X_s, Y_s = np.meshgrid(x_sparse, y_sparse)
    
    U = 2 * X_s  # df/d(sugar)
    V = Y_s      # df/d(temp)
    
    # Normalize for visualization
    mag = np.sqrt(U**2 + V**2) + 1e-10
    U_norm = U / mag * 0.4
    V_norm = V / mag * 0.4
    
    ax2.quiver(X_s, Y_s, U_norm, V_norm, mag, cmap=cm.Reds, alpha=0.8)
    ax2.set_xlabel('Sugar Amount')
    ax2.set_ylabel('Oven Temp')
    ax2.set_title('Gradient Field Over Recipe Space\nArrows point toward WORSE flavor')
    ax2.set_aspect('equal')
    ax2.plot([0], [0], 'k*', markersize=15, label='Perfect Recipe')
    ax2.legend()
    
    # Plot 3: Directional derivative at a specific point
    ax3 = fig.add_subplot(133)
    
    # Pick a point (current recipe)
    px, py = 2.0, 1.0
    grad = grad_f(px, py)
    grad_mag = np.linalg.norm(grad)
    
    # Compute directional derivative for all directions
    angles = np.linspace(0, 2*np.pi, 100)
    dir_derivs = []
    for theta in angles:
        direction = np.array([np.cos(theta), np.sin(theta)])
        dir_deriv = np.dot(grad, direction)
        dir_derivs.append(dir_deriv)
    
    # Plot directional derivative vs angle
    ax3.plot(np.degrees(angles), dir_derivs, 'b-', linewidth=2)
    ax3.axhline(y=0, color='k', linewidth=0.5)
    ax3.axhline(y=grad_mag, color='g', linestyle='--', label=f'Max = |grad| = {grad_mag:.2f}')
    ax3.axhline(y=-grad_mag, color='r', linestyle='--', label=f'Min = -|grad| = {-grad_mag:.2f}')
    
    # Mark special directions
    grad_angle = np.degrees(np.arctan2(grad[1], grad[0]))
    ax3.axvline(x=grad_angle, color='g', alpha=0.5)
    ax3.axvline(x=grad_angle + 180, color='r', alpha=0.5)
    
    ax3.set_xlabel('Recipe Change Direction (degrees)')
    ax3.set_ylabel('Rate of Flavor Error Change')
    ax3.set_title(f'Sensitivity vs Direction at Recipe ({px}, {py})\nWhich way to adjust?')
    ax3.legend(loc='lower right')
    ax3.set_xticks([0, 90, 180, 270, 360])
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"At recipe (sugar={px}, temp={py}):")
    print(f"  Gradient = {grad}")
    print(f"  Gradient magnitude = {grad_mag:.2f}")
    print(f"  Gradient direction = {grad_angle:.1f} degrees")
    print(f"\n  To make dish WORSE fastest: change recipe at {grad_angle:.1f} degrees (with gradient)")
    print(f"  To make dish BETTER fastest: change recipe at {grad_angle + 180:.1f} degrees (against gradient)")
    print(f"  To trade off (same quality): change at {grad_angle + 90:.1f} degrees or {grad_angle - 90:.1f} degrees")

visualize_gradient_directions()

In [None]:
# Visualize a 2D function and its partial derivatives
# Cooking: Flavor error as a function of two recipe parameters
def flavor_error_surface(sugar_amount, oven_temp):
    return sugar_amount**2 + oven_temp**2

# Create meshgrid
x = np.linspace(-3, 3, 50)
y = np.linspace(-3, 3, 50)
X, Y = np.meshgrid(x, y)
Z = flavor_error_surface(X, Y)

fig = plt.figure(figsize=(15, 5))

# 3D surface
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z, cmap=cm.viridis, alpha=0.8)
ax1.set_xlabel('Sugar Amount')
ax1.set_ylabel('Oven Temp')
ax1.set_zlabel('Flavor Error')
ax1.set_title('Error = sugar² + temp²')

# Contour plot with gradient vectors
ax2 = fig.add_subplot(132)
contour = ax2.contour(X, Y, Z, levels=15, cmap=cm.viridis)
ax2.clabel(contour, inline=True, fontsize=8)

# Add gradient vectors at some recipe points
recipe_points = [(-2, -2), (-2, 0), (0, 2), (1, 1), (2, -1)]
for px, py in recipe_points:
    grad_x = 2 * px  # d(error)/d(sugar) = 2*sugar
    grad_y = 2 * py  # d(error)/d(temp) = 2*temp
    ax2.arrow(px, py, grad_x*0.3, grad_y*0.3, head_width=0.15, head_length=0.1, fc='red', ec='red')

ax2.set_xlabel('Sugar Amount')
ax2.set_ylabel('Oven Temp')
ax2.set_title('Recipe Space with Gradient Vectors\n(Red arrows = direction of increasing error)')
ax2.set_aspect('equal')

# Slice at oven_temp=1 — what happens when we only vary sugar?
ax3 = fig.add_subplot(133)
temp_fixed = 1
z_slice = flavor_error_surface(x, temp_fixed)
ax3.plot(x, z_slice, 'b-', linewidth=2)
ax3.set_xlabel('Sugar Amount')
ax3.set_ylabel(f'Flavor Error (temp={temp_fixed})')
ax3.set_title(f'Partial View: Vary Sugar Only (temp={temp_fixed})\nd(error)/d(sugar) = 2*sugar')
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 3. The Gradient

The **gradient** is the vector of all partial derivatives:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

**Cooking analogy:** The gradient is the chef's **complete sensitivity report**. Instead of asking about one ingredient at a time, the gradient bundles ALL partial derivatives into a single vector: "Here's how sensitive the flavor score is to sugar, salt, butter, oven temp, baking time... all at once." It tells you which ingredient to change to get the biggest improvement, and by how much.

### Key Properties

1. **Direction**: Points in the direction of steepest **increase** (toward worse flavor)
2. **Magnitude**: Tells how steep that increase is (how sensitive the dish is to changes)
3. **To minimize**: Move in the **opposite** direction (negative gradient = toward better flavor)

### Deep Dive: Why the Chain Rule is CRITICAL for Deep Learning

The chain rule is not just another calculus rule - it's the **mathematical heart of backpropagation**. Without it, we couldn't train neural networks.

**Cooking analogy:** The chain rule describes **sequential dependencies** in a system. In cooking, oven temperature affects crust formation, which affects moisture retention, which affects interior texture, which affects overall quality. If you want to know "how does a 10-degree oven increase affect the final texture?", you need to chain together the sensitivities at each link: (d_crust/d_temp) x (d_moisture/d_crust) x (d_texture/d_moisture) x (d_quality/d_texture). Each link multiplies the effect through the chain — that's the chain rule.

#### The Core Insight

When you have nested functions (f composed with g), changes **propagate** through the chain:

$$\text{small change in } x \rightarrow \text{change in } g(x) \rightarrow \text{change in } f(g(x))$$

The chain rule says: **multiply the rates of change at each step**.

#### Breaking Down the Formula

For $y = f(g(x))$, let's call $u = g(x)$ (the intermediate value):

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

| Component | Meaning | Neural Network Terms | Cooking Analogy |
|-----------|---------|----------------------|-----------------|
| $\frac{du}{dx}$ | How much does u change when x changes? | "Local gradient" of layer | How crust changes with oven temperature |
| $\frac{dy}{du}$ | How much does y change when u changes? | "Upstream gradient" from later layers | How overall quality changes with crust |
| $\frac{dy}{dx}$ | How much does y change when x changes? | "Full gradient" through the network | How overall quality changes with oven temperature |

#### Visual: How Changes Propagate

```
Input x                        Output y
   |                              |
   v                              v
   x ----[g]----> u = g(x) ----[f]----> y = f(u)
   
   Δx    ->     Δu = (dg/dx)·Δx   ->   Δy = (df/du)·Δu
                                            = (df/du)·(dg/dx)·Δx
```

The change in x gets **amplified (or diminished)** at each step, and the total effect is the product!

**Cooking analogy:** A small oven temperature change ($\Delta x$) produces a crust change, which produces a moisture change, which produces a texture change. If each step amplifies by 2x, the total effect is $2 \times 2 \times 2 = 8x$. But if one link attenuates (say the crust insulates the interior), the effect might be $2 \times 0.1 \times 2 = 0.4x$. This "vanishing" effect is exactly the **vanishing gradient problem** in deep networks.

In [None]:
# Step-by-step example with ACTUAL NUMBERS
# Cooking context: How oven temperature affects dish quality through a chain of dependencies
# Let's trace through f(g(x)) = (2x + 1)³ at x = 2

print("=" * 60)
print("CHAIN RULE: Step-by-Step with Actual Numbers")
print("=" * 60)
print("\nFunction: y = (2x + 1)³")
print("This is f(g(x)) where g(x) = 2x + 1 and f(u) = u³")
print("\nCooking context: x = oven temp setting, g(x) = crust formation,")
print("f(g(x)) = effect on dish quality (cubed relationship)")
print("\n" + "-" * 60)

x = 2
print(f"Evaluating at x = {x}")

# Step 1: Forward pass - compute intermediate and final values
u = 2*x + 1  # g(x)
y = u**3      # f(u)

print(f"\n1. FORWARD PASS:")
print(f"   u = g(x) = 2({x}) + 1 = {u}")
print(f"   y = f(u) = {u}³ = {y}")

# Step 2: Compute local derivatives
dg_dx = 2          # derivative of g(x) = 2x + 1 is 2
df_du = 3 * u**2   # derivative of f(u) = u³ is 3u²

print(f"\n2. LOCAL DERIVATIVES (at this point):")
print(f"   dg/dx = d(2x+1)/dx = 2")
print(f"   df/du = d(u³)/du = 3u² = 3({u})² = {df_du}")

# Step 3: Apply chain rule
dy_dx = df_du * dg_dx

print(f"\n3. CHAIN RULE:")
print(f"   dy/dx = (df/du) x (dg/dx)")
print(f"         = {df_du} x {dg_dx}")
print(f"         = {dy_dx}")

# Verify with the analytical derivative
# y = (2x+1)³, so dy/dx = 3(2x+1)² × 2 = 6(2x+1)²
dy_dx_analytical = 6 * (2*x + 1)**2
print(f"\n4. VERIFICATION:")
print(f"   Analytical formula: dy/dx = 6(2x+1)²")
print(f"   At x = {x}: dy/dx = 6({2*x+1})² = {dy_dx_analytical}")
print(f"   Match: {dy_dx == dy_dx_analytical}")

# What does this mean?
print(f"\n5. INTERPRETATION:")
print(f"   If we increase x by a tiny amount dx = 0.001:")
print(f"   y will increase by approximately {dy_dx} x 0.001 = {dy_dx * 0.001}")
print(f"   Cooking: A tiny oven temp tweak propagates and amplifies through the chain!")

# Verify numerically
h = 0.001
y_original = (2*x + 1)**3
y_nudged = (2*(x + h) + 1)**3
actual_change = y_nudged - y_original
print(f"   Actual change: {y_nudged} - {y_original} = {actual_change:.6f}")
print(f"   Predicted change: {dy_dx * h:.6f}")

In [None]:
# Visualization: Chain Rule as Signal Propagation
# Shows how a small change propagates through composed functions
# Cooking: Like tracing how an oven temperature change ripples through the baking process

def visualize_chain_rule_propagation():
    """
    Visualize how changes propagate through composed functions.
    f(g(x)) = sin(x²) 
    Cooking: Think of x as sugar amount, g(x) = caramelization level, f(g(x)) = flavor output
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    x = np.linspace(-2, 2, 200)
    
    # g(x) = x²  (caramelization response to sugar)
    g = x**2
    # f(u) = sin(u) where u = g(x)  (flavor oscillation from caramelization)
    f = np.sin(g)
    
    # Plot g(x)
    axes[0].plot(x, g, 'b-', linewidth=2)
    axes[0].set_xlabel('Sugar Amount (x)')
    axes[0].set_ylabel('Caramelization u = g(x) = x²')
    axes[0].set_title('Step 1: Inner function\nSugar -> Caramelization')
    axes[0].grid(True, alpha=0.3)
    axes[0].axhline(y=0, color='k', linewidth=0.5)
    axes[0].axvline(x=0, color='k', linewidth=0.5)
    
    # Highlight a point
    x0 = 1.5
    u0 = x0**2
    axes[0].scatter([x0], [u0], color='red', s=100, zorder=5)
    axes[0].annotate(f'x={x0}\nu={u0:.2f}', xy=(x0, u0), xytext=(x0+0.3, u0-0.5), fontsize=10)
    
    # Plot f(u) = sin(u)
    u = np.linspace(0, 4, 200)
    axes[1].plot(u, np.sin(u), 'g-', linewidth=2)
    axes[1].set_xlabel('Caramelization (u)')
    axes[1].set_ylabel('Flavor Output y = f(u) = sin(u)')
    axes[1].set_title('Step 2: Outer function\nCaramelization -> Flavor')
    axes[1].grid(True, alpha=0.3)
    axes[1].axhline(y=0, color='k', linewidth=0.5)
    
    y0 = np.sin(u0)
    axes[1].scatter([u0], [y0], color='red', s=100, zorder=5)
    axes[1].annotate(f'u={u0:.2f}\ny={y0:.2f}', xy=(u0, y0), xytext=(u0+0.3, y0+0.2), fontsize=10)
    
    # Plot the composition f(g(x))
    axes[2].plot(x, f, 'm-', linewidth=2)
    axes[2].set_xlabel('Sugar Amount (x)')
    axes[2].set_ylabel('Flavor Output y = sin(x²)')
    axes[2].set_title('Result: Full Chain\nSugar -> Flavor (composed)')
    axes[2].grid(True, alpha=0.3)
    axes[2].axhline(y=0, color='k', linewidth=0.5)
    axes[2].axvline(x=0, color='k', linewidth=0.5)
    
    axes[2].scatter([x0], [y0], color='red', s=100, zorder=5)
    axes[2].annotate(f'x={x0}\ny={y0:.2f}', xy=(x0, y0), xytext=(x0+0.2, y0+0.3), fontsize=10)
    
    plt.tight_layout()
    plt.show()
    
    # Now show the gradient computation
    print("\n" + "=" * 60)
    print("CHAIN RULE COMPUTATION for y = sin(x²) at x = 1.5")
    print("Cooking: How does a small sugar change affect final flavor?")
    print("=" * 60)
    
    # At x = 1.5
    x_val = 1.5
    u_val = x_val**2
    y_val = np.sin(u_val)
    
    # Local gradients
    dg_dx = 2 * x_val          # d(x²)/dx = 2x
    df_du = np.cos(u_val)       # d(sin(u))/du = cos(u)
    
    # Chain rule
    dy_dx = df_du * dg_dx
    
    print(f"\n1. Forward pass (sugar -> caramelization -> flavor):")
    print(f"   x (sugar amount) = {x_val}")
    print(f"   u (caramelization) = x² = {u_val}")
    print(f"   y (flavor) = sin(u) = {y_val:.4f}")
    
    print(f"\n2. Backward pass (computing sensitivities):")
    print(f"   dy/du = cos(u) = cos({u_val}) = {df_du:.4f}")
    print(f"   du/dx = 2x = 2({x_val}) = {dg_dx}")
    
    print(f"\n3. Chain rule (total sensitivity):")
    print(f"   dy/dx = (dy/du) x (du/dx)")
    print(f"         = {df_du:.4f} x {dg_dx}")
    print(f"         = {dy_dx:.4f}")
    
    # Verify numerically
    h = 1e-5
    numerical_grad = (np.sin((x_val + h)**2) - np.sin((x_val - h)**2)) / (2*h)
    print(f"\n4. Numerical verification: {numerical_grad:.4f}")

visualize_chain_rule_propagation()

### Computational Graphs: The Key to Backpropagation

A **computational graph** is a visual representation of how a function computes its output. Each node is an operation, and edges show data flow.

**Why does this matter?** Neural networks are just big computational graphs, and backpropagation is just the chain rule applied systematically through the graph!

**Cooking analogy:** A recipe itself is a computational graph. Inputs like flour, sugar, butter, and eggs flow through cooking operations (nodes) — mixing, heating, rising — to produce outputs like the final dish quality. When you want to know "how does the amount of butter affect the final texture?", you trace the graph backward — just like backpropagation.

#### Example: Computing gradients through a simple graph

Consider: $L = (wx + b - y)^2$ (squared error loss)

```
     w                              
     |                              
     v                              
x -->[ * ]--> z1 -->[ + ]--> z2 -->[ - ]--> z3 -->[ ² ]--> L
                      ^              ^
                      |              |
                      b              y (target)
```

**Forward pass (left to right):** Compute values at each node
**Backward pass (right to left):** Compute gradients using chain rule

In [None]:
# Detailed walkthrough of forward and backward pass through computational graph
# L = (w*x + b - y)²
# Cooking context: L = predicted flavor error, w = ingredient sensitivity, x = amount, b = baseline, y = target flavor

def computational_graph_example():
    """
    Step-by-step forward and backward pass through a computational graph.
    This is EXACTLY how neural network libraries compute gradients!
    
    Cooking parallel: Like tracing how ingredient amounts flow through a 
    recipe model to produce predicted flavor, then tracing back
    to find which ingredient matters most.
    """
    print("=" * 70)
    print("COMPUTATIONAL GRAPH: L = (w*x + b - y)²")
    print("Cooking: Predicted flavor error from a simple recipe model")
    print("=" * 70)
    
    # Input values
    x = 2.0   # input (e.g., sugar amount in grams)
    y = 7.0   # target (e.g., target sweetness score)
    w = 3.0   # weight (e.g., sweetness-per-gram coefficient)
    b = 1.0   # bias (e.g., baseline sweetness from other ingredients)
    
    print(f"\nInputs: x={x} (sugar grams), y={y} (target sweetness), w={w} (sensitivity), b={b} (baseline)")
    print("\n" + "-" * 70)
    print("FORWARD PASS (compute values left to right)")
    print("-" * 70)
    
    # Forward pass - compute each node
    z1 = w * x          # multiplication
    print(f"z1 = w * x = {w} * {x} = {z1}")
    
    z2 = z1 + b         # addition
    print(f"z2 = z1 + b = {z1} + {b} = {z2}")
    
    z3 = z2 - y         # subtraction (error)
    print(f"z3 = z2 - y = {z2} - {y} = {z3}")
    
    L = z3 ** 2         # square (loss)
    print(f"L = z3² = {z3}² = {L}")
    
    print("\n" + "-" * 70)
    print("BACKWARD PASS (compute gradients right to left)")
    print("-" * 70)
    print("Starting from dL/dL = 1 (gradient of L with respect to itself)\n")
    
    # Backward pass - apply chain rule at each node
    
    # Start with gradient of loss w.r.t. itself
    dL_dL = 1
    print(f"dL/dL = {dL_dL}")
    
    # Node: L = z3² 
    # dL/dz3 = d(z3²)/dz3 = 2*z3
    dL_dz3 = dL_dL * (2 * z3)
    print(f"\nNode L = z3²:")
    print(f"  Local gradient: d(z3²)/dz3 = 2*z3 = 2*{z3} = {2*z3}")
    print(f"  dL/dz3 = dL/dL x d(z3²)/dz3 = {dL_dL} x {2*z3} = {dL_dz3}")
    
    # Node: z3 = z2 - y
    # dz3/dz2 = 1, dz3/dy = -1
    dL_dz2 = dL_dz3 * 1
    dL_dy = dL_dz3 * (-1)
    print(f"\nNode z3 = z2 - y:")
    print(f"  Local gradients: dz3/dz2 = 1, dz3/dy = -1")
    print(f"  dL/dz2 = dL/dz3 x 1 = {dL_dz3} x 1 = {dL_dz2}")
    print(f"  dL/dy = dL/dz3 x (-1) = {dL_dz3} x (-1) = {dL_dy}")
    
    # Node: z2 = z1 + b
    # dz2/dz1 = 1, dz2/db = 1
    dL_dz1 = dL_dz2 * 1
    dL_db = dL_dz2 * 1
    print(f"\nNode z2 = z1 + b:")
    print(f"  Local gradients: dz2/dz1 = 1, dz2/db = 1")
    print(f"  dL/dz1 = dL/dz2 x 1 = {dL_dz2} x 1 = {dL_dz1}")
    print(f"  dL/db = dL/dz2 x 1 = {dL_dz2} x 1 = {dL_db}")
    
    # Node: z1 = w * x
    # dz1/dw = x, dz1/dx = w
    dL_dw = dL_dz1 * x
    dL_dx = dL_dz1 * w
    print(f"\nNode z1 = w * x:")
    print(f"  Local gradients: dz1/dw = x = {x}, dz1/dx = w = {w}")
    print(f"  dL/dw = dL/dz1 x x = {dL_dz1} x {x} = {dL_dw}")
    print(f"  dL/dx = dL/dz1 x w = {dL_dz1} x {w} = {dL_dx}")
    
    print("\n" + "-" * 70)
    print("SUMMARY OF GRADIENTS")
    print("-" * 70)
    print(f"dL/dw = {dL_dw}  <- How much changing sensitivity coefficient affects error")
    print(f"dL/db = {dL_db}  <- How much changing baseline sweetness affects error")
    print(f"dL/dx = {dL_dx}  <- How much changing sugar amount affects error")
    
    print("\n" + "-" * 70)
    print("VERIFICATION with numerical gradients")
    print("-" * 70)
    h = 1e-5
    
    loss = lambda w, b, x, y: (w*x + b - y)**2
    
    dL_dw_num = (loss(w+h, b, x, y) - loss(w-h, b, x, y)) / (2*h)
    dL_db_num = (loss(w, b+h, x, y) - loss(w, b-h, x, y)) / (2*h)
    
    print(f"dL/dw: analytical = {dL_dw}, numerical = {dL_dw_num:.4f}")
    print(f"dL/db: analytical = {dL_db}, numerical = {dL_db_num:.4f}")
    
    return dL_dw, dL_db

dL_dw, dL_db = computational_graph_example()

### Deep Dive: Understanding Gradient Descent

Gradient descent is the **optimization engine** of deep learning. Let's build deep intuition.

#### The Core Idea in Plain English

You're lost on a foggy mountainside and want to reach the lowest valley. What do you do?
1. **Feel the slope** under your feet (compute gradient)
2. **Take a step downhill** (move opposite to gradient)
3. **Repeat** until you reach flat ground (gradient is zero)

That's gradient descent!

**Cooking analogy:** Think of this as **iterative recipe perfection**. You bake your first attempt at a chocolate cake and taste it. Each baking attempt is a gradient descent step:
1. **Taste the cake** and assess the result (compute loss)
2. **Figure out which ingredients** to adjust to improve the most (compute gradient)
3. **Make the changes** in the direction of improvement (update parameters)
4. **Bake again** until you've found the perfect recipe (converged)

You can't test every possible ingredient combination — there are too many variables. Instead, you iteratively move "downhill" in recipe space toward the best flavor.

#### The Update Rule Decoded

$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla L(\theta)$$

| Component | Meaning | Mountain Analogy | Cooking Analogy |
|-----------|---------|------------------|-----------------|
| $\theta$ | Parameters (weights) | Your position on the mountain | Current recipe |
| $\nabla L(\theta)$ | Gradient of loss | Which way is uphill | Which recipe direction makes the dish worse |
| $-\nabla L(\theta)$ | Negative gradient | Which way is downhill | Which recipe direction makes the dish better |
| $\alpha$ | Learning rate | Size of your steps | How aggressively you adjust the recipe |
| $\alpha \nabla L(\theta)$ | The actual step | How far you move | The actual recipe adjustment made |

#### The Learning Rate $\alpha$ is Critical

| Learning rate | What happens | Problem | Cooking Parallel |
|---------------|--------------|---------|------------------|
| **Too small** | Tiny steps, very slow progress | Takes forever to converge | Timid cook: adding salt one grain at a time, takes forever to season |
| **Too large** | Big steps, overshoots minimum | Oscillates or diverges | Heavy-handed cook: dumping in tablespoons of spice, dish oscillates between bland and overwhelming |
| **Just right** | Steady progress, converges | Sweet spot (hard to find!) | Experienced chef: right-sized adjustments that converge to perfect seasoning |

This is why learning rate scheduling and adaptive optimizers (Adam) are important in practice.

In [None]:
def compute_gradient(f, point, h=1e-5):
    """Compute gradient of f at point using numerical differentiation.
    
    Cooking context: Compute the full sensitivity vector — how does flavor
    respond to changes in each recipe ingredient?
    """
    point = np.array(point, dtype=float)
    grad = np.zeros_like(point)
    for i in range(len(point)):
        grad[i] = partial_derivative(f, point, i, h)
    return grad

# Example: flavor_error(sugar, temp) = sugar² + temp²
# The perfect recipe is at (0, 0)
def flavor_error(p):
    return p[0]**2 + p[1]**2

recipe = np.array([3.0, 4.0])
grad = compute_gradient(flavor_error, recipe)

print(f"Current recipe (sugar={recipe[0]}, temp={recipe[1]}):")
print(f"  Flavor error = {flavor_error(recipe)}")
print(f"  Gradient (sensitivity) = {grad}")
print(f"  Gradient magnitude = {np.linalg.norm(grad):.4f}")
print(f"\nTo improve the recipe, adjust by: {-grad}")
print("(Move opposite to gradient = toward better flavor)")

In [None]:
# Visualize gradient field over the recipe landscape
# Cooking: Each arrow shows which direction makes the dish WORSE at that recipe point
def flavor_error_2d(sugar, temp):
    return sugar**2 + temp**2

# Create grid of recipe combinations
sugar_range = np.linspace(-3, 3, 15)
temp_range = np.linspace(-3, 3, 15)
S, T = np.meshgrid(sugar_range, temp_range)

# Compute gradient at each recipe point
U = 2 * S  # d(error)/d(sugar)
V = 2 * T  # d(error)/d(temp)

# Normalize for better visualization
magnitude = np.sqrt(U**2 + V**2)
U_norm = U / (magnitude + 1e-10)
V_norm = V / (magnitude + 1e-10)

plt.figure(figsize=(10, 8))

# Contour plot (iso-error lines)
sugar_fine = np.linspace(-3, 3, 100)
temp_fine = np.linspace(-3, 3, 100)
S_fine, T_fine = np.meshgrid(sugar_fine, temp_fine)
Z_fine = flavor_error_2d(S_fine, T_fine)
plt.contour(S_fine, T_fine, Z_fine, levels=15, cmap=cm.viridis, alpha=0.5)

# Gradient vectors (pointing toward worse flavor)
plt.quiver(S, T, U_norm, V_norm, magnitude, cmap=cm.Reds, alpha=0.8)

plt.xlabel('Sugar Amount')
plt.ylabel('Oven Temperature')
plt.title('Gradient Field: Recipe Sensitivity Landscape\nArrows point toward WORSE flavor — go OPPOSITE for better!')
plt.colorbar(label='Gradient magnitude (sensitivity)')
plt.axis('equal')
plt.show()

print("Notice: Gradients point away from the perfect recipe (origin)")
print("To find the best recipe, follow the NEGATIVE gradient (opposite direction)")

---

## 4. The Chain Rule

The **chain rule** is the foundation of backpropagation. It tells us how to differentiate composite functions.

### Single Variable

If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

### Intuition

If $x$ changes by a small amount $\Delta x$:
- $g$ changes by $\frac{dg}{dx} \cdot \Delta x$
- This causes $f$ to change by $\frac{df}{dg} \cdot (\frac{dg}{dx} \cdot \Delta x)$

The changes **multiply** through the chain!

**Cooking analogy:** The chain rule is how chefs trace the effect of a low-level change (like oven temperature) through a sequence of dependencies: oven temperature affects crust formation, which affects moisture retention, which affects interior texture, which affects overall quality. Each link has its own sensitivity, and the total effect is the product of all of them.

In [None]:
# Example: y = (3x + 2)²
# Cooking: Like "flavor_score = (3*sugar_amount + baseline)²"
# Let g(x) = 3x + 2, f(g) = g²
# dy/dx = df/dg * dg/dx = 2g * 3 = 6(3x + 2)

def y(x):
    return (3*x + 2)**2

def dy_dx_analytical(x):
    return 6 * (3*x + 2)

sugar_amount = 1.0
print(f"y = (3x + 2)² at x = {sugar_amount}")
print(f"  (Cooking: Flavor score model with sugar_amount = {sugar_amount})")
print(f"y({sugar_amount}) = {y(sugar_amount)}")
print(f"dy/dx numerical:  {numerical_derivative(y, sugar_amount):.6f}")
print(f"dy/dx analytical: {dy_dx_analytical(sugar_amount):.6f}")
print(f"\nCooking interpretation: At sugar_amount={sugar_amount}, each gram of sugar changes")
print(f"flavor score by {dy_dx_analytical(sugar_amount):.1f} units (chain rule in action)")

In [None]:
# Comprehensive visualization of gradient descent behavior
# Cooking: Like watching a baker iterate on a recipe across multiple attempts
# Shows the path, learning rate effects, and convergence

def visualize_gradient_descent_comprehensive():
    """
    Create a comprehensive visualization showing:
    1. 3D view of the flavor error surface with recipe optimization path
    2. Top-down view (contour) with path
    3. Flavor error over iterations
    4. Effect of different learning rates (recipe adjustment aggressiveness)
    """
    
    # Define a flavor error landscape: f(sugar,temp) = sugar² + 10*temp² (elongated bowl)
    # Temperature is more sensitive than sugar — common in real baking
    def f(p):
        return p[0]**2 + 10*p[1]**2
    
    def grad_f(p):
        return np.array([2*p[0], 20*p[1]])
    
    def gradient_descent(start, lr, n_steps):
        point = np.array(start, dtype=float)
        history = [point.copy()]
        for _ in range(n_steps):
            point = point - lr * grad_f(point)
            history.append(point.copy())
        return np.array(history)
    
    # Run recipe optimization with different "aggressiveness" levels
    start = [3.0, 1.0]  # Initial recipe: sugar=3, temp=1
    n_steps = 30
    
    lr_conservative = 0.01   # Very cautious baker
    lr_good = 0.05           # Experienced baker
    lr_aggressive = 0.09     # Bold baker
    lr_reckless = 0.11       # Too aggressive — overshoots!
    
    hist_conservative = gradient_descent(start, lr_conservative, n_steps)
    hist_good = gradient_descent(start, lr_good, n_steps)
    hist_aggressive = gradient_descent(start, lr_aggressive, n_steps)
    hist_reckless = gradient_descent(start, lr_reckless, n_steps)
    
    # Create figure
    fig = plt.figure(figsize=(16, 10))
    
    # Create grid for surface plots
    x = np.linspace(-4, 4, 100)
    y = np.linspace(-2, 2, 100)
    X, Y = np.meshgrid(x, y)
    Z = X**2 + 10*Y**2
    
    # Plot 1: 3D surface with recipe optimization path
    ax1 = fig.add_subplot(221, projection='3d')
    ax1.plot_surface(X, Y, Z, cmap=cm.viridis, alpha=0.6)
    
    # Add optimization path on surface
    path_z = [f(p) for p in hist_good]
    ax1.plot(hist_good[:, 0], hist_good[:, 1], path_z, 'r.-', 
             markersize=8, linewidth=2, label='Recipe path')
    ax1.scatter([start[0]], [start[1]], [f(start)], color='green', s=100, marker='o')
    ax1.scatter([0], [0], [0], color='red', s=100, marker='*')
    
    ax1.set_xlabel('Sugar Amount')
    ax1.set_ylabel('Oven Temp')
    ax1.set_zlabel('Flavor Error')
    ax1.set_title('3D View: Recipe Optimization Path\n(Learning rate = 0.05)')
    
    # Plot 2: Contour with all paths
    ax2 = fig.add_subplot(222)
    contour = ax2.contour(X, Y, Z, levels=20, cmap=cm.viridis)
    ax2.clabel(contour, inline=True, fontsize=8)
    
    ax2.plot(hist_conservative[:, 0], hist_conservative[:, 1], 'b.-', markersize=5, 
             linewidth=1.5, label=f'lr={lr_conservative} (conservative)')
    ax2.plot(hist_good[:, 0], hist_good[:, 1], 'g.-', markersize=5, 
             linewidth=1.5, label=f'lr={lr_good} (experienced)')
    ax2.plot(hist_aggressive[:, 0], hist_aggressive[:, 1], 'orange', marker='.', markersize=5, 
             linewidth=1.5, label=f'lr={lr_aggressive} (aggressive)')
    ax2.plot(hist_reckless[:, 0], hist_reckless[:, 1], 'r.-', markersize=5, 
             linewidth=1.5, label=f'lr={lr_reckless} (reckless)')
    
    ax2.scatter([start[0]], [start[1]], color='green', s=150, marker='o', zorder=5, label='First Attempt')
    ax2.scatter([0], [0], color='red', s=150, marker='*', zorder=5, label='Perfect Recipe')
    
    ax2.set_xlabel('Sugar Amount')
    ax2.set_ylabel('Oven Temp')
    ax2.set_title('Top View: Different Baker Strategies')
    ax2.legend(loc='upper right', fontsize=9)
    ax2.set_aspect('equal')
    
    # Plot 3: Flavor error curves
    ax3 = fig.add_subplot(223)
    
    losses_conservative = [f(p) for p in hist_conservative]
    losses_good = [f(p) for p in hist_good]
    losses_aggressive = [f(p) for p in hist_aggressive]
    losses_reckless = [f(p) for p in hist_reckless]
    
    ax3.plot(losses_conservative, 'b-', linewidth=2, label=f'lr={lr_conservative}')
    ax3.plot(losses_good, 'g-', linewidth=2, label=f'lr={lr_good}')
    ax3.plot(losses_aggressive, color='orange', linewidth=2, label=f'lr={lr_aggressive}')
    ax3.plot(losses_reckless, 'r-', linewidth=2, label=f'lr={lr_reckless}')
    
    ax3.set_xlabel('Baking Attempt (Iteration)')
    ax3.set_ylabel('Flavor Error')
    ax3.set_title('Flavor Error Improvement Over Attempts')
    ax3.legend()
    ax3.set_yscale('log')
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Zoomed in first few steps
    ax4 = fig.add_subplot(224)
    
    # Show step-by-step for good learning rate
    for i in range(min(8, len(hist_good)-1)):
        p1 = hist_good[i]
        p2 = hist_good[i+1]
        grad = grad_f(p1)
        
        # Point
        ax4.scatter([p1[0]], [p1[1]], color='blue', s=60, zorder=5)
        ax4.annotate(f'{i}', xy=(p1[0], p1[1]), xytext=(p1[0]+0.1, p1[1]+0.1), fontsize=9)
        
        # Gradient (scaled for visualization)
        ax4.arrow(p1[0], p1[1], -grad[0]*0.02, -grad[1]*0.02, 
                 head_width=0.05, head_length=0.02, fc='red', ec='red', alpha=0.5)
        
        # Actual step
        ax4.arrow(p1[0], p1[1], (p2[0]-p1[0])*0.95, (p2[1]-p1[1])*0.95,
                 head_width=0.05, head_length=0.02, fc='green', ec='green')
    
    contour2 = ax4.contour(X, Y, Z, levels=20, cmap=cm.viridis, alpha=0.5)
    ax4.set_xlabel('Sugar Amount')
    ax4.set_ylabel('Oven Temp')
    ax4.set_title('Step-by-Step Recipe Changes (lr=0.05)\nGreen: actual changes, Red: gradient direction')
    ax4.set_xlim([-1, 4])
    ax4.set_ylim([-0.5, 1.5])
    
    plt.tight_layout()
    plt.show()
    
    print("Key observations (recipe optimization):")
    print(f"- lr={lr_conservative} (conservative): Too cautious — wastes baking attempts, slow to improve")
    print(f"- lr={lr_good} (experienced): Efficient — reaches perfect recipe quickly")
    print(f"- lr={lr_aggressive} (aggressive): Oscillates but finds the neighborhood")
    print(f"- lr={lr_reckless} (reckless): Overshoots wildly — dish gets worse before better!")

visualize_gradient_descent_comprehensive()

### The Local Minima Problem

Real loss surfaces are rarely simple bowls. They often have:
- **Local minima**: Points that look like minima locally but aren't the global best
- **Saddle points**: Points where gradient is zero but it's neither min nor max
- **Plateaus**: Flat regions where gradient is tiny

**Cooking analogy:** This is the classic "good enough" recipe trap. You've been tweaking your chocolate chip cookies for weeks and they taste great — the recipe is balanced, friends love them, and small tweaks make them worse. But there might be a completely different approach (say, browning the butter first, or using bread flour instead of all-purpose) that's actually better overall. You've found a **local minimum** — a recipe that's locally optimal but not globally. Getting out requires a bold, creative change (like momentum in gradient descent) or starting from a completely different base recipe.

Neural networks have extremely complex loss landscapes. Fortunately:
1. In high dimensions, true local minima are rare (saddle points are more common)
2. Many local minima have similar loss values
3. Modern optimizers (Adam, etc.) can escape shallow local minima

In [None]:
# Visualize local minima, saddle points, and the challenges they pose
# Cooking: Multiple "good" recipes exist — but which is THE best?

def visualize_local_minima_problem():
    """
    Visualize a flavor landscape with multiple local minima
    and show how gradient descent can get stuck in a "good enough" recipe.
    """
    
    # Create a function with multiple local minima
    # f(x) = sin(x) + 0.1*x² (creates multiple valleys — multiple "good" recipes)
    def f_1d(x):
        return np.sin(3*x) + 0.1*x**2
    
    def df_1d(x):
        return 3*np.cos(3*x) + 0.2*x
    
    # 1D visualization
    fig, axes = plt.subplots(1, 3, figsize=(16, 4))
    
    x = np.linspace(-4, 4, 200)
    y = f_1d(x)
    
    axes[0].plot(x, y, 'b-', linewidth=2)
    axes[0].set_xlabel('Recipe Parameter')
    axes[0].set_ylabel('Flavor Error')
    axes[0].set_title('Flavor Landscape with Multiple "Good" Recipes')
    axes[0].grid(True, alpha=0.3)
    
    # Mark local minima (where derivative crosses zero from - to +)
    for xi in np.linspace(-4, 4, 1000):
        if abs(df_1d(xi)) < 0.05 and f_1d(xi-0.01) > f_1d(xi) < f_1d(xi+0.01):
            axes[0].scatter([xi], [f_1d(xi)], color='red', s=100, marker='v', zorder=5)
    
    axes[0].annotate('Best\nrecipe', xy=(-2.1, f_1d(-2.1)), xytext=(-3, 1),
                    arrowprops=dict(arrowstyle='->', color='green'), fontsize=10, color='green')
    axes[0].annotate('"Good enough"\nrecipe', xy=(0.0, f_1d(0.0)), xytext=(1, 1.5),
                    arrowprops=dict(arrowstyle='->', color='red'), fontsize=10, color='red')
    
    # Run GD from different starting recipes
    def gd_1d(x0, lr=0.1, n_steps=50):
        x = x0
        history = [x]
        for _ in range(n_steps):
            x = x - lr * df_1d(x)
            history.append(x)
        return np.array(history)
    
    starts = [-3.5, -1.0, 1.5, 3.0]
    colors = ['green', 'red', 'orange', 'purple']
    
    axes[1].plot(x, y, 'b-', linewidth=2, alpha=0.5)
    for start, color in zip(starts, colors):
        hist = gd_1d(start, lr=0.05, n_steps=100)
        y_hist = f_1d(hist)
        axes[1].plot(hist, y_hist, '.-', color=color, markersize=4, 
                    linewidth=1, label=f'Start={start}')
        axes[1].scatter([start], [f_1d(start)], color=color, s=100, marker='o', zorder=5)
    
    axes[1].set_xlabel('Recipe Parameter')
    axes[1].set_ylabel('Flavor Error')
    axes[1].set_title('Optimization from Different Starting Recipes')
    axes[1].legend(fontsize=9)
    axes[1].grid(True, alpha=0.3)
    
    # Show final flavor errors
    final_errors = []
    for start in starts:
        hist = gd_1d(start, lr=0.05, n_steps=100)
        final_errors.append(f_1d(hist[-1]))
    
    axes[2].bar(range(len(starts)), final_errors, color=colors)
    axes[2].set_xticks(range(len(starts)))
    axes[2].set_xticklabels([f'Start={s}' for s in starts])
    axes[2].set_xlabel('Starting Recipe')
    axes[2].set_ylabel('Final Flavor Error')
    axes[2].set_title('Final Error Depends on Starting Recipe!\n(Different bases = different local optima)')
    axes[2].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("Key insight: Gradient descent finds LOCAL optima, not necessarily the GLOBAL optimum.")
    print("The final recipe depends on where you started!")
    print("\nCooking strategies to escape local minima:")
    print("1. Try multiple base recipes (multiple random starts)")
    print("2. Use momentum to 'carry through' past shallow local minima")
    print("3. Add randomness (SGD noise = trying random ingredient experiments)")
    print("4. Use learning rate schedules (big changes early, fine-tuning later)")

visualize_local_minima_problem()

In [None]:
# Visualize saddle points in 2D
# Cooking: A recipe where one ingredient is perfect but another is wrong —
# e.g., sugar is spot-on but oven temperature is way off

def visualize_saddle_point():
    """
    Visualize a saddle point and why it's problematic for gradient descent.
    Cooking: Like a recipe that's perfect for sweetness but terrible for texture.
    """
    # Classic saddle: f(x,y) = x² - y²
    def f_saddle(x, y):
        return x**2 - y**2
    
    def grad_saddle(p):
        return np.array([2*p[0], -2*p[1]])
    
    fig = plt.figure(figsize=(16, 5))
    
    # Create grid
    x = np.linspace(-2, 2, 100)
    y = np.linspace(-2, 2, 100)
    X, Y = np.meshgrid(x, y)
    Z = f_saddle(X, Y)
    
    # 3D surface
    ax1 = fig.add_subplot(131, projection='3d')
    ax1.plot_surface(X, Y, Z, cmap=cm.coolwarm, alpha=0.8)
    ax1.scatter([0], [0], [0], color='black', s=200, marker='o')
    ax1.set_xlabel('Sugar Amount')
    ax1.set_ylabel('Oven Temp')
    ax1.set_zlabel('Flavor Error')
    ax1.set_title('Saddle Point: f = sugar² - temp²\nOptimal in one dimension, not the other')
    
    # Contour plot
    ax2 = fig.add_subplot(132)
    contour = ax2.contour(X, Y, Z, levels=20, cmap=cm.coolwarm)
    ax2.clabel(contour, inline=True, fontsize=8)
    ax2.scatter([0], [0], color='black', s=200, marker='o', label='Saddle point')
    
    # Draw gradient arrows around saddle point
    for px, py in [(0.5, 0), (-0.5, 0), (0, 0.5), (0, -0.5)]:
        grad = grad_saddle([px, py])
        ax2.arrow(px, py, -grad[0]*0.15, -grad[1]*0.15, head_width=0.05, 
                 head_length=0.02, fc='green', ec='green')
    
    ax2.set_xlabel('Sugar Amount')
    ax2.set_ylabel('Oven Temp')
    ax2.set_title('Recipe Space Contours\nGreen arrows: negative gradient direction')
    ax2.legend()
    ax2.set_aspect('equal')
    
    # 1D slices through saddle point
    ax3 = fig.add_subplot(133)
    x_slice = np.linspace(-2, 2, 100)
    ax3.plot(x_slice, x_slice**2, 'b-', linewidth=2, label='Vary sugar only: minimum')
    ax3.plot(x_slice, -x_slice**2, 'r-', linewidth=2, label='Vary temp only: maximum')
    ax3.axhline(y=0, color='k', linewidth=0.5)
    ax3.axvline(x=0, color='k', linewidth=0.5)
    ax3.scatter([0], [0], color='black', s=100, zorder=5)
    ax3.set_xlabel('Parameter Value')
    ax3.set_ylabel('Flavor Error')
    ax3.set_title('1D Slices Through Saddle\nSame point is min AND max!')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("At a saddle point:")
    print("- Gradient is zero (looks like an optimum)")
    print("- But it's a minimum in some directions, maximum in others")
    print("- GD can get stuck here if approaching from certain directions")
    print("\nCooking analogy: A recipe where sugar is perfect but oven temperature")
    print("is all wrong. The baker who only tastes for sweetness thinks")
    print("the recipe is optimized — but changing temperature would unlock better texture!")
    print("\nIn high-dimensional neural network loss landscapes:")
    print("- Saddle points are MUCH more common than local minima")
    print("- Momentum helps escape saddle points by building up velocity")

visualize_saddle_point()

### Chain Rule in Neural Networks

Consider a simple network:

$$\text{Input } x \rightarrow z = wx + b \rightarrow a = \sigma(z) \rightarrow L = (a - y)^2$$

To find $\frac{\partial L}{\partial w}$, we apply the chain rule:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$$

**Cooking analogy:** This is like tracing: "How does changing the butter amount (w) affect the batter consistency (z), which affects the cake rise (a), which affects the final quality score (L)?" Each link has its own sensitivity, and we multiply them all together to get the total effect of butter on the final dish.

In [None]:
# Computational graph example
# Forward pass: x -> z = wx + b -> a = sigmoid(z) -> L = (a - y)²
# Cooking: sugar_amount -> sweetness = w*sugar + baseline -> perceived_quality = sigmoid(sweetness) -> error

def forward_and_backward(x, y_true, w, b):
    """Compute forward pass and gradients using chain rule.
    
    Cooking parallel: Trace how an ingredient amount flows through a flavor
    model and back again to find sensitivities.
    """
    
    # Forward pass
    z = w * x + b
    a = sigmoid(z)
    L = (a - y_true)**2
    
    print("=== Forward Pass (Ingredient -> Flavor -> Quality Error) ===")
    print(f"x (sugar amount) = {x}")
    print(f"z = w*x + b = {w}*{x} + {b} = {z}")
    print(f"a = sigmoid(z) = {a:.6f}")
    print(f"L = (a - y_target)² = ({a:.6f} - {y_true})² = {L:.6f}")
    
    # Backward pass (chain rule)
    print("\n=== Backward Pass (Chain Rule: Tracing Sensitivities Back) ===")
    
    # dL/da
    dL_da = 2 * (a - y_true)
    print(f"dL/da = 2(a - y) = {dL_da:.6f}")
    
    # da/dz (sigmoid derivative)
    da_dz = a * (1 - a)
    print(f"da/dz = sigmoid(z)(1 - sigmoid(z)) = {da_dz:.6f}")
    
    # dz/dw
    dz_dw = x
    print(f"dz/dw = x = {dz_dw}")
    
    # dz/db
    dz_db = 1
    print(f"dz/db = 1")
    
    # Chain rule
    dL_dz = dL_da * da_dz
    dL_dw = dL_dz * dz_dw
    dL_db = dL_dz * dz_db
    
    print(f"\ndL/dw = dL/da * da/dz * dz/dw = {dL_dw:.6f}")
    print(f"dL/db = dL/da * da/dz * dz/db = {dL_db:.6f}")
    
    return L, dL_dw, dL_db

# Example
x = 2.0
y_true = 1.0
w = 0.5
b = 0.1

L, dL_dw, dL_db = forward_and_backward(x, y_true, w, b)

In [None]:
# Verify with numerical gradient — the "sanity check" every chef should do
h = 1e-5

def loss(w, b, x=2.0, y=1.0):
    z = w * x + b
    a = sigmoid(z)
    return (a - y)**2

# Numerical gradients
dL_dw_numerical = (loss(w + h, b) - loss(w - h, b)) / (2 * h)
dL_db_numerical = (loss(w, b + h) - loss(w, b - h)) / (2 * h)

print("Verification with numerical gradients (the chef's double-check):")
print(f"dL/dw: analytical = {dL_dw:.6f}, numerical = {dL_dw_numerical:.6f}")
print(f"dL/db: analytical = {dL_db:.6f}, numerical = {dL_db_numerical:.6f}")

### Multivariate Chain Rule

When a variable affects the output through multiple paths:

$$\frac{\partial L}{\partial x} = \sum_{i} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}$$

This is why we **sum** gradients when a variable is used multiple times.

**Cooking analogy:** Butter affects the final dish through multiple paths simultaneously — it changes both the richness of flavor AND the texture of the crumb AND the browning of the crust. The total effect of butter on dish quality is the **sum** of its effects through each path. This is the multivariate chain rule in action.

In [None]:
# Example: f = x*y + x*z where y and z both depend on x
# Cooking: Butter (x) affects richness (x*y path) AND texture (x*z path)
# Actually, let's do: f(x) = x² + x (x is used twice)

# Computational graph:
# x --> a = x  --\
#                 +--> c = a * b --> f = c + d
# x --> b = x  --/                      |
#                                       |
# x --> d = x  -------------------------/

# This is: f = x*x + x = x² + x
# df/dx = 2x + 1 (by calculus)

# But through the graph:
# df/dx = df/dc * dc/da * da/dx + df/dc * dc/db * db/dx + df/dd * dd/dx
#       = 1 * b * 1 + 1 * a * 1 + 1 * 1
#       = x + x + 1 = 2x + 1

butter_amount = 3.0
print(f"f(x) = x² + x at x = {butter_amount}")
print(f"Cooking: Total dish quality effect when butter appears in multiple paths")
print(f"f({butter_amount}) = {butter_amount**2 + butter_amount}")
print(f"df/dx (analytical) = 2x + 1 = {2*butter_amount + 1}")

# Through computational graph — summing all paths
a = butter_amount
b = butter_amount  
c = a * b  # = x² (richness effect)
d = butter_amount  # (texture effect)
f = c + d  # = x² + x (total)

# Backward — sum gradients from all paths
df_dc = 1
df_dd = 1
dc_da = b  # = x
dc_db = a  # = x
da_dx = 1
db_dx = 1
dd_dx = 1

# Sum all paths from f to x
df_dx = df_dc * dc_da * da_dx + df_dc * dc_db * db_dx + df_dd * dd_dx
print(f"df/dx (computational graph, summing paths) = {df_dx}")
print(f"\nKey: We SUM the contributions from each path butter takes through the recipe")

---

## 5. Gradient Descent

**Gradient descent** is the optimization algorithm that powers deep learning:

$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla L(\theta)$$

Where:
- $\theta$: Parameters (weights)
- $\alpha$: Learning rate (step size)
- $\nabla L$: Gradient of loss with respect to parameters

**Cooking analogy:** This is the mathematical version of what every dedicated home cook does when perfecting a recipe. $\theta$ is the recipe (ingredient amounts, temperatures, times). $L(\theta)$ is how far the dish is from perfection. $\nabla L$ tells you which ingredients to change and by how much. $\alpha$ controls how aggressive those changes are. Each baking attempt is an iteration of gradient descent, moving the recipe toward the most delicious possible result.

In [None]:
def gradient_descent_1d(f, df, x0, learning_rate=0.1, n_steps=50):
    """Gradient descent for 1D function.
    
    Cooking: Iteratively adjust a single recipe parameter to minimize flavor error.
    """
    x = x0
    history = [(x, f(x))]
    
    for i in range(n_steps):
        grad = df(x)
        x = x - learning_rate * grad
        history.append((x, f(x)))
        
    return x, history

# Minimize f(x) = (x - 3)² — optimal baking temperature is at x=3
# Cooking: "Find the optimal oven temperature"
f = lambda x: (x - 3)**2
df = lambda x: 2 * (x - 3)

oven_temp_final, history = gradient_descent_1d(f, df, x0=10.0, learning_rate=0.1, n_steps=30)

print(f"Oven temperature optimization:")
print(f"  Starting value: 10.0")
print(f"  Optimal found:  {oven_temp_final:.6f}")
print(f"  Error at optimum: {f(oven_temp_final):.6f}")
print(f"  True optimum: x = 3")

# Visualize
x_range = np.linspace(-2, 12, 100)
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(x_range, f(x_range), 'b-', linewidth=2, label='Error = (oven_temp - 3)²')
xs, ys = zip(*history)
plt.scatter(xs, ys, c=range(len(xs)), cmap='Reds', s=50, zorder=5)
plt.plot(xs, ys, 'r--', alpha=0.5)
plt.xlabel('Oven Temperature (scaled)')
plt.ylabel('Baking Error')
plt.title('Gradient Descent: Finding Optimal Oven Temperature')
plt.legend()
plt.colorbar(label='Iteration (Baking Attempt)')

plt.subplot(1, 2, 2)
plt.plot([h[1] for h in history], 'b-o')
plt.xlabel('Baking Attempt (Iteration)')
plt.ylabel('Baking Error')
plt.title('Error Improvement Over Attempts')

plt.tight_layout()
plt.show()

### 2D Gradient Descent

Now let's optimize two recipe parameters simultaneously — this is where the gradient (not just the derivative) comes in. The gradient tells us the optimal combined direction to adjust **both** parameters at once.

In [None]:
def gradient_descent_2d(f, grad_f, start, learning_rate=0.1, n_steps=50):
    """Gradient descent for 2D function.
    
    Cooking: Simultaneously optimize two recipe parameters (e.g., sugar amount and oven temp).
    """
    point = np.array(start, dtype=float)
    history = [point.copy()]
    
    for i in range(n_steps):
        grad = grad_f(point)
        point = point - learning_rate * grad
        history.append(point.copy())
        
    return point, np.array(history)

# Minimize flavor_error(sugar, temp) = sugar² + temp²
# Perfect recipe at (0, 0)
def f(p):
    return p[0]**2 + p[1]**2

def grad_f(p):
    return np.array([2*p[0], 2*p[1]])

start = [4.0, 3.0]  # First attempt recipe
final, history = gradient_descent_2d(f, grad_f, start, learning_rate=0.1, n_steps=30)

print(f"First attempt recipe: sugar={start[0]}, temp={start[1]}")
print(f"Optimal recipe found: sugar={final[0]:.6f}, temp={final[1]:.6f}")
print(f"Final flavor error: {f(final):.10f}")

# Visualize the recipe optimization path
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2

plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap=cm.viridis)
plt.plot(history[:, 0], history[:, 1], 'r.-', markersize=10, linewidth=2)
plt.scatter([start[0]], [start[1]], color='green', s=200, marker='o', label='First Attempt', zorder=5)
plt.scatter([final[0]], [final[1]], color='red', s=200, marker='*', label='Perfect Recipe', zorder=5)
plt.xlabel('Sugar Amount')
plt.ylabel('Oven Temperature')
plt.title('Gradient Descent: Finding the Perfect Recipe\nflavor_error(sugar, temp) = sugar² + temp²')
plt.legend()
plt.colorbar(label='Flavor Error')
plt.axis('equal')
plt.show()

### Effect of Learning Rate

**Cooking analogy:** The learning rate is the cook's **aggressiveness dial**. How big a recipe change do you make between attempts? Too small and you waste dozens of baking sessions making imperceptible changes. Too large and the recipe oscillates between too sweet and too bland, never settling on the optimum.

### Calculus Concepts and Their ML Applications

| Calculus Concept | What it Means | ML Application | Cooking Parallel |
|------------------|---------------|----------------|------------------|
| **Derivative** | Rate of change of output w.r.t. input | How loss changes when we change one weight | How flavor changes when we adjust sugar amount |
| **Partial Derivative** | Rate of change w.r.t. one variable (others fixed) | Gradient component for one parameter | Effect of changing ONLY oven temperature on crust quality |
| **Gradient** | Vector of all partial derivatives | Direction to update ALL weights at once | Complete sensitivity report for all recipe ingredients |
| **Chain Rule** | Derivative of composed functions = product of derivatives | Backpropagation through network layers | How oven temp flows through crust, moisture, texture to affect quality |
| **Gradient Descent** | Iteratively move opposite to gradient | Core training algorithm for neural networks | Iterative recipe perfection across baking attempts |
| **Learning Rate** | Step size in gradient descent | Hyperparameter controlling training speed | How aggressively recipe changes are between attempts |
| **Local Minimum** | Point where gradient = 0 and function curves up | Where training might get stuck | A "good enough" recipe that isn't actually the best |
| **Saddle Point** | Point where gradient = 0 but not min or max | Common in high-dim; momentum helps escape | Recipe optimal for sweetness but not for texture |

### The Full Picture: How a Neural Network Learns

1. **Forward pass**: Input flows through network, computing activations layer by layer
2. **Loss computation**: Compare output to target, get a single number (the loss)
3. **Backward pass**: Use chain rule to compute gradient of loss w.r.t. every weight
4. **Parameter update**: Use gradient descent to update all weights
5. **Repeat**: Until loss is small enough

**Cooking parallel:** This is exactly the recipe development workflow:
1. **Bake the dish** (forward pass through the cooking process)
2. **Taste and evaluate** (compute the loss — how far from perfection?)
3. **Analyze what to change** to improve (backward pass / gradient computation)
4. **Adjust the recipe** (parameter update via gradient descent)
5. **Bake again** (repeat until delicious)

In [None]:
# Compare different learning rates — the cook's aggressiveness dial
learning_rates = [0.01, 0.1, 0.5, 0.95]
colors = ['blue', 'green', 'orange', 'red']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Contour plot with paths
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2

axes[0].contour(X, Y, Z, levels=20, cmap=cm.viridis, alpha=0.5)

for lr, color in zip(learning_rates, colors):
    final, history = gradient_descent_2d(f, grad_f, [4.0, 3.0], learning_rate=lr, n_steps=20)
    axes[0].plot(history[:, 0], history[:, 1], '.-', color=color, markersize=8, 
                 linewidth=2, label=f'lr={lr}')
    
    # Flavor error curve
    losses = [f(p) for p in history]
    axes[1].plot(losses, color=color, linewidth=2, label=f'lr={lr}')

axes[0].set_xlabel('Sugar Amount')
axes[0].set_ylabel('Oven Temperature')
axes[0].set_title('Recipe Optimization Paths\n(Different Cook Aggressiveness)')
axes[0].legend()
axes[0].axis('equal')

axes[1].set_xlabel('Baking Attempt (Iteration)')
axes[1].set_ylabel('Flavor Error')
axes[1].set_title('Flavor Error Convergence')
axes[1].legend()
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

print("Cooking observations on learning rate (recipe adjustment aggressiveness):")
print("- Too small (0.01): Conservative — wastes baking attempts, slow to find perfection")
print("- Good (0.1): Experienced — steady progress toward the perfect recipe")
print("- Larger (0.5): Bold — finds neighborhood fast but oscillates around it")
print("- Too large (0.95): Reckless — recipe swings wildly between extremes")

### A More Challenging Function: Rosenbrock

The Rosenbrock function is a classic optimization test:

$$f(x, y) = (1 - x)^2 + 100(y - x^2)^2$$

Minimum at $(1, 1)$. Famous for its narrow, curved valley.

**Cooking analogy:** This is like a recipe landscape with a narrow "sweet spot" — a long, winding valley of decent results but only one truly optimal point. Think of it like perfecting a soufflé: there's a narrow corridor of ingredient ratios and temperatures that work, and the optimal point requires precise tuning of correlated parameters (egg whites and oven temperature that depend on each other). This is why simple gradient descent struggles and why adaptive optimizers (like Adam in ML, or experienced pastry chefs in the kitchen) are so valuable.

In [None]:
def rosenbrock(p):
    x, y = p
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(p):
    x, y = p
    dx = -2*(1 - x) - 400*x*(y - x**2)
    dy = 200*(y - x**2)
    return np.array([dx, dy])

# Visualize the Rosenbrock "soufflé" recipe landscape
x = np.linspace(-2, 2, 200)
y = np.linspace(-1, 3, 200)
X, Y = np.meshgrid(x, y)
Z = (1 - X)**2 + 100 * (Y - X**2)**2

plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=np.logspace(0, 3, 30), cmap=cm.viridis)
plt.scatter([1], [1], color='red', s=200, marker='*', label='Perfect Recipe (1,1)', zorder=5)
plt.xlabel('Egg White Ratio')
plt.ylabel('Oven Temperature')
plt.title('The Rosenbrock "Soufflé" Recipe Landscape\nNotice the narrow curved valley — hard to optimize!')
plt.colorbar(label='Flavor Error')
plt.legend()
plt.show()

In [None]:
# Gradient descent on Rosenbrock (challenging! — like perfecting a soufflé)
start = [-1.0, 1.0]
final, history = gradient_descent_2d(rosenbrock, rosenbrock_grad, start, 
                                      learning_rate=0.001, n_steps=5000)

print(f"Starting recipe: {start}")
print(f"Final recipe:    [{final[0]:.4f}, {final[1]:.4f}]")
print(f"Error at final: {rosenbrock(final):.6f}")
print(f"True perfect recipe: (1, 1), error = 0")

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.contour(X, Y, Z, levels=np.logspace(0, 3, 30), cmap=cm.viridis, alpha=0.5)
plt.plot(history[::50, 0], history[::50, 1], 'r.-', markersize=5, linewidth=1)  # Every 50th point
plt.scatter([start[0]], [start[1]], color='green', s=100, marker='o', label='First Attempt', zorder=5)
plt.scatter([final[0]], [final[1]], color='red', s=100, marker='*', label='Final Recipe', zorder=5)
plt.xlabel('Egg White Ratio')
plt.ylabel('Oven Temperature')
plt.title('Recipe Optimization on Rosenbrock (Soufflé)')
plt.legend()

plt.subplot(1, 2, 2)
losses = [rosenbrock(p) for p in history[::10]]
plt.plot(losses)
plt.xlabel('Iteration (x10)')
plt.ylabel('Flavor Error')
plt.title('Error Over Optimization Steps')
plt.yscale('log')

plt.tight_layout()
plt.show()

print("\nNote: Simple gradient descent struggles with narrow valleys!")
print("Cooking: This is why experienced pastry chefs and advanced techniques are needed")
print("for delicate recipes like soufflés with tight ingredient windows.")
print("More advanced optimizers (Adam, etc.) handle this better.")

---

## 6. Putting It Together: Training a Linear Model

Let's train a simple linear regression model using gradient descent.

**Cooking scenario:** We're building a simple model to predict baking quality from oven temperature. The relationship is roughly linear: higher temperature means faster browning (up to a point). We'll use gradient descent to find the best-fit line — exactly how a baker might calibrate their temperature-to-browning model from practice bakes.

In [None]:
# Generate synthetic cooking data: browning score vs oven temperature
np.random.seed(42)
n_bakes = 100

# True parameters: each degree of oven temp adds ~0.035 to browning score, baseline browning is 1.0
# We'll use scaled values for numerical convenience
temp_effect_true = 2.5    # scaled temperature-to-browning sensitivity
baseline_browning_true = 1.0  # scaled baseline browning score

# Generate data: browning = temp_effect * oven_temp + baseline + noise
oven_temp = np.random.uniform(-3, 3, n_bakes)  # centered temperature values
browning_score = temp_effect_true * oven_temp + baseline_browning_true + np.random.normal(0, 0.5, n_bakes)

plt.figure(figsize=(10, 6))
plt.scatter(oven_temp, browning_score, alpha=0.6, label='Practice bake data')
plt.plot(oven_temp, temp_effect_true * oven_temp + baseline_browning_true, 'r-', linewidth=2, 
         label=f'True model: browning = {temp_effect_true}*temp + {baseline_browning_true}')
plt.xlabel('Oven Temperature (centered)')
plt.ylabel('Browning Score (scaled)')
plt.title('Baking Data: Browning Score vs Oven Temperature\n(Each dot = one practice bake)')
plt.legend()
plt.show()

In [None]:
def train_browning_model(X, y, learning_rate=0.01, n_epochs=100):
    """
    Train a linear temperature-to-browning model using gradient descent.
    
    Model: predicted_browning = w * oven_temp + b
    Loss: MSE = (1/n) * sum((predicted_browning - actual_browning)^2)
    
    Cooking context: The baker wants to learn the temperature effect coefficient
    and baseline browning from practice bake data.
    """
    n = len(X)
    
    # Initialize parameters (start with no knowledge)
    w = 0.0  # temperature effect coefficient
    b = 0.0  # baseline browning
    
    history = {'loss': [], 'w': [], 'b': []}
    
    for epoch in range(n_epochs):
        # Forward pass: predict browning scores
        y_pred = w * X + b
        
        # Compute loss (MSE — how wrong are our predictions?)
        loss = np.mean((y_pred - y)**2)
        
        # Compute gradients (which direction improves the model?)
        # d(loss)/dw = (2/n) * sum((y_pred - y) * x)
        # d(loss)/db = (2/n) * sum(y_pred - y)
        dw = (2/n) * np.sum((y_pred - y) * X)
        db = (2/n) * np.sum(y_pred - y)
        
        # Update parameters (gradient descent step)
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record history
        history['loss'].append(loss)
        history['w'].append(w)
        history['b'].append(b)
        
        if epoch % 20 == 0:
            print(f"Epoch {epoch:3d}: loss = {loss:.4f}, temp_effect = {w:.4f}, baseline = {b:.4f}")
    
    return w, b, history

# Train the browning model
w_learned, b_learned, history = train_browning_model(oven_temp, browning_score, learning_rate=0.1, n_epochs=100)

print(f"\nLearned: temp_effect = {w_learned:.4f}, baseline = {b_learned:.4f}")
print(f"True:    temp_effect = {temp_effect_true:.4f}, baseline = {baseline_browning_true:.4f}")
print(f"\nThe model successfully learned the temperature-browning relationship from bake data!")

In [None]:
# Visualize the training process
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss curve
axes[0].plot(history['loss'])
axes[0].set_xlabel('Training Epoch')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Prediction Error Over Training')

# Parameter trajectory
axes[1].plot(history['w'], label='temp_effect (w)')
axes[1].plot(history['b'], label='baseline (b)')
axes[1].axhline(y=temp_effect_true, color='blue', linestyle='--', alpha=0.5, 
                label=f'true temp_effect={temp_effect_true}')
axes[1].axhline(y=baseline_browning_true, color='orange', linestyle='--', alpha=0.5, 
                label=f'true baseline={baseline_browning_true}')
axes[1].set_xlabel('Training Epoch')
axes[1].set_ylabel('Parameter Value')
axes[1].set_title('Parameter Convergence\n(Model learns the true values!)')
axes[1].legend()

# Final fit
axes[2].scatter(oven_temp, browning_score, alpha=0.6, label='Bake data')
x_line = np.linspace(-3, 3, 100)
axes[2].plot(x_line, temp_effect_true * x_line + baseline_browning_true, 'g-', linewidth=2, 
             label=f'True: {temp_effect_true}*temp + {baseline_browning_true}')
axes[2].plot(x_line, w_learned * x_line + b_learned, 'r--', linewidth=2, 
             label=f'Learned: {w_learned:.2f}*temp + {b_learned:.2f}')
axes[2].set_xlabel('Oven Temperature')
axes[2].set_ylabel('Browning Score')
axes[2].set_title('Final Model Fit\n(Red dashed = learned, Green = true)')
axes[2].legend()

plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: Implement Gradient Checking for a Caramelization Model

**Cooking scenario:** You've built an analytical model of sugar caramelization and computed its gradients by hand. Before trusting those gradients for recipe optimization, you need to **verify** them against numerical gradients. This is gradient checking — the baker's sanity check before making expensive ingredient changes based on model predictions.

Gradient checking is crucial for debugging backpropagation. Compare analytical gradients with numerical gradients.

In [None]:
def gradient_check(f, grad_f, point, h=1e-5, threshold=1e-5):
    """
    Compare analytical gradient with numerical gradient.
    Returns True if they match within threshold.
    
    Cooking context: Verify that your hand-derived caramelization model gradients 
    match numerical estimates before using them to make recipe decisions.
    """
    point = np.array(point, dtype=float)
    analytical_grad = grad_f(point)
    numerical_grad = compute_gradient(f, point, h)
    
    # Compute relative error
    diff = np.abs(analytical_grad - numerical_grad)
    denom = np.maximum(np.abs(analytical_grad) + np.abs(numerical_grad), 1e-10)
    relative_error = diff / denom
    
    print(f"Recipe point: {point}")
    print(f"Analytical gradient:  {analytical_grad}")
    print(f"Numerical gradient:   {numerical_grad}")
    print(f"Relative error: {relative_error}")
    print(f"Max relative error: {np.max(relative_error):.2e}")
    
    return np.all(relative_error < threshold)

# Test on a caramelization model: f(temp, sugar_conc) = temp³ + 2*temp*sugar_conc + sugar_conc²
# Cooking: How caramelization depends on temperature and sugar concentration
def caramelization_model(p):
    temp, sugar_conc = p
    return temp**3 + 2*temp*sugar_conc + sugar_conc**2

def caramelization_grad(p):
    temp, sugar_conc = p
    return np.array([3*temp**2 + 2*sugar_conc, 2*temp + 2*sugar_conc])

passed = gradient_check(caramelization_model, caramelization_grad, [2.0, 3.0])
print(f"\nGradient check passed: {passed}")
print("The analytical gradients are trustworthy for recipe optimization!")

### Exercise 2: Implement Softmax and Its Gradient

**Cooking scenario:** Your recipe development team needs to convert raw "fitness scores" for different cooking methods (bake, broil, sauté) into **probabilities** — "What's the probability that each method is the optimal choice for this dish?" Softmax is the standard way to do this conversion, and it's critical for classification in ML.

Softmax is critical for classification. Implement it and its gradient.

In [None]:
def softmax(x):
    """
    Compute softmax: softmax(x)_i = exp(x_i) / sum(exp(x_j))
    Subtract max for numerical stability.
    
    Cooking context: Convert cooking method fitness scores into selection probabilities.
    """
    # TODO: Implement softmax
    x_shifted = x - np.max(x)  # For numerical stability
    exp_x = np.exp(x_shifted)
    return exp_x / np.sum(exp_x)

def softmax_jacobian(x):
    """
    Compute Jacobian of softmax.
    J[i,j] = d(softmax_i)/d(x_j)
    
    Formula: J[i,j] = softmax_i * (delta_ij - softmax_j)
    where delta_ij = 1 if i==j, else 0
    
    Cooking context: How does changing the fitness score of one cooking method 
    affect the selection probability of every method?
    """
    s = softmax(x)
    n = len(s)
    jacobian = np.zeros((n, n))
    
    # TODO: Implement Jacobian
    for i in range(n):
        for j in range(n):
            if i == j:
                jacobian[i, j] = s[i] * (1 - s[j])
            else:
                jacobian[i, j] = -s[i] * s[j]
    
    return jacobian

# Test — cooking method fitness scores: [bake, broil, sauté]
method_scores = np.array([2.0, 1.0, 0.1])
print(f"Cooking method fitness scores: {method_scores}")
print(f"  (Bake=2.0, Broil=1.0, Sauté=0.1)")
print(f"\nSelection probabilities: {softmax(method_scores)}")
print(f"Sum (should be 1): {softmax(method_scores).sum():.6f}")
print(f"\nJacobian (how each score affects each probability):")
print(f"{softmax_jacobian(method_scores)}")
print(f"\nCooking interpretation: Baking has highest probability ({softmax(method_scores)[0]:.1%})")
print(f"because it has the highest fitness score.")

### Exercise 3: Gradient Descent with Momentum

**Cooking scenario:** Standard gradient descent can get stuck in local minima or oscillate in narrow valleys. Momentum is like giving your recipe optimizer "inertia" — it builds up speed in consistent directions and carries through small bumps. In cooking terms, instead of reacting purely to the last bake's results, momentum lets you carry the "trend" from multiple attempts. If the dish has been getting better with more butter for 3 attempts straight, momentum says "keep going in that direction even if this one attempt was noisy."

Momentum helps accelerate gradient descent. Implement it!

In [None]:
def gradient_descent_momentum(f, grad_f, start, learning_rate=0.01, momentum=0.9, n_steps=100):
    """
    Gradient descent with momentum.
    
    v = momentum * v - learning_rate * gradient
    x = x + v
    
    Cooking context: The velocity term carries the "trend" from previous attempts,
    helping the optimizer build speed in consistent directions and carry past
    noisy bumps in the recipe landscape.
    """
    point = np.array(start, dtype=float)
    velocity = np.zeros_like(point)
    history = [point.copy()]
    
    for i in range(n_steps):
        grad = grad_f(point)
        velocity = momentum * velocity - learning_rate * grad
        point = point + velocity
        history.append(point.copy())
        
    return point, np.array(history)

# Compare regular GD vs GD with momentum on Rosenbrock (the "soufflé" landscape)
start = [-1.0, 1.0]
n_steps = 1000

final_gd, history_gd = gradient_descent_2d(rosenbrock, rosenbrock_grad, start, 
                                            learning_rate=0.001, n_steps=n_steps)
final_mom, history_mom = gradient_descent_momentum(rosenbrock, rosenbrock_grad, start,
                                                    learning_rate=0.001, momentum=0.9, n_steps=n_steps)

print(f"Regular GD final error:  {rosenbrock(final_gd):.6f}")
print(f"Momentum GD final error: {rosenbrock(final_mom):.6f}")
print(f"\nMomentum finds a better recipe in the same number of iterations!")

# Visualize
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.contour(X, Y, Z, levels=np.logspace(0, 3, 30), cmap=cm.viridis, alpha=0.5)
plt.plot(history_gd[::20, 0], history_gd[::20, 1], 'b.-', markersize=3, label='Standard GD')
plt.plot(history_mom[::20, 0], history_mom[::20, 1], 'r.-', markersize=3, label='With Momentum')
plt.xlabel('Egg White Ratio')
plt.ylabel('Oven Temperature')
plt.title('Recipe Optimization Paths\n(Momentum carries through the narrow valley)')
plt.legend()

plt.subplot(1, 2, 2)
losses_gd = [rosenbrock(p) for p in history_gd]
losses_mom = [rosenbrock(p) for p in history_mom]
plt.plot(losses_gd, 'b-', label='Standard GD')
plt.plot(losses_mom, 'r-', label='With Momentum')
plt.xlabel('Iteration')
plt.ylabel('Flavor Error')
plt.title('Error: Standard GD vs Momentum\n(Momentum converges faster)')
plt.yscale('log')
plt.legend()

plt.tight_layout()
plt.show()

---

## Summary

### Key Concepts

| Concept | Mathematical Meaning | Cooking Parallel |
|---------|---------------------|------------------|
| **Derivatives** | Measure rate of change — essential for optimization | How browning changes when you adjust oven temperature (rising speed is the derivative of dough volume) |
| **Partial derivatives** | Handle functions of multiple variables | Effect of changing ONE ingredient (e.g., sugar amount) while holding everything else fixed |
| **The gradient** | Points in the direction of steepest ascent | The chef's "sensitivity report" — which recipe ingredient to change to improve flavor fastest |
| **Chain rule** | Compute gradients through composed functions (backprop!) | How oven temp flows through crust formation, moisture retention, texture to affect quality (sequential dependencies) |
| **Gradient descent** | Minimize loss by following the negative gradient | Iteratively tweaking a recipe across baking attempts to minimize the gap from perfection |
| **Learning rate** | How big each optimization step is | How aggressively the cook adjusts the recipe (too big = overshoot, too small = slow progress) |
| **Local minima** | A point that's locally optimal but not globally | A "good enough" recipe that's not actually the best — getting stuck |
| **Momentum** | Build up "velocity" in consistent directions | Carrying the trend from multiple baking attempts to power through noise and shallow local minima |

### Connection to Deep Learning

- **Forward pass**: Compute function values through the network
- **Loss**: Scalar measuring prediction quality
- **Backward pass**: Apply chain rule to compute gradients
- **Update**: Move parameters in negative gradient direction

### Checklist
- [ ] I can compute derivatives of common functions
- [ ] I understand partial derivatives and gradients
- [ ] I can apply the chain rule to composite functions
- [ ] I can implement gradient descent from scratch
- [ ] I understand the effect of learning rate

---

## Next Steps

Continue to **Part 1.3: Probability & Statistics** where we'll cover:
- Probability distributions
- Bayes' theorem
- Maximum likelihood estimation
- Information theory (entropy, KL divergence)

**Cooking preview:** Probability and statistics are how cooks make decisions under uncertainty — "What's the probability this bread will rise properly at this altitude?", "Given the current rate of caramelization, when should I pull it from the oven?", "How confident are we that this recipe change actually improved the dish vs. random variation?" These are all probability and statistics questions.