# Calculus Challenge Arena

Welcome, challenger! This notebook tests your calculus knowledge through progressively harder challenges. Each concept is a **World** with multiple **Levels** and a **Boss Level**.

**Rules:**
- Try each challenge before looking at hints
- Hints cost points ‚Äî use them wisely!
- Solutions explain the concept AFTER you've struggled with it
- Bonus challenges are optional but earn extra points

## Progress Tracker

| Level | Challenge | Type | Points | Status |
|-------|-----------|------|--------|--------|
| 1.1 | The Slope Finder | Compute | /10 | ‚¨ú |
| 1.2 | Shape Predictor | Predict | /15 | ‚¨ú |
| 1.3 | Derivative Detective | Debug | /20 | ‚¨ú |
| üèÜ | Boss: Activation Function Analysis | Connect | /30 | ‚¨ú |
| 2.1 | Chain Reaction | Compute | /10 | ‚¨ú |
| 2.2 | The Missing Link | Debug | /15 | ‚¨ú |
| üèÜ | Boss: Backprop Simulator | Construct | /30 | ‚¨ú |
| 3.1 | Downhill Runner | Compute | /10 | ‚¨ú |
| 3.2 | Rate Tuner | Predict | /15 | ‚¨ú |
| 3.3 | Valley Finder | Optimize | /20 | ‚¨ú |
| üèÜ | Boss: The Optimizer | Construct | /30 | ‚¨ú |
| üëë | Final Boss: Train a Model | Construct | /50 | ‚¨ú |
| | **Total** | | **/255** | |

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def numerical_derivative(f, x, h=1e-5):
    """Helper: compute derivative numerically."""
    return (f(x + h) - f(x - h)) / (2 * h)

def check_answer(yours, correct, tolerance=0.01):
    """Check if your answer is close enough."""
    if abs(yours - correct) < tolerance:
        print(f"\u2705 Correct! Your answer: {yours:.4f}, Expected: {correct:.4f}")
        return True
    else:
        print(f"\u274c Not quite. Your answer: {yours:.4f}, Expected: {correct:.4f}")
        return False

# World 1: Derivatives

## Level 1.1: The Slope Finder

**Type:** Compute | **Points:** 10 | **Difficulty:** Easy

### Mission

Find the derivative of f(x) = 3x¬≤ + 2x - 5 and evaluate it at x = 2.

### Intel

The **power rule** says: if f(x) = x^n, then f'(x) = n * x^(n-1).
For a sum: take the derivative of each term separately.

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

Break the function into three terms and find the derivative of each one separately.

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

- d/dx[3x¬≤] = 3 * 2x = 6x
- d/dx[2x] = 2
- d/dx[-5] = 0

Now combine and plug in x = 2.

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

f'(x) = 6x + 2. At x = 2: f'(2) = 6(2) + 2 = 14.

</details>

In [None]:
# Level 1.1 ‚Äî YOUR SOLUTION
# Write your derivative function and evaluate at x = 2

def f(x):
    return 3*x**2 + 2*x - 5

# TODO: Replace None with the derivative function
def f_prime(x):
    return 6*x + 2  # YOUR ANSWER

answer = f_prime(2)
print(f"f'(2) = {answer}")
check_answer(answer, 14.0)

# Verify with numerical derivative
print(f"Numerical check: {numerical_derivative(f, 2):.4f}")

## Level 1.2: Shape Predictor

**Type:** Predict | **Points:** 15 | **Difficulty:** Medium

### Mission

Without computing, predict: where does f(x) = x¬≥ - 3x have zero slope? Then verify by finding the derivative, setting it to zero, and plotting both the function and its derivative.

### Intel

A function has **zero slope** at points where f'(x) = 0. These are called **critical points**. At these points the function is either at a local maximum, a local minimum, or an inflection point. Think about the shape of a cubic curve ‚Äî it rises, flattens, dips, flattens, then rises again.

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

A cubic like x¬≥ - 3x has an S-shape. It has one local max and one local min. Where do those occur?

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

f'(x) = 3x¬≤ - 3. Set this equal to zero: 3x¬≤ - 3 = 0. Can you solve for x?

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

3x¬≤ - 3 = 0 means x¬≤ = 1 means x = +1 or x = -1. At x = -1 the function has a local max, at x = 1 it has a local min.

</details>

In [None]:
# Level 1.2 ‚Äî Verification
# f(x) = x¬≥ - 3x, f'(x) = 3x¬≤ - 3
# Set f'(x) = 0: 3x¬≤ - 3 = 0 -> x¬≤ = 1 -> x = +/-1

f = lambda x: x**3 - 3*x
f_prime = lambda x: 3*x**2 - 3

x = np.linspace(-3, 3, 200)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(x, f(x), 'b-', linewidth=2)
ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax1.scatter([1, -1], [f(1), f(-1)], color='red', s=100, zorder=5)
ax1.set_title('f(x) = x\u00b3 - 3x')
ax1.grid(True, alpha=0.3)

ax2.plot(x, f_prime(x), 'r-', linewidth=2)
ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax2.scatter([1, -1], [0, 0], color='red', s=100, zorder=5)
ax2.set_title("f'(x) = 3x\u00b2 - 3 (zero at x = \u00b11)")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print("Critical points at x = -1 and x = 1")
print(f"f'(-1) = {f_prime(-1):.1f}, f'(1) = {f_prime(1):.1f} ‚Äî both zero! \u2705")

## Level 1.3: Derivative Detective

**Type:** Debug | **Points:** 20 | **Difficulty:** Hard

### Mission

This code computes the derivative of sin(x) but has a bug. Find and fix it.

```python
# BUGGY CODE:
import numpy as np
def my_derivative(f, x, h=1e-5):
    return (f(x + h) - f(x)) / h  # Something is wrong here...
```

The function gives a result, but it is significantly less accurate than it should be. Can you figure out why and fix it?

### Intel

There are multiple ways to approximate a derivative numerically. The most common are:
- **Forward difference:** (f(x+h) - f(x)) / h
- **Centered difference:** (f(x+h) - f(x-h)) / (2h)

One of these is significantly more accurate than the other. The error in the forward difference is O(h), while the centered difference has error O(h¬≤).

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

The bug is about accuracy, not about getting a completely wrong answer. Compare the error of the buggy code vs. what a better method would give.

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

The forward difference only looks at one side of x. A centered difference looks at both sides and gives a much better approximation of the slope at x.

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

Replace `(f(x + h) - f(x)) / h` with `(f(x + h) - f(x - h)) / (2 * h)`. This changes the error from O(h) to O(h¬≤), making it roughly 100,000x more accurate for h = 1e-5.

</details>

In [None]:
# Level 1.3 ‚Äî The Bug: Forward difference vs centered difference

# Buggy version (forward difference)
def bad_derivative(f, x, h=1e-5):
    return (f(x + h) - f(x)) / h

# Fixed version (centered difference - much more accurate!)
def good_derivative(f, x, h=1e-5):
    return (f(x + h) - f(x - h)) / (2 * h)

x0 = 1.0
exact = np.cos(x0)  # derivative of sin(x) is cos(x)
bad = bad_derivative(np.sin, x0)
good = good_derivative(np.sin, x0)

print(f"Exact derivative of sin(x) at x={x0}: {exact:.10f}")
print(f"Forward difference (buggy):            {bad:.10f}  error: {abs(bad-exact):.2e}")
print(f"Centered difference (fixed):           {good:.10f}  error: {abs(good-exact):.2e}")
print(f"\nCentered difference is ~{abs(bad-exact)/abs(good-exact):.0f}x more accurate! \u2705")

## Boss Level: Activation Function Analysis

**Type:** Connect | **Points:** 30 | **Difficulty:** Boss

### Mission

The sigmoid, ReLU, and tanh are all "activation functions" used in neural networks. Your mission:

1. Compute and plot each function AND its derivative.
2. Then explain: why might a neural network prefer ReLU over sigmoid?

### Intel

- **Sigmoid:** sigma(x) = 1 / (1 + e^(-x)), outputs between 0 and 1
- **ReLU:** relu(x) = max(0, x), outputs 0 or positive
- **Tanh:** tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)), outputs between -1 and 1

In neural networks, gradients flow backward through layers. If a derivative is consistently small, gradients shrink to near-zero ‚Äî this is the **vanishing gradient problem**.

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

Plot the derivatives of all three. What is the maximum value each derivative can reach? Which one lets the largest gradients through?

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

- Sigmoid derivative: sigma(x) * (1 - sigma(x)), max value = 0.25
- ReLU derivative: 0 when x < 0, 1 when x > 0
- Tanh derivative: 1 - tanh(x)¬≤, max value = 1.0

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

Sigmoid's derivative maxes at 0.25, so in every layer the gradient is multiplied by at most 0.25. After 10 layers: 0.25^10 is near zero. ReLU's derivative is either 0 or 1 ‚Äî gradients pass through unchanged when active. This is why deep networks prefer ReLU.

</details>

In [None]:
# Boss Level 1 Solution: Activation Function Analysis
sigmoid = lambda x: 1 / (1 + np.exp(-x))
sigmoid_d = lambda x: sigmoid(x) * (1 - sigmoid(x))
relu = lambda x: np.maximum(0, x)
relu_d = lambda x: (x > 0).astype(float)
tanh_fn = lambda x: np.tanh(x)
tanh_d = lambda x: 1 - np.tanh(x)**2

x = np.linspace(-5, 5, 200)
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

for ax, fn, name in zip(axes[0], [sigmoid, relu, tanh_fn], ['Sigmoid', 'ReLU', 'Tanh']):
    ax.plot(x, fn(x), 'b-', linewidth=2)
    ax.set_title(name, fontsize=13)
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='gray', alpha=0.3)

for ax, fn, name in zip(axes[1], [sigmoid_d, relu_d, tanh_d], ["Sigmoid'", "ReLU'", "Tanh'"]):
    ax.plot(x, fn(x), 'r-', linewidth=2)
    ax.set_title(f'{name} (derivative)', fontsize=13)
    ax.grid(True, alpha=0.3)

plt.suptitle('Activation Functions and Their Derivatives', fontsize=14)
plt.tight_layout()
plt.show()
print("Key insight: Sigmoid's derivative maxes at 0.25 ‚Äî gradients shrink!")
print("ReLU's derivative is 0 or 1 ‚Äî gradients pass through unchanged.")
print("This is why ReLU helps avoid the 'vanishing gradient' problem. \u2705")

# World 2: Chain Rule

## Level 2.1: Chain Reaction

**Type:** Compute | **Points:** 10 | **Difficulty:** Easy

### Mission

Find dy/dx for y = (2x + 1)¬≥ at x = 2.

### Intel

The **chain rule** handles compositions: if y = f(g(x)), then dy/dx = f'(g(x)) * g'(x).

Think of it as "derivative of the outer, times derivative of the inner."

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

Identify the outer function and the inner function. What is "outer" and what is "inner" in (2x + 1)¬≥?

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

- Outer: u¬≥, derivative = 3u¬≤
- Inner: u = 2x + 1, derivative = 2
- Chain rule: 3(2x+1)¬≤ * 2 = 6(2x+1)¬≤

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

dy/dx = 6(2x + 1)¬≤. At x = 2: 6(2*2 + 1)¬≤ = 6(5)¬≤ = 6 * 25 = 150.

</details>

In [None]:
# Level 2.1 Solution
# y = (2x + 1)¬≥ -> outer: u¬≥, inner: 2x + 1
# dy/dx = 3u¬≤ * 2 = 6(2x + 1)¬≤
y = lambda x: (2*x + 1)**3
dy_dx = lambda x: 6 * (2*x + 1)**2

x_val = 2
print(f"y = (2x + 1)\u00b3 at x = {x_val}")
print(f"Chain rule: dy/dx = 6(2({x_val}) + 1)\u00b2 = {dy_dx(x_val)}")
print(f"Numerical:  {numerical_derivative(y, x_val):.4f}")
check_answer(dy_dx(x_val), numerical_derivative(y, x_val))

## Level 2.2: The Missing Link

**Type:** Debug | **Points:** 15 | **Difficulty:** Medium

### Mission

A student differentiated sin(x¬≤) and got cos(x¬≤). What did they forget? Find the correct derivative and verify numerically.

### Intel

When you have a function inside a function ‚Äî like sin(something) ‚Äî you need the chain rule. The derivative is NOT just the derivative of the outer function. You must multiply by the derivative of the inner function too.

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

The student only took the derivative of the outer function (sin -> cos). What about the inner function (x¬≤)?

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

Chain rule: d/dx[sin(x¬≤)] = cos(x¬≤) * d/dx[x¬≤]. What is d/dx[x¬≤]?

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

d/dx[sin(x¬≤)] = cos(x¬≤) * 2x. The student forgot to multiply by 2x (the derivative of the inner function x¬≤).

</details>

In [None]:
# Level 2.2 Solution
# f(x) = sin(x¬≤)
# WRONG: f'(x) = cos(x¬≤) ‚Äî forgot the chain rule!
# RIGHT: f'(x) = cos(x¬≤) * 2x ‚Äî must multiply by derivative of inner function

f = lambda x: np.sin(x**2)
wrong_answer = lambda x: np.cos(x**2)
right_answer = lambda x: np.cos(x**2) * 2 * x

x0 = 1.5
exact = numerical_derivative(f, x0)
print(f"f(x) = sin(x\u00b2) at x = {x0}")
print(f"Wrong (forgot chain rule): {wrong_answer(x0):.6f}")
print(f"Right (with chain rule):   {right_answer(x0):.6f}")
print(f"Numerical verification:    {exact:.6f}")
print(f"\nThe student forgot to multiply by the derivative of the INNER function (2x)!")
check_answer(right_answer(x0), exact)

## Boss Level: Backprop Simulator

**Type:** Construct | **Points:** 30 | **Difficulty:** Boss

### Mission

Build a simple computational graph. Given L = (sigmoid(wx + b) - y)¬≤, compute dL/dw step by step using the chain rule. This is exactly what happens inside neural network training ‚Äî **backpropagation** is just the chain rule applied to a computation graph!

Use w = 0.5, x = 2.0, b = 0.1, y = 1.0.

### Intel

The computation graph for L = (sigmoid(wx + b) - y)¬≤ has four steps:
1. z = wx + b (linear combination)
2. a = sigmoid(z) (activation)
3. e = a - y (error)
4. L = e¬≤ (squared loss)

To find dL/dw, apply the chain rule through each step: dL/dw = dL/de * de/da * da/dz * dz/dw.

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

Work forward first to get all the values (z, a, e, L). Then work backward, computing each local derivative.

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

The local derivatives are:
- dL/de = 2e = 2(a - y)
- de/da = 1
- da/dz = a(1 - a) (sigmoid derivative)
- dz/dw = x

Multiply them all together.

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

Forward: z = 0.5*2 + 0.1 = 1.1, a = sigmoid(1.1) = 0.7503, L = (0.7503 - 1)¬≤ = 0.0624.
Backward: dL/da = 2(0.7503 - 1) = -0.4995, da/dz = 0.7503 * 0.2497 = 0.1874, dz/dw = 2.0.
dL/dw = -0.4995 * 0.1874 * 2.0 = -0.1872.

</details>

In [None]:
# Boss Level 2: Manual Backpropagation
# L = (sigmoid(wx + b) - y)¬≤
w, x_val, b, y_val = 0.5, 2.0, 0.1, 1.0

# Forward pass
z = w * x_val + b              # z = wx + b
a = 1 / (1 + np.exp(-z))      # a = sigmoid(z)
L = (a - y_val)**2             # L = (a - y)¬≤

print("Forward pass:")
print(f"  z = w*x + b = {w}*{x_val} + {b} = {z}")
print(f"  a = sigmoid({z:.1f}) = {a:.6f}")
print(f"  L = (a - y)\u00b2 = ({a:.6f} - {y_val})\u00b2 = {L:.6f}")

# Backward pass (chain rule!)
dL_da = 2 * (a - y_val)       # d/da[(a-y)¬≤]
da_dz = a * (1 - a)           # sigmoid derivative
dz_dw = x_val                  # d/dw[wx + b] = x

dL_dw = dL_da * da_dz * dz_dw  # CHAIN RULE: multiply through!

print(f"\nBackward pass (chain rule):")
print(f"  dL/da = 2(a - y) = {dL_da:.6f}")
print(f"  da/dz = a(1-a) = {da_dz:.6f}")
print(f"  dz/dw = x = {dz_dw}")
print(f"  dL/dw = {dL_da:.6f} \u00d7 {da_dz:.6f} \u00d7 {dz_dw} = {dL_dw:.6f}")

# Verify numerically
def loss(w_val):
    z = w_val * x_val + b
    a = 1 / (1 + np.exp(-z))
    return (a - y_val)**2

numerical = numerical_derivative(loss, w)
print(f"\nNumerical verification: {numerical:.6f}")
check_answer(dL_dw, numerical)

# World 3: Gradient Descent

## Level 3.1: Downhill Runner

**Type:** Compute | **Points:** 10 | **Difficulty:** Easy

### Mission

Implement gradient descent to minimize f(x) = (x - 3)¬≤. Start at x = 10, use learning_rate = 0.2, and run for 20 steps. Where do you end up?

### Intel

Gradient descent updates a parameter by moving it in the direction opposite to the gradient (derivative): x_new = x_old - learning_rate * f'(x_old).

Think of it as rolling a ball downhill. The gradient tells you which way is "up," so you go the opposite direction. The learning rate controls how big your steps are.

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

What is f'(x) for f(x) = (x - 3)¬≤? Use the power rule (or chain rule).

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

f'(x) = 2(x - 3). The update rule is: x = x - 0.2 * 2(x - 3) = x - 0.4(x - 3).

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

Starting at x = 10:
- Step 1: x = 10 - 0.4*(10-3) = 10 - 2.8 = 7.2
- Step 2: x = 7.2 - 0.4*(7.2-3) = 7.2 - 1.68 = 5.52
- ...continues approaching 3.0

</details>

In [None]:
# Level 3.1 Solution
f = lambda x: (x - 3)**2
df = lambda x: 2 * (x - 3)

x = 10.0
lr = 0.2
history = [x]
for step in range(20):
    x = x - lr * df(x)
    history.append(x)

print(f"Start: x = 10.0")
print(f"After 20 steps: x = {x:.6f} (target: 3.0)")
check_answer(x, 3.0, tolerance=0.01)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
t = np.linspace(-1, 12, 100)
plt.plot(t, (t-3)**2, 'b-', alpha=0.3)
plt.plot(history, [(h-3)**2 for h in history], 'ro-', markersize=4)
plt.title('Gradient Descent Path')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(history, 'g.-')
plt.axhline(y=3, color='r', linestyle='--', label='target')
plt.title('x value over steps')
plt.xlabel('Step')
plt.ylabel('x')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Level 3.2: Rate Tuner

**Type:** Predict | **Points:** 15 | **Difficulty:** Medium

### Mission

Predict what happens with learning rates 0.01, 0.5, and 1.1 on f(x) = x¬≤, starting at x = 5. Then verify with code and plots.

### Intel

The learning rate is the most important hyperparameter in gradient descent. Too small and you waste time. Too large and you overshoot the minimum and may even diverge (the values get larger instead of smaller). There is a sweet spot in between.

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

Think about what happens at each step. With lr = 0.01, the step size is 0.01 * 2x. With lr = 1.1, the step size is 1.1 * 2x. For x = 5, that is a step of 11 ‚Äî past zero and out the other side!

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

For f(x) = x¬≤, the update is x_new = x - lr * 2x = x(1 - 2*lr).
- lr = 0.01: x_new = 0.98x (slow convergence)
- lr = 0.5: x_new = 0 (instant convergence!)
- lr = 1.1: x_new = -1.2x (oscillates and grows!)

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

For f(x) = x¬≤, convergence requires |1 - 2*lr| < 1, which means 0 < lr < 1. So lr = 0.01 converges (slowly), lr = 0.5 converges (instantly), and lr = 1.1 diverges.

</details>

In [None]:
# Level 3.2 Solution: Learning Rate Effects
f = lambda x: x**2
df = lambda x: 2*x

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, lr, desc in zip(axes, [0.01, 0.5, 1.1], ['Too Slow', 'Just Right', 'Diverges!']):
    x = 5.0
    path = [x]
    for _ in range(30):
        x = x - lr * df(x)
        path.append(x)
        if abs(x) > 1e6:  # Divergence detection
            break
    
    t = np.linspace(-6, 6, 100)
    ax.plot(t, t**2, 'b-', alpha=0.3)
    path_clipped = [p for p in path if abs(p) < 10]
    ax.plot(path_clipped, [p**2 for p in path_clipped], 'ro-', markersize=3)
    ax.set_title(f'lr={lr} ‚Äî {desc}', fontsize=12)
    ax.set_ylim(-2, 40)
    ax.grid(True, alpha=0.3)

plt.suptitle('Learning Rate Comparison', fontsize=14)
plt.tight_layout()
plt.show()
print("lr=0.01: Converges but very slowly")
print("lr=0.5:  Converges in one step! (perfect for x\u00b2)")
print("lr=1.1:  Diverges! Each step overshoots more than the last")

## Level 3.3: Valley Finder

**Type:** Optimize | **Points:** 20 | **Difficulty:** Hard

### Mission

Use 2D gradient descent to find the minimum of f(x, y) = x¬≤ + 3y¬≤. Start at (4, 3) with learning rate 0.1. Run 50 steps and plot the path on a contour plot.

### Intel

In 2D, the gradient is a vector of partial derivatives: grad f = [df/dx, df/dy]. You update both x and y simultaneously. The contour plot shows "level curves" ‚Äî points with the same function value. The minimum is at the center of the contours.

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

What are the partial derivatives of f(x, y) = x¬≤ + 3y¬≤? Take the derivative with respect to x (treating y as constant), then with respect to y (treating x as constant).

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

df/dx = 2x, df/dy = 6y. So the gradient is [2x, 6y]. Note that the gradient in the y-direction is 3x larger ‚Äî the function is steeper in y. This means y will converge faster.

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

Update: [x, y] = [x, y] - 0.1 * [2x, 6y].
Step 1: [4, 3] - 0.1 * [8, 18] = [3.2, 1.2].
The path curves toward the x-axis first (y converges faster), then slides along toward the origin.

</details>

In [None]:
# Level 3.3 Solution: 2D Gradient Descent
def f_2d(p):
    return p[0]**2 + 3*p[1]**2

def grad_2d(p):
    return np.array([2*p[0], 6*p[1]])

point = np.array([4.0, 3.0])
lr = 0.1
path = [point.copy()]

for _ in range(50):
    point = point - lr * grad_2d(point)
    path.append(point.copy())

path = np.array(path)
print(f"Start: ({path[0][0]:.1f}, {path[0][1]:.1f})")
print(f"End:   ({path[-1][0]:.4f}, {path[-1][1]:.4f})")
print(f"Minimum at: (0, 0), f = {f_2d(path[-1]):.6f}")

# Contour plot
x = np.linspace(-5, 5, 100)
y = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + 3*Y**2

plt.figure(figsize=(8, 6))
plt.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.5)
plt.plot(path[:, 0], path[:, 1], 'r.-', markersize=4, label='GD path')
plt.plot(4, 3, 'go', markersize=10, label='Start')
plt.plot(0, 0, 'r*', markersize=15, label='Minimum')
plt.title('2D Gradient Descent on f(x,y) = x\u00b2 + 3y\u00b2')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlabel('x')
plt.ylabel('y')
plt.tight_layout()
plt.show()

## Boss Level: The Optimizer

**Type:** Construct | **Points:** 30 | **Difficulty:** Boss

### Mission

Build a complete gradient descent optimizer that works for ANY 1D function. Your optimizer should:

1. Accept any function f(x) ‚Äî compute its derivative numerically
2. Run gradient descent from a starting point
3. Track and plot the optimization path
4. Detect convergence (stop when the gradient is small enough)

Test it on f(x) = x‚Å¥ - 3x¬≤ + 2 ‚Äî a function with multiple local minima!

### Intel

A robust optimizer needs:
- A numerical derivative function (you built one in Level 1.3!)
- A convergence criterion: stop when |f'(x)| < epsilon
- A maximum iteration count to prevent infinite loops
- History tracking for visualization

The function x‚Å¥ - 3x¬≤ + 2 has two local minima. Depending on your starting point, gradient descent will find different ones!

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

Your optimizer function should take f, x_start, lr, and max_steps as arguments. Use the centered difference formula for the numerical derivative.

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

The convergence check: if abs(numerical_derivative(f, x)) < 1e-6, the gradient is essentially zero and you can stop. Return the full history so you can plot it.

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

```python
def optimize(f, x_start, lr=0.01, max_steps=1000, tol=1e-6):
    x = x_start
    history = [x]
    for i in range(max_steps):
        grad = numerical_derivative(f, x)
        if abs(grad) < tol:
            break
        x = x - lr * grad
        history.append(x)
    return x, history
```

</details>

In [None]:
# Boss Level 3 Solution: Universal 1D Optimizer

def optimize(f, x_start, lr=0.01, max_steps=1000, tol=1e-6):
    """Gradient descent optimizer for any 1D function."""
    x = x_start
    history = [x]
    for i in range(max_steps):
        grad = numerical_derivative(f, x)
        if abs(grad) < tol:
            print(f"Converged after {i} steps!")
            break
        x = x - lr * grad
        history.append(x)
    return x, history

# Test on f(x) = x‚Å¥ - 3x¬≤ + 2
f = lambda x: x**4 - 3*x**2 + 2

# Try two starting points to find different minima
x_min1, hist1 = optimize(f, x_start=2.0, lr=0.01)
x_min2, hist2 = optimize(f, x_start=-2.0, lr=0.01)

print(f"\nStarting at x=2.0:  found minimum at x = {x_min1:.4f}, f(x) = {f(x_min1):.4f}")
print(f"Starting at x=-2.0: found minimum at x = {x_min2:.4f}, f(x) = {f(x_min2):.4f}")

# Visualization
t = np.linspace(-2.5, 2.5, 200)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(t, f(t), 'b-', linewidth=2, alpha=0.5)
ax1.plot(hist1, [f(x) for x in hist1], 'ro-', markersize=2, label=f'Start x=2.0')
ax1.plot(hist2, [f(x) for x in hist2], 'g^-', markersize=2, label=f'Start x=-2.0')
ax1.plot(x_min1, f(x_min1), 'r*', markersize=15)
ax1.plot(x_min2, f(x_min2), 'g*', markersize=15)
ax1.set_title('f(x) = x\u2074 - 3x\u00b2 + 2 with GD paths')
ax1.set_xlabel('x')
ax1.set_ylabel('f(x)')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.plot(hist1, 'r.-', label='Start x=2.0')
ax2.plot(hist2, 'g.-', label='Start x=-2.0')
ax2.set_title('Convergence: x value over steps')
ax2.set_xlabel('Step')
ax2.set_ylabel('x')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print("\nKey insight: Same function, different starting points -> different minima!")
print("This is why initialization matters in neural network training. \u2705")

# Final Boss: Train a Model

**Type:** Construct | **Points:** 50 | **Difficulty:** Final Boss

### Mission

Train a linear model y = wx + b to fit noisy data using gradient descent. You must implement everything from scratch:

1. **Forward pass:** Compute predictions y_pred = w*x + b
2. **Loss computation:** Mean Squared Error = mean((y_pred - y_true)¬≤)
3. **Gradient computation:** Use the chain rule to find dL/dw and dL/db
4. **Parameter update:** Apply gradient descent to update w and b

The true relationship is y = 2.5x + 1.0 with some noise. Can your optimizer recover these values?

### Intel

This is where everything comes together:
- **Derivatives** (World 1): You need to compute gradients of the loss
- **Chain rule** (World 2): The loss depends on predictions, which depend on w and b
- **Gradient descent** (World 3): You update parameters using the gradients

For MSE loss L = mean((wx + b - y)¬≤):
- dL/dw = mean(2 * (wx + b - y) * x)
- dL/db = mean(2 * (wx + b - y))

### Hints

<details>
<summary>Hint 1 (Conceptual Nudge)</summary>

Initialize w = 0 and b = 0. Run the forward pass, compute the loss, then compute the gradients and update. Repeat for 100 epochs.

</details>

<details>
<summary>Hint 2 (Methodological Clue)</summary>

The gradients come from the chain rule applied to L = mean((wx + b - y)¬≤):
- dL/dw = mean(2 * error * x) where error = (y_pred - y_true)
- dL/db = mean(2 * error)
Use lr = 0.01 for stable convergence.

</details>

<details>
<summary>Hint 3 (Detailed Walkthrough)</summary>

```python
for epoch in range(100):
    y_pred = w * X + b
    loss = np.mean((y_pred - y_true)**2)
    dL_dw = np.mean(2 * (y_pred - y_true) * X)
    dL_db = np.mean(2 * (y_pred - y_true))
    w = w - lr * dL_dw
    b = b - lr * dL_db
```

</details>

In [None]:
# Final Boss Solution: Train a Linear Model
np.random.seed(42)
X = np.random.uniform(-3, 3, 50)
y_true = 2.5 * X + 1.0 + np.random.normal(0, 0.5, 50)

# Initialize parameters
w, b = 0.0, 0.0
lr = 0.01
n_epochs = 100
losses = []

for epoch in range(n_epochs):
    # Forward pass
    y_pred = w * X + b
    
    # Loss (Mean Squared Error)
    loss = np.mean((y_pred - y_true)**2)
    losses.append(loss)
    
    # Gradients (chain rule!)
    dL_dw = np.mean(2 * (y_pred - y_true) * X)
    dL_db = np.mean(2 * (y_pred - y_true))
    
    # Update (gradient descent!)
    w = w - lr * dL_dw
    b = b - lr * dL_db

print(f"Learned: y = {w:.3f}x + {b:.3f}")
print(f"True:    y = 2.500x + 1.000")
print(f"Final loss: {losses[-1]:.4f}")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.scatter(X, y_true, alpha=0.5, label='Data')
x_line = np.linspace(-3, 3, 100)
ax1.plot(x_line, w*x_line + b, 'r-', linewidth=2, label=f'Learned: {w:.2f}x + {b:.2f}')
ax1.plot(x_line, 2.5*x_line + 1.0, 'g--', linewidth=2, label='True: 2.5x + 1.0')
ax1.legend()
ax1.set_title('Linear Regression via Gradient Descent')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.grid(True, alpha=0.3)

ax2.plot(losses)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss (MSE)')
ax2.set_title('Training Loss')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n--- FINAL BOSS CLEARED! ---")
print(f"You combined derivatives, chain rule, and gradient descent")
print(f"to train a model from scratch. This is the foundation of")
print(f"ALL neural network training. \u2705")

## Score Summary

Fill in your scores below!

### Scoring Rules
- **Full points:** Solved without any hints
- **-2 points:** Used Hint 1
- **-5 points:** Used Hint 2
- **-8 points:** Used Hint 3 (or looked at the solution)

| Level | Challenge | Max Points | Your Score |
|-------|-----------|-----------|------------|
| 1.1 | The Slope Finder | 10 | |
| 1.2 | Shape Predictor | 15 | |
| 1.3 | Derivative Detective | 20 | |
| Boss 1 | Activation Function Analysis | 30 | |
| 2.1 | Chain Reaction | 10 | |
| 2.2 | The Missing Link | 15 | |
| Boss 2 | Backprop Simulator | 30 | |
| 3.1 | Downhill Runner | 10 | |
| 3.2 | Rate Tuner | 15 | |
| 3.3 | Valley Finder | 20 | |
| Boss 3 | The Optimizer | 30 | |
| Final Boss | Train a Model | 50 | |
| | **Total** | **255** | |

### Achievement Levels

| Score Range | Title | What It Means |
|-------------|-------|---------------|
| 230-255 | Calculus Grand Master | You crushed it with minimal hints. Ready for advanced ML. |
| 180-229 | Calculus Warrior | Strong fundamentals. Review the hints you used ‚Äî they highlight your growth areas. |
| 120-179 | Calculus Apprentice | Good start! Re-try the Boss Levels after reviewing World solutions. |
| 60-119 | Calculus Explorer | You are building the right instincts. Work through the hints carefully and try again. |
| 0-59 | Calculus Newcomer | No shame ‚Äî everyone starts somewhere. Focus on World 1, master it, then move on. |

### What to Do Next

- **Scored 200+?** Try modifying the Final Boss to use a non-linear model (quadratic regression).
- **Struggled with chain rule?** Go back to World 2 and trace through the backprop example step by step.
- **Gradient descent clicked?** Explore momentum and Adam optimizer ‚Äî they build on exactly these ideas.
- **Want more?** Every deep learning framework (PyTorch, TensorFlow) does exactly what you did in the Final Boss ‚Äî just at a much larger scale.