# Autograd: PyTorch's Gradient Engine

In this notebook, you'll build hands-on intuition for how PyTorch computes gradients automatically.

**What you'll do:**
- Compute gradients for a polynomial and verify by hand
- Reproduce a backprop network and compare `.grad` to manual calculation
- Discover the gradient accumulation trap and fix it
- Write a complete manual training step
- Use `detach()` to surgically stop gradient flow

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones — they reveal gaps in your mental model.

In [None]:
import torch
import matplotlib.pyplot as plt

# For nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

---

## Exercise 1: Polynomial Gradient (Guided)

Let's start simple. We'll compute the gradient of a polynomial and verify it by hand.

Given: $y = x^3 + 2x^2 - 5x + 1$

The derivative is: $\frac{dy}{dx} = 3x^2 + 4x - 5$

At $x = 3$: $\frac{dy}{dx} = 3(9) + 4(3) - 5 = 27 + 12 - 5 = 34$

**Before running the cell below, predict:** What will `x.grad` be?

In [None]:
# Create a tensor with requires_grad=True so PyTorch tracks operations on it
x = torch.tensor(3.0, requires_grad=True)

# Forward pass: compute the polynomial
y = x**3 + 2*x**2 - 5*x + 1

# Backward pass: compute dy/dx
y.backward()

# Check the gradient
print(f"y = x^3 + 2x^2 - 5x + 1")
print(f"At x = {x.item()}:")
print(f"  y            = {y.item()}")
print(f"  x.grad (PyTorch) = {x.grad.item()}")
print(f"  dy/dx (by hand)  = {3*3**2 + 4*3 - 5}")
print(f"  Match: {x.grad.item() == 3*3**2 + 4*3 - 5}")

**What just happened:**
- `requires_grad=True` tells PyTorch to record every operation on `x`
- When we called `y.backward()`, PyTorch walked the computation graph backwards
- It applied the chain rule at each step to compute $\frac{dy}{dx}$
- The result landed in `x.grad`

This is the same mechanism that computes gradients for millions of parameters in a neural network. The math is identical — just more chain rule steps.

---

## Exercise 2: Reproduce a Backprop Network (Guided)

Now let's reproduce a simple 2-layer computation and compare `.grad` values to what we'd compute by hand.

Network:
```
x = 2.0
w1 = 3.0
w2 = -1.0
b = 0.5

h = x * w1          # hidden: 2 * 3 = 6
h_act = h + b        # add bias: 6 + 0.5 = 6.5
y = h_act * w2       # output: 6.5 * -1 = -6.5
```

**Manual gradients (chain rule):**

- $\frac{dy}{dw_2} = h_{act} = 6.5$
- $\frac{dy}{dh_{act}} = w_2 = -1$
- $\frac{dy}{db} = \frac{dy}{dh_{act}} \cdot \frac{dh_{act}}{db} = -1 \cdot 1 = -1$
- $\frac{dy}{dh} = \frac{dy}{dh_{act}} \cdot \frac{dh_{act}}{dh} = -1 \cdot 1 = -1$
- $\frac{dy}{dw_1} = \frac{dy}{dh} \cdot \frac{dh}{dw_1} = -1 \cdot x = -1 \cdot 2 = -2$
- $\frac{dy}{dx} = \frac{dy}{dh} \cdot \frac{dh}{dx} = -1 \cdot w_1 = -1 \cdot 3 = -3$

**Before running: predict all six `.grad` values.**

In [None]:
# Set up all values as tensors that track gradients
x = torch.tensor(2.0, requires_grad=True)
w1 = torch.tensor(3.0, requires_grad=True)
w2 = torch.tensor(-1.0, requires_grad=True)
b = torch.tensor(0.5, requires_grad=True)

# Forward pass (same computation as above)
h = x * w1           # hidden
h_act = h + b        # add bias
y = h_act * w2       # output

# Backward pass
y.backward()

# Compare PyTorch's gradients to our manual calculation
print("Forward pass:")
print(f"  h     = x * w1     = {h.item()}")
print(f"  h_act = h + b      = {h_act.item()}")
print(f"  y     = h_act * w2 = {y.item()}")
print()

# Manual expected values
expected = {
    'x':  -3.0,   # dy/dh * dh/dx = -1 * 3 = -3
    'w1': -2.0,   # dy/dh * dh/dw1 = -1 * 2 = -2
    'w2':  6.5,   # dy/dw2 = h_act = 6.5
    'b':  -1.0,   # dy/dh_act * dh_act/db = -1 * 1 = -1
}

print(f"{'Variable':<8} {'PyTorch .grad':>15} {'Manual calc':>15} {'Match':>8}")
print("-" * 50)
for name, tensor in [('x', x), ('w1', w1), ('w2', w2), ('b', b)]:
    grad = tensor.grad.item()
    manual = expected[name]
    match = abs(grad - manual) < 1e-6
    print(f"{name:<8} {grad:>15.4f} {manual:>15.4f} {str(match):>8}")

**Key insight:** Autograd is doing exactly the chain rule you'd do by hand. It's not magic — it's bookkeeping. Each `.grad` is $\frac{d(\text{output})}{d(\text{this variable})}$, computed by multiplying local derivatives along every path from the output back to that variable.

---

## Exercise 3: The Accumulation Trap (Guided)

Here's a subtle behavior that catches everyone. PyTorch **accumulates** gradients by default — calling `.backward()` again **adds** to existing `.grad` values instead of replacing them.

**Before running each cell, predict what `x.grad` will be.**

In [None]:
# Step 1: First backward pass
x = torch.tensor(3.0, requires_grad=True)

y = x ** 2    # dy/dx = 2x = 6
y.backward()

print(f"After first backward():")
print(f"  x.grad = {x.grad.item()}")
print(f"  Expected: 6.0 (2 * 3)")

In [None]:
# Step 2: Second backward pass WITHOUT zero_grad
# PREDICT: What will x.grad be?

y = x ** 2    # dy/dx = 2x = 6 again
y.backward()

print(f"After second backward() (no zero_grad):")
print(f"  x.grad = {x.grad.item()}")
print(f"  Expected: 12.0 (6 + 6, accumulated!)")
print()
print("The gradient ACCUMULATED. PyTorch added 6 to the existing 6.")
print("This is by design (useful for RNNs), but a trap if you forget.")

In [None]:
# Step 3: The fix — zero_grad before backward
x.grad.zero_()   # Reset gradient to 0 (in-place)

y = x ** 2
y.backward()

print(f"After zero_grad() + backward():")
print(f"  x.grad = {x.grad.item()}")
print(f"  Expected: 6.0 (fresh gradient, no accumulation)")
print()
print("RULE: Always zero gradients before each backward pass.")
print("In training loops, this is optimizer.zero_grad().")

**Why does PyTorch accumulate by default?**

Some architectures (like RNNs) call `.backward()` multiple times for a single parameter update. Accumulation makes this natural. But for standard training, you must zero gradients each step or your updates will be wrong.

**The pattern you'll use in every training loop:**
```python
optimizer.zero_grad()   # Clear old gradients
loss.backward()         # Compute new gradients
optimizer.step()        # Update parameters
```

---

## Exercise 4: Manual Training Step (Supported)

Now put it all together. You'll write a single training step by hand — the same forward-backward-update loop that every deep learning model uses, but without an optimizer.

**Task:** Fit a linear model $\hat{y} = wx + b$ to data generated from $y = 2x + 1$.

Fill in the TODO sections. The comments tell you exactly what to do.

In [None]:
# Generate simple data: y = 2x + 1
torch.manual_seed(42)
X = torch.linspace(-3, 3, 20)
y_true = 2 * X + 1 + torch.randn(20) * 0.3   # true line + noise

# Initialize learnable parameters (random starting point)
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

learning_rate = 0.01

print(f"Initial parameters: w = {w.item():.4f}, b = {b.item():.4f}")
print(f"Target:             w = 2.0, b = 1.0")
print(f"Learning rate: {learning_rate}")

In [None]:
# ============================================================
# YOUR TASK: Implement the training loop
# ============================================================

losses = []

for step in range(200):
    # --- Step 1: Forward pass ---
    # Compute predictions: y_pred = w * X + b
    y_pred = w * X + b          # TODO: uncomment or write this

    # --- Step 2: Compute loss ---
    # MSE loss: mean of (y_true - y_pred)^2
    loss = ((y_true - y_pred) ** 2).mean()   # TODO: uncomment or write this

    # --- Step 3: Backward pass ---
    # Compute gradients of loss w.r.t. w and b
    loss.backward()             # TODO: uncomment or write this

    # --- Step 4: Update parameters ---
    # Must use torch.no_grad() so the update itself isn't tracked
    with torch.no_grad():       # TODO: uncomment this block
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad

    # --- Step 5: Zero gradients ---
    # Without this, gradients accumulate (Exercise 3!)
    w.grad.zero_()              # TODO: uncomment or write this
    b.grad.zero_()              # TODO: uncomment or write this

    losses.append(loss.item())

    if (step + 1) % 40 == 0:
        print(f"Step {step+1:3d}: loss = {loss.item():.4f}, w = {w.item():.4f}, b = {b.item():.4f}")

print(f"\nFinal: w = {w.item():.4f} (target: 2.0), b = {b.item():.4f} (target: 1.0)")

In [None]:
# Visualize the result
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss curve
axes[0].plot(losses, linewidth=2)
axes[0].set_xlabel('Step')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Training Loss')
axes[0].grid(alpha=0.3)

# Data + learned line
with torch.no_grad():
    x_line = torch.linspace(-3, 3, 100)
    y_line = w * x_line + b

axes[1].scatter(X.numpy(), y_true.numpy(), alpha=0.7, label='Data')
axes[1].plot(x_line.numpy(), y_line.numpy(), 'r-', linewidth=2,
             label=f'Learned: y = {w.item():.2f}x + {b.item():.2f}')
axes[1].plot(x_line.numpy(), (2*x_line + 1).numpy(), 'g--', linewidth=2, alpha=0.5,
             label='True: y = 2x + 1')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].set_title('Learned Fit')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

**What you just built:**

Every single training loop in deep learning is this same pattern:

1. **Forward:** compute predictions from parameters
2. **Loss:** measure how wrong the predictions are
3. **Backward:** compute how each parameter contributed to the error
4. **Update:** nudge parameters in the direction that reduces error
5. **Zero:** clear old gradients so they don't accumulate

In practice, `torch.optim` handles steps 4 and 5. But now you know what it's doing under the hood.

---

## Exercise 5: Stopping Gradients with `detach()` (Stretch)

Sometimes you want gradients to flow through part of a computation but not all of it. `detach()` creates a copy of a tensor that's disconnected from the computation graph.

**Task:** Create a chain `a -> b -> c`, detach `b`, and verify that `a` gets no gradient.

This is used in practice for things like:
- Freezing pretrained layers
- Target networks in reinforcement learning
- Stop-gradient tricks in contrastive learning

In [None]:
# First: WITHOUT detach — gradients flow all the way back
a = torch.tensor(2.0, requires_grad=True)
b = a * 3       # b = 6, db/da = 3
c = b ** 2      # c = 36, dc/db = 2b = 12

c.backward()

print("WITHOUT detach:")
print(f"  a = {a.item()}, b = {b.item()}, c = {c.item()}")
print(f"  a.grad = {a.grad.item()}")
print(f"  Expected: dc/da = dc/db * db/da = 12 * 3 = 36")
print()

In [None]:
# Now: WITH detach — gradient flow is severed at b
a = torch.tensor(2.0, requires_grad=True)
b = a * 3           # b = 6
b_detached = b.detach()  # Same value (6), but no connection to a
c = b_detached ** 2      # c = 36, but c doesn't know about a

c.backward()

print("WITH detach:")
print(f"  a = {a.item()}, b_detached = {b_detached.item()}, c = {c.item()}")
print(f"  a.grad = {a.grad}")
print(f"  Expected: None (gradient path was severed by detach)")
print()
print("detach() made b_detached a plain tensor with the same value as b,")
print("but with no memory of how it was computed. The chain rule has")
print("nowhere to go — so a.grad stays None.")

In [None]:
# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(12, 3))

# Without detach
axes[0].set_xlim(0, 10)
axes[0].set_ylim(0, 4)
axes[0].annotate('a', xy=(1, 2), fontsize=20, ha='center',
                  bbox=dict(boxstyle='round,pad=0.5', facecolor='#4a9eff', alpha=0.8))
axes[0].annotate('', xy=(3.5, 2), xytext=(2, 2),
                  arrowprops=dict(arrowstyle='->', color='white', lw=2))
axes[0].annotate('b = a*3', xy=(5, 2), fontsize=20, ha='center',
                  bbox=dict(boxstyle='round,pad=0.5', facecolor='#4a9eff', alpha=0.8))
axes[0].annotate('', xy=(7.5, 2), xytext=(6.5, 2),
                  arrowprops=dict(arrowstyle='->', color='white', lw=2))
axes[0].annotate('c = b^2', xy=(9, 2), fontsize=20, ha='center',
                  bbox=dict(boxstyle='round,pad=0.5', facecolor='#4a9eff', alpha=0.8))
axes[0].annotate('grad=36', xy=(1, 0.8), fontsize=14, ha='center', color='#00ff88')
axes[0].set_title('Without detach: gradients flow through', fontsize=14)
axes[0].axis('off')

# With detach
axes[1].set_xlim(0, 10)
axes[1].set_ylim(0, 4)
axes[1].annotate('a', xy=(1, 2), fontsize=20, ha='center',
                  bbox=dict(boxstyle='round,pad=0.5', facecolor='#666666', alpha=0.8))
axes[1].annotate('', xy=(3.5, 2), xytext=(2, 2),
                  arrowprops=dict(arrowstyle='->', color='red', lw=2, linestyle='dashed'))
axes[1].annotate('X', xy=(2.75, 2.5), fontsize=18, ha='center', color='red', fontweight='bold')
axes[1].annotate('b.detach()', xy=(5, 2), fontsize=18, ha='center',
                  bbox=dict(boxstyle='round,pad=0.5', facecolor='#4a9eff', alpha=0.8))
axes[1].annotate('', xy=(7.5, 2), xytext=(6.8, 2),
                  arrowprops=dict(arrowstyle='->', color='white', lw=2))
axes[1].annotate('c = b^2', xy=(9, 2), fontsize=20, ha='center',
                  bbox=dict(boxstyle='round,pad=0.5', facecolor='#4a9eff', alpha=0.8))
axes[1].annotate('grad=None', xy=(1, 0.8), fontsize=14, ha='center', color='red')
axes[1].set_title('With detach: gradient flow severed', fontsize=14)
axes[1].axis('off')

plt.tight_layout()
plt.show()

---

## Key Takeaways

1. **`requires_grad=True`** tells PyTorch to record operations on a tensor, building a computation graph
2. **`.backward()`** walks the graph backwards, applying the chain rule to compute gradients for every tensor that requires grad
3. **Gradients accumulate** by default — always call `.zero_grad()` (or `optimizer.zero_grad()`) before each backward pass
4. **`torch.no_grad()`** disables gradient tracking for operations inside its block — essential for parameter updates and inference
5. **`.detach()`** severs a tensor from the computation graph — the value is preserved but the gradient path is cut

**The training loop pattern** (you'll use this hundreds of times):
```python
optimizer.zero_grad()   # Clear old gradients
output = model(input)   # Forward pass
loss = criterion(output, target)  # Compute loss
loss.backward()         # Backward pass (compute gradients)
optimizer.step()        # Update parameters
```

Autograd isn't magic. It's the chain rule, applied systematically. Now you've verified that with your own hands.