# Autograd — Automatic Differentiation

**Month 2, Week 1** — Sequence Models

Autograd is PyTorch's automatic differentiation engine. It's how neural networks learn.

## Why This Matters

Training a neural network requires:
1. **Forward pass**: compute output from input
2. **Loss**: measure how wrong the output is
3. **Backward pass**: compute gradients (how to adjust weights)
4. **Update**: adjust weights to reduce loss

Autograd handles step 3 automatically — you don't need to derive gradients by hand.

In [1]:
import torch
print(f"PyTorch {torch.__version__}")

PyTorch 2.9.1


---

## 1. The Computation Graph

When you set `requires_grad=True`, PyTorch builds a graph of operations.

In [2]:
# Create a tensor that tracks gradients
x = torch.tensor([3.0], requires_grad=True)
print(f"x = {x}")
print(f"requires_grad: {x.requires_grad}")

x = tensor([3.], requires_grad=True)
requires_grad: True


In [3]:
# Operations create a computation graph
y = x ** 2       # y = x²
z = 2 * y + 3    # z = 2y + 3 = 2x² + 3

print(f"y = x² = {y.item()}")
print(f"z = 2x² + 3 = {z.item()}")
print(f"\nz.grad_fn: {z.grad_fn}")  # Shows the operation that created z

y = x² = 9.0
z = 2x² + 3 = 21.0

z.grad_fn: <AddBackward0 object at 0x108720f40>


---

## 2. Computing Gradients with backward()

The chain rule: if z = 2y + 3 and y = x², then dz/dx = dz/dy × dy/dx

In [4]:
# Compute gradients
z.backward()

print(f"x = {x.item()}")
print(f"z = 2x² + 3")
print(f"dz/dx = 4x = {x.grad.item()}")
print(f"Expected at x=3: 4×3 = 12 ✓")

x = 3.0
z = 2x² + 3
dz/dx = 4x = 12.0
Expected at x=3: 4×3 = 12 ✓


---

## 3. Gradients with Multiple Variables

In neural networks, we have many weights. Each gets its own gradient.

In [5]:
# Two "weights"
w1 = torch.tensor([2.0], requires_grad=True)
w2 = torch.tensor([3.0], requires_grad=True)

# Forward pass: z = w1² + w1*w2
z = w1 ** 2 + w1 * w2
print(f"z = w1² + w1×w2 = {z.item()}")

# Backward pass
z.backward()

print(f"\ndz/dw1 = 2×w1 + w2 = {w1.grad.item()}")
print(f"Expected: 2×2 + 3 = 7 ✓")

print(f"\ndz/dw2 = w1 = {w2.grad.item()}")
print(f"Expected: 2 ✓")

z = w1² + w1×w2 = 10.0

dz/dw1 = 2×w1 + w2 = 7.0
Expected: 2×2 + 3 = 7 ✓

dz/dw2 = w1 = 2.0
Expected: 2 ✓


---

## 4. Gradient Accumulation (Important!)

Gradients **accumulate** by default. You must zero them before each backward pass.

In [6]:
x = torch.tensor([2.0], requires_grad=True)

# First backward
y = x ** 2
y.backward()
print(f"After 1st backward: x.grad = {x.grad.item()}")

# Second backward WITHOUT zeroing
y = x ** 2
y.backward()
print(f"After 2nd backward: x.grad = {x.grad.item()}  (accumulated!)")

# Correct way: zero gradients first
x.grad.zero_()
y = x ** 2
y.backward()
print(f"After zeroing + backward: x.grad = {x.grad.item()}  ✓")

After 1st backward: x.grad = 4.0
After 2nd backward: x.grad = 8.0  (accumulated!)
After zeroing + backward: x.grad = 4.0  ✓


**Key insight**: In training loops, always call `optimizer.zero_grad()` before `loss.backward()`

---

## 5. torch.no_grad() — Disable Tracking

During inference (prediction), we don't need gradients. Disabling saves memory and speeds up computation.

In [7]:
x = torch.tensor([2.0], requires_grad=True)

# With gradient tracking
y = x ** 2
print(f"With tracking: y.requires_grad = {y.requires_grad}")

# Without gradient tracking (inference mode)
with torch.no_grad():
    y_no_grad = x ** 2
    print(f"In no_grad: y.requires_grad = {y_no_grad.requires_grad}")

With tracking: y.requires_grad = True
In no_grad: y.requires_grad = False


---

## 6. A Simple Neural Network Example

Let's see autograd in action with a tiny "network" that learns.

In [8]:
# Data: learn y = 2x + 1
X = torch.tensor([[1.0], [2.0], [3.0], [4.0]])
Y = torch.tensor([[3.0], [5.0], [7.0], [9.0]])  # y = 2x + 1

# Initialize weights (random start)
w = torch.tensor([[0.0]], requires_grad=True)  # slope
b = torch.tensor([[0.0]], requires_grad=True)  # intercept

learning_rate = 0.1

print("Learning y = 2x + 1")
print(f"Initial: w={w.item():.2f}, b={b.item():.2f}")
print()

Learning y = 2x + 1
Initial: w=0.00, b=0.00



In [9]:
# Training loop
for epoch in range(100):
    # Forward pass
    Y_pred = X @ w + b  # predictions
    
    # Loss (mean squared error)
    loss = ((Y_pred - Y) ** 2).mean()
    
    # Backward pass
    loss.backward()
    
    # Update weights (gradient descent)
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
    
    # Zero gradients for next iteration
    w.grad.zero_()
    b.grad.zero_()
    
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d}: loss={loss.item():.4f}, w={w.item():.3f}, b={b.item():.3f}")

print(f"\nFinal: w={w.item():.3f}, b={b.item():.3f}")
print(f"Target: w=2.000, b=1.000")

Epoch   0: loss=41.0000, w=3.500, b=1.200
Epoch  20: loss=0.0041, w=2.052, b=0.849
Epoch  40: loss=0.0012, w=2.028, b=0.918
Epoch  60: loss=0.0004, w=2.015, b=0.955
Epoch  80: loss=0.0001, w=2.008, b=0.976

Final: w=2.005, b=0.986
Target: w=2.000, b=1.000


---

## 7. Vector/Matrix Gradients

In real networks, weights are matrices. Autograd handles this too.

In [10]:
# Weight matrix (2 inputs → 3 outputs)
W = torch.randn(2, 3, requires_grad=True)
x = torch.randn(1, 2)  # 1 sample, 2 features

# Forward: y = xW
y = x @ W
print(f"Input x shape: {x.shape}")
print(f"Weight W shape: {W.shape}")
print(f"Output y shape: {y.shape}")

# Sum output (need scalar for backward)
loss = y.sum()
loss.backward()

print(f"\nW.grad shape: {W.grad.shape}")
print(f"W.grad:\n{W.grad}")

Input x shape: torch.Size([1, 2])
Weight W shape: torch.Size([2, 3])
Output y shape: torch.Size([1, 3])

W.grad shape: torch.Size([2, 3])
W.grad:
tensor([[-0.3314, -0.3314, -0.3314],
        [ 1.3739,  1.3739,  1.3739]])


---

## 8. detach() — Stop Gradient Flow

Sometimes you want to use a tensor's value but not backprop through it.

In [11]:
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# Detach creates a new tensor that doesn't track history
y_detached = y.detach()

print(f"y.requires_grad: {y.requires_grad}")
print(f"y_detached.requires_grad: {y_detached.requires_grad}")

# Use case: target in loss calculation shouldn't get gradients
# loss = (prediction - target.detach()) ** 2

y.requires_grad: True
y_detached.requires_grad: False


---

## Summary

| Concept | Code | Purpose |
|---------|------|----------|
| Track gradients | `requires_grad=True` | Enable autograd for this tensor |
| Compute gradients | `.backward()` | Backprop through the graph |
| Access gradient | `.grad` | Get the computed gradient |
| Zero gradients | `.grad.zero_()` | Reset before next backward |
| Disable tracking | `with torch.no_grad():` | Inference mode, saves memory |
| Stop gradient | `.detach()` | Break the computation graph |

## Training Loop Pattern

```python
for epoch in range(num_epochs):
    # 1. Forward pass
    predictions = model(X)
    loss = loss_fn(predictions, Y)
    
    # 2. Backward pass
    optimizer.zero_grad()  # Zero gradients
    loss.backward()        # Compute gradients
    
    # 3. Update weights
    optimizer.step()       # Apply gradients
```

## Next

Build a feedforward neural network using `torch.nn` and `torch.optim`.