<a href="https://colab.research.google.com/github/your-username/pytorch-for-deeplearning/blob/main/notebooks/03_automatic_differentiation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 3: Automatic Differentiation

This notebook covers PyTorch's automatic differentiation (autograd) system - the foundation of neural network training.

## Learning Objectives
- Understand gradient computation with `requires_grad`
- Explore the computational graph
- Learn backpropagation mechanics
- Handle gradient accumulation and zeroing
- Work with higher-order derivatives

## Setup and Installation

In [None]:
# Install and import necessary libraries
try:
    import torch
    print(f"PyTorch version: {torch.__version__}")
except ImportError:
    !pip install torch torchvision torchaudio
    import torch
    print(f"PyTorch installed. Version: {torch.__version__}")

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from graphviz import Digraph

# Set device and random seed
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
torch.manual_seed(42)

# Try to install graphviz for computational graph visualization
try:
    !pip install graphviz
    print("Graphviz installed for computational graph visualization")
except:
    print("Could not install graphviz - graph visualization will be limited")

## 1. Gradient Computation Basics

Understanding `requires_grad` and basic gradient computation.

In [None]:
print("=== Basic Gradient Computation ===")

# Create tensors with gradient tracking
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

print(f"x = {x}, requires_grad = {x.requires_grad}")
print(f"y = {y}, requires_grad = {y.requires_grad}")

# Perform computations
z = x**2 + 2*y + 1
print(f"\nz = x² + 2y + 1 = {z}")
print(f"z.requires_grad = {z.requires_grad}")
print(f"z.grad_fn = {z.grad_fn}")

# Compute gradients
z.backward()

print(f"\nAfter z.backward():")
print(f"∂z/∂x = {x.grad} (expected: 2x = {2*x.item()})")
print(f"∂z/∂y = {y.grad} (expected: 2)")

# Verify gradients manually
print(f"\nManual verification:")
print(f"∂z/∂x = ∂(x² + 2y + 1)/∂x = 2x = {2 * x.item()}")
print(f"∂z/∂y = ∂(x² + 2y + 1)/∂y = 2")

## 2. Computational Graph

Understanding how PyTorch builds and uses the computational graph.

In [None]:
print("=== Computational Graph Exploration ===")

# Create a more complex computation
a = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(2.0, requires_grad=True)
c = torch.tensor(3.0, requires_grad=True)

# Build computation graph: d = (a + b) * c
temp = a + b
d = temp * c

print(f"a = {a}")
print(f"b = {b}")
print(f"c = {c}")
print(f"temp = a + b = {temp}")
print(f"d = temp * c = {d}")

print(f"\nGrad functions:")
print(f"temp.grad_fn = {temp.grad_fn}")
print(f"d.grad_fn = {d.grad_fn}")

# Inspect the graph structure
print(f"\nGraph structure:")
print(f"d.grad_fn.next_functions = {d.grad_fn.next_functions}")

# Compute gradients
d.backward()

print(f"\nGradients:")
print(f"∂d/∂a = {a.grad} (expected: c = {c.item()})")
print(f"∂d/∂b = {b.grad} (expected: c = {c.item()})")
print(f"∂d/∂c = {c.grad} (expected: a+b = {a.item() + b.item()})")

In [None]:
# Function to visualize computational graph (simplified)
def visualize_graph_simple():
    """Create a simple visualization of computational graph"""
    print("\n=== Computational Graph Visualization ===")
    print("")
    print("  a(1.0)    b(2.0)")
    print("     \\      /")
    print("      \\    /")
    print("       \\  /")
    print("        +   ← AddBackward")
        print("        |")
    print("    temp(3.0)")
    print("        |")
    print("        |  c(3.0)")
    print("        |    /")
    print("        |   /")
    print("        |  /")
    print("        * ← MulBackward")
    print("        |")
    print("      d(9.0)")
    print("")
    print("Backward pass:")
    print("1. d.backward() starts with gradient 1.0")
    print("2. MulBackward: grad_temp = 1.0 * c, grad_c = 1.0 * temp")
    print("3. AddBackward: grad_a = grad_temp, grad_b = grad_temp")

visualize_graph_simple()

## 3. Backpropagation Mechanics

Understanding how gradients flow backwards through the computation graph.

In [None]:
print("=== Step-by-step Backpropagation ===")

# Create a simple neural network computation
# f(x, w, b) = σ(w * x + b) where σ is sigmoid
x = torch.tensor(0.5, requires_grad=True)
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(-1.0, requires_grad=True)

print(f"Input: x = {x.item()}")
print(f"Weight: w = {w.item()}")
print(f"Bias: b = {b.item()}")

# Forward pass
linear = w * x + b
output = torch.sigmoid(linear)

print(f"\nForward pass:")
print(f"linear = w*x + b = {linear.item():.4f}")
print(f"output = σ(linear) = {output.item():.4f}")

# Backward pass
output.backward()

print(f"\nBackward pass gradients:")
print(f"∂output/∂b = {b.grad.item():.4f}")
print(f"∂output/∂w = {w.grad.item():.4f}")
print(f"∂output/∂x = {x.grad.item():.4f}")

# Manual gradient computation for verification
print(f"\nManual verification:")
sigmoid_grad = output.item() * (1 - output.item())  # d/dx σ(x) = σ(x)(1-σ(x))
print(f"σ'(linear) = σ(linear)(1-σ(linear)) = {sigmoid_grad:.4f}")
print(f"∂output/∂b = σ'(linear) * 1 = {sigmoid_grad:.4f}")
print(f"∂output/∂w = σ'(linear) * x = {sigmoid_grad * x.item():.4f}")
print(f"∂output/∂x = σ'(linear) * w = {sigmoid_grad * w.item():.4f}")

## 4. Gradient Management

Learning how to manage gradients: accumulation, zeroing, and detaching.

In [None]:
print("=== Gradient Accumulation ===")

# Gradients accumulate by default
x = torch.tensor(1.0, requires_grad=True)

# First computation and backward
y1 = x**2
y1.backward()
print(f"After first backward (y1 = x²): x.grad = {x.grad}")

# Second computation and backward (gradients accumulate!)
y2 = x**3
y2.backward()
print(f"After second backward (y2 = x³): x.grad = {x.grad}")
print(f"Expected: 2x + 3x² = {2*x.item()} + {3*x.item()**2} = {2*x.item() + 3*x.item()**2}")

# Zero gradients before next computation
x.grad.zero_()
print(f"After zeroing gradients: x.grad = {x.grad}")

# Third computation
y3 = 2 * x + 5
y3.backward()
print(f"After third backward (y3 = 2x + 5): x.grad = {x.grad}")
print(f"Expected: 2")

In [None]:
print("\n=== Detaching from Computational Graph ===")

x = torch.tensor(2.0, requires_grad=True)
y = x**2

print(f"y = x² = {y}")
print(f"y.requires_grad = {y.requires_grad}")

# Detach y from the graph
y_detached = y.detach()
print(f"\ny_detached = {y_detached}")
print(f"y_detached.requires_grad = {y_detached.requires_grad}")

# Using detached tensor
z1 = y * 2  # This maintains the graph
z2 = y_detached * 2  # This breaks the graph

print(f"\nz1 = y * 2, z1.requires_grad = {z1.requires_grad}")
print(f"z2 = y_detached * 2, z2.requires_grad = {z2.requires_grad}")

# Backward through z1 works
z1.backward()
print(f"\nAfter z1.backward(): x.grad = {x.grad}")

# Reset gradient
x.grad.zero_()

# Backward through z2 won't affect x
try:
    z2.backward()
    print(f"After z2.backward(): x.grad = {x.grad}")
except RuntimeError as e:
    print(f"Error with z2.backward(): {e}")
    print("z2 doesn't require gradients, so backward() fails")

## 5. Context Managers for Gradient Control

Using `torch.no_grad()` and `torch.enable_grad()` for gradient control.

In [None]:
print("=== Context Managers ===")

x = torch.tensor(1.0, requires_grad=True)

# Normal computation with gradients
y1 = x**2
print(f"Normal: y1 = x² = {y1}, requires_grad = {y1.requires_grad}")

# Computation without gradients
with torch.no_grad():
    y2 = x**2
    print(f"No grad: y2 = x² = {y2}, requires_grad = {y2.requires_grad}")

# Inference mode (PyTorch 1.9+)
with torch.inference_mode():
    y3 = x**2
    print(f"Inference: y3 = x² = {y3}, requires_grad = {y3.requires_grad}")

print(f"\nMemory efficiency comparison:")
print(f"y1.grad_fn = {y1.grad_fn} (stores computation graph)")
print(f"y2.grad_fn = {y2.grad_fn} (no computation graph)")
print(f"y3.grad_fn = {y3.grad_fn} (no computation graph)")

## 6. Higher-Order Derivatives

Computing second and higher-order derivatives.

In [None]:
print("=== Higher-Order Derivatives ===")

# Function: f(x) = x³ - 2x² + x + 1
x = torch.tensor(2.0, requires_grad=True)
y = x**3 - 2*x**2 + x + 1

print(f"f(x) = x³ - 2x² + x + 1")
print(f"f({x.item()}) = {y.item()}")

# First derivative
grad_1 = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"\nFirst derivative: f'(x) = 3x² - 4x + 1")
print(f"f'({x.item()}) = {grad_1.item()}")
print(f"Expected: 3({x.item()})² - 4({x.item()}) + 1 = {3*x.item()**2 - 4*x.item() + 1}")

# Second derivative
grad_2 = torch.autograd.grad(grad_1, x, create_graph=True)[0]
print(f"\nSecond derivative: f''(x) = 6x - 4")
print(f"f''({x.item()}) = {grad_2.item()}")
print(f"Expected: 6({x.item()}) - 4 = {6*x.item() - 4}")

# Third derivative
grad_3 = torch.autograd.grad(grad_2, x, create_graph=True)[0]
print(f"\nThird derivative: f'''(x) = 6")
print(f"f'''({x.item()}) = {grad_3.item()}")
print(f"Expected: 6")

# Fourth derivative (should be zero)
grad_4 = torch.autograd.grad(grad_3, x)[0]
print(f"\nFourth derivative: f''''(x) = 0")
print(f"f''''({x.item()}) = {grad_4.item()}")

## 7. Gradient Flow Analysis

Understanding how gradients flow through different operations.

In [None]:
print("=== Gradient Flow Analysis ===")

def analyze_gradients(func, func_name, x_range=(-3, 3)):
    """Analyze gradient behavior of a function"""
    x = torch.linspace(x_range[0], x_range[1], 100, requires_grad=True)
    y = func(x)
    
    # Compute gradients
    grad_outputs = torch.ones_like(y)
    grads = torch.autograd.grad(outputs=y, inputs=x, 
                               grad_outputs=grad_outputs,
                               create_graph=True)[0]
    
    # Statistics
    grad_mean = grads.mean().item()
    grad_std = grads.std().item()
    grad_max = grads.max().item()
    grad_min = grads.min().item()
    zero_grads = (grads == 0).sum().item()
    
    print(f"\n{func_name}:")
    print(f"  Gradient mean: {grad_mean:.4f}")
    print(f"  Gradient std:  {grad_std:.4f}")
    print(f"  Gradient range: [{grad_min:.4f}, {grad_max:.4f}]")
    print(f"  Zero gradients: {zero_grads}")
    
    return x.detach(), y.detach(), grads.detach()

# Analyze different functions
functions = [
    (lambda x: torch.relu(x), "ReLU"),
    (lambda x: torch.sigmoid(x), "Sigmoid"),
    (lambda x: torch.tanh(x), "Tanh"),
    (lambda x: x**2, "Quadratic"),
    (lambda x: torch.sin(x), "Sine")
]

results = []
for func, name in functions:
    x, y, grad = analyze_gradients(func, name)
    results.append((x, y, grad, name))

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, (x, y, grad, name) in enumerate(results):
    if i < len(axes):
        ax = axes[i]
        ax.plot(x.numpy(), y.numpy(), label=f'{name}(x)', linewidth=2)
        ax.plot(x.numpy(), grad.numpy(), label=f"{name}'(x)", linewidth=2, alpha=0.7)
        ax.set_title(f'{name} and its Gradient')
        ax.legend()
        ax.grid(True, alpha=0.3)
        ax.axhline(y=0, color='k', linewidth=0.5)
        ax.axvline(x=0, color='k', linewidth=0.5)

# Remove empty subplot
if len(results) < len(axes):
    fig.delaxes(axes[-1])

plt.tight_layout()
plt.show()

## 8. Practice Exercises

In [None]:
# Exercise 1: Implement gradient descent manually
print("Exercise 1: Manual Gradient Descent")
print("Minimizing f(x) = (x - 3)² + 1")

# Initialize
x = torch.tensor(0.0, requires_grad=True)
learning_rate = 0.1
num_steps = 20

print(f"Starting at x = {x.item():.4f}")
print(f"True minimum at x = 3")

losses = []
x_values = []

for step in range(num_steps):
    # Forward pass
    loss = (x - 3)**2 + 1
    losses.append(loss.item())
    x_values.append(x.item())
    
    # Backward pass
    loss.backward()
    
    # Update (no_grad to avoid building computation graph for update)
    with torch.no_grad():
        x -= learning_rate * x.grad
    
    # Zero gradients
    x.grad.zero_()
    
    if step % 5 == 0:
        print(f"Step {step:2d}: x = {x.item():.4f}, loss = {loss.item():.4f}")

print(f"\nFinal: x = {x.item():.4f}, expected = 3.0000")
print(f"Final loss = {losses[-1]:.6f}, expected = 1.0000")

# Plot convergence
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(x_values, losses, 'bo-', linewidth=2, markersize=6)
plt.axhline(y=1.0, color='r', linestyle='--', label='True minimum')
plt.xlabel('x')
plt.ylabel('Loss')
plt.title('Loss vs x')
plt.grid(True, alpha=0.3)
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(range(num_steps), losses, 'go-', linewidth=2, markersize=6)
plt.axhline(y=1.0, color='r', linestyle='--', label='True minimum')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Loss vs Step')
plt.grid(True, alpha=0.3)
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
# Exercise 2: Chain rule verification
print("\nExercise 2: Chain Rule Verification")
print("Computing ∂h/∂x where h = g(f(x)), f(x) = x², g(u) = sin(u)")

x = torch.tensor(1.0, requires_grad=True)

# Composite function: h(x) = sin(x²)
f_x = x**2  # f(x) = x²
h = torch.sin(f_x)  # h = sin(f(x))

print(f"x = {x.item()}")
print(f"f(x) = x² = {f_x.item()}")
print(f"h = sin(f(x)) = sin({f_x.item():.4f}) = {h.item():.4f}")

# Automatic differentiation
h.backward()
auto_grad = x.grad.item()

print(f"\nAutomatic differentiation: ∂h/∂x = {auto_grad:.6f}")

# Manual chain rule: ∂h/∂x = (∂h/∂f) * (∂f/∂x)
# ∂h/∂f = ∂sin(f)/∂f = cos(f)
# ∂f/∂x = ∂(x²)/∂x = 2x
dh_df = torch.cos(f_x).item()  # cos(x²)
df_dx = 2 * x.item()  # 2x
manual_grad = dh_df * df_dx

print(f"\nManual chain rule:")
print(f"∂h/∂f = cos(f) = cos({f_x.item():.4f}) = {dh_df:.6f}")
print(f"∂f/∂x = 2x = 2({x.item()}) = {df_dx:.6f}")
print(f"∂h/∂x = (∂h/∂f) * (∂f/∂x) = {dh_df:.6f} * {df_dx:.6f} = {manual_grad:.6f}")

print(f"\nVerification: {abs(auto_grad - manual_grad) < 1e-6}")
print(f"Difference: {abs(auto_grad - manual_grad):.2e}")

## Summary

In this notebook, we covered:

1. **Basic Gradients**: Understanding `requires_grad` and gradient computation
2. **Computational Graph**: How PyTorch builds and uses computation graphs
3. **Backpropagation**: Step-by-step gradient flow through operations
4. **Gradient Management**: Accumulation, zeroing, and detaching
5. **Context Managers**: Controlling gradient computation with `torch.no_grad()`
6. **Higher-Order Derivatives**: Computing second and third derivatives
7. **Gradient Flow**: Analyzing how gradients behave in different functions
8. **Manual Implementation**: Implementing gradient descent from scratch

### Key Concepts
- **Autograd**: PyTorch's automatic differentiation system
- **Dynamic Graphs**: Computation graphs are built on-the-fly
- **Gradient Accumulation**: Gradients add up unless explicitly zeroed
- **Chain Rule**: Automatic application for complex compositions
- **Memory Management**: Use `no_grad()` for inference to save memory

### Best Practices
1. Always zero gradients before backward pass in training loops
2. Use `torch.no_grad()` for inference to save memory
3. Be careful with in-place operations that can break gradients
4. Use `detach()` when you need to stop gradient flow
5. Monitor gradient magnitudes to detect vanishing/exploding gradients

### Next Steps
- Apply autograd to build and train neural networks
- Learn about optimizers that use these gradients
- Understand gradient-based optimization algorithms
- Move on to the next notebook: Neural Network Building Blocks