# Week 1: PyTorch Basics - Building Neural Networks

**Bread Financial - AI for Data Scientists Academy**

---

## Learning Objectives

By the end of this session, you will be able to:

- Work with PyTorch tensors and perform basic tensor operations
- Understand automatic differentiation with autograd
- Build feedforward neural networks using `nn.Module`
- Implement a complete training loop with proper best practices
- Train a neural network to classify handwritten digits (MNIST dataset)
- Evaluate model performance and visualize predictions

## Prerequisites

Before starting this notebook, you should have:

- Python programming fundamentals
- Basic NumPy and Pandas knowledge
- Watched pre-class videos on: neural networks, backpropagation, activation functions

## Session Format

- **2-hour hands-on session**
- Instructor will demo key concepts (live coding)
- You will complete labs independently
- Solutions shared after class

---

## Important: GPU Setup for Google Colab

If you're running this notebook on Google Colab, **enable GPU acceleration** for faster training:

1. Click on **Runtime** in the top menu
2. Select **Change runtime type**
3. Under **Hardware accelerator**, select **T4 GPU**
4. Click **Save**

The notebook will work fine on CPU, but GPU makes training much faster!

---

## Section 0: Environment Setup

Let's start by setting up our environment and understanding what we're building today.

In [None]:
# Install required packages (run this first in Google Colab)
# If running locally with conda/venv, you may skip this cell

!pip install torch torchvision matplotlib numpy scikit-learn

In [None]:
# Import all necessary libraries
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms
import matplotlib.pyplot as plt
import numpy as np

# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Device configuration - automatically use GPU if available, otherwise CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if device.type == 'cuda':
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# Set random seed for reproducibility
# This ensures everyone gets the same results
torch.manual_seed(42)

print("\n Environment setup complete!")

## What Are We Building Today?

### Real-World Context: Automated Check Processing

Imagine you're a data scientist at a bank. Thousands of checks arrive daily, and clerks manually type in the check amounts. This is slow, expensive, and error-prone.

**Your mission**: Build an AI system that automatically reads handwritten digits on checks.

Today, we'll start with the MNIST dataset - 70,000 images of handwritten digits (0-9). This is a classic dataset that simulates the digit recognition problem banks face.

Let's look at what we're working with:

In [None]:
# Load a sample of MNIST data to visualize
# Don't worry about the details yet - we'll explain everything later!
sample_data = torchvision.datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)

# Display 10 sample digits
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    image, label = sample_data[i]
    ax.imshow(image.squeeze(), cmap='gray')
    ax.set_title(f'Label: {label}', fontsize=14)
    ax.axis('off')

plt.suptitle('MNIST Handwritten Digits - Examples', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"\nDataset size: {len(sample_data):,} training images")
print(f"Image dimensions: 28x28 pixels (grayscale)")
print(f"Number of classes: 10 (digits 0-9)")
print(f"\n Goal: Build a neural network that achieves ~95% accuracy!")

---

# Topic 1: PyTorch Tensors & Operations

## Why PyTorch?

You might be wondering: "We already know NumPy, why learn PyTorch?"

**PyTorch offers three critical advantages:**

1. **GPU Acceleration**: PyTorch tensors can run on GPUs, making computations 10-100x faster
2. **Automatic Differentiation**: PyTorch automatically computes gradients (derivatives) for us - essential for training neural networks
3. **Deep Learning Ecosystem**: Built-in layers, optimizers, and tools specifically designed for neural networks

## What is a Tensor?

A **tensor** is a multi-dimensional array - just like NumPy arrays, but optimized for deep learning:

- **Scalar** (0D tensor): A single number → `5`
- **Vector** (1D tensor): Array of numbers → `[1, 2, 3, 4]`
- **Matrix** (2D tensor): Table of numbers → `[[1, 2], [3, 4]]`
- **3D+ tensors**: Images (height × width × channels), videos, etc.

### Example: Representing an MNIST Image

Each MNIST digit is a **2D tensor** of shape `(28, 28)` containing pixel intensities:

```python
# Conceptual example (actual values)
digit_image = torch.tensor([
[0.0, 0.0, 0.5, 0.8, 0.8, 0.5, ...], # Row 1 (28 pixels)
[0.0, 0.2, 0.9, 1.0, 1.0, 0.9, ...], # Row 2
... # 28 rows total
]) # Shape: (28, 28)
```

---

## Demo: Tensor Basics

The instructor will demonstrate tensor creation and operations. **Pay attention to**:
- How to create tensors
- Basic operations (add, multiply, reshape)
- Moving tensors between CPU and GPU

In [None]:
# Demo: Creating Tensors

# From Python lists
tensor_1d = torch.tensor([1, 2, 3, 4, 5])
print("1D Tensor (vector):")
print(tensor_1d)
print(f"Shape: {tensor_1d.shape}\n")

# 2D tensor (matrix)
tensor_2d = torch.tensor([[1, 2, 3], [4, 5, 6]])
print("2D Tensor (matrix):")
print(tensor_2d)
print(f"Shape: {tensor_2d.shape}\n")

# Special tensors
zeros = torch.zeros(3, 4) # 3x4 matrix of zeros
ones = torch.ones(2, 3) # 2x3 matrix of ones
random = torch.randn(2, 2) # 2x2 matrix with random values (normal distribution)

print("Special tensors:")
print(f"Zeros:\n{zeros}\n")
print(f"Ones:\n{ones}\n")
print(f"Random:\n{random}\n")

In [None]:
# Demo: Tensor Operations

x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])

# Element-wise operations (performed on corresponding elements)
print("Element-wise operations:")
print(f"x + y = {x + y}") # [1+4, 2+5, 3+6] = [5, 7, 9]
print(f"x * y = {x * y}") # [1*4, 2*5, 3*6] = [4, 10, 18]
print(f"x ** 2 = {x ** 2}\n") # [1^2, 2^2, 3^2] = [1, 4, 9]

# Matrix operations
A = torch.tensor([[1, 2], [3, 4]])
B = torch.tensor([[5, 6], [7, 8]])

# Matrix multiplication (dot product)
C = torch.mm(A, B) # or A @ B
print("Matrix multiplication A @ B:")
print(C)

In [None]:
# Demo: Reshaping Tensors

# This is CRITICAL for neural networks!
# We often need to flatten images or change tensor dimensions

# Create a tensor representing an image (batch_size=1, channels=1, height=28, width=28)
image = torch.randn(1, 1, 28, 28)
print(f"Original image shape: {image.shape}")

# Flatten the image to a vector (needed for feedforward networks)
# 28 * 28 = 784 pixels
flattened = image.view(1, 784) # or image.reshape(1, 784)
print(f"Flattened image shape: {flattened.shape}")

# Alternative: use -1 to infer dimension automatically
flattened_auto = image.view(image.size(0), -1) # -1 means "figure out this dimension"
print(f"Auto-flattened shape: {flattened_auto.shape}")

print("\n Key insight: .view() and .reshape() let us change tensor shape without copying data")

In [None]:
# Demo: Device Management (CPU vs GPU)

# Create tensor on CPU (default)
cpu_tensor = torch.tensor([1, 2, 3])
print(f"CPU tensor device: {cpu_tensor.device}")

# Move tensor to GPU if available
if device.type == 'cuda':
    gpu_tensor = cpu_tensor.to(device)  # Copy to GPU
    print(f"GPU tensor device: {gpu_tensor.device}")
    
    # For operations to work, tensors must be on the SAME device
    # This would ERROR: cpu_tensor + gpu_tensor
    
    # Move back to CPU
    back_to_cpu = gpu_tensor.cpu()
    print(f"Back to CPU: {back_to_cpu.device}")
else:
    print("GPU not available, staying on CPU")

print("\n Best practice: Use .to(device) to automatically handle CPU/GPU")

---

## Lab 1: Tensor Operations Practice

Now it's your turn! Complete the following exercises to practice working with tensors.

### Exercise 1.1: Create Tensors

Create the following tensors:
1. A 1D tensor with values `[10, 20, 30, 40, 50]`
2. A 3x3 matrix of zeros
3. A 2x4 matrix of random values (use `torch.randn()`)
4. A tensor representing a batch of 5 MNIST images (shape should be `(5, 1, 28, 28)`)

**Hints:**
- Use `torch.tensor()` for specific values
- Use `torch.zeros()`, `torch.ones()`, `torch.randn()` for special tensors
- Check shapes with `.shape`

In [None]:
# Solution: Lab 1.1 - Create Tensors

# 1. Create 1D tensor
tensor_1 = torch.tensor([10, 20, 30, 40, 50])

# 2. Create 3x3 zeros
tensor_2 = torch.zeros(3, 3)

# 3. Create 2x4 random
tensor_3 = torch.randn(2, 4)

# 4. Create batch of MNIST-shaped tensors
tensor_4 = torch.randn(5, 1, 28, 28)

# Print shapes to verify
print(f"tensor_1 shape: {tensor_1.shape}")  # Expected: torch.Size([5])
print(f"tensor_2 shape: {tensor_2.shape}")  # Expected: torch.Size([3, 3])
print(f"tensor_3 shape: {tensor_3.shape}")  # Expected: torch.Size([2, 4])
print(f"tensor_4 shape: {tensor_4.shape}")  # Expected: torch.Size([5, 1, 28, 28])

### Exercise 1.2: Tensor Operations

Given two tensors `a` and `b`, perform the following operations:

1. Element-wise addition
2. Element-wise multiplication
3. Compute the mean of tensor `a`
4. Find the maximum value in tensor `b`

In [None]:
# Solution: Lab 1.2 - Tensor Operations

# Given tensors
a = torch.tensor([2.0, 4.0, 6.0, 8.0])
b = torch.tensor([1.0, 3.0, 5.0, 7.0])

# 1. Element-wise addition
addition = a + b  # Result: [3, 7, 11, 15]

# 2. Element-wise multiplication
multiplication = a * b  # Result: [2, 12, 30, 56]

# 3. Mean of a
mean_a = a.mean()  # or torch.mean(a) - Result: 5.0

# 4. Max of b
max_b = b.max()  # or torch.max(b) - Result: 7.0

print(f"a + b = {addition}")
print(f"a * b = {multiplication}")
print(f"Mean of a = {mean_a}")
print(f"Max of b = {max_b}")

### Exercise 1.3: Reshaping for Neural Networks

**Scenario**: You have a batch of 10 MNIST images (shape `(10, 1, 28, 28)`). To feed them into a feedforward neural network, you need to flatten each image into a vector of 784 pixels.

**Task**: Flatten the batch so the shape becomes `(10, 784)`.

**Hint**: Use `.view()` or `.reshape()` with size `(batch_size, -1)`

In [None]:
# Solution: Lab 1.3 - Reshaping for Neural Networks

# Create a batch of 10 random MNIST-like images
batch_images = torch.randn(10, 1, 28, 28)
print(f"Original shape: {batch_images.shape}")

# Flatten the images using .view() or .reshape()
# Method 1: Explicit dimensions
# flattened_batch = batch_images.view(10, 784)

# Method 2: Using -1 to auto-infer dimension (recommended)
flattened_batch = batch_images.view(batch_images.size(0), -1)

print(f"Flattened shape: {flattened_batch.shape}")
print(f"Expected shape: (10, 784)")

# Verify dimensions
if flattened_batch is not None and flattened_batch.shape == (10, 784):
    print("\n Correct! You successfully flattened the batch.")
else:
    print("\n Shape doesn't match. Try again!")

---

# Topic 2: Autograd & Computational Graphs

## Why Do We Need Automatic Differentiation?

Training a neural network requires computing **gradients** (derivatives) to know how to adjust weights:

1. **Forward pass**: Input → Network → Prediction → Loss
2. **Backward pass**: Compute gradients of loss with respect to all weights
3. **Weight update**: Adjust weights in the direction that reduces loss

Manually computing gradients for complex networks is extremely tedious and error-prone. **Autograd** does this automatically!

## How Autograd Works

PyTorch builds a **computational graph** tracking all operations:

```
x (requires_grad=True) → multiply by W → add b → loss
↓ ↓ ↓
gradients computed automatically
```

When you call `.backward()`, PyTorch:
1. Traverses the graph backwards (chain rule)
2. Computes gradients for all tensors with `requires_grad=True`
3. Stores gradients in the `.grad` attribute

---

## Demo: Autograd in Action

The instructor will demonstrate:
- How to enable gradient tracking
- Computing gradients with `.backward()`
- **Why we need `zero_grad()`** (critical for training!)

In [None]:
# Demo: Simple Gradient Computation

# Create a tensor and tell PyTorch to track operations on it
x = torch.tensor([2.0], requires_grad=True)
print(f"x = {x}")
print(f"requires_grad: {x.requires_grad}\n")

# Perform some operations
# Let's compute y = 3x^2 + 2x + 1
y = 3 * x**2 + 2 * x + 1
print(f"y = 3x² + 2x + 1 = {y}")
print(f"y requires_grad: {y.requires_grad}\n")

# Compute gradients (dy/dx)
# Mathematically: dy/dx = 6x + 2 = 6(2) + 2 = 14
y.backward() # This computes gradients!

print(f"Gradient dy/dx = {x.grad}")
print(f"Expected: 6x + 2 = 6(2) + 2 = 14")
print("\n Autograd computed the derivative automatically!")

In [None]:
# Demo: Why zero_grad() is CRITICAL

print(" COMMON MISTAKE: Forgetting to zero gradients\n")

# Create tensor
x = torch.tensor([3.0], requires_grad=True)

# First computation
y1 = x ** 2 # y1 = 9, dy1/dx = 2x = 6
y1.backward()
print(f"First computation: x² = {y1.item():.1f}, gradient = {x.grad.item():.1f}")

# Second computation WITHOUT zeroing gradients
y2 = x ** 3 # y2 = 27, dy2/dx = 3x² = 27
y2.backward() # This ADDS to existing gradient!
print(f"Second computation (no zero): x³ = {y2.item():.1f}, gradient = {x.grad.item():.1f}")
print(f" Wrong! Gradient accumulated: 6 + 27 = 33\n")

# Correct way: Zero gradients before each new computation
x = torch.tensor([3.0], requires_grad=True)
y1 = x ** 2
y1.backward()
print(f"First: gradient = {x.grad.item():.1f}")

x.grad.zero_() # Zero the gradients!
y2 = x ** 3
y2.backward()
print(f"Second (with zero): gradient = {x.grad.item():.1f}")
print(f" Correct! Gradient = 27\n")

print(" KEY TAKEAWAY: Always call zero_grad() before computing new gradients!")
print(" In training loops: optimizer.zero_grad() does this for all model parameters.")

---

## Lab 2: Autograd Practice

### Exercise 2.1: Compute Gradients

Given the function `f(x) = x³ - 2x² + 5x - 1`:

1. Create a tensor `x = 4.0` with gradient tracking enabled
2. Compute `f(x)`
3. Compute the gradient `df/dx`
4. Verify your answer (derivative: `f'(x) = 3x² - 4x + 5`, so `f'(4) = 3(16) - 4(4) + 5 = 48 - 16 + 5 = 37`)

**Hints:**
- Use `requires_grad=True`
- Call `.backward()` on the result
- Access gradient with `.grad`

In [None]:
# Solution: Lab 2.1 - Compute Gradients

# 1. Create x with gradient tracking
x = torch.tensor([4.0], requires_grad=True)

# 2. Compute f(x) = x³ - 2x² + 5x - 1
f = x**3 - 2*x**2 + 5*x - 1

# 3. Compute gradient
f.backward()

# 4. Print results
if x is not None and f is not None:
    print(f"f(4) = {f.item():.1f}")
    if x.grad is not None:
        print(f"f'(4) = {x.grad.item():.1f}")
    print(f"Expected: 37.0")
    if x.grad is not None and abs(x.grad.item() - 37.0) < 0.01:
        print("\n Correct gradient!")
    else:
        print("\n Gradient doesn't match. Check your computation.")

# Explanation: 
# f(x) = x³ - 2x² + 5x - 1
# f'(x) = 3x² - 4x + 5
# f'(4) = 3(16) - 4(4) + 5 = 48 - 16 + 5 = 37

### Exercise 2.2: Understanding Gradient Accumulation

**Task**: Demonstrate the gradient accumulation problem and fix it.

1. Create `x = 2.0` with gradient tracking
2. Compute `y1 = 5x²` and get the gradient (should be 20)
3. WITHOUT zeroing, compute `y2 = 3x` and get the gradient
4. Observe the accumulated gradient
5. Now zero the gradient and recompute `y2` to get the correct gradient (should be 3)

In [None]:
# Solution: Lab 2.2 - Understanding Gradient Accumulation

# Step 1: Create x
x = torch.tensor([2.0], requires_grad=True)

# Step 2: Compute y1 and gradient
y1 = 5 * x**2  # y1 = 5x², dy1/dx = 10x = 20
y1.backward()
print(f"After y1 = 5x²: gradient = {x.grad.item()}")  # Should be 20

# Step 3: Compute y2 WITHOUT zeroing
y2 = 3 * x  # y2 = 3x, dy2/dx = 3
y2.backward()  # This ADDS to existing gradient!
print(f"After y2 = 3x (no zero): gradient = {x.grad.item()}")  # Should be 23 (20+3)
print("This should be WRONG (accumulated)\n")

# Step 4: Zero gradients
x.grad.zero_()

# Step 5: Recompute y2
y2 = 3 * x
y2.backward()
print(f"After y2 = 3x (with zero): gradient = {x.grad.item()}")  # Should be 3
print("This should be CORRECT (3.0)")

# Key Takeaway: Always zero gradients before computing new ones!
# In training loops, optimizer.zero_grad() does this for all parameters

---

# Topic 3: Building Neural Networks with nn.Module

## What is nn.Module?

PyTorch provides `nn.Module` as the base class for all neural networks. To create a custom neural network:

1. **Subclass nn.Module**: `class MyNetwork(nn.Module):`
2. **Define layers in `__init__`**: Create layers as attributes (they're automatically registered)
3. **Define forward pass in `forward()`**: Specify how data flows through the network

### Key Rules:

- **Always call `super().__init__()`** first in your `__init__` method
- **Define layers as attributes** in `__init__` (not in `forward`)
- **Implement `forward()`** to define the computation
- **DON'T create new layers in `forward`** (they won't be registered!)

### For MNIST:

We'll build a simple feedforward network:
- **Input**: 784 pixels (28×28 flattened)
- **Hidden Layer**: 128 neurons with ReLU activation
- **Output**: 10 neurons (one per digit 0-9)

---

## Demo: Building a Simple Network

The instructor will demonstrate:
- How to subclass nn.Module
- Defining layers in `__init__`
- Implementing the `forward()` method
- Moving the model to GPU/CPU

In [None]:
# Demo: Building a Simple MNIST Classifier

class SimpleMNISTNet(nn.Module):
    def __init__(self):
        super().__init__()  # MUST call parent constructor
        
        # Define layers as attributes (registered automatically)
        self.fc1 = nn.Linear(784, 128)  # Input: 784 pixels, Output: 128 features
        self.fc2 = nn.Linear(128, 10)   # Input: 128 features, Output: 10 classes
    
    def forward(self, x):
        # Define forward pass computation
        x = x.view(x.size(0), -1)  # Flatten: (batch, 1, 28, 28) -> (batch, 784)
        x = torch.relu(self.fc1(x))  # Apply first layer + ReLU activation
        x = self.fc2(x)  # Apply second layer (no activation - raw logits)
        return x

# Create model and move to device
model = SimpleMNISTNet().to(device)
print(model)
print(f"\nModel is on: {next(model.parameters()).device}")

In [None]:
# Demo: Understanding Model Parameters

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Test with dummy data
dummy_input = torch.randn(5, 1, 28, 28).to(device) # Batch of 5 images
output = model(dummy_input)
print(f"\nInput shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}") # Should be (5, 10)
print(f"\n Output has 10 values per image - one score for each digit (0-9)")

---

## Lab 3: Build Your Own Neural Network

**Task**: Create a deeper network with architecture **784 → 256 → 128 → 10**

Requirements:
1. Create a class called `DeepMNISTNet` that subclasses `nn.Module`
2. Define three linear layers:
- `fc1`: 784 → 256
- `fc2`: 256 → 128
- `fc3`: 128 → 10
3. In the forward pass:
- Flatten the input
- Apply fc1, then ReLU
- Apply fc2, then ReLU
- Apply fc3 (no activation)
4. Create an instance and test it with dummy data

**Hints:**
- Follow the pattern from `SimpleMNISTNet`
- Use `torch.relu()` for activations
- Don't forget to call `super().__init__()`

In [None]:
# Solution: Lab 3 - Build Your Own Neural Network

class DeepMNISTNet(nn.Module):
    def __init__(self):
        super().__init__()  # MUST call parent constructor
        
        # Define three linear layers: 784 → 256 → 128 → 10
        self.fc1 = nn.Linear(784, 256)  # Input to first hidden layer
        self.fc2 = nn.Linear(256, 128)  # First to second hidden layer
        self.fc3 = nn.Linear(128, 10)   # Second hidden layer to output
    
    def forward(self, x):
        # Flatten input: (batch, 1, 28, 28) -> (batch, 784)
        x = x.view(x.size(0), -1)
        
        # First layer + ReLU activation
        x = torch.relu(self.fc1(x))
        
        # Second layer + ReLU activation
        x = torch.relu(self.fc2(x))
        
        # Output layer (no activation - raw logits for CrossEntropyLoss)
        x = self.fc3(x)
        
        return x

# Test your model
deep_model = DeepMNISTNet().to(device)
test_input = torch.randn(3, 1, 28, 28).to(device)
test_output = deep_model(test_input)

print(f"Input shape: {test_input.shape}")
print(f"Output shape: {test_output.shape}")  # Should be (3, 10)

# Count parameters
total_params = sum(p.numel() for p in deep_model.parameters())
print(f"\nTotal parameters: {total_params:,}")
print(" Model built successfully!")

---

# Topic 4: Complete Training Loop

## Components of Training

To train a neural network, we need:

1. **DataLoader**: Automatically batches data and shuffles it
2. **Loss Function**: Measures how wrong our predictions are
3. **Optimizer**: Updates weights to minimize loss
4. **Training Loop**: Repeats the process for multiple epochs

### The Training Loop Pattern

**CRITICAL PATTERN** - You'll use this for every neural network:

```python
for epoch in range(num_epochs):
for images, labels in train_loader:
optimizer.zero_grad() # 1. Zero gradients
outputs = model(images) # 2. Forward pass
loss = criterion(outputs, labels) # 3. Compute loss
loss.backward() # 4. Backward pass (compute gradients)
optimizer.step() # 5. Update weights
```

### For MNIST:

- **Loss Function**: `nn.CrossEntropyLoss()` (includes softmax, perfect for classification)
- **Optimizer**: `Adam` (adaptive learning rate, works great for beginners)
- **Batch Size**: 64 (standard for MNIST)
- **Learning Rate**: 0.001 (good default for Adam)

---

## Demo: Full Training Loop

The instructor will demonstrate:
- Setting up DataLoaders
- Creating loss function and optimizer
- The complete training loop
- Tracking training progress

In [None]:
# Demo: Preparing MNIST Data

# Define transforms (convert to tensor and normalize)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean and std
])

# Load training and test datasets
train_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

test_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

# Create DataLoaders
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"Training batches: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")

# Visualize a batch
images, labels = next(iter(train_loader))
print(f"\nBatch shape: {images.shape}")  # (64, 1, 28, 28)
print(f"Labels shape: {labels.shape}")  # (64,)

In [None]:
# Demo: Full Training Loop for MNIST

# 1. Create model, loss, optimizer
model = SimpleMNISTNet().to(device)
criterion = nn.CrossEntropyLoss()  # Includes softmax + negative log likelihood
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# 2. Training loop
num_epochs = 5
for epoch in range(num_epochs):
    model.train()  # Set to training mode
    running_loss = 0.0
    
    for batch_idx, (images, labels) in enumerate(train_loader):
        # Move data to device
        images, labels = images.to(device), labels.to(device)
        
        # Training step pattern: zero → forward → loss → backward → step
        optimizer.zero_grad()        # 1. Zero gradients
        outputs = model(images)      # 2. Forward pass
        loss = criterion(outputs, labels)  # 3. Compute loss
        loss.backward()              # 4. Backward pass (compute gradients)
        optimizer.step()             # 5. Update weights
        
        running_loss += loss.item()
        
        # Print progress every 200 batches
        if batch_idx % 200 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Batch [{batch_idx}/{len(train_loader)}], Loss: {loss.item():.4f}')
    
    avg_loss = running_loss / len(train_loader)
    print(f'Epoch [{epoch+1}/{num_epochs}] Average Loss: {avg_loss:.4f}\n')

print(" Training complete!")

---

## Lab 4: Train Your Deep Network

**Tasks**:
1. Train your `DeepMNISTNet` for 5 epochs
2. Track loss values in a list and plot the training curve
3. Save the trained model to `'mnist_model.pth'`

**Requirements**:
- Use the same DataLoaders we created above
- Use `CrossEntropyLoss` and `Adam` optimizer (lr=0.001)
- Follow the training loop pattern: `zero_grad → forward → loss → backward → step`
- Store loss values and plot them

**Hints:**
- Create empty list: `loss_history = []`
- Append losses: `loss_history.append(avg_loss)`
- Plot: `plt.plot(loss_history)`
- Save model: `torch.save(model.state_dict(), 'mnist_model.pth')`

In [None]:
# Solution: Lab 4 - Train Your Deep Network

# 1. Create model, loss, optimizer
deep_model = DeepMNISTNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(deep_model.parameters(), lr=0.001)

# 2. Training loop
loss_history = []
num_epochs = 5

for epoch in range(num_epochs):
    deep_model.train()  # Set to training mode
    running_loss = 0.0
    
    for batch_idx, (images, labels) in enumerate(train_loader):
        # Move data to device
        images, labels = images.to(device), labels.to(device)
        
        # Training step pattern: zero_grad → forward → loss → backward → step
        optimizer.zero_grad()           # 1. Zero gradients
        outputs = deep_model(images)    # 2. Forward pass
        loss = criterion(outputs, labels)  # 3. Compute loss
        loss.backward()                 # 4. Backward pass (compute gradients)
        optimizer.step()                # 5. Update weights
        
        running_loss += loss.item()
    
    # Calculate average loss for this epoch
    avg_loss = running_loss / len(train_loader)
    loss_history.append(avg_loss)
    print(f'Epoch [{epoch+1}/{num_epochs}], Average Loss: {avg_loss:.4f}')

print("\n Training complete!")

# 3. Plot training loss
plt.plot(loss_history, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Average Loss')
plt.title('Training Loss Over Time')
plt.grid(True)
plt.show()

# 4. Save model
torch.save(deep_model.state_dict(), 'mnist_model.pth')
print("\n Model saved to 'mnist_model.pth'!")

---

# Topic 5: Evaluation & Analysis

## Evaluating Neural Networks

After training, we need to:
1. **Switch to evaluation mode**: `model.eval()` (disables dropout, batch norm tracking)
2. **Disable gradient computation**: `torch.no_grad()` (saves memory)
3. **Compute accuracy** on test data
4. **Analyze where the model fails** (confusion matrix)

### Why model.eval() and torch.no_grad()?

- `model.eval()`: Changes model behavior (e.g., disables dropout layers)
- `torch.no_grad()`: Don't compute gradients (we're not training, saves memory)

### Confusion Matrix

Shows which digits are confused with which:
- Diagonal: Correct predictions
- Off-diagonal: Mistakes (e.g., predicted 4 when true label was 9)

---

## Demo: Model Evaluation

The instructor will demonstrate:
- Computing test accuracy
- Creating confusion matrix
- Visualizing predictions and mistakes

In [None]:
# Demo: Evaluating the Model

model.eval()  # Set to evaluation mode (disables dropout, batch norm tracking)
correct = 0
total = 0

with torch.no_grad():  # Don't compute gradients during evaluation
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)  # Get class with highest score
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f'Test Accuracy: {accuracy:.2f}%')
print(f'\n Goal was ~95% accuracy. How did we do?')

In [None]:
# Demo: Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Collect all predictions
all_preds = []
all_labels = []

model.eval()
with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.numpy())

# Compute confusion matrix
cm = confusion_matrix(all_labels, all_preds)

# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=range(10))
disp.plot(cmap='Blues', values_format='d')
plt.title('MNIST Confusion Matrix')
plt.show()

# Per-class accuracy
print("\nPer-digit accuracy:")
class_correct = cm.diagonal()
class_total = cm.sum(axis=1)
for digit in range(10):
    acc = 100 * class_correct[digit] / class_total[digit]
    print(f'Digit {digit}: {acc:.1f}% ({class_correct[digit]}/{class_total[digit]})')

In [None]:
# Demo: Visualize Predictions (Correct and Incorrect)

# Get one batch
images, labels = next(iter(test_loader))
images_on_device = images.to(device)

# Get predictions
model.eval()
with torch.no_grad():
    outputs = model(images_on_device)
    _, predicted = torch.max(outputs, 1)

# Move back to CPU for visualization
predicted = predicted.cpu()

# Show first 10 examples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(images[i].squeeze(), cmap='gray')
    true_label = labels[i].item()
    pred_label = predicted[i].item()
    color = 'green' if true_label == pred_label else 'red'
    ax.set_title(f'True: {true_label}, Pred: {pred_label}', color=color)
    ax.axis('off')

plt.tight_layout()
plt.show()

print("\n Green = correct prediction, Red = incorrect prediction")

---

## Lab 5: Analyze Your Model

**Tasks**:
1. Evaluate your `DeepMNISTNet` on test data and compute accuracy
2. Generate a confusion matrix
3. Find and visualize 10 misclassified examples
4. Analyze which digit pairs are most confused (e.g., 4 vs 9, 5 vs 3)

**Requirements**:
- Use `model.eval()` and `torch.no_grad()`
- Collect all predictions and labels
- Use sklearn's `confusion_matrix`
- Identify where predicted != true

**Questions to answer**:
- What's your model's accuracy?
- Which digit is hardest to classify?
- Which pairs of digits are most confused?
- Can you identify patterns in the mistakes?

In [None]:
# Solution: Lab 5 - Analyze Your Model

# 1. Compute test accuracy
deep_model.eval()
correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = deep_model(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%\n")

# 2. Generate confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

all_preds = []
all_labels = []

deep_model.eval()
with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = deep_model(images)
        _, predicted = torch.max(outputs, 1)
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.numpy())

cm = confusion_matrix(all_labels, all_preds)
disp = ConfusionMatrixDisplay(cm, display_labels=range(10))
disp.plot(cmap='Blues', values_format='d')
plt.title('Deep MNIST Model - Confusion Matrix')
plt.show()

# 3. Find misclassified examples
misclassified_images = []
misclassified_true = []
misclassified_pred = []

deep_model.eval()
with torch.no_grad():
    for images, labels in test_loader:
        images_on_device = images.to(device)
        outputs = deep_model(images_on_device)
        _, predicted = torch.max(outputs, 1)
        predicted = predicted.cpu()
        
        # Find misclassified in this batch
        for i in range(len(labels)):
            if predicted[i] != labels[i]:
                misclassified_images.append(images[i])
                misclassified_true.append(labels[i].item())
                misclassified_pred.append(predicted[i].item())
                
                if len(misclassified_images) >= 10:
                    break
        
        if len(misclassified_images) >= 10:
            break

# 4. Visualize misclassified examples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(misclassified_images[i].squeeze(), cmap='gray')
    ax.set_title(f'True: {misclassified_true[i]}, Pred: {misclassified_pred[i]}', color='red')
    ax.axis('off')
plt.suptitle('Misclassified Examples', fontsize=16, fontweight='bold')
plt.show()

# 5. Analyze confusion - which digits are most commonly confused?
print("\nMost Common Confusions:")
# Look at off-diagonal elements
for i in range(10):
    for j in range(10):
        if i != j and cm[i, j] > 20:  # Show confusions with >20 examples
            print(f"  Digit {i} misclassified as {j}: {cm[i, j]} times")

# Per-digit accuracy
print("\nPer-digit accuracy:")
class_correct = cm.diagonal()
class_total = cm.sum(axis=1)
for digit in range(10):
    acc = 100 * class_correct[digit] / class_total[digit]
    print(f'Digit {digit}: {acc:.1f}% ({class_correct[digit]}/{class_total[digit]})')

---

# Optional Labs (Complete at Home)

These labs are for students who finish early or want extra practice. They explore different aspects of PyTorch and neural networks.

## Optional Lab A: Iris Dataset Classification

The Iris dataset is a classic classification problem with only 4 features and 3 classes. Build a simple network to classify iris flowers.

**Challenge**: Achieve >95% accuracy on this smaller dataset.

**Steps**:
1. Load the Iris dataset from sklearn
2. Split into train/test and normalize features
3. Convert to PyTorch tensors
4. Build a network: 4 inputs → hidden layer(s) → 3 outputs
5. Train and evaluate

**Hints**:
- Use `sklearn.datasets.load_iris()`
- Network can be simpler than MNIST (e.g., 4 → 10 → 3)
- Fewer epochs needed (dataset is small)

In [None]:
# Solution: Optional Lab A - Iris Classification

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

print(f"Training samples: {len(X_train_tensor)}")
print(f"Test samples: {len(X_test_tensor)}")
print(f"Features: {X_train_tensor.shape[1]}")
print(f"Classes: {len(set(y.tolist()))}")

# 1. Build network: 4 inputs -> hidden layer(s) -> 3 outputs
class IrisNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 16)   # Input to hidden layer
        self.fc2 = nn.Linear(16, 8)   # First to second hidden layer
        self.fc3 = nn.Linear(8, 3)    # Hidden to output layer
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)  # No activation for logits
        return x

# Create model
iris_model = IrisNet()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(iris_model.parameters(), lr=0.01)

# 2. Train the network
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass
    outputs = iris_model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    
    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 20 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# 3. Evaluate accuracy
iris_model.eval()
with torch.no_grad():
    outputs = iris_model(X_test_tensor)
    _, predicted = torch.max(outputs, 1)
    accuracy = (predicted == y_test_tensor).sum().item() / len(y_test_tensor)
    print(f'\nTest Accuracy: {accuracy * 100:.2f}%')
    
    if accuracy > 0.95:
        print(" Goal achieved: >95% accuracy!")

---

# Congratulations!

You've completed Week 1: PyTorch Basics!

## What You've Learned

PyTorch tensors and operations
Automatic differentiation with autograd
Building neural networks with nn.Module
Complete training loops with proper patterns
Model evaluation and analysis
Real-world application: Handwritten digit recognition

## Key Takeaways

1. **Always call `optimizer.zero_grad()`** before computing new gradients
2. **Training loop pattern**: `zero_grad → forward → loss → backward → step`
3. **Use `.to(device)`** to handle CPU/GPU automatically
4. **`model.eval()` and `torch.no_grad()`** for evaluation
5. **Confusion matrix** helps identify where your model struggles

## Next Steps

- **Week 2**: CNNs & RNNs - Transfer learning and sequence modeling
- **Practice**: Complete the optional Iris lab
- **Challenge**: Try to improve your MNIST model to >97% accuracy
- Hint: Try adding more layers, dropout, or data augmentation

## Resources

- [PyTorch Official Tutorials](https://pytorch.org/tutorials/)
- [PyTorch Documentation](https://pytorch.org/docs/)
- [MNIST Dataset Info](http://yann.lecun.com/exdb/mnist/)

---

**Great work! See you next week for CNNs and RNNs! **