# ML Lab 07: Train a Neural Network

You've been using scikit-learn for everything. It's great for tabular data and text.
But try feeding raw pixels into logistic regression and watch it struggle.

Images have **spatial structure** -- edges, textures, shapes -- that flat feature vectors
can't capture. In this lab, you'll build your first neural network in PyTorch, train it
on real images, see why CNNs beat feedforward networks, and learn to checkpoint your
training so a crash doesn't erase hours of work.

---
## Section 1: Why scikit-learn Isn't Enough

Let's load the **CIFAR-10** dataset: 60,000 tiny (32x32) color images across 10 classes.
We'll flatten each image into a vector of 3,072 numbers (32 x 32 x 3 channels) and
feed it to scikit-learn's logistic regression.

Spoiler: it will get around 40% accuracy. Better than random (10%), but terrible.

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import numpy as np
import matplotlib.pyplot as plt

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Use CPU explicitly (no GPU needed for this lab)
device = torch.device('cpu')
print(f"Using device: {device}")

# Load CIFAR-10
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform
)
test_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=transform
)

CLASSES = ('airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck')

print(f"Training images: {len(train_dataset)}")
print(f"Test images:     {len(test_dataset)}")
print(f"Image shape:     {train_dataset[0][0].shape} (channels, height, width)")
print(f"Classes:         {CLASSES}")

In [None]:
# Visualize some examples
fig, axes = plt.subplots(2, 8, figsize=(16, 4))
for i, ax in enumerate(axes.flat):
    image, label = train_dataset[i]
    # Undo normalization for display
    image = image * 0.5 + 0.5
    ax.imshow(image.permute(1, 2, 0).numpy())
    ax.set_title(CLASSES[label], fontsize=9)
    ax.axis('off')
plt.suptitle('CIFAR-10 Sample Images', fontsize=14)
plt.tight_layout()
plt.show()

print("These are 32x32 pixel images. Tiny, but enough to learn from.")

In [None]:
# Now try scikit-learn: flatten pixels -> logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Use raw pixel values (unnormalized) for sklearn
raw_transform = transforms.Compose([transforms.ToTensor()])
raw_train = torchvision.datasets.CIFAR10(root='./data', train=True, download=False, transform=raw_transform)
raw_test = torchvision.datasets.CIFAR10(root='./data', train=False, download=False, transform=raw_transform)

# Flatten: (3, 32, 32) -> (3072,)
# Use a subset for speed (full dataset would take too long with sklearn)
n_train = 5000
n_test = 1000

X_train_flat = torch.stack([raw_train[i][0] for i in range(n_train)]).reshape(n_train, -1).numpy()
y_train_flat = np.array([raw_train[i][1] for i in range(n_train)])
X_test_flat = torch.stack([raw_test[i][0] for i in range(n_test)]).reshape(n_test, -1).numpy()
y_test_flat = np.array([raw_test[i][1] for i in range(n_test)])

print(f"Flattened shape: {X_train_flat.shape} (samples, pixels)")
print(f"Each image is now a flat vector of {X_train_flat.shape[1]} numbers.")
print("All spatial structure is destroyed.")
print("\nTraining logistic regression on raw pixels...")

clf = LogisticRegression(max_iter=1000, random_state=42, solver='saga')
clf.fit(X_train_flat, y_train_flat)

sklearn_acc = accuracy_score(y_test_flat, clf.predict(X_test_flat))
print(f"\nscikit-learn accuracy on CIFAR-10: {sklearn_acc:.3f}")
print(f"Random baseline (10 classes): 0.100")
print(f"\nConclusion: {sklearn_acc*100:.0f}% is better than random, but terrible.")
print("scikit-learn can't learn spatial patterns from raw pixels.")
print("That's what neural networks are for.")

**Why did sklearn fail?** Logistic regression treats each pixel independently. It doesn't
know that pixel (5,5) is *next to* pixel (5,6). It can't learn edges, textures, or shapes.
It's like trying to understand a sentence by looking at each letter in isolation.

Neural networks solve this by learning **intermediate representations** -- hidden layers
that combine raw inputs into increasingly abstract features.

---

## Section 2: Your First Neural Network

A **feedforward neural network** (also called a multi-layer perceptron) stacks multiple
layers of linear transformations with non-linear activations between them.

Our architecture:
```
Input (3072) -> Linear(512) -> ReLU -> Linear(256) -> ReLU -> Linear(10)
```

- **Input**: 3072 = 3 channels x 32 x 32 pixels (flattened)
- **Hidden layer 1**: 512 neurons with ReLU activation
- **Hidden layer 2**: 256 neurons with ReLU activation
- **Output**: 10 neurons (one per class, no activation -- CrossEntropyLoss handles softmax)

In [None]:
import torch.nn as nn


class FeedforwardNet(nn.Module):
    """A simple 3-layer feedforward neural network."""

    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(3 * 32 * 32, 512),   # 3072 -> 512
            nn.ReLU(),
            nn.Linear(512, 256),             # 512 -> 256
            nn.ReLU(),
            nn.Linear(256, 10),              # 256 -> 10 (one per class)
        )

    def forward(self, x):
        x = self.flatten(x)  # (batch, 3, 32, 32) -> (batch, 3072)
        return self.layers(x)


model = FeedforwardNet().to(device)
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters:     {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"\nThat's {total_params:,} numbers the network needs to learn.")
print("Compare this to logistic regression, which had ~30,000 coefficients.")

In [None]:
# Let's see what a forward pass looks like with random data
dummy_input = torch.randn(4, 3, 32, 32).to(device)  # batch of 4 images
dummy_output = model(dummy_input)

print(f"Input shape:  {dummy_input.shape}  (batch, channels, height, width)")
print(f"Output shape: {dummy_output.shape}  (batch, num_classes)")
print(f"\nRaw outputs (logits) for first image:")
print(f"  {dummy_output[0].detach().numpy()}")
print(f"\nThese are raw scores, not probabilities. The highest score is the prediction.")
print(f"Predicted class: {CLASSES[dummy_output[0].argmax().item()]}")
print(f"\n(Random weights = random predictions. We haven't trained yet.)")

---
## Section 3: The Training Loop

The training loop is the heart of deep learning. Every iteration does four things:

1. **Forward pass**: Feed data through the network to get predictions
2. **Compute loss**: Measure how wrong the predictions are (CrossEntropyLoss)
3. **Backward pass**: Compute gradients (how to adjust each parameter to reduce loss)
4. **Optimizer step**: Update parameters using the gradients

```
for epoch in range(num_epochs):
    for batch in dataloader:
        optimizer.zero_grad()          # Clear old gradients
        outputs = model(batch.images)  # Forward pass
        loss = criterion(outputs, batch.labels)  # Compute loss
        loss.backward()                # Backward pass (compute gradients)
        optimizer.step()               # Update weights
```

That's it. Every neural network ever trained follows this pattern.

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import time

# Create data loaders (batch the data)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

print(f"Training batches: {len(train_loader)} (each with up to 64 images)")
print(f"Test batches:     {len(test_loader)}")

# Reset model with fresh weights
model = FeedforwardNet().to(device)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

print(f"\nLoss function: CrossEntropyLoss (combines LogSoftmax + NLLLoss)")
print(f"Optimizer:     SGD with lr=0.01")
print(f"\nReady to train!")

In [None]:
# The training loop -- this is where the magic happens
NUM_EPOCHS = 10

# Track metrics for plotting later
history = {
    'train_loss': [],
    'train_acc': [],
    'test_acc': [],
    'batch_losses': [],
}

start_time = time.time()

for epoch in range(NUM_EPOCHS):
    # --- Training phase ---
    model.train()  # Set model to training mode
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()               # 1. Clear gradients
        outputs = model(images)             # 2. Forward pass
        loss = criterion(outputs, labels)   # 3. Compute loss
        loss.backward()                     # 4. Backward pass
        optimizer.step()                    # 5. Update weights

        running_loss += loss.item()
        history['batch_losses'].append(loss.item())
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    train_loss = running_loss / len(train_loader)
    train_acc = correct / total
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)

    # --- Evaluation phase ---
    model.eval()  # Set model to evaluation mode
    test_correct = 0
    test_total = 0

    with torch.no_grad():  # No gradients needed for evaluation
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            test_total += labels.size(0)
            test_correct += predicted.eq(labels).sum().item()

    test_acc = test_correct / test_total
    history['test_acc'].append(test_acc)

    elapsed = time.time() - start_time
    print(f"Epoch {epoch+1:2d}/{NUM_EPOCHS} | "
          f"Loss: {train_loss:.3f} | "
          f"Train Acc: {train_acc*100:.1f}% | "
          f"Test Acc: {test_acc*100:.1f}% | "
          f"Time: {elapsed:.0f}s")

total_time = time.time() - start_time
print(f"\nTraining complete in {total_time:.0f} seconds.")
print(f"Final test accuracy: {history['test_acc'][-1]*100:.1f}%")
print(f"\nCompare to scikit-learn's {sklearn_acc*100:.0f}% -- already better!")

**What just happened?** The network started with random weights and gradually learned to
classify images by adjusting its ~1.7 million parameters through gradient descent.
Each epoch, loss went down and accuracy went up.

---

## Section 4: Watch It Learn

Numbers are good, but plots are better. Let's visualize the training process
and look for signs of overfitting.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Batch loss curve (every batch)
axes[0].plot(history['batch_losses'], alpha=0.3, color='blue', linewidth=0.5)
# Add smoothed version
window = 50
if len(history['batch_losses']) > window:
    smoothed = np.convolve(history['batch_losses'], np.ones(window)/window, mode='valid')
    axes[0].plot(range(window-1, len(history['batch_losses'])), smoothed, color='red', linewidth=2, label='Smoothed')
axes[0].set_xlabel('Batch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss (per batch)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Epoch loss
epochs = range(1, NUM_EPOCHS + 1)
axes[1].plot(epochs, history['train_loss'], 'o-', linewidth=2, markersize=6, label='Train Loss')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss')
axes[1].set_title('Training Loss (per epoch)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Train vs Test accuracy
axes[2].plot(epochs, [a*100 for a in history['train_acc']], 'o-', linewidth=2, markersize=6, label='Train Acc')
axes[2].plot(epochs, [a*100 for a in history['test_acc']], 's-', linewidth=2, markersize=6, label='Test Acc')
axes[2].axhline(y=sklearn_acc*100, color='gray', linestyle='--', alpha=0.5, label=f'sklearn ({sklearn_acc*100:.0f}%)')
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('Accuracy (%)')
axes[2].set_title('Train vs Test Accuracy')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

gap = history['train_acc'][-1] - history['test_acc'][-1]
print(f"Train-Test accuracy gap: {gap*100:.1f}%")
if gap > 0.1:
    print("The gap is growing -- this is overfitting.")
    print("Train accuracy keeps climbing but test accuracy plateaus.")
else:
    print("Gap is small -- not much overfitting yet.")

In [None]:
# Visualize some predictions
model.eval()
fig, axes = plt.subplots(3, 8, figsize=(18, 7))

# Get a batch of test images
test_iter = iter(test_loader)
images, labels = next(test_iter)
images, labels = images.to(device), labels.to(device)

with torch.no_grad():
    outputs = model(images)
    _, predicted = outputs.max(1)

for i, ax in enumerate(axes.flat):
    if i >= len(images):
        break
    # Undo normalization
    img = images[i].cpu() * 0.5 + 0.5
    ax.imshow(img.permute(1, 2, 0).numpy())

    true_label = CLASSES[labels[i].item()]
    pred_label = CLASSES[predicted[i].item()]
    correct = labels[i].item() == predicted[i].item()

    color = 'green' if correct else 'red'
    ax.set_title(f"P:{pred_label}\nT:{true_label}", fontsize=8, color=color)
    ax.axis('off')

plt.suptitle('Predictions (green=correct, red=wrong)', fontsize=14)
plt.tight_layout()
plt.show()

**Key observations:**
- The loss curve shows training is working (loss decreasing)
- If train accuracy keeps rising but test accuracy plateaus or drops, that's **overfitting**
- The feedforward network gets ~50% on CIFAR-10. Decent for a first attempt, but we can do much better.

---

## Section 5: Improve with a CNN

The feedforward network flattens the image and ignores spatial relationships.
A **Convolutional Neural Network (CNN)** preserves spatial structure by:

1. **Convolving** small filters (3x3) across the image to detect local patterns
2. **Pooling** to reduce resolution while keeping important features
3. **Stacking** conv layers so later layers combine simple features into complex ones

```
Layer 1: edges, color blobs
Layer 2: textures, simple shapes (wheels, eyes)
Layer 3+: complex patterns (faces, cars)
```

Our CNN architecture:
```
Conv2d(3->32) -> ReLU -> MaxPool -> Conv2d(32->64) -> ReLU -> MaxPool -> FC(64*8*8->256) -> FC(256->10)
```

In [None]:
class SimpleCNN(nn.Module):
    """A simple CNN with 2 convolutional layers and 2 fully-connected layers."""

    def __init__(self):
        super().__init__()
        # Convolutional layers (preserve spatial structure)
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (3, 32, 32) -> (32, 32, 32)
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                            # (32, 32, 32) -> (32, 16, 16)
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # (32, 16, 16) -> (64, 16, 16)
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                            # (64, 16, 16) -> (64, 8, 8)
        )
        # Fully-connected layers (classify from features)
        self.classifier = nn.Sequential(
            nn.Flatten(),                                  # (64, 8, 8) -> (4096)
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)


cnn_model = SimpleCNN().to(device)
print(cnn_model)

cnn_params = sum(p.numel() for p in cnn_model.parameters())
ff_params = sum(p.numel() for p in FeedforwardNet().parameters())
print(f"\nCNN parameters:         {cnn_params:,}")
print(f"Feedforward parameters: {ff_params:,}")
print(f"\nThe CNN has fewer parameters but is smarter about how it uses them.")
print(f"Convolutions share weights across spatial positions -- this is the key insight.")

In [None]:
# Train the CNN with the same setup
cnn_model = SimpleCNN().to(device)
cnn_criterion = nn.CrossEntropyLoss()
cnn_optimizer = torch.optim.SGD(cnn_model.parameters(), lr=0.01)

cnn_history = {
    'train_loss': [],
    'train_acc': [],
    'test_acc': [],
}

start_time = time.time()

for epoch in range(NUM_EPOCHS):
    # Training
    cnn_model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        cnn_optimizer.zero_grad()
        outputs = cnn_model(images)
        loss = cnn_criterion(outputs, labels)
        loss.backward()
        cnn_optimizer.step()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    train_loss = running_loss / len(train_loader)
    train_acc = correct / total
    cnn_history['train_loss'].append(train_loss)
    cnn_history['train_acc'].append(train_acc)

    # Evaluation
    cnn_model.eval()
    test_correct = 0
    test_total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = cnn_model(images)
            _, predicted = outputs.max(1)
            test_total += labels.size(0)
            test_correct += predicted.eq(labels).sum().item()

    test_acc = test_correct / test_total
    cnn_history['test_acc'].append(test_acc)

    elapsed = time.time() - start_time
    print(f"Epoch {epoch+1:2d}/{NUM_EPOCHS} | "
          f"Loss: {train_loss:.3f} | "
          f"Train Acc: {train_acc*100:.1f}% | "
          f"Test Acc: {test_acc*100:.1f}% | "
          f"Time: {elapsed:.0f}s")

total_time = time.time() - start_time
print(f"\nCNN training complete in {total_time:.0f} seconds.")

In [None]:
# Compare all three approaches
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

epochs = range(1, NUM_EPOCHS + 1)

# Test accuracy comparison
axes[0].plot(epochs, [a*100 for a in history['test_acc']], 'o-', linewidth=2, label='Feedforward NN')
axes[0].plot(epochs, [a*100 for a in cnn_history['test_acc']], 's-', linewidth=2, label='CNN')
axes[0].axhline(y=sklearn_acc*100, color='gray', linestyle='--', alpha=0.7, label=f'sklearn LogReg ({sklearn_acc*100:.0f}%)')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Test Accuracy (%)')
axes[0].set_title('Test Accuracy: sklearn vs Feedforward vs CNN')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Loss comparison
axes[1].plot(epochs, history['train_loss'], 'o-', linewidth=2, label='Feedforward NN')
axes[1].plot(epochs, cnn_history['train_loss'], 's-', linewidth=2, label='CNN')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Training Loss')
axes[1].set_title('Training Loss: Feedforward vs CNN')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal test accuracy comparison:")
print(f"  scikit-learn LogReg:  {sklearn_acc*100:.1f}%")
print(f"  Feedforward NN:       {history['test_acc'][-1]*100:.1f}%")
print(f"  CNN:                  {cnn_history['test_acc'][-1]*100:.1f}%")
print(f"\nCNNs learn spatial features -- edges, textures, shapes.")
print(f"That's why they're better at images.")

**Why does the CNN win?**

- **Weight sharing**: The same 3x3 filter is applied everywhere -- it doesn't need to relearn "edge" at each position
- **Local patterns**: Conv filters look at small neighborhoods of pixels, capturing spatial relationships
- **Hierarchical features**: Stacked conv layers build simple features into complex ones

The feedforward net treats each pixel independently. The CNN understands that pixels near each other matter together.

---

## Section 6: Checkpointing

Training a neural network can take hours, days, or weeks. If your process crashes at
epoch 47 of 50, you don't want to start over.

**Checkpointing** saves the training state to disk so you can resume from where you left off.
This is the same concept as **write-ahead logging** in databases -- you persist state so
you can recover from failures.

A checkpoint contains:
- Model weights (`state_dict`)
- Optimizer state (momentum, learning rate schedule, etc.)
- Current epoch and loss
- Anything else you need to resume exactly

In [None]:
import os

# Save a checkpoint
checkpoint_path = 'cnn_checkpoint.pt'

checkpoint = {
    'epoch': NUM_EPOCHS,
    'model_state_dict': cnn_model.state_dict(),
    'optimizer_state_dict': cnn_optimizer.state_dict(),
    'train_loss': cnn_history['train_loss'][-1],
    'test_acc': cnn_history['test_acc'][-1],
    'history': cnn_history,
}

torch.save(checkpoint, checkpoint_path)

size_mb = os.path.getsize(checkpoint_path) / (1024 * 1024)
print(f"Checkpoint saved to: {checkpoint_path}")
print(f"File size: {size_mb:.2f} MB")
print(f"Contents:")
print(f"  - epoch: {checkpoint['epoch']}")
print(f"  - train_loss: {checkpoint['train_loss']:.4f}")
print(f"  - test_acc: {checkpoint['test_acc']*100:.1f}%")
print(f"  - model_state_dict: {len(checkpoint['model_state_dict'])} parameter tensors")
print(f"  - optimizer_state_dict: optimizer momentum + state")

In [None]:
# Load the checkpoint into a fresh model
loaded_checkpoint = torch.load(checkpoint_path, weights_only=False)

# Create a fresh model and optimizer
restored_model = SimpleCNN().to(device)
restored_optimizer = torch.optim.SGD(restored_model.parameters(), lr=0.01)

# Load saved state
restored_model.load_state_dict(loaded_checkpoint['model_state_dict'])
restored_optimizer.load_state_dict(loaded_checkpoint['optimizer_state_dict'])
resume_epoch = loaded_checkpoint['epoch']

print(f"Checkpoint loaded! Resuming from epoch {resume_epoch}.")
print(f"Previous test accuracy: {loaded_checkpoint['test_acc']*100:.1f}%")

# Verify the restored model produces the same predictions
cnn_model.eval()
restored_model.eval()

test_images, test_labels = next(iter(test_loader))
test_images = test_images.to(device)

with torch.no_grad():
    original_preds = cnn_model(test_images).argmax(dim=1)
    restored_preds = restored_model(test_images).argmax(dim=1)

match = (original_preds == restored_preds).all().item()
print(f"\nPredictions match original model: {match}")
print("The checkpoint captures the model exactly -- no information lost.")

In [None]:
# Continue training from the checkpoint for 2 more epochs
print(f"Resuming training from epoch {resume_epoch}...")

restored_model.train()
for epoch in range(resume_epoch, resume_epoch + 2):
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        restored_optimizer.zero_grad()
        outputs = restored_model(images)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        restored_optimizer.step()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    # Quick test eval
    restored_model.eval()
    test_correct = 0
    test_total = 0
    with torch.no_grad():
        for imgs, lbls in test_loader:
            imgs, lbls = imgs.to(device), lbls.to(device)
            outs = restored_model(imgs)
            _, preds = outs.max(1)
            test_total += lbls.size(0)
            test_correct += preds.eq(lbls).sum().item()
    restored_model.train()

    print(f"Epoch {epoch+1} | "
          f"Loss: {running_loss/len(train_loader):.3f} | "
          f"Train Acc: {100.*correct/total:.1f}% | "
          f"Test Acc: {100.*test_correct/test_total:.1f}%")

print("\nTraining resumed seamlessly from the checkpoint.")
print("This is fault tolerance for training -- same concept as write-ahead logging.")

**Checkpointing best practices:**
- Save after every epoch (or every N batches for long epochs)
- Save the optimizer state too, not just the model (momentum matters!)
- Keep the last K checkpoints and delete older ones to save disk space
- In production, save to networked storage so you can resume on a different machine

---

## Section 7: Experiment Tracking (Simple)

When you start experimenting with different architectures, learning rates, and other
hyperparameters, you need to track what you tried and what worked.

For now, we'll use a simple JSON log file. After 20+ experiments, you'll want MLflow
or Weights & Biases. But the principle is the same: **log everything, compare later.**

In [None]:
import json
import pandas as pd


def run_experiment(name, model_class, lr, epochs, train_loader, test_loader):
    """Run a training experiment and return results."""
    model = model_class().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    start = time.time()

    # Train
    for epoch in range(epochs):
        model.train()
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    # Evaluate
    model.eval()
    train_correct = 0
    train_total = 0
    with torch.no_grad():
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            _, predicted = model(images).max(1)
            train_total += labels.size(0)
            train_correct += predicted.eq(labels).sum().item()

    test_correct = 0
    test_total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            _, predicted = model(images).max(1)
            test_total += labels.size(0)
            test_correct += predicted.eq(labels).sum().item()

    train_time = time.time() - start
    n_params = sum(p.numel() for p in model.parameters())

    result = {
        'experiment': name,
        'architecture': model_class.__name__,
        'lr': lr,
        'epochs': epochs,
        'params': n_params,
        'train_acc': round(train_correct / train_total, 4),
        'test_acc': round(test_correct / test_total, 4),
        'train_time_s': round(train_time, 1),
    }
    return result


print("Experiment runner ready. Let's compare configurations.")

In [None]:
# Run several experiments
experiments = []

configs = [
    ('ff_lr0.01_ep5', FeedforwardNet, 0.01, 5),
    ('ff_lr0.05_ep5', FeedforwardNet, 0.05, 5),
    ('cnn_lr0.01_ep5', SimpleCNN, 0.01, 5),
    ('cnn_lr0.05_ep5', SimpleCNN, 0.05, 5),
]

for name, model_class, lr, epochs in configs:
    print(f"Running: {name}...")
    result = run_experiment(name, model_class, lr, epochs, train_loader, test_loader)
    experiments.append(result)
    print(f"  -> test_acc={result['test_acc']*100:.1f}% in {result['train_time_s']}s")

print(f"\nAll {len(experiments)} experiments complete!")

In [None]:
# Save experiments to JSON
experiments_path = 'experiment_log.json'
with open(experiments_path, 'w') as f:
    json.dump(experiments, f, indent=2)
print(f"Experiments saved to {experiments_path}")

# Load and compare in a DataFrame
df = pd.DataFrame(experiments)
df = df.sort_values('test_acc', ascending=False)
print(f"\n{'='*80}")
print(f"EXPERIMENT RESULTS (sorted by test accuracy)")
print(f"{'='*80}")
print(df[['experiment', 'architecture', 'lr', 'epochs', 'params',
          'train_acc', 'test_acc', 'train_time_s']].to_string(index=False))
print(f"\nBest experiment: {df.iloc[0]['experiment']} with {df.iloc[0]['test_acc']*100:.1f}% test accuracy")

In [None]:
# Visualize experiment comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of test accuracies
colors = ['#2196F3' if 'ff' in exp['experiment'] else '#FF9800' for exp in experiments]
axes[0].barh(range(len(experiments)), [e['test_acc']*100 for e in experiments], color=colors)
axes[0].set_yticks(range(len(experiments)))
axes[0].set_yticklabels([e['experiment'] for e in experiments])
axes[0].set_xlabel('Test Accuracy (%)')
axes[0].set_title('Test Accuracy by Experiment')
axes[0].axvline(x=sklearn_acc*100, color='gray', linestyle='--', alpha=0.5, label=f'sklearn ({sklearn_acc*100:.0f}%)')
axes[0].legend()

# Train vs test accuracy (overfitting check)
x_pos = range(len(experiments))
width = 0.35
axes[1].bar([x - width/2 for x in x_pos], [e['train_acc']*100 for e in experiments],
           width, label='Train Acc', alpha=0.8)
axes[1].bar([x + width/2 for x in x_pos], [e['test_acc']*100 for e in experiments],
           width, label='Test Acc', alpha=0.8)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels([e['experiment'] for e in experiments], rotation=45, ha='right')
axes[1].set_ylabel('Accuracy (%)')
axes[1].set_title('Train vs Test Accuracy (gap = overfitting)')
axes[1].legend()

plt.tight_layout()
plt.show()

print("Blue = Feedforward, Orange = CNN")
print("The gap between train and test bars shows overfitting.")

In [None]:
# Clean up checkpoint and experiment files
for path in ['cnn_checkpoint.pt', 'experiment_log.json']:
    if os.path.exists(path):
        os.remove(path)
        print(f"Cleaned up {path}")

print("\nDone! All temporary files removed.")

---
## Summary

You just built and trained neural networks from scratch. Here's what you now know:

| Concept | What You Learned |
|---------|------------------|
| **sklearn limits** | Flat feature vectors lose spatial structure -- bad for images |
| **Feedforward NN** | Stacked linear layers with ReLU activations; learns intermediate features |
| **Training loop** | forward -> loss -> backward -> step; the core of all deep learning |
| **Overfitting** | Train accuracy rises, test accuracy plateaus; watch the gap |
| **CNN** | Convolutional layers learn spatial patterns (edges, textures, shapes) |
| **Checkpointing** | Save model + optimizer state to resume training after failures |
| **Experiment tracking** | Log every run so you know what worked and what didn't |

### What's Next?

In **ML Lab 08**, you'll move from training models to running pre-trained large language
models (LLMs) on your own machine.