# Fashion-MNIST Project

**Module 2.3, Lesson 3 (Project)** | CourseAI

This is your second end-to-end project and the **final lesson in Series 2**. You will build, train, debug, and improve a Fashion-MNIST classifier independently â€” combining every skill from the past ten lessons.

**What you'll do:**
- Run a baseline model and diagnose the accuracy gap
- Train longer and observe the scissors (overfitting) pattern
- Add regularization (BatchNorm, Dropout, weight decay) and measure improvement
- Analyze per-class accuracy to understand what the model finds hard
- Build a complete pipeline with GPU, checkpointing, and early stopping

**Structure (decreasing scaffolding):**
1. **(Guided)** Data loading and baseline model â€” run it, observe results, diagnose the gap
2. **(Supported)** Experimentation â€” train longer, add regularization, try architectures
3. **(Supported)** Per-class analysis â€” computation provided, you interpret
4. **(Independent)** Full pipeline â€” GPU, checkpointing, early stopping, your best model

**For each exercise, PREDICT the output before running the cell.**

**Estimated time:** 30-45 minutes.

**Target accuracy:** ~88-90% with FC layers. Do not expect 97% â€” that requires CNNs (Series 3).

---

## Setup

Run this cell to import everything and configure the environment.

In [None]:
import time
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np

# Reproducible results
torch.manual_seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# For nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Checkpoint directory
os.makedirs('saved_models', exist_ok=True)

---

## Section 1: Load Fashion-MNIST (Guided)

Fashion-MNIST is a drop-in replacement for MNIST. Same image size (28x28 grayscale), same number of classes (10), same torchvision API. One word changes in your loading code.

**Before running, predict:** How will Fashion-MNIST images look compared to MNIST digits? Which classes do you think will be hardest to distinguish from each other?

Note the different normalization values: mean=0.2860, std=0.3530 (MNIST was 0.1307, 0.3081).

In [None]:
# Fashion-MNIST normalization values (different from MNIST)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.2860,), (0.3530,))
])

train_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=True, download=True, transform=transform
)
test_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=False, download=True, transform=transform
)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False)

# Class names for display
class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

print(f'Training samples: {len(train_dataset)}')
print(f'Test samples:     {len(test_dataset)}')
print(f'Image shape:      {train_dataset[0][0].shape}')
print(f'Classes:          {len(class_names)}')

In [None]:
# Visualize: one row per class, 8 samples each
fig, axes = plt.subplots(10, 8, figsize=(10, 14))
fig.suptitle('Fashion-MNIST: 8 Samples Per Class', fontsize=14)

# Collect indices for each class
class_indices = {i: [] for i in range(10)}
for idx, (_, label) in enumerate(train_dataset):
    if len(class_indices[label]) < 8:
        class_indices[label].append(idx)
    if all(len(v) >= 8 for v in class_indices.values()):
        break

for class_idx in range(10):
    for col in range(8):
        ax = axes[class_idx, col]
        image, _ = train_dataset[class_indices[class_idx][col]]
        ax.imshow(image.squeeze(), cmap='gray')
        ax.axis('off')
        if col == 0:
            ax.set_title(class_names[class_idx], fontsize=7, loc='left')

plt.tight_layout()
plt.show()

print('\nNotice: T-shirt, Pullover, Coat, and Shirt look very similar as 28x28 silhouettes.')
print('These are the "hard" classes your model will struggle with.')

---

## Section 2: Baseline Model (Guided)

Start with the exact architecture from the MNIST project: Flatten -> Linear(784, 256) -> ReLU -> Linear(256, 128) -> ReLU -> Linear(128, 10). No regularization.

**Before running, predict:** The MNIST baseline hit ~97% accuracy. Will Fashion-MNIST be higher, lower, or similar? By how much?

This is your baseline â€” the number to beat.

In [None]:
class BaselineModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.net = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        return self.net(x)

print(f'Baseline parameters: {sum(p.numel() for p in BaselineModel().parameters()):,}')

In [None]:
def train_one_epoch(model, train_loader, optimizer, criterion):
    """Train for one epoch. Returns (avg_loss, accuracy)."""
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * images.size(0)
        _, predicted = torch.max(outputs, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

    return running_loss / total, 100.0 * correct / total


def evaluate(model, test_loader):
    """Evaluate model. Returns (avg_loss, accuracy)."""
    model.eval()
    criterion = nn.CrossEntropyLoss()
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            running_loss += loss.item() * images.size(0)
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)

    return running_loss / total, 100.0 * correct / total

print('Helper functions defined.')

In [None]:
# Train the baseline for 5 epochs
baseline = BaselineModel().to(device)
baseline_optimizer = optim.Adam(baseline.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

baseline_history = {'train_loss': [], 'train_acc': [], 'test_loss': [], 'test_acc': []}

print('Training baseline for 5 epochs...')
print('=' * 70)

for epoch in range(5):
    train_loss, train_acc = train_one_epoch(baseline, train_loader, baseline_optimizer, criterion)
    test_loss, test_acc = evaluate(baseline, test_loader)

    baseline_history['train_loss'].append(train_loss)
    baseline_history['train_acc'].append(train_acc)
    baseline_history['test_loss'].append(test_loss)
    baseline_history['test_acc'].append(test_acc)

    print(f'Epoch {epoch+1}/5 | '
          f'Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | '
          f'Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.2f}%')

print('=' * 70)
print(f'\nBaseline result: {baseline_history["test_acc"][-1]:.2f}% test accuracy')
print(f'Gap from MNIST (~97%): ~{97 - baseline_history["test_acc"][-1]:.0f} percentage points')
print(f'\nObservations:')
print(f'  - Training loss still decreasing? {baseline_history["train_loss"][-1] < baseline_history["train_loss"][-2]}')
print(f'  - Train/test gap: {baseline_history["train_acc"][-1] - baseline_history["test_acc"][-1]:.1f} points')

### Diagnose the Baseline

Before experimenting, answer these questions (from your debugging checklist):

1. The loss is still decreasing at epoch 5. What does that tell you?
2. Training accuracy is higher than test accuracy. What pattern is this?
3. What should you try first â€” more epochs, or a different model?

---

## Section 3: Experimentation (Supported)

You have a baseline (~87-88%). Now improve it through three experiments. The structure is given, but you write the code and observe the results.

### Experiment 1: Train Longer

The simplest thing to try: give the model more time to converge.

**Task:** Train the baseline architecture for 20 epochs instead of 5. Track train and test accuracy.

**Before running, predict:** Will the test accuracy keep improving for all 20 epochs, or will it plateau/get worse? What will the train/test gap look like?

**Observe:**
- Does test accuracy improve? By how much?
- Does the train/test gap get wider? (That is the scissors pattern â€” overfitting.)
- Does the loss plateau?

In [None]:
# Experiment 1: Train longer (20 epochs)
# TODO: Create a fresh BaselineModel, optimizer, and train for 20 epochs
# Track train_loss, train_acc, test_loss, test_acc in a history dict
# Print results each epoch

exp1_model = BaselineModel().to(device)
exp1_optimizer = optim.Adam(exp1_model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

exp1_history = {'train_loss': [], 'train_acc': [], 'test_loss': [], 'test_acc': []}

print('Experiment 1: Training baseline for 20 epochs...')
print('=' * 70)

for epoch in range(20):
    train_loss, train_acc = train_one_epoch(exp1_model, train_loader, exp1_optimizer, criterion)
    test_loss, test_acc = evaluate(exp1_model, test_loader)

    exp1_history['train_loss'].append(train_loss)
    exp1_history['train_acc'].append(train_acc)
    exp1_history['test_loss'].append(test_loss)
    exp1_history['test_acc'].append(test_acc)

    print(f'Epoch {epoch+1:2d}/20 | '
          f'Train: {train_acc:.2f}% | Test: {test_acc:.2f}% | '
          f'Gap: {train_acc - test_acc:.1f}')

print('=' * 70)
print(f'Best test accuracy: {max(exp1_history["test_acc"]):.2f}%')
print(f'Final train/test gap: {exp1_history["train_acc"][-1] - exp1_history["test_acc"][-1]:.1f} points')

In [None]:
# Plot the scissors pattern
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

epochs_range = range(1, 21)

ax1.plot(epochs_range, exp1_history['train_loss'], 'o-', linewidth=2, markersize=3, label='Train')
ax1.plot(epochs_range, exp1_history['test_loss'], 's-', linewidth=2, markersize=3, label='Test')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Loss: Train vs Test (Scissors Pattern?)')
ax1.legend()
ax1.grid(alpha=0.3)

ax2.plot(epochs_range, exp1_history['train_acc'], 'o-', linewidth=2, markersize=3, label='Train')
ax2.plot(epochs_range, exp1_history['test_acc'], 's-', linewidth=2, markersize=3, label='Test')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy (%)')
ax2.set_title('Accuracy: Train vs Test')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### Experiment 2: Add Regularization

The scissors are open â€” training accuracy is higher than test accuracy. Close them with:
- **BatchNorm1d** â€” normalizes activations, stabilizes training
- **Dropout(0.3)** â€” randomly zeros 30% of activations during training
- **weight_decay=0.01** â€” L2 regularization via AdamW

**Task:** Build an `ImprovedModel` with the layer ordering: Linear -> BatchNorm -> ReLU -> Dropout.

Use `torch.optim.AdamW` with `weight_decay=0.01`.

Train for 20 epochs. Compare to Experiment 1.

**Before running, predict:** Will regularization close the train/test gap? Will it improve test accuracy, or just reduce overfitting?

In [None]:
# TODO: Define your ImprovedModel
# Architecture: Flatten -> Linear(784,256) -> BN -> ReLU -> Dropout(0.3)
#                       -> Linear(256,128) -> BN -> ReLU -> Dropout(0.3)
#                       -> Linear(128, 10)

class ImprovedModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        # TODO: Define self.net using nn.Sequential
        pass

    def forward(self, x):
        x = self.flatten(x)
        return self.net(x)

In [None]:
# Train the improved model for 20 epochs
exp2_model = ImprovedModel().to(device)
exp2_optimizer = optim.AdamW(exp2_model.parameters(), lr=1e-3, weight_decay=0.01)
criterion = nn.CrossEntropyLoss()

exp2_history = {'train_loss': [], 'train_acc': [], 'test_loss': [], 'test_acc': []}

print('Experiment 2: Training with regularization for 20 epochs...')
print('=' * 70)

for epoch in range(20):
    train_loss, train_acc = train_one_epoch(exp2_model, train_loader, exp2_optimizer, criterion)
    test_loss, test_acc = evaluate(exp2_model, test_loader)

    exp2_history['train_loss'].append(train_loss)
    exp2_history['train_acc'].append(train_acc)
    exp2_history['test_loss'].append(test_loss)
    exp2_history['test_acc'].append(test_acc)

    print(f'Epoch {epoch+1:2d}/20 | '
          f'Train: {train_acc:.2f}% | Test: {test_acc:.2f}% | '
          f'Gap: {train_acc - test_acc:.1f}')

print('=' * 70)
print(f'Best test accuracy: {max(exp2_history["test_acc"]):.2f}%')
print(f'Final train/test gap: {exp2_history["train_acc"][-1] - exp2_history["test_acc"][-1]:.1f} points')

<details>
<summary>ðŸ’¡ Solution</summary>

**The key insight:** The layer ordering matters â€” Linear -> BatchNorm -> ReLU -> Dropout. BatchNorm normalizes before activation, and Dropout regularizes after activation. No BatchNorm or Dropout on the output layer.

```python
class ImprovedModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.net = nn.Sequential(
            nn.Linear(784, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(128, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        return self.net(x)
```

</details>

In [None]:
# Compare Experiment 1 vs Experiment 2
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

epochs_range = range(1, 21)

# Test accuracy comparison
ax1.plot(epochs_range, exp1_history['test_acc'], 'o-', linewidth=2, markersize=3, label='Baseline (no reg.)')
ax1.plot(epochs_range, exp2_history['test_acc'], 's-', linewidth=2, markersize=3, label='Improved (with reg.)')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Test Accuracy (%)')
ax1.set_title('Test Accuracy: Baseline vs Improved')
ax1.legend()
ax1.grid(alpha=0.3)

# Train/test gap comparison
exp1_gap = [t - v for t, v in zip(exp1_history['train_acc'], exp1_history['test_acc'])]
exp2_gap = [t - v for t, v in zip(exp2_history['train_acc'], exp2_history['test_acc'])]
ax2.plot(epochs_range, exp1_gap, 'o-', linewidth=2, markersize=3, label='Baseline gap')
ax2.plot(epochs_range, exp2_gap, 's-', linewidth=2, markersize=3, label='Improved gap')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Train - Test Accuracy (points)')
ax2.set_title('Overfitting Gap: Scissors Closing?')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f'\nBaseline best test acc: {max(exp1_history["test_acc"]):.2f}%')
print(f'Improved best test acc: {max(exp2_history["test_acc"]):.2f}%')
print(f'Improvement: {max(exp2_history["test_acc"]) - max(exp1_history["test_acc"]):+.2f} points')

### Experiment 3: Architecture Decisions (YOUR CHOICE)

This is fully independent. Try different architectures and compare. Some ideas:

- **More capacity:** 784 â†’ 512 â†’ 256 â†’ 128 â†’ 10
- **Deeper:** 784 â†’ 256 â†’ 256 â†’ 128 â†’ 64 â†’ 10
- **Different dropout:** p=0.5 instead of 0.3
- **Different weight decay:** 0.001 instead of 0.01

**Negative experiment to try:** Build a very large model (784 â†’ 1024 â†’ 512 â†’ 256 â†’ 10) with **no regularization**. Watch the scissors open wide.

Run at least 2 different configurations. Use the training curves to compare.

In [None]:
# YOUR CODE HERE
# Try at least 2 different architectures or hyperparameter settings.
# Track and compare their training curves.
#
# Remember:
# - Always use model.train() during training and model.eval() during evaluation
# - The layer ordering is: Linear -> BatchNorm -> ReLU -> Dropout
# - No activation, no dropout, no batchnorm on the output layer
# - Use AdamW with weight_decay for L2 regularization




---

## Section 4: Per-Class Analysis (Supported)

A single accuracy number hides important structure. Your model is not uniformly 89% accurate â€” some classes are easy and some are hard.

**Before running, predict:** Which Fashion-MNIST classes do you think the model finds easiest? Which hardest? Think about what 28x28 silhouettes look like.

The `per_class_accuracy` function is provided. **Your job:** run it on your best model, interpret the results, and identify the easy vs hard classes.

In [None]:
def per_class_accuracy(model, test_loader, class_names):
    """Compute and display accuracy for each class."""
    model.eval()
    correct = torch.zeros(10)
    total = torch.zeros(10)

    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            preds = torch.argmax(outputs, dim=1)

            for i in range(10):
                mask = labels == i
                total[i] += mask.sum().item()
                correct[i] += (preds[mask] == labels[mask]).sum().item()

    print('\nPer-class accuracy:')
    print('-' * 45)
    accs = []
    for i in range(10):
        acc = 100 * correct[i] / total[i]
        accs.append(acc.item())
        bar = 'â–ˆ' * int(acc / 2.5)
        print(f'{class_names[i]:>12s}: {acc:5.1f}%  {bar}')
    print('-' * 45)
    print(f'{"Overall":>12s}: {100 * correct.sum() / total.sum():5.1f}%')

    return accs

In [None]:
# Run per-class accuracy on your best model from the experiments above
# Replace 'exp2_model' with whichever model performed best
accs = per_class_accuracy(exp2_model, test_loader, class_names)

In [None]:
# Visualize per-class accuracy as a horizontal bar chart
fig, ax = plt.subplots(figsize=(10, 6))

colors = ['#ef4444' if a < 85 else '#f59e0b' if a < 90 else '#22c55e' for a in accs]
bars = ax.barh(class_names, accs, color=colors)
ax.set_xlabel('Accuracy (%)')
ax.set_title('Per-Class Accuracy on Fashion-MNIST')
ax.set_xlim(60, 100)
ax.axvline(x=90, color='white', linestyle='--', alpha=0.3, label='90%')
ax.grid(axis='x', alpha=0.3)

# Add accuracy labels on bars
for bar, acc in zip(bars, accs):
    ax.text(bar.get_width() + 0.5, bar.get_y() + bar.get_height()/2,
            f'{acc:.1f}%', va='center', fontsize=9)

plt.tight_layout()
plt.show()

# Identify easy and hard classes
easy = [(class_names[i], accs[i]) for i in range(10) if accs[i] >= 90]
hard = [(class_names[i], accs[i]) for i in range(10) if accs[i] < 85]

print(f'\nEasy classes (â‰¥90%): {", ".join(f"{n} ({a:.1f}%)" for n, a in easy)}')
print(f'Hard classes (<85%): {", ".join(f"{n} ({a:.1f}%)" for n, a in hard)}')

### Interpret the Results

Answer these questions based on your per-class accuracy:

1. Which classes does the model find easiest? Why? (Think about what the silhouettes look like.)
2. Which classes are hardest? What do they have in common visually?
3. Why can't an FC network distinguish shirts from coats? (Hint: what happens when you flatten a 28x28 image?)

### Stretch Goal: Confusion Matrix

A confusion matrix shows exactly which classes get confused with which. The diagonal shows correct predictions; off-diagonal shows mistakes.

In [None]:
# Stretch: Confusion matrix
# TODO: Build and display a confusion matrix
#
# Steps:
# 1. Collect all predictions and true labels from test_loader
# 2. Build a 10x10 matrix: confusion[true_class][predicted_class] = count
# 3. Display as a heatmap with plt.imshow()
#
# Hint: you can use torch.zeros(10, 10) for the matrix




<details>
<summary><strong>Hint: Confusion Matrix</strong> (click to expand)</summary>

```python
# Collect predictions
all_preds = []
all_labels = []

model = exp2_model  # or whichever model you want to analyze
model.eval()
with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = model(images)
        preds = torch.argmax(outputs, dim=1)
        all_preds.append(preds.cpu())
        all_labels.append(labels)

all_preds = torch.cat(all_preds)
all_labels = torch.cat(all_labels)

# Build confusion matrix
confusion = torch.zeros(10, 10, dtype=torch.int64)
for true, pred in zip(all_labels, all_preds):
    confusion[true][pred] += 1

# Plot
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(confusion.numpy(), cmap='Blues')
ax.set_xticks(range(10))
ax.set_yticks(range(10))
ax.set_xticklabels(class_names, rotation=45, ha='right', fontsize=8)
ax.set_yticklabels(class_names, fontsize=8)
ax.set_xlabel('Predicted')
ax.set_ylabel('True')
ax.set_title('Confusion Matrix')

# Add numbers
for i in range(10):
    for j in range(10):
        val = confusion[i, j].item()
        if val > 0:
            color = 'white' if val > confusion.max().item() * 0.5 else 'black'
            ax.text(j, i, str(val), ha='center', va='center',
                    fontsize=7, color=color)

plt.colorbar(im)
plt.tight_layout()
plt.show()
```

</details>

---

## Section 5: Full Pipeline (Independent)

Put everything together into a complete, production-ready training pipeline:

1. **Device detection** â€” GPU if available
2. **Data loading** with Fashion-MNIST transforms
3. **Your best model** with regularization
4. **Training loop** with GPU, checkpointing, and early stopping (patience=5)
5. **Restore best model** and run per-class analysis

**Optional:** Add mixed precision if on GPU.

This is the pattern you carry forward to every future project. Write it from scratch.

In [None]:
# YOUR CODE HERE
# Write the complete pipeline from scratch.
#
# Requirements:
#   - Device detection (GPU if available)
#   - Fashion-MNIST data loading with correct normalization
#   - Your best model architecture (with regularization)
#   - AdamW optimizer with weight_decay
#   - Training loop with:
#       - model.train() / model.eval() correctly
#       - Track train loss, train acc, test loss, test acc
#       - Save best model checkpoint by test accuracy
#       - Early stopping with patience=5
#   - After training: restore best model
#   - Per-class accuracy analysis
#
# Optional: mixed precision (autocast + GradScaler) if on GPU




In [None]:
# Plot the full training curves from your pipeline
# TODO: Plot train/test loss and accuracy curves
# Show where early stopping triggered (if it did)




In [None]:
# Final per-class analysis on your best model
# TODO: Load best checkpoint and run per_class_accuracy




<details>
<summary>ðŸ’¡ Solution</summary>

**The key insight:** This combines every skill from Series 2 â€” data loading, model design, regularization, training loops, checkpointing, early stopping, and per-class analysis. The pattern is the same one from the GPU Training lesson, extended with early stopping.

```python
# 1. Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Training on: {device}')

# 2. Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.2860,), (0.3530,))
])
train_data = torchvision.datasets.FashionMNIST('./data', train=True, download=True, transform=transform)
test_data = torchvision.datasets.FashionMNIST('./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=256)

# 3. Model
model = ImprovedModel().to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
criterion = nn.CrossEntropyLoss()

# 4. Training loop with early stopping
num_epochs = 30
patience = 5
best_acc = 0.0
patience_counter = 0
history = {'train_loss': [], 'train_acc': [], 'test_loss': [], 'test_acc': []}

start_time = time.time()

for epoch in range(num_epochs):
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, criterion)
    test_loss, test_acc = evaluate(model, test_loader)

    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['test_loss'].append(test_loss)
    history['test_acc'].append(test_acc)

    improved = ''
    if test_acc > best_acc:
        best_acc = test_acc
        patience_counter = 0
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'accuracy': test_acc,
        }, 'saved_models/fashion_best.pth')
        improved = ' <- best'
    else:
        patience_counter += 1

    print(f'Epoch {epoch+1:2d}/{num_epochs} | '
          f'Train: {train_acc:.2f}% | Test: {test_acc:.2f}% | '
          f'Patience: {patience_counter}/{patience}{improved}')

    if patience_counter >= patience:
        print(f'\nEarly stopping at epoch {epoch+1}.')
        break

elapsed = time.time() - start_time

# 5. Restore best model
best_ckpt = torch.load('saved_models/fashion_best.pth', map_location=device, weights_only=False)
model.load_state_dict(best_ckpt['model_state_dict'])
print(f'\nRestored best model from epoch {best_ckpt["epoch"] + 1}')
print(f'Best accuracy: {best_acc:.2f}%')
print(f'Training time: {elapsed:.1f}s')

# Per-class analysis
per_class_accuracy(model, test_loader, class_names)
```

</details>

---

## Final Summary

Collect all your results in one place.

In [None]:
# Summary table â€” fill in your results
print('=' * 65)
print(f'{"Experiment":<30} {"Best Test Acc":>15} {"Train/Test Gap":>15}')
print('-' * 65)
print(f'{"Baseline (5 epochs)":<30} {max(baseline_history["test_acc"]):>14.2f}% {baseline_history["train_acc"][-1] - baseline_history["test_acc"][-1]:>14.1f}')
print(f'{"Exp 1: Longer (20 epochs)":<30} {max(exp1_history["test_acc"]):>14.2f}% {exp1_history["train_acc"][-1] - exp1_history["test_acc"][-1]:>14.1f}')
print(f'{"Exp 2: Regularization":<30} {max(exp2_history["test_acc"]):>14.2f}% {exp2_history["train_acc"][-1] - exp2_history["test_acc"][-1]:>14.1f}')
# Add your Experiment 3 results here if you tracked them
print('=' * 65)
print()
print('Key observations:')
print('  - Training longer improves accuracy but widens the overfitting gap')
print('  - Regularization closes the gap and improves generalization')
print('  - Hard classes (Shirt, Coat, Pullover) share similar silhouettes')
print('  - FC networks flatten spatial structure â€” CNNs (Series 3) will help')

---

## Key Takeaways

1. **Training longer helps, but overfitting follows.** The scissors pattern (train acc rising, test acc plateauing) tells you when to stop adding epochs and start adding regularization.
2. **Regularization closes the scissors.** BatchNorm, Dropout, and weight decay each fight overfitting in different ways â€” combined, they let you train longer without diverging.
3. **Per-class accuracy reveals structure.** A single number hides the fact that T-shirt/Shirt/Pullover/Coat are hard (similar silhouettes) while Trouser/Sandal/Bag are easy (distinctive shapes).
4. **FC networks flatten spatial structure.** A shifted or rotated garment becomes an entirely different input vector â€” CNNs (Series 3) fix this.
5. **The complete pipeline pattern** (device detection, training loop, checkpointing, early stopping, analysis) is what you carry forward to every future project.

| Skill | Source Lesson |
|-------|-------------|
| Data loading with transforms | Datasets and DataLoaders |
| nn.Module model design | nn.Module |
| Training loop (forward, loss, backward, step) | The Training Loop |
| Cross-entropy loss, accuracy tracking | MNIST Project |
| model.train() / model.eval() | MNIST Project |
| BatchNorm, Dropout, weight decay | MNIST Project |
| Debugging checklist | Debugging and Visualization |
| Checkpointing and early stopping | Saving, Loading, and Checkpoints |
| GPU training | GPU Training |
| Per-class analysis | This project |

**You did not follow a tutorial. You made decisions, observed results, and adapted. That is machine learning.**

**Next:** Series 3 â€” Convolutional Neural Networks. Your FC model tops out at ~89-90%. CNNs reach 93-95%. The remaining gap is what motivates the next series.

In [None]:
# Optional: clean up
# import shutil
# shutil.rmtree('saved_models', ignore_errors=True)
# print('Cleaned up saved_models/')