# Debugging Machine Learning Models in PyTorch
Welcome to the hands-on session for ML model debugging! In this notebook, you'll learn practical techniques to diagnose and fix issues in neural networks using PyTorch.

**Session Outline (75 min):**
1. Introduction to debugging ML models (5 min)
2. Setup & imports (5 min)
3. Visualizing data and model (10 min)
4. Forward/backward pass debugging (15 min)
5. Common pitfalls and how to fix them (15 min)
6. Practical debugging tools (10 min)
7. Guided exercise: fix a buggy model (10 min)
8. Wrap-up & Q&A (5 min)

## 1. Introduction to Debugging ML Models
Debugging is a critical skill for any machine learning practitioner. Even well-designed models can fail due to subtle bugs, data issues, or training instabilities.

In this session, you will learn:
- How to systematically diagnose problems in neural networks
- Common sources of errors in ML workflows
- Practical tools and techniques for debugging PyTorch models
- How to interpret model outputs and training signals to identify issues

By the end, you'll be able to approach ML model debugging with confidence and efficiency.

## 2. Setup & Imports
In this section, we'll set up the environment and import the necessary libraries for debugging PyTorch models.

In [5]:
# Setup: Import libraries and prepare data/model
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np

# Set random seed for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x14adb8b8bcd0>

In [6]:
The MNIST dataset consists of 28x28 grayscale images of handwritten digits (0-9).
- train_dataset: Contains 60,000 training images and their labels.
- test_dataset: Contains 10,000 test images and their labels.
Each image is transformed to a tensor and normalized for better training stability.
Data loaders (train_loader, test_loader) provide batches of data for model training and evaluation.

SyntaxError: invalid decimal literal (3984107832.py, line 1)

In [None]:
# Download and prepare MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1000, shuffle=False)

The model defined here is a simple Convolutional Neural Network (CNN) designed for classifying MNIST handwritten digit images. It consists of two convolutional layers followed by two fully connected layers. The convolutional layers extract spatial features from the input images, while the fully connected layers perform classification based on these features. The final output layer produces scores for each of the 10 digit classes (0-9). This architecture is commonly used for image classification tasks and serves as a solid baseline for MNIST.

In [None]:
# Define a simple CNN model for MNIST
class ConvNeuralNetwork(nn.Module):
    def __init__(self):
        super(ConvNeuralNetwork, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3) # -> (32, 26, 26)
        self.conv2 = nn.Conv2d(32, 32, kernel_size=5) # -> (32, 22, 22)
        self.conv3 = nn.Conv2d(32, 32, kernel_size=5) # -> (32, 18, 18)
        self.dropout1 = nn.Dropout(0.4)

        self.conv4 = nn.Conv2d(32, 64, kernel_size=3) # -> (64, 16, 16)
        self.conv5 = nn.Conv2d(64, 64, kernel_size=5) # -> (64, 12, 12)
        self.conv6 = nn.Conv2d(64, 64, kernel_size=5) # -> (64, 8, 8)
        self.dropout2 = nn.Dropout(0.4)

        self.fc1 = nn.Linear(64 * 8 * 8, 50)
        self.fc2 = nn.Linear(50, 10)
        self.dropout3 = nn.Dropout(0.4)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.dropout1(x)
        x = F.relu(self.conv4(x))
        x = F.relu(self.conv5(x))
        x = F.relu(self.conv6(x))
        x = self.dropout2(x)
        x = x.view(-1, 64 * 8 * 8)
        x = F.relu(self.fc1(x))
        x = self.dropout3(x)
        x = F.softmax(self.fc2(x))
        return x

model = ConvNeuralNetwork()

## 3. Visualizing Data & Model

Understanding your data and model architecture is a crucial first step in debugging machine learning workflows. In this section, you'll learn how to:

- Visualize sample images and their corresponding labels from the MNIST training set
- Inspect the structure and layers of the convolutional neural network (CNN) model
- Examine model parameters to ensure correct initialization

These visualizations help verify that data is loaded correctly and the model is structured as intended before proceeding to training and debugging.

In [None]:
# Visualize sample inputs and labels from the training set
examples = enumerate(train_loader)
batch_idx, (example_data, example_targets) = next(examples)

plt.figure(figsize=(8, 3))
for i in range(6):
    plt.subplot(2, 3, i + 1)
    plt.imshow(example_data[i][0], cmap='gray')
    plt.title(f"Label: {example_targets[i].item()}")
    plt.axis('off')
plt.tight_layout()
plt.show()

# Visualize model architecture
print("Model architecture:\n")
print(model)

# Visualize model parameters (layer names and shapes)
print("\nModel parameters:")
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

## 4. Forward/Backward Pass Debugging (15 min)
Understanding how data flows through your model and how gradients are computed is essential for effective debugging.

In this section, you'll learn to:
- Inspect activations and outputs at each layer
- Check gradients to ensure proper learning
- Use hooks and manual inspection to debug the forward and backward passes
- Identify issues such as vanishing/exploding gradients or incorrect output shapes

We'll walk through practical examples using PyTorch's autograd and hooks.

In [None]:
# Training loop with loss visualization
def train(model, optimizer, loss, num_epochs: int):
    loss_values = {"train": [], "val": []}
    for epoch in range(1, num_epochs + 1):
        start_time = time.time()

        # Training phase
        model.train()
        running_loss = 0.0
        for data, target in train_loader:
            optimizer.zero_grad()
            output = model(data)
            # Convert targets to one-hot encoding for MSE
            target_onehot = F.one_hot(target, num_classes=10).float()
            loss = criterion(output, target_onehot)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        loss_values["train"].append(running_loss / len(train_loader))
    
        # Validation phase
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for data, target in test_loader:
                output = model(data)
                target_onehot = F.one_hot(target, num_classes=10).float()
                loss = criterion(output, target_onehot)
                val_loss += loss.item()
        loss_values["val"].append(val_loss / len(test_loader))
        elapsed_time = time.time() - start_time
    
        print(f"Epoch {epoch}: Average training loss = {loss_values["train"][-1]:.4f}, Validation loss = {loss_values["val"][-1]:.4f}, Elapsed Time = {elapsed_time:.2f} s")
    return loss_values

To train a neural network in PyTorch, you need to define both an optimizer and a loss function. The optimizer updates the model parameters based on the computed gradients, while the loss function measures how well the model's predictions match the true labels.

- **Optimizer:** [torch.optim documentation](https://pytorch.org/docs/stable/optim.html)
- **Loss Function:** [torch.nn documentation](https://pytorch.org/docs/stable/nn.html#loss-functions)

Particularly, the _Adam_ optimizer is a popular choice for training deep learning models. It combines the advantages of two other extensions of stochastic gradient descent: _AdaGrad_ and _RMSProp_. _Adam_ adapts the learning rate for each parameter and uses estimates of first and second moments of the gradients to provide efficient and robust training. Learn more about _Adam_ in the [Adam optimizer documentation](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html).

In [2]:
# Set up optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

losses = train(model, optimizer, loss=criterion, num_epochs=10)

NameError: name 'model' is not defined

## 5. Model Validation

Validating your model is crucial to ensure it generalizes well to unseen data and does not simply memorize the training set. In this notebook, we use two main methods for model validation:

1. **Plotting Training and Validation Losses:**  
    By visualizing the loss curves for both the training and validation sets over each epoch, you can monitor the learning process and detect issues such as overfitting (where validation loss increases while training loss decreases) or underfitting (both losses remain high). Consistent and decreasing validation loss indicates good generalization.

2. **Visualizing Predictions on Sample Data:**  
    Examining a few examples from the dataset along with their predicted labels and confidence scores helps you qualitatively assess model performance. This can reveal systematic errors, misclassifications, or areas where the model is uncertain, guiding further debugging and improvement.

3. **Extra Validation Metrics:**  
    Calculating metrics such as accuracy per label, confusion matrices, or heatmaps provides deeper insight into model performance across different classes. These metrics help identify if the model is biased toward certain labels or struggles with specific digits, enabling targeted improvements.

These validation techniques provide both quantitative and qualitative insights into your model's behavior during training.

In [7]:
import matplotlib.pyplot as plt

plt.plot(range(1, len(losses["train"]) + 1), losses["train"], label="train")
plt.plot(range(1, len(losses["val"]) + 1), losses["val"], label="validation")
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.legend()

NameError: name 'losses' is not defined

In [None]:
# Visualize sample inputs and labels from the training set
examples = enumerate(train_loader)
batch_idx, (example_data, example_targets) = next(examples)

plt.figure(figsize=(8, 3))
for i in range(6):
    plt.subplot(2, 3, i + 1)
    plt.imshow(example_data[i][0], cmap='gray')
    pred = model(example_data[0]).squeeze()
    label = int(pred.argmax())
    prob = 100 * pred[label]
    plt.title(f"Label: {example_targets[i].item()}\n Pred: {label} ({prob:.0f} %)")
    plt.axis('off')
plt.tight_layout()
plt.show()

## 6. Common Pitfalls and How to Fix Them
Machine learning models often fail due to subtle bugs or data issues. Recognizing common pitfalls can save significant debugging time.

Key issues to watch for:
- Data leakage between train/test sets
- Incorrect loss function or output activation
- Poor data normalization or preprocessing
- Overfitting or underfitting
- Vanishing/exploding gradients
- Misaligned labels or targets

We'll demonstrate how to detect and address these problems in practice.

## 6. Practical Debugging Tools
PyTorch and the Python ecosystem offer powerful tools for debugging ML models.

Recommended tools and techniques:
- `torch.autograd` for inspecting gradients and computation graphs
- Forward/backward hooks for monitoring activations and gradients
- TensorBoard for visualizing metrics and model graphs
- Matplotlib for plotting loss, accuracy, and predictions
- Printing shapes and values at key points in the model

We'll show how to use these tools to quickly identify and resolve issues.

## 7. Guided Exercise: Fix a Buggy Model
Now it's your turn! Below is a model with intentional bugs. Try to identify and fix the issues using the debugging techniques we've covered.

Steps:
1. Run the code and observe any errors or unexpected outputs.
2. Use visualization, hooks, and print statements to diagnose the problem.
3. Fix the bugs and verify the model trains correctly.

Discuss your findings and solutions with your peers.

## 8. Wrap-up & Q&A
Congratulations on completing the debugging session!
- Review the key techniques and tools for debugging ML models
- Share your experiences and ask questions
- Explore further resources for advanced debugging and model analysis

Thank you for participating!