# Debugging and Visualization

This notebook gives you hands-on practice with PyTorch debugging and visualization tools:

1. **torchinfo** â€” Inspect model architecture and verify parameter counts
2. **Gradient monitoring** â€” Write a function to track gradient norms per layer
3. **TensorBoard** â€” Log training metrics and compare runs visually
4. **Debugging a broken script** â€” Apply the debugging checklist to find real bugs

These are the tools you reach for when something goes wrong. Practice them now so they're second nature when you need them.

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones â€” they reveal gaps in your mental model.

**Estimated time:** 30â€“45 minutes on Colab.

---

## Setup

Run these cells to install dependencies and import everything.

In [None]:
!pip install torchinfo

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
from torchinfo import summary

# TensorBoard
%load_ext tensorboard
from torch.utils.tensorboard import SummaryWriter

# Reproducibility
torch.manual_seed(42)

# Use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# For nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

## MNIST Model and Data (from MNIST Project)

This is the same model and data loading code from lesson 2-2-2. Run these cells so you have a working model to debug and visualize.

In [None]:
# Download and load MNIST
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean and std
])

train_dataset = torchvision.datasets.MNIST(
    root='./data', train=True, download=True, transform=transform
)
test_dataset = torchvision.datasets.MNIST(
    root='./data', train=False, download=True, transform=transform
)

train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=64, shuffle=True
)
test_loader = torch.utils.data.DataLoader(
    test_dataset, batch_size=64, shuffle=False
)

print(f'Training samples: {len(train_dataset)}')
print(f'Test samples: {len(test_dataset)}')
print(f'Image shape: {train_dataset[0][0].shape}')

In [None]:
class MNISTClassifier(nn.Module):
    """The same MNIST model from your MNIST Project.
    
    Architecture:
        Flatten (784) -> Linear(784, 128) -> ReLU
        -> Linear(128, 64) -> ReLU -> Linear(64, 10)
    """
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = MNISTClassifier().to(device)
print(f'Model parameters: {sum(p.numel() for p in model.parameters()):,}')

---

## Exercise 1: Inspect with torchinfo (Guided)

Run `torchinfo.summary()` on your MNIST model. Verify the parameter count matches your manual calculation from the MNIST Project.

Recall the manual calculation:
- `fc1`: 784 x 128 + 128 = 100,480
- `fc2`: 128 x 64 + 64 = 8,256
- `fc3`: 64 x 10 + 10 = 650
- **Total**: 109,386

**Before running, predict:** What output shape will torchinfo show after the Flatten layer? After fc1? Where will most of the parameters be concentrated?

In [None]:
# Run torchinfo summary on MNISTClassifier
# input_size matches a single MNIST batch: (batch_size, channels, height, width)
summary(model, input_size=(1, 1, 28, 28))

In [None]:
# Verify: manual calculation vs torchinfo
manual_fc1 = 784 * 128 + 128
manual_fc2 = 128 * 64 + 64
manual_fc3 = 64 * 10 + 10
manual_total = manual_fc1 + manual_fc2 + manual_fc3

actual_total = sum(p.numel() for p in model.parameters())

print(f'Manual calculation:')
print(f'  fc1: {manual_fc1:>10,}')
print(f'  fc2: {manual_fc2:>10,}')
print(f'  fc3: {manual_fc3:>10,}')
print(f'  Total: {manual_total:>8,}')
print(f'')
print(f'Actual total:  {actual_total:,}')
print(f'Match: {manual_total == actual_total}')

### What to notice

- torchinfo shows **output shape** at each layer â€” this is how you catch dimension bugs
- It shows **parameter count** per layer â€” use this to verify your architecture
- The `Flatten` layer has 0 parameters (it just reshapes)
- Most parameters live in `fc1` because its input is 784-dimensional

---

---

## Exercise 2: Find a Shape Bug with torchinfo (Guided)

Introduce a shape bug by removing `nn.Flatten()` from the model. Run torchinfo and identify the problem. Then fix it.

This simulates a common debugging scenario: the model crashes with a cryptic shape error, and you need to figure out where the mismatch happens.

**Before running, predict:** If you feed a `[1, 1, 28, 28]` tensor directly into `nn.Linear(784, 128)` without flattening, what error will you get? What shape does the Linear layer actually receive?

In [None]:
class BrokenModel(nn.Module):
    """Same as MNISTClassifier but missing nn.Flatten()."""
    def __init__(self):
        super().__init__()
        # Bug: no self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        # Bug: x goes straight into fc1 without flattening
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

broken_model = BrokenModel().to(device)

# Try running torchinfo on the broken model.
# It will show an error or unexpected shapes â€” that's the point.
try:
    summary(broken_model, input_size=(1, 1, 28, 28))
except Exception as e:
    print(f'Error: {e}')
    print()
    print('The model crashed because fc1 expects input of size 784,')
    print('but received a 4D tensor of shape [1, 1, 28, 28].')
    print('The fix: add nn.Flatten() before the first Linear layer.')

In [None]:
class FixedModel(nn.Module):
    """BrokenModel with the Flatten fix applied."""
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()  # Fix: add flatten
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.flatten(x)  # Fix: flatten before fc1
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

fixed_model = FixedModel().to(device)

# Now torchinfo should work
print('Fixed model:')
summary(fixed_model, input_size=(1, 1, 28, 28))

### Takeaway

When you hit a shape error:
1. Run `torchinfo.summary()` â€” it shows the output shape at every layer
2. Find the layer where the shape goes wrong
3. The fix is usually a missing reshape, flatten, or wrong dimension argument

This is faster than reading PyTorchâ€™s error messages, which often point to the *symptom* (wrong size at a Linear layer) rather than the *cause* (missing Flatten).

---

---

## Exercise 3: Gradient Norm Monitoring (Supported)

Write a `log_gradient_norms()` function that iterates over `model.named_parameters()` and prints `grad.norm()` for each parameter.

Then compare gradient norms between:
- A **healthy model** (default PyTorch initialization)
- A **poorly-initialized model** (weights set to 100.0)

This shows you what gradient health looks like â€” and what problems look like.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that `model.named_parameters()` gives you access to every learnable tensor and its name. After calling `loss.backward()`, each parameter's `.grad` attribute contains the gradient. The L2 norm of that gradient tells you the magnitude of the update signal.

```python
def log_gradient_norms(model):
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad_norm = param.grad.norm().item()
            print(f'{name:<30} {grad_norm:>12.6f}')
```

Healthy gradients are typically in the 0.01 to 10.0 range. If you see norms above 100 (exploding) or exactly 0.0 (dead neurons/saturated activations), something is wrong with your initialization or architecture.

</details>

In [None]:
def log_gradient_norms(model):
    """Print the gradient norm for each named parameter.
    
    Call this after loss.backward() to inspect gradient magnitudes.
    Healthy gradients are typically in the range 0.01 to 10.0.
    Very large (>100) or very small (<1e-6) gradients signal problems.
    """
    print(f'{"Layer":<30} {"Grad Norm":>12}')
    print('-' * 44)
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad_norm = param.grad.norm().item()
            print(f'{name:<30} {grad_norm:>12.6f}')
        else:
            print(f'{name:<30} {"No grad":>12}')

In [None]:
# --- Healthy model (default initialization) ---
healthy_model = MNISTClassifier().to(device)

# One forward + backward pass
sample_input = torch.randn(1, 1, 28, 28).to(device)
sample_target = torch.tensor([3]).to(device)

output = healthy_model(sample_input)
loss = nn.CrossEntropyLoss()(output, sample_target)
loss.backward()

print('HEALTHY MODEL (default init)')
print('=' * 44)
log_gradient_norms(healthy_model)
print(f'\nLoss: {loss.item():.4f}')

In [None]:
# --- Poorly initialized model (weights set to 100.0) ---
bad_model = MNISTClassifier().to(device)

# Set all weights to an extreme value
with torch.no_grad():
    for param in bad_model.parameters():
        param.fill_(100.0)

# One forward + backward pass
output = bad_model(sample_input)
loss = nn.CrossEntropyLoss()(output, sample_target)
loss.backward()

print('POORLY INITIALIZED MODEL (all weights = 100.0)')
print('=' * 44)
log_gradient_norms(bad_model)
print(f'\nLoss: {loss.item():.4f}')

### What to notice

- **Healthy model**: Gradient norms are moderate (roughly 0.01 to 10). The loss is a reasonable number (around 2â€“3 for random predictions on 10 classes).
- **Bad model**: Gradient norms are either enormous (exploding) or zero (saturated ReLU). The loss may be NaN or extremely large.
- This is exactly the kind of signal that tells you "something is wrong with initialization" before you waste time training.

**Rule of thumb:** If gradient norms vary by more than 3â€“4 orders of magnitude across layers, or if any are exactly 0.0, investigate before training.

---

---

## Exercise 4: TensorBoard Logging (Supported)

Add TensorBoard logging to your MNIST training loop. Train for 10 epochs, then open TensorBoard to examine the loss and accuracy curves.

TensorBoard gives you a persistent, interactive dashboard â€” much better than printing numbers in a cell.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that `SummaryWriter` creates a log directory, and `writer.add_scalar(tag, value, step)` logs a single data point. Use separate tags for different metrics (e.g., `'Loss/train'`, `'Accuracy/test'`) and the epoch number as the step.

```python
writer = SummaryWriter('runs/mnist')

for epoch in range(num_epochs):
    # ... training loop ...
    writer.add_scalar('Loss/train', avg_loss, epoch)
    writer.add_scalar('Accuracy/test', test_acc, epoch)

writer.close()
```

Always call `writer.close()` when done. Each experiment should use a separate log directory (e.g., `runs/mnist_v1`, `runs/mnist_v2`) so you can compare them in TensorBoard.

</details>

In [None]:
def evaluate_accuracy(model, loader):
    """Compute accuracy on a dataset."""
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct / total

In [None]:
# Training loop with TensorBoard logging
tb_model = MNISTClassifier().to(device)
optimizer = optim.Adam(tb_model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Create a TensorBoard writer
writer = SummaryWriter('runs/mnist')

num_epochs = 10

for epoch in range(num_epochs):
    # Training
    tb_model.train()
    running_loss = 0.0
    n_batches = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = tb_model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        n_batches += 1

    avg_loss = running_loss / n_batches
    test_acc = evaluate_accuracy(tb_model, test_loader)

    # Log to TensorBoard
    writer.add_scalar('Loss/train', avg_loss, epoch)
    writer.add_scalar('Accuracy/test', test_acc, epoch)

    print(f'Epoch {epoch+1:2d}/{num_epochs}  Loss: {avg_loss:.4f}  Test Acc: {test_acc:.2%}')

writer.close()
print('\nTraining complete. TensorBoard logs written to runs/mnist/')

In [None]:
# Open TensorBoard inline
# You should see Loss/train decreasing and Accuracy/test increasing.
%tensorboard --logdir runs

### What to notice in TensorBoard

- **Loss/train** should decrease smoothly over epochs
- **Accuracy/test** should increase and plateau
- If loss is noisy or increasing, something is wrong (learning rate too high, bug in the loop)
- TensorBoard keeps data across runs â€” useful for comparing experiments

---

---

## Exercise 5: Learning Rate Comparison (Supported)

Train 3 runs with different learning rates: 0.001, 0.01, and 0.1. Compare them in TensorBoard.

Identify which learning rate is too high, too low, and just right.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is using separate `SummaryWriter` directories for each run so TensorBoard can overlay them. The pattern is `SummaryWriter(f'runs/lr_{lr}')`.

```python
for lr in [0.001, 0.01, 0.1]:
    writer = SummaryWriter(f'runs/lr_{lr}')
    model = MNISTClassifier().to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    # ... same training loop, logging with writer ...
    writer.close()
```

You should see: lr=0.001 converges slowly but steadily, lr=0.01 converges quickly and cleanly, lr=0.1 may oscillate or fail to converge. Adam is more forgiving than SGD, so even lr=0.1 might work â€” but the loss curve will be noisier.

</details>

In [None]:
learning_rates = [0.001, 0.01, 0.1]
num_epochs = 10

for lr in learning_rates:
    print(f'\n{"=" * 50}')
    print(f'Training with lr={lr}')
    print(f'{"=" * 50}')

    lr_model = MNISTClassifier().to(device)
    optimizer = optim.Adam(lr_model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    # Each run gets its own TensorBoard log directory
    writer = SummaryWriter(f'runs/lr_{lr}')

    for epoch in range(num_epochs):
        lr_model.train()
        running_loss = 0.0
        n_batches = 0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = lr_model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            n_batches += 1

        avg_loss = running_loss / n_batches
        test_acc = evaluate_accuracy(lr_model, test_loader)

        writer.add_scalar('Loss/train', avg_loss, epoch)
        writer.add_scalar('Accuracy/test', test_acc, epoch)

        print(f'  Epoch {epoch+1:2d}/{num_epochs}  Loss: {avg_loss:.4f}  Test Acc: {test_acc:.2%}')

    writer.close()

print('\nAll runs complete. Open TensorBoard to compare.')

In [None]:
# Open TensorBoard to compare all runs side by side
# You'll see runs/lr_0.001, runs/lr_0.01, runs/lr_0.1 as separate curves.
%tensorboard --logdir runs

### What to look for

- **lr=0.001**: Loss decreases steadily but slowly. Accuracy climbs gradually. This is the *too low* rate â€” it works but wastes compute.
- **lr=0.01**: Loss drops faster. Accuracy ramps up quickly. Likely the *just right* rate for this model.
- **lr=0.1**: Loss may spike, oscillate, or fail to converge. Accuracy may be erratic or stuck. This is the *too high* rate.

The exact behavior depends on the optimizer (Adam is more forgiving than SGD), but the pattern holds. Being able to **see** these curves side by side is why TensorBoard exists.

---

---

## Exercise 6: Debug a Broken Training Script (Independent)

The script below has **3 intentional bugs**. Your job: use the debugging checklist to find and fix all three.

The bugs:
1. A **shape error** â€” the model processes input incorrectly
2. A **missing `model.eval()`** â€” evaluation runs in training mode
3. A **subtle data loading bug** â€” the training DataLoader is misconfigured

**Approach:**
- Read the script carefully
- Try running it â€” one bug will crash immediately
- Fix the crash, then look for the two silent bugs (they won't crash, but they hurt performance)
- Use `torchinfo.summary()` and print statements to help

<details>
<summary>ðŸ’¡ Solution</summary>

The three bugs target different categories of errors you'll encounter in practice:

**Bug 1 (Shape error):** The model is missing `nn.Flatten()`. The input is `[batch, 1, 28, 28]` but `nn.Linear(784, 128)` expects `[batch, 784]`. Fix: add `self.flatten = nn.Flatten()` and call it first in `forward()`. This one crashes immediately, making it the easiest to find.

**Bug 2 (Silent correctness):** `buggy_evaluate()` never calls `model.eval()`. This affects models with Dropout or BatchNorm because they behave differently during training vs inference. Fix: add `model.eval()` at the top of the evaluation function.

**Bug 3 (Silent performance):** The training DataLoader uses `shuffle=False`. The model sees data in the same order every epoch, which can cause it to learn ordering patterns instead of generalizing. Fix: set `shuffle=True`.

```python
# Bug 1 fix
self.flatten = nn.Flatten()
# in forward: x = self.flatten(x)

# Bug 2 fix
def fixed_evaluate(model, loader):
    model.eval()  # Add this line
    ...

# Bug 3 fix
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
```

The shape error is easy because it crashes. The dangerous bugs are the silent ones â€” that's why systematic checklists matter.

</details>

In [None]:
# ===== BUGGY TRAINING SCRIPT =====
# There are 3 bugs in this script. Find and fix them all.
#
# Bug hints (read these AFTER you've tried finding them yourself):
#   1. The model is missing a critical layer for handling image input
#   2. The evaluation function doesn't switch the model to eval mode
#   3. The training DataLoader has a configuration that hurts learning


class BuggyClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        # BUG 1: No nn.Flatten() â€” fc1 will receive a 4D tensor instead of 784
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        # BUG 1: x is [batch, 1, 28, 28] but fc1 expects [batch, 784]
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x


def buggy_evaluate(model, loader):
    """Evaluate accuracy on a dataset."""
    # BUG 2: Missing model.eval() â€” dropout/batchnorm would behave incorrectly
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct / total


def buggy_train():
    buggy_model = BuggyClassifier().to(device)
    optimizer = optim.Adam(buggy_model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()

    # BUG 3: shuffle=False for training data
    # The model sees the same order every epoch, which can hurt generalization
    # and cause the loss to follow a predictable pattern rather than converging smoothly
    buggy_train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=64, shuffle=False
    )

    for epoch in range(5):
        buggy_model.train()
        running_loss = 0.0
        n_batches = 0

        for images, labels in buggy_train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = buggy_model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            n_batches += 1

        avg_loss = running_loss / n_batches
        acc = buggy_evaluate(buggy_model, test_loader)
        print(f'Epoch {epoch+1}/5  Loss: {avg_loss:.4f}  Test Acc: {acc:.2%}')


# This will crash â€” start debugging here
buggy_train()

### Fix the bugs below

Copy the buggy code into the cell below and fix all three bugs. Then run it to verify.

In [None]:
# ===== YOUR FIXED VERSION =====
# Copy the buggy code here and fix all three bugs:
#   1. Add nn.Flatten() to the model
#   2. Add model.eval() in the evaluate function
#   3. Set shuffle=True in the training DataLoader


class FixedClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()       # FIX 1: add flatten
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.flatten(x)               # FIX 1: flatten input
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x


def fixed_evaluate(model, loader):
    model.eval()                           # FIX 2: switch to eval mode
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct / total


def fixed_train():
    fixed_model = FixedClassifier().to(device)
    optimizer = optim.Adam(fixed_model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()

    # FIX 3: shuffle=True for training data
    fixed_train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=64, shuffle=True
    )

    for epoch in range(5):
        fixed_model.train()
        running_loss = 0.0
        n_batches = 0

        for images, labels in fixed_train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = fixed_model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            n_batches += 1

        avg_loss = running_loss / n_batches
        acc = fixed_evaluate(fixed_model, test_loader)
        print(f'Epoch {epoch+1}/5  Loss: {avg_loss:.4f}  Test Acc: {acc:.2%}')


fixed_train()

### Bug debrief

| Bug | Type | How to catch it |
|-----|------|----------------|
| Missing `nn.Flatten()` | Shape error | Crashes immediately. `torchinfo.summary()` would show the dimension mismatch. |
| Missing `model.eval()` | Silent correctness bug | Doesnâ€™t crash. Affects models with dropout or batch norm. Check your evaluation function template. |
| `shuffle=False` in training | Silent performance bug | Doesnâ€™t crash. Model still trains, but converges slower and generalizes worse. Always shuffle training data. |

The shape error is easy â€” it crashes. The dangerous bugs are the silent ones. Thatâ€™s why you need a **debugging checklist** that you run through systematically, not just when things crash.

---

## Key Takeaways

1. **`torchinfo.summary()`** is your first tool when a model crashes or you want to verify architecture. It shows shapes and parameter counts at every layer.

2. **Gradient norms** tell you if training is healthy before you waste time on epochs. Extreme values (too large or zero) mean something is wrong with initialization or architecture.

3. **TensorBoard** replaces print statements with persistent, interactive dashboards. Use separate run directories to compare experiments (different learning rates, architectures, etc.).

4. **The most dangerous bugs are silent.** Missing `model.eval()`, wrong `shuffle` settings, and subtle data issues wonâ€™t crash your code â€” they just produce worse results without explanation. A systematic debugging checklist catches these.

5. **Use these tools proactively**, not just when something breaks. Run `torchinfo` on every new model. Check gradient norms after the first batch. Log everything to TensorBoard. The 30 seconds of setup saves hours of confusion.