# Part 4.1: Convolutional Neural Networks (CNNs) â€” The Formula 1 Edition

Images have **spatial structure** -- pixels near each other are related. Fully connected networks ignore this structure entirely, treating each pixel as an independent input. Convolutional Neural Networks exploit spatial patterns by learning local filters that slide across the image, dramatically reducing parameters while capturing the features that matter: edges, textures, shapes, and objects.

CNNs are the foundation of modern computer vision and have transformed everything from medical imaging to self-driving cars.

**F1 analogy:** Think of a CNN the way an F1 engineer scans telemetry data. You don't stare at every single data point from a 78-lap race simultaneously -- you slide a window across the trace, looking for local patterns: a braking spike here, a traction event there, a temperature anomaly in sector 2. Convolution filters are exactly this: small pattern-detectors that sweep across data, flagging where interesting things happen. Pooling is how you zoom out from thousands of data points per lap to a handful of key sector metrics. And the feature hierarchy CNNs learn -- edges, textures, parts, objects -- mirrors how telemetry analysis progresses from raw sensor spikes to driving-style classification.

---

## Learning Objectives

By the end of this notebook, you should be able to:

- [ ] Explain why fully connected networks are impractical for images and how CNNs solve this
- [ ] Perform 2D convolution by hand and understand padding, stride, and output size
- [ ] Describe how multiple filters create feature maps and what filters learn at different depths
- [ ] Explain the purpose of pooling layers and compare max vs average pooling
- [ ] Build a CNN in PyTorch using nn.Conv2d, nn.MaxPool2d, and linear layers
- [ ] Describe the key innovations in LeNet-5, VGG, and ResNet
- [ ] Implement skip connections and explain why they help training
- [ ] Train a CNN on a real dataset with data augmentation and evaluate with a confusion matrix

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as transforms

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

---

## 1. Why CNNs?

### Intuitive Explanation

Imagine you have a 256x256 color image. As a flat vector, that is 256 x 256 x 3 = **196,608 input values**. If the first hidden layer has just 1,000 neurons, you need 196,608 x 1,000 = **~197 million** parameters in the first layer alone. This is absurd for several reasons:

1. **Parameter explosion**: More parameters means more memory, more compute, and much more data needed to train
2. **No spatial awareness**: A fully connected layer treats pixel (0,0) and pixel (255,255) identically -- it does not know they are far apart
3. **No translation invariance**: If the network learns to detect a cat in the top-left corner, it cannot recognize the same cat in the bottom-right without learning entirely separate weights

CNNs solve all three problems with three key ideas:

| Key Idea | What It Means | Benefit | F1 Parallel |
|----------|--------------|---------|-------------|
| **Local connectivity** | Each neuron connects to only a small patch of the input | Massively fewer parameters | An engineer looks at a short window of telemetry, not the entire race at once |
| **Weight sharing** | The same filter is applied everywhere across the image | Learns one detector, uses it everywhere | The same "braking-zone detector" works at every corner on every circuit |
| **Translation invariance** | A feature detected anywhere produces the same response | Recognizes objects regardless of position | A lockup signature looks the same whether it happens at Turn 1 or Turn 15 |

**F1 analogy:** Imagine analyzing telemetry from a 5.4 km track with sensors logging 300 Hz -- that is over 1.6 million data points per lap. A fully connected approach would try to learn separate weights for every single sample point. A CNN-style approach slides a small filter (say, a 50-sample window) across the trace, detecting patterns like braking events, throttle lifts, and traction spikes regardless of where on the circuit they occur.

### Visualization: FC vs CNN Parameter Comparison

In [None]:
# Visualize the parameter explosion problem
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Parameter count comparison
image_sizes = [28, 64, 128, 256, 512]
hidden = 1000
fc_params = [s * s * 3 * hidden for s in image_sizes]
# CNN: 3 input channels, 32 filters, 3x3 kernel
cnn_params = [3 * 32 * 3 * 3 + 32 for _ in image_sizes]  # Same regardless of image size!

ax = axes[0]
x_pos = np.arange(len(image_sizes))
width = 0.35
bars1 = ax.bar(x_pos - width/2, [p / 1e6 for p in fc_params], width, 
               label='Fully Connected', color='red', alpha=0.7)
bars2 = ax.bar(x_pos + width/2, [p / 1e6 for p in cnn_params], width, 
               label='Conv Layer (3x3, 32 filters)', color='blue', alpha=0.7)
ax.set_xlabel('Input Image Size')
ax.set_ylabel('Parameters (millions)')
ax.set_title('Parameters in First Layer: FC vs CNN')
ax.set_xticks(x_pos)
ax.set_xticklabels([f'{s}x{s}x3' for s in image_sizes], rotation=15)
ax.legend()
ax.set_yscale('log')
ax.grid(True, alpha=0.3)

# Add value labels on FC bars
for bar, val in zip(bars1, fc_params):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() * 1.1,
            f'{val/1e6:.1f}M', ha='center', va='bottom', fontsize=8)

# Right: Connectivity diagram
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)

# Draw "image" grid (input)
for i in range(5):
    for j in range(5):
        color = 'lightblue' if not (1 <= i <= 3 and 1 <= j <= 3) else 'orange'
        alpha = 0.3 if color == 'lightblue' else 0.7
        rect = plt.Rectangle((0.5 + j * 0.7, 6.5 - i * 0.7), 0.6, 0.6, 
                              facecolor=color, edgecolor='black', alpha=alpha, linewidth=1)
        ax.add_patch(rect)

# Label
ax.text(2.3, 4.3, 'Input (5x5)', ha='center', fontsize=10, fontweight='bold')
ax.text(2.3, 3.8, 'Orange = local\nreceptive field', ha='center', fontsize=8, color='darkorange')

# Draw filter
for i in range(3):
    for j in range(3):
        rect = plt.Rectangle((6 + j * 0.7, 7.5 - i * 0.7), 0.6, 0.6,
                              facecolor='green', edgecolor='black', alpha=0.6, linewidth=1)
        ax.add_patch(rect)

ax.text(7.1, 5.7, 'Filter (3x3)', ha='center', fontsize=10, fontweight='bold')
ax.text(7.1, 5.2, '9 shared weights', ha='center', fontsize=8, color='green')

# Arrow from receptive field to filter
ax.annotate('', xy=(6, 7.8), xytext=(4.2, 7.8),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))
ax.text(5.1, 8.2, 'Same filter\nslides across\nentire image', ha='center', fontsize=8)

# Draw output neuron
circle = plt.Circle((8.5, 3.5), 0.4, facecolor='red', edgecolor='black', alpha=0.7)
ax.add_patch(circle)
ax.text(8.5, 2.7, 'Output\nneuron', ha='center', fontsize=9)

# Arrow from filter to output
ax.annotate('', xy=(8.5, 3.9), xytext=(7.5, 5.7),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))

ax.set_title('CNN: Local Connectivity + Weight Sharing')
ax.axis('off')

plt.tight_layout()
plt.show()

# Print the dramatic difference
print("Parameter comparison for 256x256x3 image:")
print(f"  Fully Connected (1000 neurons): {256*256*3*1000:>15,} parameters")
print(f"  Conv Layer (32 3x3 filters):    {3*32*3*3+32:>15,} parameters")
print(f"  Reduction factor:               {256*256*3*1000 / (3*32*3*3+32):>15,.0f}x fewer!")

### Deep Dive: Why Spatial Structure Matters

Consider how you actually look at an image. You do not process every pixel independently -- you see **local patterns**: edges, corners, textures. A vertical edge is defined by neighboring pixels being very different (dark on one side, light on the other). This is fundamentally a **local** property.

**F1 analogy:** Telemetry has the same kind of local structure. A braking zone is defined by a sharp drop in speed and a spike in brake pressure within a short window. A traction event is a brief burst of wheelspin followed by recovery. These are local patterns -- you detect them by looking at neighboring data points, not by comparing the start of the lap to the end.

#### Key Insight

CNNs mirror how biological vision works: simple cells in the visual cortex respond to edges at specific locations and orientations, then more complex cells combine these into higher-level features. CNNs learn a similar hierarchy automatically.

#### Common Misconceptions

| Misconception | Reality |
|---------------|--------|
| CNNs only work on images | They work on any data with spatial/temporal structure (audio, time series, graphs) -- including 1D telemetry traces |
| More parameters = better | CNNs show fewer parameters with weight sharing often works *better* due to regularization effect |
| CNNs replace fully connected layers | CNNs typically end with FC layers for final classification |

---

## 2. Convolution Operation

### Intuitive Explanation

**Convolution** is a mathematical operation that combines two functions to produce a third. In CNNs, think of it as a **sliding window** operation:

1. Take a small filter (e.g., 3x3 grid of weights)
2. Place it on the top-left corner of the image
3. Multiply each filter weight by the corresponding pixel value
4. Sum all the products to get one output value
5. Slide the filter one position to the right and repeat
6. When you reach the right edge, move down one row and start again from the left

The result is a **feature map** -- a new, smaller image where each value tells you "how much does this local region match the filter pattern?"

**F1 analogy:** Picture scanning a speed trace from a race lap. You take a small template -- say, the signature shape of a heavy braking event (speed dropping sharply, then leveling) -- and slide it along the entire trace. At every position, you compute a similarity score. Where the trace matches the template, you get a big number. The output is a "braking event map" showing where on the circuit the driver braked hard. Different templates detect different events: throttle lifts, DRS activations, traction loss. Each template is a convolution filter.

### 2.1 1D Convolution: The Simplest Case

Before tackling 2D images, let us see convolution in 1D -- it is just a sliding dot product.

In [None]:
# 1D Convolution: sliding window visualization
signal = np.array([1, 3, 2, 5, 4, 1, 3, 2])
kernel = np.array([1, 0, -1])  # Edge detector

# Manual convolution
output_size = len(signal) - len(kernel) + 1
output = np.zeros(output_size)
for i in range(output_size):
    output[i] = np.sum(signal[i:i+len(kernel)] * kernel)

fig, axes = plt.subplots(3, 1, figsize=(12, 8))

# Signal
ax = axes[0]
ax.bar(range(len(signal)), signal, color='blue', alpha=0.7, edgecolor='black')
for i, v in enumerate(signal):
    ax.text(i, v + 0.15, str(v), ha='center', fontsize=12, fontweight='bold')
ax.set_title('Input Signal', fontsize=13)
ax.set_ylabel('Value')
ax.set_xticks(range(len(signal)))
ax.grid(True, alpha=0.3)

# Kernel
ax = axes[1]
colors_k = ['green' if v >= 0 else 'red' for v in kernel]
ax.bar(range(len(kernel)), kernel, color=colors_k, alpha=0.7, edgecolor='black')
for i, v in enumerate(kernel):
    ax.text(i, v + 0.1 * np.sign(v), str(v), ha='center', fontsize=12, fontweight='bold')
ax.set_title('Kernel [1, 0, -1] (detects changes/edges)', fontsize=13)
ax.set_ylabel('Value')
ax.set_xticks(range(len(kernel)))
ax.axhline(y=0, color='black', linewidth=0.5)
ax.grid(True, alpha=0.3)

# Output with step-by-step
ax = axes[2]
colors_out = ['green' if v >= 0 else 'red' for v in output]
ax.bar(range(len(output)), output, color=colors_out, alpha=0.7, edgecolor='black')
for i, v in enumerate(output):
    # Show computation
    s = signal[i:i+len(kernel)]
    comp = f'{s[0]}*1 + {s[1]}*0 + {s[2]}*(-1) = {int(v)}'
    ax.text(i, v + 0.3 * np.sign(v) if v != 0 else 0.3, comp, 
            ha='center', fontsize=8, rotation=0)
ax.set_title('Output (Convolution Result)', fontsize=13)
ax.set_ylabel('Value')
ax.set_xlabel('Position')
ax.set_xticks(range(len(output)))
ax.axhline(y=0, color='black', linewidth=0.5)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Positive output = signal is increasing (left < right)")
print("Negative output = signal is decreasing (left > right)")
print("Zero output = no change")

### 2.2 2D Convolution Step-by-Step

Now the same idea in 2D. The filter slides across both rows and columns of the image.

$$\text{Output}(i, j) = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} \text{Input}(i+m, j+n) \cdot \text{Kernel}(m, n)$$

#### Breaking down the formula:

| Component | Meaning | Typical Values | F1 Analogy |
|-----------|---------|----------------|------------|
| $K$ | Kernel size (height and width) | 3, 5, 7 | Size of the telemetry window you scan |
| $\text{Input}(i+m, j+n)$ | Pixel value at position (i+m, j+n) | 0 to 255 (images) | Sensor reading at a given sample |
| $\text{Kernel}(m, n)$ | Filter weight at position (m, n) | Learned during training | Template shape you are looking for |
| $\text{Output}(i, j)$ | Feature map value at position (i, j) | Any real number | Match strength at that position |

**What this means:** At each output position, we compute a weighted sum of the input pixels in the local neighborhood, using the kernel weights. If the local pattern matches the kernel, the output is large.

In [None]:
def conv2d_manual(image, kernel):
    """
    Perform 2D convolution manually (valid mode, no padding).
    
    Args:
        image: 2D numpy array (H x W)
        kernel: 2D numpy array (K x K)
    
    Returns:
        Output feature map
    """
    H, W = image.shape
    K = kernel.shape[0]
    out_h = H - K + 1
    out_w = W - K + 1
    output = np.zeros((out_h, out_w))
    
    for i in range(out_h):
        for j in range(out_w):
            patch = image[i:i+K, j:j+K]
            output[i, j] = np.sum(patch * kernel)
    
    return output

# Step-by-step 2D convolution visualization
image = np.array([
    [1, 2, 0, 1, 3],
    [0, 1, 3, 2, 1],
    [1, 3, 1, 0, 2],
    [2, 1, 0, 3, 1],
    [0, 2, 1, 1, 0]
], dtype=float)

kernel = np.array([
    [1, 0, -1],
    [1, 0, -1],
    [1, 0, -1]
], dtype=float)

output = conv2d_manual(image, kernel)

# Visualize the sliding window at different positions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

positions = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]

for ax, (pi, pj) in zip(axes.flat, positions):
    # Show the full image
    ax.imshow(np.ones_like(image) * 0.9, cmap='gray', vmin=0, vmax=1, 
              extent=[-0.5, 4.5, 4.5, -0.5])
    
    # Highlight the receptive field
    for mi in range(3):
        for mj in range(3):
            r, c = pi + mi, pj + mj
            # Color based on kernel weight
            if kernel[mi, mj] > 0:
                color = 'lightgreen'
            elif kernel[mi, mj] < 0:
                color = 'lightsalmon'
            else:
                color = 'lightyellow'
            rect = plt.Rectangle((c - 0.5, r - 0.5), 1, 1, 
                                  facecolor=color, edgecolor='black', linewidth=2, alpha=0.7)
            ax.add_patch(rect)
    
    # Show all pixel values
    for i in range(5):
        for j in range(5):
            ax.text(j, i, f'{int(image[i, j])}', ha='center', va='center', fontsize=12)
    
    # Compute and show result
    patch = image[pi:pi+3, pj:pj+3]
    result = np.sum(patch * kernel)
    
    # Build computation string
    terms = []
    for mi in range(3):
        for mj in range(3):
            if kernel[mi, mj] != 0:
                terms.append(f'{int(image[pi+mi, pj+mj])}*({int(kernel[mi, mj])})')
    
    ax.set_title(f'Position ({pi},{pj}): output = {int(result)}', fontsize=11, fontweight='bold')
    ax.set_xlim(-0.5, 4.5)
    ax.set_ylim(4.5, -0.5)
    ax.set_xticks(range(5))
    ax.set_yticks(range(5))
    ax.grid(True, alpha=0.3)

plt.suptitle('2D Convolution: Sliding a Vertical Edge Detector [1,0,-1; 1,0,-1; 1,0,-1]', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Full output feature map:")
print(output.astype(int))
print("\nGreen = positive kernel weight, Red = negative, Yellow = zero")

### 2.3 Kernels as Feature Detectors

Different kernels detect different features. Here are classic examples applied to a real-looking image.

**F1 analogy:** In telemetry analysis, you might have different "kernels" for different events: a `[1, 0, -1]` filter detects sudden changes (like a braking point), a `[1, 1, 1]` averaging filter smooths out noise to reveal underlying trends, and a `[-1, 2, -1]` filter highlights spikes (like a wheel lockup or a kerb strike). Each kernel is a specialized detector for a different type of on-track event.

In [None]:
# Create a simple test image with clear features
def make_test_image(size=64):
    """Create a test image with edges, corners, and gradients."""
    img = np.zeros((size, size))
    
    # Bright rectangle
    img[10:30, 10:30] = 1.0
    
    # Diagonal line
    for i in range(size):
        j = i
        if 0 <= j < size:
            img[max(0,i-1):min(size,i+2), max(0,j-1):min(size,j+2)] = max(
                img[max(0,i-1), max(0,j-1)], 0.7)
    
    # Circle
    cy, cx = 45, 45
    for i in range(size):
        for j in range(size):
            if abs((i - cy)**2 + (j - cx)**2 - 100) < 40:
                img[i, j] = 1.0
    
    # Gradient region
    img[35:55, 5:25] = np.tile(np.linspace(0, 1, 20), (20, 1))
    
    return img

# Define classic kernels
kernels = {
    'Vertical Edge': np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], dtype=float),  # Sobel X
    'Horizontal Edge': np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]], dtype=float),  # Sobel Y
    'Sharpen': np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]], dtype=float),
    'Blur (Box)': np.ones((3, 3), dtype=float) / 9.0,
    'Emboss': np.array([[-2, -1, 0], [-1, 1, 1], [0, 1, 2]], dtype=float),
    'Laplacian': np.array([[0, 1, 0], [1, -4, 1], [0, 1, 0]], dtype=float),
}

test_img = make_test_image()

# Show original + all kernel results
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# Original
axes[0, 0].imshow(test_img, cmap='gray')
axes[0, 0].set_title('Original Image', fontsize=11, fontweight='bold')
axes[0, 0].axis('off')

# Show kernel values in second position
ax = axes[0, 1]
ax.axis('off')
ax.set_title('Kernel Values', fontsize=11, fontweight='bold')
kernel_text = "Classic 3x3 kernels:\n\n"
for name, k in list(kernels.items())[:3]:
    kernel_text += f"{name}:\n{k}\n\n"
ax.text(0.05, 0.95, kernel_text, transform=ax.transAxes, fontsize=7,
        verticalalignment='top', fontfamily='monospace')

# Apply each kernel
for idx, (name, kernel) in enumerate(kernels.items()):
    row = (idx + 2) // 4
    col = (idx + 2) % 4
    result = conv2d_manual(test_img, kernel)
    axes[row, col].imshow(result, cmap='RdBu', vmin=-result.max(), vmax=result.max())
    axes[row, col].set_title(name, fontsize=11, fontweight='bold')
    axes[row, col].axis('off')

plt.suptitle('Different Kernels Detect Different Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Interactive: Apply Different Kernels to a Simple Image

Let us create a custom kernel and see what it detects.

In [None]:
# Interactive: Vary kernel parameters and see results
# Create a checkerboard + stripe image
img_size = 48
interactive_img = np.zeros((img_size, img_size))

# Add horizontal stripes
for i in range(0, img_size, 8):
    interactive_img[i:i+4, :24] = 1.0

# Add vertical stripes
for j in range(24, img_size, 8):
    interactive_img[:24, j:j+4] = 1.0

# Add diagonal
for i in range(24, img_size):
    j = i - 24 + 24
    if j < img_size:
        interactive_img[i, j] = 1.0
        if j+1 < img_size:
            interactive_img[i, j+1] = 1.0

# Add a filled region
interactive_img[30:45, 28:43] = 0.8

# Test multiple kernel orientations
angles_kernels = {
    'Vertical\n[1,0,-1]': np.array([[1, 0, -1], [1, 0, -1], [1, 0, -1]]),
    'Horizontal\n[-1,-1,-1; 0,0,0; 1,1,1]': np.array([[-1, -1, -1], [0, 0, 0], [1, 1, 1]]),
    'Diagonal /\n[0,0,1; 0,0,0; -1,0,0]': np.array([[0, 0, 1], [0, 0, 0], [-1, 0, 0]]),
    'Diagonal \\\n[1,0,0; 0,0,0; 0,0,-1]': np.array([[1, 0, 0], [0, 0, 0], [0, 0, -1]]),
    'Corner\n[1,-1; -1,1] (padded)': np.array([[1, -1, 0], [-1, 1, 0], [0, 0, 0]]),
    'Gaussian Blur\n(3x3)': np.array([[1, 2, 1], [2, 4, 2], [1, 2, 1]]) / 16.0,
}

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

axes[0, 0].imshow(interactive_img, cmap='gray')
axes[0, 0].set_title('Original', fontsize=10, fontweight='bold')
axes[0, 0].axis('off')

# Empty slot
axes[0, 1].axis('off')
axes[0, 1].text(0.5, 0.5, 'Stripes respond\nto edge detectors\nin matching\norientation', 
                ha='center', va='center', fontsize=10, transform=axes[0, 1].transAxes)

for idx, (name, kern) in enumerate(angles_kernels.items()):
    row = (idx + 2) // 4
    col = (idx + 2) % 4
    result = conv2d_manual(interactive_img, kern)
    axes[row, col].imshow(np.abs(result), cmap='hot')
    axes[row, col].set_title(name, fontsize=8, fontweight='bold')
    axes[row, col].axis('off')

plt.suptitle('Different Kernel Orientations Detect Different Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### 2.4 Padding

Without padding, the output is smaller than the input (we lose border pixels). **Padding** adds extra pixels around the border to control the output size.

| Padding Type | Description | Output Size |
|-------------|-------------|-------------|
| **Valid (no padding)** | No padding at all | Smaller than input |
| **Same (zero padding)** | Pad so output = input size | Same as input |

**Why same padding matters:** Without it, every layer shrinks the feature map. After many layers, you would have nothing left!

**F1 analogy:** Imagine your telemetry trace starts at the pit exit and ends at the pit entry. Without padding, your braking-zone filter cannot fully analyze the very first and last corners because there is not enough data on either side. Padding is like extending the trace with zeros so you can properly analyze the edges of the data -- just as an engineer might pad the start/end of a session log to avoid edge artifacts.

In [None]:
# Visualize padding
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

small_img = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25]
], dtype=float)

kern_3x3 = np.ones((3, 3)) / 9.0  # Average filter

# No padding (valid)
ax = axes[0]
out_valid = conv2d_manual(small_img, kern_3x3)
ax.imshow(np.ones((5, 5)) * 0.9, cmap='gray', vmin=0, vmax=1, extent=[-0.5, 4.5, 4.5, -0.5])
for i in range(5):
    for j in range(5):
        ax.text(j, i, f'{int(small_img[i,j])}', ha='center', va='center', fontsize=10)
# Highlight valid output region
rect = plt.Rectangle((0.5, 0.5), 3, 3, linewidth=3, edgecolor='red', facecolor='red', alpha=0.15)
ax.add_patch(rect)
ax.set_title(f'Valid (no padding)\nInput: 5x5, Output: {out_valid.shape[0]}x{out_valid.shape[1]}', fontsize=11)
ax.set_xticks(range(5))
ax.set_yticks(range(5))

# Same padding (pad=1)
ax = axes[1]
padded = np.pad(small_img, 1, mode='constant', constant_values=0)
ax.imshow(np.ones((7, 7)) * 0.9, cmap='gray', vmin=0, vmax=1, extent=[-0.5, 6.5, 6.5, -0.5])
for i in range(7):
    for j in range(7):
        val = int(padded[i, j])
        color = 'blue' if val > 0 else 'gray'
        ax.text(j, i, f'{val}', ha='center', va='center', fontsize=9, color=color)
# Highlight padding
for i in range(7):
    for j in range(7):
        if i == 0 or i == 6 or j == 0 or j == 6:
            rect = plt.Rectangle((j-0.5, i-0.5), 1, 1, facecolor='lightyellow', 
                                  edgecolor='orange', linewidth=1, alpha=0.5)
            ax.add_patch(rect)

out_same = conv2d_manual(padded, kern_3x3)
ax.set_title(f'Same padding (P=1)\nInput: 5x5 + pad, Output: {out_same.shape[0]}x{out_same.shape[1]}', fontsize=11)
ax.set_xticks(range(7))
ax.set_yticks(range(7))

# Show both outputs
ax = axes[2]
ax.text(0.5, 0.85, 'Output Comparison', ha='center', fontsize=12, fontweight='bold',
        transform=ax.transAxes)
ax.text(0.5, 0.7, f'Valid: {out_valid.shape[0]}x{out_valid.shape[1]} (shrinks!)', 
        ha='center', fontsize=11, color='red', transform=ax.transAxes)
ax.text(0.5, 0.55, f'Same:  {out_same.shape[0]}x{out_same.shape[1]} (preserved!)', 
        ha='center', fontsize=11, color='green', transform=ax.transAxes)
ax.text(0.5, 0.35, 'Formula for "same" padding:\nP = (K - 1) / 2\n\nFor 3x3 kernel: P = 1\nFor 5x5 kernel: P = 2', 
        ha='center', fontsize=10, transform=ax.transAxes)
ax.axis('off')

plt.suptitle('Padding: Controlling Output Size', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### 2.5 Stride

**Stride** controls how far the filter moves at each step. Stride=1 means move one pixel at a time. Stride=2 means skip every other position, producing an output half the size.

**F1 analogy:** Stride=1 is like analyzing telemetry at full resolution (every millisecond). Stride=2 is like downsampling to every other sample -- you lose some granularity but process twice as fast. For a strategy overview you might use a large stride; for detailed corner analysis you want stride=1.

### Output Size Formula

$$O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1$$

#### Breaking down the formula:

| Component | Meaning | Example |
|-----------|---------|---------|
| $W$ | Input width (or height) | 32 |
| $K$ | Kernel size | 3 |
| $P$ | Padding | 1 |
| $S$ | Stride | 1 |
| $O$ | Output size | (32-3+2)/1 + 1 = 32 |

**What this means:** This formula tells you exactly how big your output feature map will be. It is essential for designing CNN architectures -- you need each layer's output to match the next layer's expected input.

In [None]:
# Visualize stride effect
def output_size(W, K, P, S):
    """Compute convolution output size."""
    return (W - K + 2 * P) // S + 1

# Show stride 1 vs stride 2
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Stride 1
ax = axes[0]
grid = np.arange(1, 50).reshape(7, 7)
ax.imshow(np.ones((7, 7)) * 0.9, cmap='gray', vmin=0, vmax=1, extent=[-0.5, 6.5, 6.5, -0.5])
for i in range(7):
    for j in range(7):
        ax.text(j, i, f'{grid[i,j]}', ha='center', va='center', fontsize=9)

# Show stride=1 positions (all positions in first row)
colors_s1 = ['red', 'blue', 'green', 'orange', 'purple']
for pos, color in zip(range(5), colors_s1):
    rect = plt.Rectangle((pos-0.45, -0.45), 2.9, 2.9, linewidth=2, 
                          edgecolor=color, facecolor='none', linestyle='--', alpha=0.6)
    ax.add_patch(rect)
out_s1 = output_size(7, 3, 0, 1)
ax.set_title(f'Stride = 1\n7x7 input, 3x3 kernel -> {out_s1}x{out_s1} output', fontsize=11)
ax.set_xticks(range(7))
ax.set_yticks(range(7))

# Stride 2
ax = axes[1]
ax.imshow(np.ones((7, 7)) * 0.9, cmap='gray', vmin=0, vmax=1, extent=[-0.5, 6.5, 6.5, -0.5])
for i in range(7):
    for j in range(7):
        ax.text(j, i, f'{grid[i,j]}', ha='center', va='center', fontsize=9)

# Show stride=2 positions (first row)
for pos_idx, pos in enumerate([0, 2, 4]):
    color = colors_s1[pos_idx]
    rect = plt.Rectangle((pos-0.45, -0.45), 2.9, 2.9, linewidth=2, 
                          edgecolor=color, facecolor=color, alpha=0.15)
    ax.add_patch(rect)
out_s2 = output_size(7, 3, 0, 2)
ax.set_title(f'Stride = 2\n7x7 input, 3x3 kernel -> {out_s2}x{out_s2} output', fontsize=11)
ax.set_xticks(range(7))
ax.set_yticks(range(7))

# Output size calculator
ax = axes[2]
ax.axis('off')

# Show table of output sizes for common configurations
configs = [
    (28, 3, 0, 1), (28, 3, 1, 1), (28, 3, 1, 2), (28, 5, 2, 1),
    (32, 3, 1, 1), (32, 3, 1, 2), (224, 7, 3, 2), (224, 3, 1, 1),
]

table_text = "Output Size Calculator\n" + "=" * 40 + "\n"
table_text += f"{'W':>5} {'K':>3} {'P':>3} {'S':>3} {'Output':>8}\n"
table_text += "-" * 40 + "\n"
for W, K, P, S in configs:
    O = output_size(W, K, P, S)
    table_text += f"{W:>5} {K:>3} {P:>3} {S:>3} {O:>8}\n"

ax.text(0.1, 0.95, table_text, transform=ax.transAxes, fontsize=10,
        verticalalignment='top', fontfamily='monospace',
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

plt.suptitle('Stride: How Far the Filter Moves Each Step', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---

## 3. Multiple Channels and Filters

### Intuitive Explanation

So far we have convolved a single grayscale image with a single filter. Real images have multiple **channels** (R, G, B), and we want to detect many different features. This leads to two extensions:

1. **Multi-channel input**: A 3x3 filter on an RGB image is actually 3x3x3 = 27 weights (one 3x3 slice per channel). The filter produces a weighted sum across ALL channels at each position.

2. **Multiple filters**: Each filter produces one feature map. If we use 32 filters, we get 32 feature maps -- each detecting a different pattern.

**The full picture:** A convolutional layer with $C_{in}$ input channels and $C_{out}$ filters has weights of shape $(C_{out}, C_{in}, K, K)$.

**F1 analogy:** An F1 car streams multiple telemetry channels simultaneously -- speed, throttle, brake pressure, steering angle, tire temperatures, fuel flow. A single "braking event" filter needs to look across all these channels at once (speed dropping AND brake pressure spiking AND throttle at zero). That is multi-channel convolution. And you want many filters: one for braking events, one for traction loss, one for DRS deployment, one for fuel-saving coasting. Each filter produces its own "event map" of the lap.

In [None]:
# Visualization: How multi-channel convolution works
fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')

# Draw RGB input (3 channels)
channel_colors = ['#FF6B6B', '#69DB7C', '#74C0FC']
channel_names = ['R', 'G', 'B']

for c, (color, name) in enumerate(zip(channel_colors, channel_names)):
    x_off = 0.5 + c * 0.3
    y_off = 6.5 - c * 0.3
    
    # Draw channel grid
    for i in range(5):
        for j in range(5):
            rect = plt.Rectangle((x_off + j * 0.4, y_off - i * 0.4), 0.38, 0.38,
                                  facecolor=color, edgecolor='black', alpha=0.6, linewidth=0.5)
            ax.add_patch(rect)
    
    ax.text(x_off + 1.0, y_off + 0.5, name, fontsize=10, fontweight='bold', color=color)

ax.text(1.5, 4.3, 'Input\n(H x W x 3)', ha='center', fontsize=11, fontweight='bold')

# Draw filter (3 channel slices)
for c, color in enumerate(channel_colors):
    x_off = 4.5 + c * 0.2
    y_off = 7.5 - c * 0.2
    
    for i in range(3):
        for j in range(3):
            rect = plt.Rectangle((x_off + j * 0.4, y_off - i * 0.4), 0.38, 0.38,
                                  facecolor=color, edgecolor='black', alpha=0.6, linewidth=0.5)
            ax.add_patch(rect)

ax.text(5.3, 6.0, 'One Filter\n(3 x 3 x 3)\n= 27 weights', ha='center', fontsize=10, fontweight='bold')

# Arrow
ax.annotate('', xy=(4.3, 6.5), xytext=(3.0, 6.5),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))
ax.text(3.6, 7.0, 'convolve', ha='center', fontsize=9)

# Arrow to output
ax.annotate('', xy=(7.5, 6.5), xytext=(6.3, 6.5),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))

# Single feature map output
for i in range(3):
    for j in range(3):
        rect = plt.Rectangle((7.7 + j * 0.5, 7.5 - i * 0.5), 0.48, 0.48,
                              facecolor='gold', edgecolor='black', alpha=0.7, linewidth=1)
        ax.add_patch(rect)
ax.text(8.5, 6.0, 'Feature Map\n(1 channel)', ha='center', fontsize=10, fontweight='bold')

# Now show multiple filters -> multiple feature maps
ax.text(7, 3.5, 'With N filters, we get N feature maps:', fontsize=12, fontweight='bold', ha='center')

filter_colors = ['#FF6B6B', '#69DB7C', '#74C0FC', '#FAB005']
for f in range(4):
    x_off = 2 + f * 3
    
    # Small filter icon
    for i in range(2):
        for j in range(2):
            rect = plt.Rectangle((x_off + j * 0.3, 2.5 - i * 0.3), 0.28, 0.28,
                                  facecolor=filter_colors[f], edgecolor='black', alpha=0.6, linewidth=0.5)
            ax.add_patch(rect)
    
    ax.text(x_off + 0.3, 1.7, f'Filter {f+1}', ha='center', fontsize=9, fontweight='bold')
    
    # Arrow down
    ax.annotate('', xy=(x_off + 0.3, 1.3), xytext=(x_off + 0.3, 1.6),
                arrowprops=dict(arrowstyle='->', color='black', lw=1.5))
    
    # Feature map
    for i in range(2):
        for j in range(2):
            rect = plt.Rectangle((x_off + j * 0.3, 0.5 - i * 0.3), 0.28, 0.28,
                                  facecolor=filter_colors[f], edgecolor='black', alpha=0.4, linewidth=0.5)
            ax.add_patch(rect)
    
    ax.text(x_off + 0.3, -0.2, f'Map {f+1}', ha='center', fontsize=9)

ax.text(7, -0.7, 'Output: H\' x W\' x N_filters', ha='center', fontsize=11, fontweight='bold')

ax.set_title('Multi-Channel Convolution: RGB Input with Multiple Filters', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Parameter count example
C_in, C_out, K = 3, 32, 3
params = C_out * (C_in * K * K + 1)  # +1 for bias per filter
print(f"Conv layer: {C_in} input channels, {C_out} filters, {K}x{K} kernels")
print(f"  Weight shape: ({C_out}, {C_in}, {K}, {K})")
print(f"  Parameters: {C_out} * ({C_in} * {K} * {K} + 1) = {params}")

### Deep Dive: What Filters Learn at Different Depths

One of the most fascinating discoveries in deep learning is the **feature hierarchy** that CNNs learn automatically. Early layers learn simple patterns, and deeper layers combine them into increasingly complex ones.

| Layer Depth | What Filters Detect | Example | F1 Telemetry Parallel |
|-------------|--------------------|---------|-----------------------|
| Layer 1 (shallow) | Edges, colors, simple gradients | Horizontal edge, red blob | Individual sensor spikes -- a brake pressure jump, a single temperature reading |
| Layer 2-3 | Textures, corners, simple shapes | Brick pattern, corner of a box | Short event patterns -- a braking zone (speed drop + brake spike), a traction event (wheelspin + TC intervention) |
| Layer 4-5 | Parts of objects | Eye, wheel, window | Corner profiles -- the full sequence of braking, turn-in, apex, exit for a single corner |
| Deep layers | Entire objects or scenes | Face, car, building | Driving style -- aggressive late braking vs smooth early braking across an entire sector |

#### Key Insight

This hierarchy emerges naturally from training -- nobody designs these filters by hand. The network discovers that edges are useful building blocks for textures, textures for parts, and parts for objects. Similarly, an F1 telemetry CNN would discover that individual sensor spikes compose into event patterns, events compose into corner profiles, and corner profiles compose into driving style signatures.

This is why **transfer learning** works: the early layers of a CNN trained on one task (e.g., ImageNet) learn general features (edges, textures) that are useful for almost any vision task.

In [None]:
# Simulate what filters learn at different depths
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# Layer 1: Simple edge detectors (hand-crafted examples of what networks learn)
layer1_filters = [
    np.array([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]),    # Vertical edge
    np.array([[-1, -1, -1], [0, 0, 0], [1, 1, 1]]),     # Horizontal edge
    np.array([[0, -1, 0], [-1, 4, -1], [0, -1, 0]]),    # Spot detector
    np.array([[-1, 0, 1], [0, 0, 0], [1, 0, -1]]),      # Diagonal
]
layer1_names = ['Vertical\nEdge', 'Horizontal\nEdge', 'Spot\nDetector', 'Diagonal\nEdge']

for idx, (filt, name) in enumerate(zip(layer1_filters, layer1_names)):
    ax = axes[0, idx]
    im = ax.imshow(filt, cmap='RdBu', vmin=-2, vmax=2)
    ax.set_title(f'Layer 1: {name}', fontsize=10, fontweight='bold')
    for i in range(3):
        for j in range(3):
            ax.text(j, i, f'{filt[i,j]:+d}', ha='center', va='center', fontsize=12,
                    color='white' if abs(filt[i,j]) > 1 else 'black')
    ax.set_xticks([])
    ax.set_yticks([])

# Layer 2+: Show what happens when you compose filters (simulated)
# Generate synthetic "deeper" feature responses
np.random.seed(42)
test_img = make_test_image(64)

# Apply successive convolutions to show increasing complexity
layer_outputs = [test_img]
current = test_img
for filters in [layer1_filters[:2], layer1_filters[2:]]:
    responses = []
    for f in filters:
        resp = conv2d_manual(current, f)
        responses.append(np.abs(resp))
    # Combine responses (like what a deeper layer sees)
    min_shape = min(r.shape[0] for r in responses)
    combined = sum(r[:min_shape, :min_shape] for r in responses) / len(responses)
    current = combined
    layer_outputs.append(combined)

# Show progressive feature extraction
titles_bottom = ['Original\nImage', 'After Layer 1\n(edges)', 
                 'After Layer 2\n(combinations)', 'Feature\nHierarchy']
for idx in range(3):
    ax = axes[1, idx]
    ax.imshow(layer_outputs[idx], cmap='hot')
    ax.set_title(titles_bottom[idx], fontsize=10, fontweight='bold')
    ax.axis('off')

# Summary in last panel
ax = axes[1, 3]
ax.axis('off')
hierarchy = "Feature Hierarchy:\n\n"
hierarchy += "Layer 1: Edges\n      |\n"
hierarchy += "Layer 2: Textures\n      |\n"
hierarchy += "Layer 3: Parts\n      |\n"
hierarchy += "Layer 4: Objects\n      |\n"
hierarchy += "Output:  Classification"
ax.text(0.5, 0.5, hierarchy, ha='center', va='center', fontsize=11,
        fontfamily='monospace', transform=ax.transAxes,
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

plt.suptitle('What CNN Filters Learn at Different Depths', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---

## 4. Pooling Layers

### Intuitive Explanation

After convolution, we often want to **reduce the spatial dimensions** while keeping the most important information. Pooling does this by summarizing small regions of the feature map.

Think of it like creating a thumbnail of an image -- you lose fine detail but keep the big picture. This provides:

1. **Dimensionality reduction**: Fewer values to process in later layers
2. **Translation invariance**: Small shifts in the input do not change the output
3. **Larger receptive field**: Each neuron in later layers "sees" more of the original image

**F1 analogy:** Pooling is how you go from thousands of data points per sector to a handful of key metrics. Max pooling is like reporting the peak speed in each sector -- you keep the most extreme value. Average pooling is like reporting average speed through each sector. Global average pooling is like summarizing an entire lap into one number per metric (average speed, average tire temp, etc.). You lose the moment-by-moment detail but gain a compact summary that captures the essence of performance.

In [None]:
# Visualize Max Pooling and Average Pooling
pool_input = np.array([
    [1, 3, 2, 4],
    [5, 6, 1, 2],
    [3, 2, 7, 8],
    [4, 1, 3, 5]
], dtype=float)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Original
ax = axes[0]
ax.imshow(pool_input, cmap='YlOrRd', vmin=0, vmax=8, extent=[-0.5, 3.5, 3.5, -0.5])
for i in range(4):
    for j in range(4):
        ax.text(j, i, f'{int(pool_input[i,j])}', ha='center', va='center', fontsize=16, fontweight='bold')
# Draw 2x2 regions
for i in range(0, 4, 2):
    for j in range(0, 4, 2):
        colors_pool = ['#FF6B6B', '#69DB7C', '#74C0FC', '#FAB005']
        idx = (i // 2) * 2 + (j // 2)
        rect = plt.Rectangle((j-0.5, i-0.5), 2, 2, linewidth=3, 
                              edgecolor=colors_pool[idx], facecolor='none')
        ax.add_patch(rect)
ax.set_title('Input (4x4)', fontsize=12, fontweight='bold')
ax.set_xticks(range(4))
ax.set_yticks(range(4))

# Max pooling
ax = axes[1]
max_pool = np.array([
    [max(pool_input[0:2, 0:2].flat), max(pool_input[0:2, 2:4].flat)],
    [max(pool_input[2:4, 0:2].flat), max(pool_input[2:4, 2:4].flat)]
])
ax.imshow(max_pool, cmap='YlOrRd', vmin=0, vmax=8, extent=[-0.5, 1.5, 1.5, -0.5])
for i in range(2):
    for j in range(2):
        idx = i * 2 + j
        ax.text(j, i, f'{int(max_pool[i,j])}', ha='center', va='center', fontsize=20, fontweight='bold')
        rect = plt.Rectangle((j-0.5, i-0.5), 1, 1, linewidth=3, 
                              edgecolor=colors_pool[idx], facecolor='none')
        ax.add_patch(rect)

# Show which values were selected
ax.text(0, -0.8, 'max(1,3,5,6)=6', ha='center', fontsize=8, color=colors_pool[0])
ax.text(1, -0.8, 'max(2,4,1,2)=4', ha='center', fontsize=8, color=colors_pool[1])

ax.set_title('Max Pooling 2x2\n(keeps strongest activation)', fontsize=12, fontweight='bold')
ax.set_xticks(range(2))
ax.set_yticks(range(2))

# Average pooling
ax = axes[2]
avg_pool = np.array([
    [np.mean(pool_input[0:2, 0:2]), np.mean(pool_input[0:2, 2:4])],
    [np.mean(pool_input[2:4, 0:2]), np.mean(pool_input[2:4, 2:4])]
])
ax.imshow(avg_pool, cmap='YlOrRd', vmin=0, vmax=8, extent=[-0.5, 1.5, 1.5, -0.5])
for i in range(2):
    for j in range(2):
        idx = i * 2 + j
        ax.text(j, i, f'{avg_pool[i,j]:.1f}', ha='center', va='center', fontsize=18, fontweight='bold')
        rect = plt.Rectangle((j-0.5, i-0.5), 1, 1, linewidth=3, 
                              edgecolor=colors_pool[idx], facecolor='none')
        ax.add_patch(rect)

ax.set_title('Average Pooling 2x2\n(keeps average activation)', fontsize=12, fontweight='bold')
ax.set_xticks(range(2))
ax.set_yticks(range(2))

plt.suptitle('Pooling: Reducing Spatial Dimensions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Global Average Pooling (Modern Approach)

Modern architectures increasingly use **Global Average Pooling (GAP)** instead of flattening + fully connected layers. GAP takes the average of each entire feature map, producing one value per channel.

For example, if the final conv layer outputs 512 feature maps of size 7x7, GAP produces a vector of length 512 (averaging each 7x7 map into a single number).

### Pooling Comparison Table

| Pooling Type | Operation | Parameters | Use Case | F1 Parallel |
|-------------|-----------|------------|----------|-------------|
| **Max Pooling** | Take maximum in window | None (no learnable params!) | Classic CNNs, preserves strong activations | Peak speed per sector, maximum g-force per corner |
| **Average Pooling** | Take mean in window | None | Smoother downsampling | Average sector speed, mean tire temperature per stint |
| **Global Average Pooling** | Average entire feature map | None | Modern replacement for FC layers | One summary stat per channel for the whole lap |
| **Strided Convolution** | Conv with stride > 1 | Learned | Modern alternative to pooling | Learned downsampling -- let the network decide what to keep |

### Why This Matters in Machine Learning

| Application | Pooling Strategy |
|-------------|-----------------|
| Image classification | Max pool between conv blocks, GAP before classifier |
| Object detection | Feature pyramid with different pooling scales |
| Semantic segmentation | Avoid pooling (need full resolution output) |
| Modern architectures | Strided convolutions replacing pooling |

---

## 5. Building a CNN in PyTorch

### nn.Conv2d Parameters Explained

```python
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
```

| Parameter | Meaning | Example |
|-----------|---------|---------|
| `in_channels` | Number of input channels | 3 (RGB) or 1 (grayscale) |
| `out_channels` | Number of filters (output channels) | 32, 64, 128 |
| `kernel_size` | Size of the filter | 3 (means 3x3) |
| `stride` | Step size | 1 (default) or 2 (halves output) |
| `padding` | Zero-padding | 1 (for 'same' with 3x3 kernel) |

### nn.MaxPool2d

```python
nn.MaxPool2d(kernel_size, stride=None)
```

If `stride` is not specified, it defaults to `kernel_size`. So `MaxPool2d(2)` means 2x2 windows with stride 2, halving the spatial dimensions.

In [None]:
# Building a simple CNN step by step
class SimpleCNN(nn.Module):
    """
    A simple CNN for 28x28 grayscale images (e.g., MNIST).
    
    Architecture:
        Conv(1->16, 3x3) -> ReLU -> MaxPool(2x2)   [28x28 -> 13x13]
        Conv(16->32, 3x3) -> ReLU -> MaxPool(2x2)   [13x13 -> 5x5]
        Flatten -> FC(800->128) -> ReLU -> FC(128->10)
    """
    def __init__(self):
        super().__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
        
        # Pooling
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Fully connected layers
        # After 2 pooling layers: 28 -> 14 -> 7, so 32 * 7 * 7 = 1568
        self.fc1 = nn.Linear(32 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        # Block 1: Conv -> ReLU -> Pool
        x = self.pool(F.relu(self.conv1(x)))    # (B, 1, 28, 28) -> (B, 16, 14, 14)
        
        # Block 2: Conv -> ReLU -> Pool
        x = self.pool(F.relu(self.conv2(x)))    # (B, 16, 14, 14) -> (B, 32, 7, 7)
        
        # Flatten
        x = x.view(x.size(0), -1)              # (B, 32, 7, 7) -> (B, 1568)
        
        # Classifier
        x = F.relu(self.fc1(x))                 # (B, 1568) -> (B, 128)
        x = self.fc2(x)                         # (B, 128) -> (B, 10)
        
        return x

model = SimpleCNN()
print(model)

# Test with a dummy input
dummy = torch.randn(1, 1, 28, 28)
output = model(dummy)
print(f"\nInput shape:  {dummy.shape}")
print(f"Output shape: {output.shape}")

### Visualizing Feature Maps at Each Layer

In [None]:
# Create a simple digit-like image and visualize feature maps
def create_digit_image():
    """Create a simple '7' digit image."""
    img = np.zeros((28, 28))
    # Horizontal bar at top
    img[4:7, 6:22] = 1.0
    # Diagonal bar
    for i in range(20):
        j = 20 - i
        if 4 <= j <= 22 and 6 <= i + 6 <= 25:
            img[i + 6, max(6, j-1):j+2] = 1.0
    return img

digit_img = create_digit_image()

# Pass through the model and capture intermediate outputs
model.eval()
x = torch.tensor(digit_img, dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # Add batch and channel dims

# Get intermediate activations
with torch.no_grad():
    after_conv1 = F.relu(model.conv1(x))
    after_pool1 = model.pool(after_conv1)
    after_conv2 = F.relu(model.conv2(after_pool1))
    after_pool2 = model.pool(after_conv2)

# Plot feature maps
fig, axes = plt.subplots(3, 6, figsize=(15, 8))

# Original image
axes[0, 0].imshow(digit_img, cmap='gray')
axes[0, 0].set_title('Input (28x28)', fontsize=10, fontweight='bold')
axes[0, 0].axis('off')
for j in range(1, 6):
    axes[0, j].axis('off')
axes[0, 1].text(0.5, 0.5, f'After Conv1:\n16 feature maps\nSize: {after_conv1.shape[2]}x{after_conv1.shape[3]}',
                ha='center', va='center', fontsize=11, transform=axes[0, 1].transAxes)

# After conv1 - show first 6 feature maps
for j in range(6):
    if j < after_conv1.shape[1]:
        axes[1, j].imshow(after_conv1[0, j].numpy(), cmap='viridis')
        axes[1, j].set_title(f'Conv1 filter {j}', fontsize=8)
    axes[1, j].axis('off')

# After conv2 - show first 6 feature maps
for j in range(6):
    if j < after_conv2.shape[1]:
        axes[2, j].imshow(after_conv2[0, j].numpy(), cmap='viridis')
        axes[2, j].set_title(f'Conv2 filter {j}', fontsize=8)
    axes[2, j].axis('off')

axes[1, 0].set_ylabel('After Conv1\n(16 maps)', fontsize=10, fontweight='bold')
axes[2, 0].set_ylabel('After Conv2\n(32 maps)', fontsize=10, fontweight='bold')

plt.suptitle('Feature Maps at Each Layer (random weights)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Note: These feature maps use random (untrained) weights.")
print("After training, each filter map would detect meaningful patterns.")

### Parameter Counting: CNN vs Fully Connected

In [None]:
# Parameter counting comparison
def count_parameters(model):
    """Count total and per-layer parameters."""
    total = 0
    layer_counts = {}
    for name, param in model.named_parameters():
        count = param.numel()
        total += count
        layer_counts[name] = count
    return total, layer_counts

# CNN parameters
cnn = SimpleCNN()
cnn_total, cnn_layers = count_parameters(cnn)

# Equivalent FC network for 28x28 images
class EquivalentFC(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

fc = EquivalentFC()
fc_total, fc_layers = count_parameters(fc)

# Print comparison
print("=" * 55)
print("CNN Parameter Breakdown:")
print("=" * 55)
for name, count in cnn_layers.items():
    print(f"  {name:25s}: {count:>8,}")
print(f"  {'TOTAL':25s}: {cnn_total:>8,}")

print(f"\n{'=' * 55}")
print("FC Network Parameter Breakdown:")
print("=" * 55)
for name, count in fc_layers.items():
    print(f"  {name:25s}: {count:>8,}")
print(f"  {'TOTAL':25s}: {fc_total:>8,}")

print(f"\n{'=' * 55}")
print(f"CNN parameters:  {cnn_total:>8,}")
print(f"FC parameters:   {fc_total:>8,}")
print(f"CNN/FC ratio:    {cnn_total/fc_total:.2f}x")
print(f"\nNote: The CNN has MORE parameters here because of the FC layers")
print(f"at the end. Most CNN params are in fc1 ({cnn_layers['fc1.weight']:,} weights).")
print(f"The conv layers themselves use very few: conv1={cnn_layers['conv1.weight']:,}, conv2={cnn_layers['conv2.weight']:,}")

---

## 6. Classic CNN Architectures

### Intuitive Explanation

The history of CNNs is a story of going deeper and finding clever ways to make depth work. Each architecture introduced a key innovation that changed the field.

**F1 analogy:** Think of it as the evolution of F1 car design. LeNet-5 is the early 1950s cars -- simple, functional, proved the concept. AlexNet is the ground-effect era -- a breakthrough in raw performance. VGG is the 1990s approach of refining a simple formula. And ResNet is the double-diffuser or blown-diffuser moment -- a clever trick (skip connections) that unlocked performance levels nobody thought were possible.

### Architecture Comparison Table

| Architecture | Year | Depth | Parameters | Key Innovation | Top-5 Error (ImageNet) |
|-------------|------|-------|-----------|----------------|----------------------|
| **LeNet-5** | 1998 | 5 layers | ~60K | First practical CNN | N/A (MNIST) |
| **AlexNet** | 2012 | 8 layers | 60M | ReLU, dropout, GPU training | 15.3% |
| **VGG-16** | 2014 | 16 layers | 138M | Small 3x3 filters throughout | 7.3% |
| **GoogLeNet** | 2014 | 22 layers | 6.8M | Inception modules (multi-scale) | 6.7% |
| **ResNet-50** | 2015 | 50 layers | 25.6M | Skip connections | 3.6% |

In [None]:
# Visualize architecture evolution
fig, ax = plt.subplots(figsize=(14, 6))

architectures = [
    ('LeNet-5\n(1998)', 5, 0.06, 'First CNN'),
    ('AlexNet\n(2012)', 8, 60, 'ReLU + GPU'),
    ('VGG-16\n(2014)', 16, 138, '3x3 filters'),
    ('GoogLeNet\n(2014)', 22, 6.8, 'Inception'),
    ('ResNet-50\n(2015)', 50, 25.6, 'Skip connections'),
    ('ResNet-152\n(2015)', 152, 60, 'Very deep'),
]

x_pos = range(len(architectures))
depths = [a[1] for a in architectures]
params = [a[2] for a in architectures]
names = [a[0] for a in architectures]
innovations = [a[3] for a in architectures]

# Bar chart for depth
bars = ax.bar(x_pos, depths, color=['#74C0FC', '#FF6B6B', '#69DB7C', '#FAB005', '#DA77F2', '#DA77F2'],
              alpha=0.8, edgecolor='black', linewidth=1)

# Add parameter count as text
for i, (bar, p, innov) in enumerate(zip(bars, params, innovations)):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 2,
            f'{p}M params', ha='center', va='bottom', fontsize=9, fontweight='bold')
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height()/2,
            innov, ha='center', va='center', fontsize=8, color='white', fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.2', facecolor='black', alpha=0.5))

ax.set_xticks(x_pos)
ax.set_xticklabels(names)
ax.set_ylabel('Number of Layers', fontsize=12)
ax.set_title('Evolution of CNN Architectures', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

### 6.1 LeNet-5 (1998): The Original CNN

Yann LeCun's **LeNet-5** was the first CNN to achieve practical success, reading handwritten digits on checks. Its architecture is simple by modern standards but established the Conv-Pool-Conv-Pool-FC pattern that dominated for years.

### 6.2 VGG (2014): Deeper is Better

VGG's key insight was elegantly simple: **use only 3x3 filters and go deeper**. Two stacked 3x3 conv layers have the same receptive field as one 5x5 layer, but with fewer parameters and more nonlinearity (two ReLU activations instead of one).

### 6.3 ResNet (2015): Skip Connections Solve Everything

ResNet introduced the most important architectural innovation in deep learning: **skip connections** (also called residual connections). Instead of learning $H(x)$, each block learns $F(x) = H(x) - x$, the "residual."

$$\text{output} = F(x) + x$$

**Why this is brilliant:** If a layer is not helpful, the network can easily learn $F(x) = 0$, making the block an identity function. This means adding more layers can never hurt -- at worst, they do nothing.

**F1 analogy:** Skip connections are like having both the raw telemetry AND the processed telemetry available at every stage of analysis. Imagine your data pipeline has multiple processing stages: filtering, smoothing, feature extraction. Without skip connections, each stage only sees the output of the previous stage -- if any stage corrupts the signal, the information is lost forever. With skip connections, every stage also gets the original raw data. If your smoothing filter accidentally erases a brief but critical traction event, the raw trace still carries that information forward. This is exactly why modern F1 telemetry dashboards overlay raw and filtered data simultaneously.

In [None]:
# Visualize the skip connection concept
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Without skip connection (plain network)
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')

# Draw blocks
blocks = [(2, 6, 'Conv + BN\n+ ReLU'), (2, 4, 'Conv + BN'), (2, 2, 'ReLU')]
for x, y, label in blocks:
    rect = plt.Rectangle((x, y - 0.4), 6, 0.8, facecolor='lightblue', 
                          edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(5, y, label, ha='center', va='center', fontsize=10, fontweight='bold')

# Arrows between blocks
ax.annotate('', xy=(5, 5.6), xytext=(5, 5.0),
            arrowprops=dict(arrowstyle='<-', color='black', lw=2))
ax.annotate('', xy=(5, 3.6), xytext=(5, 3.0),
            arrowprops=dict(arrowstyle='<-', color='black', lw=2))

# Input/Output labels
ax.text(5, 7.2, 'Input x', ha='center', fontsize=11, fontweight='bold')
ax.annotate('', xy=(5, 6.8), xytext=(5, 6.4),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))
ax.annotate('', xy=(5, 1.6), xytext=(5, 1.0),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))
ax.text(5, 0.6, 'Output H(x)', ha='center', fontsize=11, fontweight='bold')

ax.set_title('Plain Block\nOutput = H(x)', fontsize=13, fontweight='bold')

# Right: With skip connection (residual block)
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')

# Draw blocks
for x, y, label in blocks:
    rect = plt.Rectangle((x, y - 0.4), 5, 0.8, facecolor='lightgreen', 
                          edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(4.5, y, label, ha='center', va='center', fontsize=10, fontweight='bold')

# Arrows between blocks
ax.annotate('', xy=(4.5, 5.6), xytext=(4.5, 5.0),
            arrowprops=dict(arrowstyle='<-', color='black', lw=2))
ax.annotate('', xy=(4.5, 3.6), xytext=(4.5, 3.0),
            arrowprops=dict(arrowstyle='<-', color='black', lw=2))

# Input/Output
ax.text(4.5, 7.2, 'Input x', ha='center', fontsize=11, fontweight='bold')
ax.annotate('', xy=(4.5, 6.8), xytext=(4.5, 6.4),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))

# SKIP CONNECTION - the key part!
ax.annotate('', xy=(8.2, 2.0), xytext=(8.2, 6.0),
            arrowprops=dict(arrowstyle='->', color='red', lw=3, 
                           connectionstyle='arc3,rad=0'))
ax.text(9.0, 4.0, 'Skip\nconnection\n(identity)', ha='center', fontsize=9, 
        color='red', fontweight='bold')

# Plus symbol at output
circle = plt.Circle((7.5, 2.0), 0.25, facecolor='yellow', edgecolor='red', linewidth=2)
ax.add_patch(circle)
ax.text(7.5, 2.0, '+', ha='center', va='center', fontsize=16, fontweight='bold', color='red')

# Output
ax.annotate('', xy=(5, 1.6), xytext=(5, 1.0),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))
ax.text(5, 0.6, 'Output = F(x) + x', ha='center', fontsize=11, fontweight='bold', color='red')

ax.set_title('Residual Block (ResNet)\nOutput = F(x) + x', fontsize=13, fontweight='bold')

plt.suptitle('Skip Connections: The Key Innovation of ResNet', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Implement a Mini-ResNet with Skip Connections

In [None]:
class ResidualBlock(nn.Module):
    """
    A single residual block with skip connection.
    
    output = ReLU(Conv(ReLU(Conv(x))) + x)
    
    If input and output channels differ, we use a 1x1 convolution 
    on the skip path to match dimensions.
    """
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, 
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, 
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # Skip connection: if dimensions change, use 1x1 conv to match
        self.skip = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.skip = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, 
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x):
        identity = x
        
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        
        out += self.skip(identity)  # <-- The skip connection!
        out = F.relu(out)
        
        return out


class MiniResNet(nn.Module):
    """
    A small ResNet for 28x28 images (MNIST/FashionMNIST).
    
    Architecture:
        Conv(1->16) -> ResBlock(16->16) -> ResBlock(16->32, stride=2) 
        -> ResBlock(32->64, stride=2) -> GAP -> FC(64->10)
    """
    def __init__(self, num_classes=10):
        super().__init__()
        
        # Initial convolution
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(16)
        
        # Residual blocks
        self.layer1 = ResidualBlock(16, 16, stride=1)    # 28x28 -> 28x28
        self.layer2 = ResidualBlock(16, 32, stride=2)    # 28x28 -> 14x14
        self.layer3 = ResidualBlock(32, 64, stride=2)    # 14x14 -> 7x7
        
        # Global average pooling + classifier
        self.gap = nn.AdaptiveAvgPool2d(1)               # 7x7 -> 1x1
        self.fc = nn.Linear(64, num_classes)
    
    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.gap(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Create and inspect the model
resnet = MiniResNet()
print(resnet)

# Test with dummy input
dummy = torch.randn(2, 1, 28, 28)
output = resnet(dummy)
print(f"\nInput shape:  {dummy.shape}")
print(f"Output shape: {output.shape}")

# Count parameters
total_params = sum(p.numel() for p in resnet.parameters())
print(f"Total parameters: {total_params:,}")

### Deep Dive: Why ResNets Work (Gradient Highways)

The fundamental problem with very deep networks is **vanishing gradients**. During backpropagation, gradients are multiplied through each layer. With many layers, these products can become astronomically small, so early layers learn almost nothing.

Skip connections create **gradient highways** -- shortcuts for gradients to flow backward without being multiplied through many layers. The gradient of the skip connection is simply 1 (the derivative of the identity function), so gradients always have a direct path back.

**F1 analogy:** Think of vanishing gradients like trying to relay a radio message from the pit wall through 50 intermediate stations around the track. By the time it arrives, it is garbled beyond recognition. Skip connections are like giving the pit wall a direct radio link to every station -- the message arrives intact regardless of how many intermediate points there are.

#### Key Insight

Without skip connections, a 100-layer network performs *worse* than a 20-layer network (the degradation problem). With skip connections, a 152-layer ResNet outperforms everything before it. The difference is entirely due to trainability, not capacity.

#### Common Misconceptions

| Misconception | Reality |
|---------------|--------|
| Skip connections add parameters | A pure identity skip adds zero parameters |
| ResNets learn residuals by choice | The architecture forces residual learning |
| Deeper always needs skip connections | For < 20 layers, skip connections help less |
| Skip connections are only for CNNs | They are used everywhere: Transformers, RNNs, MLPs |

In [None]:
# Demonstrate the gradient flow advantage of skip connections
def simulate_gradient_flow(n_layers, has_skip=False):
    """
    Simulate gradient magnitude through layers.
    
    Without skip: grad *= layer_factor at each layer
    With skip:    grad = grad * layer_factor + grad_skip (identity = 1)
    """
    np.random.seed(42)
    gradient = 1.0
    gradients = [gradient]
    
    for i in range(n_layers):
        # Each layer multiplies gradient by a factor < 1 (slight vanishing)
        layer_factor = np.random.uniform(0.7, 0.95)
        
        if has_skip and i % 2 == 1:  # Skip connection every 2 layers
            # Gradient flows through both the layer path AND the skip path
            gradient = gradient * layer_factor + gradients[-2] * 1.0
        else:
            gradient = gradient * layer_factor
        
        gradients.append(abs(gradient))
    
    return gradients

fig, ax = plt.subplots(figsize=(10, 6))

n_layers = 50

no_skip = simulate_gradient_flow(n_layers, has_skip=False)
with_skip = simulate_gradient_flow(n_layers, has_skip=True)

ax.plot(no_skip, 'r-', linewidth=2, label='Without skip connections', alpha=0.8)
ax.plot(with_skip, 'g-', linewidth=2, label='With skip connections', alpha=0.8)
ax.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='Initial gradient')

ax.set_xlabel('Layer (from output to input)', fontsize=12)
ax.set_ylabel('Gradient Magnitude', fontsize=12)
ax.set_title('Gradient Flow: Skip Connections Prevent Vanishing Gradients', fontsize=14)
ax.set_yscale('log')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"After {n_layers} layers:")
print(f"  Without skip: gradient = {no_skip[-1]:.6f} (vanished!)")
print(f"  With skip:    gradient = {with_skip[-1]:.4f} (healthy!)")

---

## 7. Practical CNN Training

### 7.1 Data Augmentation

**Data augmentation** artificially expands the training set by applying random transformations to existing images. This is one of the most effective regularization techniques for CNNs.

| Transform | What It Does | When to Use | F1 Parallel |
|-----------|-------------|-------------|-------------|
| `RandomHorizontalFlip` | Mirror left-right | Most tasks (not text!) | Mirroring a clockwise circuit to simulate a counter-clockwise one |
| `RandomRotation` | Rotate by random angle | When orientation varies | Small variations in track camber or car roll angle |
| `RandomCrop` | Crop random region | Almost always | Analyzing a random subsection of a lap |
| `ColorJitter` | Change brightness/contrast | Color-invariant tasks | Simulating different weather/lighting conditions |
| `RandomAffine` | Scale, translate, shear | General robustness | Slight sensor calibration differences between sessions |
| `Normalize` | Standardize pixel values | Always (not augmentation, but essential) | Normalizing telemetry to a common scale across cars |

In [None]:
# Visualize data augmentation transforms
# Define transforms
train_transform = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load FashionMNIST
train_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=True, download=True, transform=train_transform
)
test_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=False, download=True, transform=test_transform
)

# Class names for FashionMNIST
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# Show augmented examples
raw_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=True, download=True, transform=transforms.ToTensor()
)

fig, axes = plt.subplots(3, 8, figsize=(16, 6))

# First row: original images
for j in range(8):
    img, label = raw_dataset[j]
    axes[0, j].imshow(img.squeeze(), cmap='gray')
    axes[0, j].set_title(class_names[label], fontsize=8)
    axes[0, j].axis('off')
axes[0, 0].set_ylabel('Original', fontsize=10, fontweight='bold')

# Second and third rows: augmented versions of the same images
for row in range(1, 3):
    for j in range(8):
        img, label = train_dataset[j]  # Random augmentation applied each time
        axes[row, j].imshow(img.squeeze(), cmap='gray')
        axes[row, j].axis('off')
    axes[row, 0].set_ylabel(f'Augmented {row}', fontsize=10, fontweight='bold')

plt.suptitle('Data Augmentation: Same Images with Random Transforms', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"Training samples: {len(train_dataset):,}")
print(f"Test samples:     {len(test_dataset):,}")
print(f"Classes:          {len(class_names)}")

### 7.2 Complete Training Pipeline

Let us train our MiniResNet on FashionMNIST with all the best practices from the previous notebook.

In [None]:
# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

# Initialize model, loss, optimizer, scheduler
torch.manual_seed(42)
model = MiniResNet(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)

def train_one_epoch(model, loader, criterion, optimizer):
    """Train for one epoch and return average loss and accuracy."""
    model.train()
    total_loss, correct, total = 0, 0, 0
    
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item() * images.size(0)
        _, predicted = outputs.max(1)
        correct += predicted.eq(labels).sum().item()
        total += images.size(0)
    
    return total_loss / total, correct / total

def evaluate(model, loader, criterion):
    """Evaluate model and return loss and accuracy."""
    model.eval()
    total_loss, correct, total = 0, 0, 0
    
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item() * images.size(0)
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()
            total += images.size(0)
    
    return total_loss / total, correct / total

# Training loop
num_epochs = 10
history = {'train_loss': [], 'test_loss': [], 'train_acc': [], 'test_acc': []}

print(f"Training MiniResNet on FashionMNIST ({sum(p.numel() for p in model.parameters()):,} parameters)")
print("=" * 65)

for epoch in range(num_epochs):
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer)
    test_loss, test_acc = evaluate(model, test_loader, criterion)
    scheduler.step()
    
    history['train_loss'].append(train_loss)
    history['test_loss'].append(test_loss)
    history['train_acc'].append(train_acc)
    history['test_acc'].append(test_acc)
    
    print(f"Epoch {epoch+1:2d}/{num_epochs}: "
          f"train_loss={train_loss:.4f}, train_acc={train_acc:.4f} | "
          f"test_loss={test_loss:.4f}, test_acc={test_acc:.4f}")

print(f"\nBest test accuracy: {max(history['test_acc']):.4f}")

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
ax = axes[0]
ax.plot(history['train_loss'], 'b-', linewidth=2, label='Train')
ax.plot(history['test_loss'], 'r-', linewidth=2, label='Test')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Training & Test Loss')
ax.legend()
ax.grid(True, alpha=0.3)

# Accuracy
ax = axes[1]
ax.plot(history['train_acc'], 'b-', linewidth=2, label='Train')
ax.plot(history['test_acc'], 'r-', linewidth=2, label='Test')
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy')
ax.set_title('Training & Test Accuracy')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0.7, 1.0)

plt.suptitle('MiniResNet Training on FashionMNIST', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### 7.3 Confusion Matrix and Per-Class Analysis

In [None]:
# Generate predictions for confusion matrix
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = model(images)
        _, predicted = outputs.max(1)
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.numpy())

all_preds = np.array(all_preds)
all_labels = np.array(all_labels)

# Compute confusion matrix
n_classes = 10
confusion = np.zeros((n_classes, n_classes), dtype=int)
for true, pred in zip(all_labels, all_preds):
    confusion[true, pred] += 1

# Plot confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Confusion matrix
ax = axes[0]
im = ax.imshow(confusion, cmap='Blues')
plt.colorbar(im, ax=ax, fraction=0.046)

# Add text annotations
for i in range(n_classes):
    for j in range(n_classes):
        val = confusion[i, j]
        color = 'white' if val > confusion.max() / 2 else 'black'
        ax.text(j, i, str(val), ha='center', va='center', fontsize=7, color=color)

ax.set_xticks(range(n_classes))
ax.set_yticks(range(n_classes))
ax.set_xticklabels(class_names, rotation=45, ha='right', fontsize=8)
ax.set_yticklabels(class_names, fontsize=8)
ax.set_xlabel('Predicted')
ax.set_ylabel('True')
ax.set_title('Confusion Matrix', fontsize=12, fontweight='bold')

# Per-class accuracy
ax = axes[1]
per_class_acc = confusion.diagonal() / confusion.sum(axis=1)
colors_bar = ['green' if acc > 0.9 else 'orange' if acc > 0.8 else 'red' for acc in per_class_acc]
bars = ax.barh(range(n_classes), per_class_acc, color=colors_bar, alpha=0.7, edgecolor='black')
ax.set_yticks(range(n_classes))
ax.set_yticklabels(class_names, fontsize=9)
ax.set_xlabel('Accuracy')
ax.set_title('Per-Class Accuracy', fontsize=12, fontweight='bold')
ax.set_xlim(0, 1)
ax.grid(True, alpha=0.3, axis='x')

# Add value labels
for bar, acc in zip(bars, per_class_acc):
    ax.text(acc + 0.01, bar.get_y() + bar.get_height()/2, f'{acc:.1%}', 
            va='center', fontsize=9)

plt.tight_layout()
plt.show()

# Print summary
overall_acc = np.sum(all_preds == all_labels) / len(all_labels)
print(f"Overall test accuracy: {overall_acc:.4f}")
print(f"\nMost confused pair: ", end="")

# Find most confused pair (off-diagonal max)
np.fill_diagonal(confusion, 0)
max_idx = np.unravel_index(confusion.argmax(), confusion.shape)
print(f"{class_names[max_idx[0]]} mistaken for {class_names[max_idx[1]]} "
      f"({confusion[max_idx]} times)")

In [None]:
# Show sample predictions with confidence
model.eval()
fig, axes = plt.subplots(2, 5, figsize=(15, 6))

# Get a batch of test images
test_iter = iter(test_loader)
images, labels = next(test_iter)

for idx in range(10):
    ax = axes[idx // 5, idx % 5]
    
    img = images[idx]
    true_label = labels[idx].item()
    
    # Get prediction
    with torch.no_grad():
        output = model(img.unsqueeze(0).to(device))
        probs = F.softmax(output, dim=1)
        pred_label = probs.argmax().item()
        confidence = probs.max().item()
    
    # Display
    ax.imshow(img.squeeze(), cmap='gray')
    
    color = 'green' if pred_label == true_label else 'red'
    ax.set_title(f'True: {class_names[true_label]}\nPred: {class_names[pred_label]} ({confidence:.0%})',
                 fontsize=8, color=color, fontweight='bold')
    ax.axis('off')

plt.suptitle('Sample Predictions (green=correct, red=wrong)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: Manual Convolution

Implement 2D convolution with padding and stride support.

**F1 scenario:** Imagine you are building a telemetry analysis tool from scratch. Your first task is to implement the core operation -- sliding a pattern-detection filter across a 2D data grid (e.g., a time-frequency representation of engine audio). Get the padding and stride right so your output dimensions are exactly what you expect.

In [None]:
# EXERCISE 1: Implement 2D convolution with padding and stride
def conv2d_full(image, kernel, padding=0, stride=1):
    """
    Perform 2D convolution with padding and stride.
    
    Args:
        image: 2D numpy array (H x W)
        kernel: 2D numpy array (K x K)
        padding: number of zero-padding pixels
        stride: step size
    
    Returns:
        Output feature map as numpy array
    """
    # TODO: Implement this!
    # Step 1: Pad the image with zeros if padding > 0
    # Hint: Use np.pad(image, padding, mode='constant', constant_values=0)
    
    # Step 2: Compute output dimensions using the formula
    # Hint: out_h = (H_padded - K) // stride + 1
    
    # Step 3: Slide the kernel with the given stride and compute dot products
    
    pass  # Replace with your implementation

# Test cases
test_image = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12],
    [13, 14, 15, 16]
], dtype=float)

test_kernel = np.array([
    [1, 0],
    [0, -1]
], dtype=float)

# Test 1: No padding, stride 1
result1 = conv2d_full(test_image, test_kernel, padding=0, stride=1)
expected1 = np.array([[-5, -5, -5], [-5, -5, -5], [-5, -5, -5]])
print("Test 1 (pad=0, stride=1):")
print(f"  Your result:\n{result1}")
print(f"  Expected:\n{expected1}")
if result1 is not None:
    print(f"  Correct: {np.allclose(result1, expected1)}")

# Test 2: Padding 1, stride 1
result2 = conv2d_full(test_image, test_kernel, padding=1, stride=1)
print(f"\nTest 2 (pad=1, stride=1):")
print(f"  Output shape: {result2.shape if result2 is not None else 'None'}")
print(f"  Expected shape: (5, 5)")

# Test 3: No padding, stride 2
result3 = conv2d_full(test_image, test_kernel, padding=0, stride=2)
expected3 = np.array([[-5, -5], [-5, -5]])
print(f"\nTest 3 (pad=0, stride=2):")
print(f"  Your result:\n{result3}")
print(f"  Expected:\n{expected3}")
if result3 is not None:
    print(f"  Correct: {np.allclose(result3, expected3)}")

### Exercise 2: Build a Custom CNN

Design a CNN for FashionMNIST and beat 88% test accuracy.

**F1 scenario:** Think of each FashionMNIST image as a simplified telemetry snapshot -- a 28x28 grid encoding some pattern. Your task is to design a CNN architecture (choosing filter counts, kernel sizes, pooling strategy) that can reliably classify these patterns, much like an engineer would design a pipeline to classify different types of on-track events from telemetry spectrograms.

In [None]:
# EXERCISE 2: Build your own CNN
class MyCNN(nn.Module):
    """
    Design a CNN that achieves > 88% accuracy on FashionMNIST.
    
    Requirements:
    - Input: (batch, 1, 28, 28) grayscale images
    - Output: (batch, 10) class logits
    - Use at least 2 conv layers
    - Use at least 1 pooling layer
    - Use BatchNorm
    - Keep parameters under 500K
    
    Hints:
    - Start with 16 or 32 filters, double at each block
    - Use padding=1 with 3x3 kernels for 'same' convolutions
    - Use MaxPool2d(2) to halve spatial dimensions
    - Remember to flatten before the FC layers
    - Use dropout for regularization
    """
    def __init__(self):
        super().__init__()
        # TODO: Design your architecture!
        # Hint: Conv -> BN -> ReLU -> Pool -> Conv -> BN -> ReLU -> Pool -> FC
        
        pass  # Replace with your implementation
    
    def forward(self, x):
        # TODO: Implement forward pass
        
        pass  # Replace with your implementation

# Test your architecture
# my_model = MyCNN()
# dummy = torch.randn(2, 1, 28, 28)
# out = my_model(dummy)
# print(f"Output shape: {out.shape}")
# total_params = sum(p.numel() for p in my_model.parameters())
# print(f"Total parameters: {total_params:,}")
# assert out.shape == (2, 10), "Output shape should be (batch, 10)"
# assert total_params < 500000, f"Too many parameters: {total_params}"
# print("Architecture check passed!")

### Exercise 3: Implement a Residual Block from Scratch

Implement a residual block and verify the skip connection works correctly.

**F1 scenario:** Build the "raw + processed telemetry" pipeline. Your residual block should process the input through two convolutional layers (the processed path) while also forwarding the raw input directly to the output. Verify that when the convolutional layers learn to output zero, the block acts as a perfect pass-through -- just like a telemetry system that defaults to showing raw data when no processing is applied.

In [None]:
# EXERCISE 3: Implement a residual block
class MyResidualBlock(nn.Module):
    """
    Implement a residual block:
        output = ReLU( BN(Conv(ReLU(BN(Conv(x))))) + x )
    
    If in_channels != out_channels, use a 1x1 conv on the skip path.
    
    Args:
        in_channels: Number of input channels
        out_channels: Number of output channels
    """
    def __init__(self, in_channels, out_channels):
        super().__init__()
        # TODO: Define layers
        # Hint: You need conv1, bn1, conv2, bn2
        # Hint: If in_channels != out_channels, add a skip projection
        
        pass  # Replace with your implementation
    
    def forward(self, x):
        # TODO: Implement forward with skip connection
        # Hint: identity = x (or projected x)
        # Hint: out = conv -> bn -> relu -> conv -> bn
        # Hint: out = relu(out + identity)
        
        pass  # Replace with your implementation

# Test
# block = MyResidualBlock(16, 16)
# x = torch.randn(2, 16, 14, 14)
# out = block(x)
# print(f"Same channels:  input {x.shape} -> output {out.shape}")
# assert out.shape == x.shape, "Output shape should match input when channels are the same"

# block2 = MyResidualBlock(16, 32)
# out2 = block2(x)
# print(f"Diff channels:  input {x.shape} -> output {out2.shape}")
# assert out2.shape == (2, 32, 14, 14), "Output should have 32 channels"

# # Verify skip connection: if conv weights are zero, output should equal input
# block_zero = MyResidualBlock(16, 16)
# with torch.no_grad():
#     for param in block_zero.parameters():
#         param.zero_()
# # With zero weights and zero BN, output should be close to relu(x)
# out_zero = block_zero(x)
# print(f"Zero weights test passed: output is non-trivial = {out_zero.abs().sum() > 0}")
# print("All tests passed!")

---

## Summary

### Key Concepts

**Why CNNs:**
- Fully connected networks waste parameters on images -- no spatial awareness
- CNNs exploit local connectivity, weight sharing, and translation invariance
- Dramatically fewer parameters while capturing spatial structure
- **F1 parallel:** Just as an engineer scans telemetry with a sliding window rather than staring at every data point, CNNs slide learned filters across input data

**Convolution Operation:**
- A sliding window dot product between a filter and the input
- Kernels detect features: edges, textures, patterns
- Padding preserves spatial dimensions; stride reduces them
- Output size: $O = \lfloor(W - K + 2P) / S\rfloor + 1$
- **F1 parallel:** Each filter is a pattern template (braking signature, traction event) swept across the telemetry trace

**Multiple Channels & Filters:**
- Each filter slides across all input channels and produces one feature map
- More filters = more features detected at each location
- CNNs learn a hierarchy: edges -> textures -> parts -> objects
- **F1 parallel:** Multi-channel convolution combines speed, brake, throttle, and steering simultaneously; the hierarchy goes from raw sensor spikes to event patterns to driving-style classification

**Pooling:**
- Max pooling keeps strongest activations; average pooling keeps means
- Global average pooling replaces fully connected layers in modern architectures
- No learnable parameters -- purely downsampling
- **F1 parallel:** Summarizing a sector into key metrics -- peak speed (max pool), average pace (avg pool), one number per metric for the whole lap (GAP)

**Classic Architectures:**
- LeNet-5 (1998): Conv-Pool pattern, first practical CNN
- VGG (2014): simple 3x3 filters stacked deep
- ResNet (2015): skip connections enable hundreds of layers
- **F1 parallel:** Skip connections = having both raw and processed telemetry available at every analysis stage, so no information is ever permanently lost

**Practical Training:**
- Data augmentation is critical for generalization
- BatchNorm, proper initialization, and learning rate scheduling
- Confusion matrices reveal per-class strengths and weaknesses

### Connection to Deep Learning

| Concept | Application | F1 Parallel |
|---------|------------|-------------|
| Convolution | Feature extraction from images, audio, and sequences | Scanning telemetry for braking zones, traction events |
| Pooling | Dimensionality reduction, translation invariance | Summarizing sectors into key metrics |
| Skip connections | Transformers, U-Nets, any deep architecture | Raw + processed telemetry overlay |
| Feature hierarchy | Transfer learning (reuse early layers) | Sensor spikes -> events -> corner profiles -> driving style |
| Data augmentation | Standard practice for all vision tasks | Simulating varied conditions (weather, calibration) |
| Global average pooling | Modern classifier heads (fewer params than FC) | One summary stat per channel for a whole lap |
| Confusion matrix | Diagnosing model weaknesses, class imbalance | Identifying which event types the model misclassifies |

### Checklist

- [ ] I can explain why CNNs are better than FC networks for images
- [ ] I can compute 2D convolution by hand and calculate output sizes
- [ ] I understand how multiple filters create feature maps
- [ ] I know the difference between max pooling, average pooling, and GAP
- [ ] I can build a CNN in PyTorch with Conv2d, MaxPool2d, and BatchNorm
- [ ] I can explain skip connections and why they enable deeper networks
- [ ] I can train a CNN with data augmentation and evaluate with a confusion matrix
- [ ] I understand the key innovations of LeNet, VGG, and ResNet

---

## Next Steps

With CNNs under your belt, you are ready to tackle more advanced topics:

1. **Recurrent Neural Networks (RNNs)**: Processing sequential data like text, time series, and audio -- or in F1 terms, modeling lap-by-lap tire degradation and fuel burn over a race distance
2. **Transfer Learning**: Using pretrained CNNs (ResNet, EfficientNet) as feature extractors for new tasks with minimal data
3. **Object Detection & Segmentation**: Going beyond classification to localize and segment objects (YOLO, Mask R-CNN)
4. **Transformers for Vision**: Vision Transformers (ViT) that apply attention mechanisms to images, rivaling CNNs

**Practical next steps:**
- Train on CIFAR-10 (32x32 color images, 10 classes) -- a step up from FashionMNIST
- Try transfer learning with `torchvision.models.resnet18(pretrained=True)`
- Experiment with different augmentation strategies and measure their impact
- Visualize learned filters of a trained network to see what it detects at each layer
- Implement a deeper ResNet and compare with the MiniResNet from this notebook