# Notebook 3: Learning Patterns - The Convolutional Neural Network (CNN)

In the previous notebook, we built a Multilayer Perceptron (MLP) that classified MNIST digits by flattening the entire image into a single vector. While this works, it's not how our visual system processes images. 

Unlike an MLP that looks at all pixels at once, a CNN uses a small 'sliding window' (called a kernel or filter) to look for local patterns like edges, corners, and textures. This makes CNNs much more efficient and better suited for vision tasks—they can recognize the same pattern whether it appears in the top-left or bottom-right of an image. This property is called **translation invariance**, and it's one of the key reasons CNNs revolutionized computer vision.


## The Core Layers of a CNN

### nn.Conv2d - The Convolutional Layer

The convolutional layer is the heart of a CNN. It applies a set of learnable filters (also called kernels) to the input image.

**Key arguments:**
- **`in_channels`**: Number of input channels (e.g., 1 for grayscale, 3 for RGB)
- **`out_channels`**: Number of filters/kernels to apply (this becomes the number of output channels)
- **`kernel_size`**: Size of the sliding window (e.g., 3 means a 3×3 filter)
- **`padding`**: Adds zeros around the image border. `padding=1` with `kernel_size=3` keeps the output size the same as the input

Each filter learns to detect a specific pattern (like vertical edges, horizontal edges, or curves). When you specify `out_channels=32`, you're creating 32 different filters, each learning to detect a different pattern.

### nn.MaxPool2d - The Pooling Layer

Pooling layers downsample the feature maps, reducing both computational cost and spatial dimensions. 

**Purpose:**
- **Downsampling**: Reduces the height and width of feature maps (typically by a factor of 2)
- **Translation invariance**: Makes the learned patterns more robust to small shifts in position
- **Efficiency**: Reduces the number of parameters in subsequent layers

`nn.MaxPool2d(2)` applies a 2×2 window that takes the maximum value, effectively cutting the spatial dimensions in half. For example, a 28×28 feature map becomes 14×14 after max pooling with kernel size 2.


## Tracing the Shapes Through the CNN

Understanding how tensor shapes transform through each layer is **critical** for debugging CNNs. Let's walk through the data flow step-by-step and predict the shape at each stage.

**Input**: A batch of images with shape `(64, 1, 28, 28)`.
- 64 = batch size
- 1 = channels (grayscale)
- 28×28 = image dimensions

**After `nn.Conv2d(1, 32, kernel_size=3, padding=1)`**: 
- The number of channels becomes 32 (we have 32 filters)
- With `padding=1` and `kernel_size=3`, the spatial dimensions stay the same
- Shape: `(64, 32, 28, 28)`

**After `nn.ReLU()`**: 
- No shape change (element-wise operation)
- Shape: `(64, 32, 28, 28)`

**After `nn.MaxPool2d(2)`**: 
- This cuts the height and width in half
- Channels remain 32
- Shape: `(64, 32, 14, 14)`

**After `nn.Conv2d(32, 64, kernel_size=3, padding=1)`**: 
- The number of channels becomes 64 (we now have 64 filters)
- With padding, spatial dimensions stay the same
- Shape: `(64, 64, 14, 14)`

**After `nn.ReLU()`**: 
- No shape change
- Shape: `(64, 64, 14, 14)`

**After `nn.MaxPool2d(2)`**: 
- Height and width are cut in half again
- Channels remain 64
- Shape: `(64, 64, 7, 7)`

**After `nn.Flatten()`**: 
- Flattens all dimensions except the batch dimension
- Shape: `(64, 64 * 7 * 7)` = `(64, 3136)`

**After `nn.Linear(3136, 128)`**: 
- Fully connected layer: 3136 → 128
- Shape: `(64, 128)`

**After `nn.ReLU()`**: 
- No shape change
- Shape: `(64, 128)`

**After `nn.Linear(128, 10)`**: 
- Final output layer: 128 → 10 (one for each digit class)
- Shape: `(64, 10)`

### Calculating in_features for the First Linear Layer

After the convolutional and pooling layers, we have shape `(64, 64, 7, 7)`. When we flatten this (excluding the batch dimension), we get:
- `64 * 7 * 7 = 3136` features

This is why `nn.Linear(3136, 128)` uses `in_features=3136`—it's the product of the channel, height, and width dimensions after flattening.


## Training the CNN

The beauty of PyTorch is its modularity. Our training logic—the loss function, the optimizer, and the training loop—remains exactly the same. We just need to pass our new CNN model to it.

This demonstrates one of PyTorch's core design principles: **models are interchangeable**. You can swap out an MLP for a CNN, a ResNet, or any other architecture, and the training code stays the same. The only thing that changes is the model architecture itself.


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Load the data (same as Notebook 2)
transform = transforms.ToTensor()

train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

batch_size = 64
train_dataloader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False
)

# Define the CNN
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # First convolutional block
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2)
        
        # Second convolutional block
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2)
        
        # Fully connected layers
        self.flatten = nn.Flatten()
        self.linear1 = nn.Linear(64 * 7 * 7, 128)  # 64 channels * 7 * 7 = 3136
        self.relu3 = nn.ReLU()
        self.linear2 = nn.Linear(128, 10)
    
    def forward(self, x):
        print(f"Input shape: {x.shape}")
        
        x = self.conv1(x)
        print(f"After conv1: {x.shape}")
        
        x = self.relu1(x)
        print(f"After relu1: {x.shape}")
        
        x = self.pool1(x)
        print(f"After pool1: {x.shape}")
        
        x = self.conv2(x)
        print(f"After conv2: {x.shape}")
        
        x = self.relu2(x)
        print(f"After relu2: {x.shape}")
        
        x = self.pool2(x)
        print(f"After pool2: {x.shape}")
        
        x = self.flatten(x)
        print(f"After flatten: {x.shape}")
        
        x = self.linear1(x)
        print(f"After linear1: {x.shape}")
        
        x = self.relu3(x)
        print(f"After relu3: {x.shape}")
        
        x = self.linear2(x)
        print(f"After linear2 (output): {x.shape}")
        
        return x

# Instantiate the model
model = SimpleCNN()
print("Model architecture:")
print(model)
print("\n" + "="*50)
print("Testing forward pass with one batch:")
print("="*50)

# Get one batch to test
X, y = next(iter(train_dataloader))
_ = model(X)


In [None]:
# Create a new CNN model (without the print statements for cleaner training)
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # First convolutional block
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2)
        
        # Second convolutional block
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2)
        
        # Fully connected layers
        self.flatten = nn.Flatten()
        self.linear1 = nn.Linear(64 * 7 * 7, 128)
        self.relu3 = nn.ReLU()
        self.linear2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        x = self.flatten(x)
        x = self.linear1(x)
        x = self.relu3(x)
        x = self.linear2(x)
        return x

# Instantiate the model
model = SimpleCNN()

# Instantiate loss function
loss_fn = nn.CrossEntropyLoss()

# Create a new Adam optimizer for the CNN's parameters
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Reuse the training function from Notebook 2
def train(dataloader, model, loss_fn, optimizer, epochs=5):
    """
    Train the model for a specified number of epochs.
    
    Args:
        dataloader: DataLoader providing batches of training data
        model: The neural network model
        loss_fn: Loss function
        optimizer: Optimizer for updating weights
        epochs: Number of training epochs
    """
    model.train()  # Set model to training mode
    
    for epoch in range(epochs):
        total_loss = 0.0
        num_batches = 0
        
        for batch_idx, (X, y) in enumerate(dataloader):
            # Step 1: Forward pass
            pred = model(X)
            
            # Step 2: Calculate loss
            loss = loss_fn(pred, y)
            
            # Step 3: Backpropagation
            loss.backward()
            
            # Step 4: Update weights
            optimizer.step()
            
            # Step 5: Zero gradients
            optimizer.zero_grad()
            
            total_loss += loss.item()
            num_batches += 1
            
            # Print progress every 100 batches
            if (batch_idx + 1) % 100 == 0:
                avg_loss = total_loss / num_batches
                print(f'Epoch {epoch + 1}/{epochs}, Batch {batch_idx + 1}/{len(dataloader)}, Loss: {avg_loss:.4f}')
        
        # Print average loss for the epoch
        avg_loss = total_loss / num_batches
        print(f'Epoch {epoch + 1}/{epochs} completed. Average Loss: {avg_loss:.4f}\n')

# Train the CNN
print("Starting CNN training...\n")
train(train_dataloader, model, loss_fn, optimizer, epochs=5)
print("Training completed!")
