# Tutorial 5-2: The Modern CNN Architect â€“ "From LeNet to ResNet"

**Course:** CSEN 342: Deep Learning  
**Topic:** CNN Architectures, VGG Blocks, and Receptive Fields

## Objective
Deep Learning is not just about stacking random layers. It is about identifying and repeating successful **design patterns**.

In this tutorial, we will trace the evolution of CNN architectures:
1.  **LeNet-5 (1998):** The classic "Sandwich" architecture (Conv-Pool-Conv-Pool).
2.  **VGG (2014):** The birth of the "Modular Block" design pattern (stacks of $3\times3$ convolutions).
3.  **Receptive Field:** We will calculate exactly how much of the input image a neuron "sees," explaining why deeper networks are better at recognizing global shapes.

---

## Part 1: The Classic (LeNet-5)

Yann LeCun's LeNet-5 (1998) was designed for handwriting recognition (MNIST). It established the standard pattern: **Convolution $\to$ Activation $\to$ Pooling**.

**Architecture from Lecture Slide 8:**
* Input: $32\times32$ grayscale image
* C1: Conv (6 filters, $5\times5$, stride 1)
* S2: Avg Pool ($2\times2$, stride 2) *Note: Modern versions often use MaxPool*
* C3: Conv (16 filters, $5\times5$, stride 1)
* S4: Avg Pool ($2\times2$, stride 2)
* C5: Fully Connected (120 units)
* F6: Fully Connected (84 units)
* Output: 10 units (Digits 0-9)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        # 1. Feature Extractor (The "Sandwich")
        self.features = nn.Sequential(
            # C1: Input 32x32 -> 28x28 (Filter 5x5 removes 4 pixels)
            nn.Conv2d(1, 6, kernel_size=5),
            nn.ReLU(),
            # S2: 28x28 -> 14x14
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # C3: 14x14 -> 10x10
            nn.Conv2d(6, 16, kernel_size=5),
            nn.ReLU(),
            # S4: 10x10 -> 5x5
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # 2. Classifier (The MLP Head)
        self.classifier = nn.Sequential(
            nn.Flatten(),
            # Input size: 16 channels * 5 * 5 pixels = 400 features
            nn.Linear(16 * 5 * 5, 120),
            nn.ReLU(),
            nn.Linear(120, 84),
            nn.ReLU(),
            nn.Linear(84, 10)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Verify Dimensions
dummy_input = torch.randn(1, 1, 32, 32)
model = LeNet5()
out = model(dummy_input)
print(f"Input Shape: {dummy_input.shape}")
print(f"Output Shape: {out.shape} (Expected: 1, 10)")

---

## Part 2: The Modular Architect (VGG)

In 2014, the VGG network introduced a key insight: **Small filters ($3\times3$) stacked deeply are better than large filters.**

Instead of writing every layer manually, VGG introduced the concept of the **VGG Block**: a sequence of convolutions followed by one pooling layer. This allows us to write code that *generates* the network.

**A VGG Block:**
1.  Conv3x3 (padding=1 to keep size)
2.  ReLU
3.  Conv3x3 (padding=1)
4.  ReLU
5.  MaxPool (halves the size)

In [None]:
def vgg_block(in_channels, out_channels, num_convs):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
        # The output of this layer becomes input for the next (in the loop)
        in_channels = out_channels
        
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)

class MiniVGG(nn.Module):
    def __init__(self, in_channels=3, num_classes=10):
        super(MiniVGG, self).__init__()
        
        # Modular Construction
        self.features = nn.Sequential(
            vgg_block(in_channels, 32, 2),  # Block 1: 2 convs, 32 filters
            vgg_block(32, 64, 2),           # Block 2: 2 convs, 64 filters
            vgg_block(64, 128, 2)           # Block 3: 2 convs, 128 filters
        )
        
        # Calculate Flattened Size dynamically
        # Input 32x32 -> Pool -> 16x16 -> Pool -> 8x8 -> Pool -> 4x4
        self.flat_dim = 128 * 4 * 4
        
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(self.flat_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.5), # VGG heavily uses dropout
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Verify Dimensions for CIFAR-10 input (3 channels)
dummy_cifar = torch.randn(1, 3, 32, 32)
vgg_model = MiniVGG()
out_vgg = vgg_model(dummy_cifar)
print(f"VGG Output Shape: {out_vgg.shape}")

---

## Part 3: The Receptive Field (Why deeper is better)

Why do we stack layers? Why not just use one giant convolution?

**The Concept:**
Each neuron in a feature map only looks at a small patch of the previous layer. However, that patch looked at a patch of the layer before it. As we go deeper, the "Receptive Field" (the area of the original image that influences the neuron) grows.

* **Layer 1 ($3\times3$):** Sees $3\times3$ pixels.
* **Layer 2 ($3\times3$):** Sees $5\times5$ pixels (because the inputs are overlapping).
* **Layer 3 ($3\times3$):** Sees $7\times7$ pixels.

Let's visualize this growth.

In [None]:
def calculate_rf(layers_description):
    # Formula for RF growth: RF_prev + (Kernel - 1) * Stride_total
    rf = 1
    stride_total = 1
    
    print(f"{'Layer':<15} | {'Kernel':<8} | {'Stride':<8} | {'Receptive Field'}")
    print("-"*55)
    
    for name, k, s in layers_description:
        # RF math
        added_rf = (k - 1) * stride_total
        rf += added_rf
        stride_total *= s
        print(f"{name:<15} | {k:<8} | {s:<8} | {rf}")

# Define LeNet-like structure
lenet_layers = [
    ("Conv1", 5, 1),
    ("Pool1", 2, 2),
    ("Conv2", 5, 1),
    ("Pool2", 2, 2)
]

# Define VGG-like structure (stacking 3x3s)
vgg_layers = [
    ("Conv1_1", 3, 1),
    ("Conv1_2", 3, 1),
    ("Pool1", 2, 2),
    ("Conv2_1", 3, 1),
    ("Conv2_2", 3, 1),
    ("Pool2", 2, 2),
    ("Conv3_1", 3, 1),
    ("Conv3_2", 3, 1)
]

print("LeNet Receptive Field Growth:")
calculate_rf(lenet_layers)

print("\nVGG Receptive Field Growth:")
calculate_rf(vgg_layers)

### Conclusion
Observe how quickly the Receptive Field grows in VGG compared to LeNet. By the end of the VGG block, a single neuron "sees" a huge chunk of the image (often larger than the object itself!). This allows late layers to recognize global concepts like "Cat" or "Car," while early layers only recognize local edges.