# Tutorial 5-1: Building Blocks of Vision â€“ "Convolutions from Scratch"

**Course:** CSEN 342: Deep Learning  
**Topic:** Convolutional Neural Networks (CNNs), Kernels, Stride, Padding, and Pooling

## Objective
Before using `nn.Conv2d` as a black box, we must understand exactly what happens inside. In this tutorial, we will:
1.  **Implement Conv2d:** Write a naive 2D convolution function using nested loops to grasp the "sliding window" mechanics.
2.  **Visualize Kernels:** Manually define edge detection filters (Sobel) and apply them to an image to see "features" emerge.
3.  **Calculate Dimensions:** Write a utility to compute output shapes given input size, kernel size, stride, and padding (a classic exam topic!).
4.  **Implement Pooling:** Write a naive Max Pooling function to understand downsampling.

---

## Part 1: The Mechanics of Convolution

A convolution involves sliding a small matrix (the **kernel** or filter) over a larger input matrix (the **image**) and computing the dot product at each position.

**Key Parameters:**
* **Kernel Size ($K$):** The width/height of the filter.
* **Stride ($S$):** How many pixels the window moves each step.
* **Padding ($P$):** Zero-pixels added around the border.

Let's implement a naive version for a single channel.

In [None]:
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

def naive_conv2d(input_tensor, kernel, stride=1, padding=0):
    """
    Args:
        input_tensor: 2D tensor (H, W)
        kernel: 2D tensor (K, K)
        stride: int
        padding: int
    Returns:
        output: 2D tensor
    """
    # 1. Apply Padding
    if padding > 0:
        # Pad format: (left, right, top, bottom)
        input_tensor = F.pad(input_tensor, (padding, padding, padding, padding))
    
    H_in, W_in = input_tensor.shape
    K_h, K_w = kernel.shape
    
    # 2. Calculate Output Dimensions
    # Formula: (W - K) / S + 1
    H_out = (H_in - K_h) // stride + 1
    W_out = (W_in - K_w) // stride + 1
    
    output = torch.zeros((H_out, W_out))
    
    # 3. Sliding Window Loop
    for i in range(H_out):
        for j in range(W_out):
            # Determine the window on the input
            row_start = i * stride
            row_end = row_start + K_h
            col_start = j * stride
            col_end = col_start + K_w
            
            # Extract patch
            patch = input_tensor[row_start:row_end, col_start:col_end]
            
            # Element-wise multiply and sum (Dot Product)
            output[i, j] = torch.sum(patch * kernel)
            
    return output

print("Naive Convolution Defined.")

### 1.1 Verification
Let's compare our naive implementation against PyTorch's optimized `F.conv2d`.

In [None]:
# Create random inputs
input_test = torch.randn(5, 5)
kernel_test = torch.randn(3, 3)

# Our Run
out_naive = naive_conv2d(input_test, kernel_test, stride=1, padding=0)

# PyTorch Run (Requires expanding dimensions for Batch and Channel)
# Input: (Batch=1, Channel=1, H, W)
# Kernel: (OutChan=1, InChan=1, K, K)
out_pytorch = F.conv2d(input_test.view(1, 1, 5, 5), kernel_test.view(1, 1, 3, 3), stride=1, padding=0)

# Compare
diff = torch.abs(out_naive - out_pytorch.squeeze()).sum()
print(f"Difference between Naive and PyTorch: {diff.item():.6f}")

---

## Part 2: Visualizing Kernels (Edge Detection)

In a CNN, the kernels are learned. But before deep learning, computer vision engineers manually designed kernels to detect edges. Let's see this in action.

We will use the **Sobel Filter**, which approximates the derivative of the image intensity.

In [None]:
# Define Sobel Kernels
sobel_x = torch.tensor([
    [-1., 0., 1.],
    [-2., 0., 2.],
    [-1., 0., 1.]
])

sobel_y = torch.tensor([
    [-1., -2., -1.],
    [ 0.,  0.,  0.],
    [ 1.,  2.,  1.]
])

# Load a sample image (using torchvision or creating a dummy one)
try:
    # Try to load a real image if available, else create synthetic
    !wget -q -O sample.jpg https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg
    img_pil = Image.open("sample.jpg").convert("L") # Convert to grayscale
    img_pil = img_pil.resize((200, 200))
    img_tensor = torch.tensor(np.array(img_pil)).float()
except:
    print("Could not download image, using synthetic pattern.")
    img_tensor = torch.zeros(100, 100)
    img_tensor[20:80, 20:80] = 255.0 # A white box in the middle

# Apply Convolutions
edges_x = naive_conv2d(img_tensor, sobel_x)
edges_y = naive_conv2d(img_tensor, sobel_y)

# Combine (Magnitude of gradient)
edges_mag = torch.sqrt(edges_x**2 + edges_y**2)

# Visualize
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
axs[0].imshow(img_tensor, cmap='gray'); axs[0].set_title("Original")
axs[1].imshow(edges_x, cmap='gray'); axs[1].set_title("Vertical Edges (Sobel X)")
axs[2].imshow(edges_y, cmap='gray'); axs[2].set_title("Horizontal Edges (Sobel Y)")
axs[3].imshow(edges_mag, cmap='gray'); axs[3].set_title("Edge Magnitude")
plt.show()

### Discussion
Notice how `Sobel X` lights up vertical lines (where pixel values change from left-to-right) and `Sobel Y` lights up horizontal lines. 

**In a CNN:** The first layer filters often learn to look like these Sobel filters or Gabor filters automatically to detect boundaries!

---

## Part 3: The Output Size Calculator

A common source of bugs (and exam questions!) is calculating the shape of the tensor after a convolution.

**Formula:** 
$$ W_{out} = \lfloor \frac{W_{in} - K + 2P}{S} \rfloor + 1 $$

In [None]:
def calculate_output_shape(input_size, kernel_size, stride, padding):
    output_size = np.floor((input_size - kernel_size + 2 * padding) / stride) + 1
    return int(output_size)

# Test Cases from Slides (e.g., Slide 46)
# Input: 32x32, Filter: 5x5, Stride: 1, Pad: 2
in_s = 32; k = 5; s = 1; p = 2
out_s = calculate_output_shape(in_s, k, s, p)
print(f"In: {in_s}, K: {k}, S: {s}, P: {p} -> Out: {out_s}")
print(f"Does shape match input? {in_s == out_s} (Padding=2 preserves size for K=5)")

# Input: 7x7, Filter: 3x3, Stride: 1, Pad: 0 (Slide 26)
in_s = 7; k = 3; s = 1; p = 0
out_s = calculate_output_shape(in_s, k, s, p)
print(f"In: {in_s}, K: {k}, S: {s}, P: {p} -> Out: {out_s}")

---

## Part 4: Max Pooling

Pooling layers downsample the image, reducing computation and introducing translation invariance.
**Max Pooling** selects the largest value in the window.

In [None]:
def naive_maxpool2d(input_tensor, kernel_size=2, stride=2):
    H_in, W_in = input_tensor.shape
    
    H_out = (H_in - kernel_size) // stride + 1
    W_out = (W_in - kernel_size) // stride + 1
    
    output = torch.zeros((H_out, W_out))
    
    for i in range(H_out):
        for j in range(W_out):
            r_start = i * stride
            c_start = j * stride
            patch = input_tensor[r_start : r_start+kernel_size, c_start : c_start+kernel_size]
            output[i, j] = torch.max(patch)
            
    return output

# Apply to our edge-detected image
pooled_img = naive_maxpool2d(edges_mag, kernel_size=4, stride=4)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1); plt.imshow(edges_mag, cmap='gray'); plt.title(f"Before Pooling {edges_mag.shape}")
plt.subplot(1, 2, 2); plt.imshow(pooled_img, cmap='gray'); plt.title(f"After MaxPool (4x4) {pooled_img.shape}")
plt.show()

### Conclusion
You have now implemented the core building blocks of a Convolutional Neural Network from scratch! 
* **Convolution** extracts local patterns (features).
* **Padding** controls output size.
* **Pooling** summarizes features and reduces dimensions.

In the next tutorial, we will chain these together using PyTorch's optimized `nn.Conv2d` to classify images.