# Convolutional Neural Networks (CNNs)

This notebook will implement a simple CNN building blocks from scratch. And explain how they differ from fully connected layers.


## 1) Convolutional Layer

A convolutional layer slides a small filter (kernel) over the input image.
At each location, it computes a dot product between the filter and the image patch underneath it.  (A dot product that becomes large when the patch contains the pattern the filter has learned to detect.)

This produces a feature map showing where the filter's pattern appears in the image.

- The filter's weights are learned during training. (think of it like a small image)

- In practice, each filter looks for a specific pattern (e.g., vertical edge, corner).

- The output preserves the 2D spatial structure of the image.

- The layer sees only local regions at a time (the receptive field).

(the padding is a pad (of 0's usually) around the edges of the input image, the stride is the slide size, for example when stride=2 we take the dot product with every other location)

In [2]:
import numpy as np
import torch
import torch.nn as nn

In [None]:
# a simple 2D implementation in numpy

def conv2d(input_matrix, kernel, padding, stride):
    input_height, input_width = input_matrix.shape
    kernel_height, kernel_width = kernel.shape

    h = input_height + 2*padding
    w = input_width + 2*padding

    # adding padding to the input matrix
    padded = np.pad(input_matrix, padding)

    # initializing output matrix
    output_matrix = np.zeros(((h-kernel_height) // stride+1, (w-kernel_width) // stride+1))

    # slide kernel across the image and for each compute the dot product (sum of elementwise products)
    for out_i, i in enumerate(range(0, h-kernel_height+1, stride)):
        for out_j, j in enumerate(range(0, w-kernel_width+1, stride)):
            # elementwise multiply + sum = dot product between kernel and image patch
            output_matrix[out_i, out_j] = np.sum((padded[i:i+kernel_height, j:j+kernel_width] * kernel))
    
    return output_matrix

That was a 2D (1 channel) example for simplicity. It real examples, an RGB image has multiple channels making it 3D. Also, the batch size affects the dimensionality which makes if 4D.

in PyTorch we have:

In [4]:
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)

That example means there are $3 \cdot 16$ small 3x3 kernels. Each filter consists of in_channels filters. So there are 16 filters

## Why convolutions are more efficient than fully connected layers

Because the convolutional layer shares parameters (the sliding window has the same parameters everywhere), it's a lot more efficient than a fully connected layer.

Example:

Let's say there is a $32 \times 32$ RBG image

A fully connected layer flattened to 100 neurons would be $(3 \cdot 32 \cdot 32) \cdot 100 = 307200$

A convolution layer with $16 $ $3 \times 3 \times 3$ filters is $(16 \cdot 3 \cdot 3 \cdot 3)=432$

## 2) Pooling Layers

Another part of CNNs are pooling layers. Commonly average pooling or max pooling. These just reduce dimensions like the following example:

Max pooling (2 $\times$ 2): $\begin{bmatrix} 3 & 0 & 5 & 2  \\ 1 & 2 & 7 & 5 \\ 4 & 3 & 2 & 1\\ 3 & 1& 0 & 4\end{bmatrix}$  $\rightarrow$ $\begin{bmatrix} 3 & 7  \\ 4 & 4\end{bmatrix}$

Pooling layers help with robustness to small spatial changes as well as efficiency.

## Example of a simple CNN

We combine a bunch of conv layers and pooling layers to build a full CNN. The conv layers and pooling layers basically extract features before passing it to the fully connected layers.

The example takes as grayscale input of shape 28 $\times$ 28. Learns to classify these images into 10 classes.

In [6]:
class CNNExample(nn.Module):
    def __init__(self):
        super(CNNExample, self).__init__()
        # input shape: (batch_size, 1, 28, 28)
        self.block1 = nn.Sequential(
            nn.Conv2d(1, 20, 3, padding=1),
            nn.BatchNorm2d(20), # Batch norm helps with training stability and speed. Not crucial to understand.
            nn.ReLU(),
            
            nn.Conv2d(20, 20, 3, padding=1),
            nn.BatchNorm2d(20),
            nn.ReLU(),
            
            nn.Conv2d(20, 20, 3, padding=1),
            nn.BatchNorm2d(20),
            nn.ReLU(),
            
            nn.MaxPool2d(2, 2) # 28x28 -> 14x14
        )
        self.block2 = nn.Sequential(
            nn.Conv2d(20, 40, 3, padding=1),
            nn.BatchNorm2d(40),
            nn.ReLU(),
            
            nn.Conv2d(40, 40, 3, padding=1),
            nn.BatchNorm2d(40),
            nn.ReLU(),
            
            nn.Conv2d(40, 40, 3, padding=1),
            nn.BatchNorm2d(40),
            nn.ReLU(),
            
            nn.MaxPool2d(2, 2) # 14x14 -> 7x7
        )
        self.conv1 = nn.Sequential(
            nn.Conv2d(40, 60, 3), # 7x7 -> 5x5
            nn.BatchNorm2d(60),
            nn.ReLU()
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(60, 40, 1), # 5x5 -> 5x5
            nn.BatchNorm2d(40),
            nn.ReLU()
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(40, 20, 1), # 5x5 -> 5x5
            nn.BatchNorm2d(20),
            nn.ReLU()
        )
        # input to avg pool will be of shape (batch_size, 20, 5, 5)
        self.avg1 = nn.AvgPool2d(5)
        # (batch_size, 20, 1, 1)
        # flatten to (batch_size, 20)
        self.fc1 = nn.Linear(20, 10)
        # output shape: (batch_size, 10)
        
    def forward(self, x):

        c1 = self.block1(x)
        c2 = self.block2(c1)
        c3 = self.conv1(c2)
        c4 = self.conv2(c3)
        c5 = self.conv3(c4)
        c6 = self.avg1(c5)
        c6 = torch.flatten(c6, 1)
        c7 = self.fc1(c6)
        y = c7
        
        return y