# Chapter 7: Convolutional Neural Networks

This chapter introduces **convolutional neural networks (CNNs)**, a powerful family of neural networks designed for image data. CNNs leverage the spatial structure of images through translation invariance and locality principles.

ðŸ”‘ **KEY INSIGHT**: Images exhibit rich structure that CNNs exploit - nearby pixels are typically related. This allows CNNs to achieve both sample efficiency AND computational efficiency compared to fully connected networks.

---
## 7.1 From Fully Connected Layers to Convolutions

This section derives the structure of CNNs from first principles, showing why convolutions are the natural choice for image processing.

ðŸ”‘ **KEY INSIGHT - The Parameter Problem**: A 1-megapixel image with just 1000 hidden units requires 10^9 parameters in a fully connected layer. CNNs dramatically reduce this through two principles:
1. **Translation Invariance**: The same pattern should be detected regardless of location
2. **Locality**: Only nearby pixels matter for computing hidden representations

These principles reduce parameters from ~10^12 to just 4Î”Â² (where Î” is the kernel radius, typically <10).

---
## 7.2 Convolutions for Images

Now that we understand how convolutional layers work in theory, we are ready to see how they work in practice.

In [None]:
from d2l import torch as d2l
import torch
from torch import nn

### The Cross-Correlation Operation

ðŸ”‘ **KEY INSIGHT**: Despite the name "convolution", CNNs actually perform **cross-correlation**. The difference is cosmetic since kernels are learned - flipping doesn't matter when weights are trained from data.

In [None]:
def corr2d(X, K):  #@save
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = d2l.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = d2l.reduce_sum((X[i: i + h, j: j + w] * K))
    return Y

In [None]:
X = d2l.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = d2l.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

### Convolutional Layers

A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output.

In [None]:
class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

### Object Edge Detection in Images

Let's see a simple application: detecting edges by finding pixel changes.

In [None]:
X = d2l.ones((6, 8))
X[:, 2:6] = 0
X

In [None]:
K = d2l.tensor([[1.0, -1.0]])

In [None]:
Y = corr2d(X, K)
Y

In [None]:
corr2d(d2l.transpose(X), K)

### Learning a Kernel

ðŸ”‘ **KEY INSIGHT**: We don't need to design kernels manually - we can **learn** them from data! This replaces feature engineering with evidence-based learning.

In [None]:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.LazyConv2d(1, kernel_size=(1, 2), bias=False)

# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, channel, height, width), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2  # Learning rate

for i in range(10):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward()
    # Update the kernel
    conv2d.weight.data[:] -= lr * conv2d.weight.grad
    if (i + 1) % 2 == 0:
        print(f'epoch {i + 1}, loss {l.sum():.3f}')

In [None]:
d2l.reshape(conv2d.weight.data, (1, 2))

---
## 7.3 Padding and Stride

Techniques for controlling output size in convolutional layers.

ðŸ”‘ **KEY INSIGHT - Output Size Formula**:
- Without padding/stride: output = (n - k + 1)
- With padding p and stride s: output = âŒŠ(n - k + p + s) / sâŒ‹

In [None]:
import torch
from torch import nn

### Padding

Padding adds extra pixels around the boundary to preserve spatial dimensions.

In [None]:
# We define a helper function to calculate convolutions. It initializes the
# convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
    # (1, 1) indicates that batch size and the number of channels are both 1
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    # Strip the first two dimensions: examples and channels
    return Y.reshape(Y.shape[2:])

# 1 row and column is padded on either side, so a total of 2 rows or columns
# are added
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape

In [None]:
# We use a convolution kernel with height 5 and width 3. The padding on either
# side of the height and width are 2 and 1, respectively
conv2d = nn.LazyConv2d(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape

### Stride

Stride controls how many positions the kernel moves per step, useful for downsampling.

In [None]:
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2)
comp_conv2d(conv2d, X).shape

In [None]:
conv2d = nn.LazyConv2d(1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
comp_conv2d(conv2d, X).shape

---
## 7.4 Multiple Input and Multiple Output Channels

Real images have multiple channels (RGB). This section extends convolutions to handle them.

In [None]:
from d2l import torch as d2l
import torch

### Multiple Input Channels

ðŸ”‘ **KEY INSIGHT**: With multiple input channels, the kernel has shape (c_i Ã— k_h Ã— k_w). We perform cross-correlation on each channel separately and **sum the results**.

In [None]:
def corr2d_multi_in(X, K):
    # Iterate through the 0th dimension (channel) of K first, then add them up
    return sum(d2l.corr2d(x, k) for x, k in zip(X, K))

In [None]:
X = d2l.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = d2l.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K)

### Multiple Output Channels

To get multiple output channels, we create a kernel tensor of shape (c_o Ã— c_i Ã— k_h Ã— k_w).

In [None]:
def corr2d_multi_in_out(X, K):
    # Iterate through the 0th dimension of K, and each time, perform
    # cross-correlation operations with input X. All of the results are
    # stacked together
    return d2l.stack([corr2d_multi_in(X, k) for k in K], 0)

In [None]:
K = d2l.stack((K, K + 1, K + 2), 0)
K.shape

In [None]:
corr2d_multi_in_out(X, K)

### 1Ã—1 Convolutional Layer

ðŸ”‘ **KEY INSIGHT**: A 1Ã—1 convolution acts like a **fully connected layer applied at each pixel location** - it mixes channels without considering spatial neighbors. Used for channel dimension changes in architectures like ResNet.

In [None]:
def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]
    X = d2l.reshape(X, (c_i, h * w))
    K = d2l.reshape(K, (c_o, c_i))
    # Matrix multiplication in the fully connected layer
    Y = d2l.matmul(K, X)
    return d2l.reshape(Y, (c_o, h, w))

In [None]:
X = d2l.normal(0, 1, (3, 3, 3))
K = d2l.normal(0, 1, (2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
assert float(d2l.reduce_sum(d2l.abs(Y1 - Y2))) < 1e-6

---
## 7.5 Pooling

Pooling layers serve two purposes:
1. Reduce sensitivity to location (provide some translation invariance)
2. Spatially downsample representations

In [None]:
from d2l import torch as d2l
import torch
from torch import nn

### Maximum Pooling and Average Pooling

ðŸ”‘ **KEY INSIGHT**: Pooling has **no learnable parameters** - it's deterministic. Max-pooling is generally preferred as it provides some degree of invariance to small translations.

In [None]:
def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = d2l.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y

In [None]:
X = d2l.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))

In [None]:
pool2d(X, (2, 2), 'avg')

### Padding and Stride in Pooling

In [None]:
X = d2l.reshape(d2l.arange(16, dtype=d2l.float32), (1, 1, 4, 4))
X

In [None]:
pool2d = nn.MaxPool2d(3)
# Pooling has no model parameters, hence it needs no initialization
pool2d(X)

In [None]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

In [None]:
pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1))
pool2d(X)

### Multiple Channels

Pooling operates on each channel independently (unlike convolution which sums across input channels).

In [None]:
X = d2l.concat((X, X + 1), 1)
X

In [None]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

---
## 7.6 Convolutional Neural Networks (LeNet)

LeNet-5 was among the first CNNs to achieve wide recognition for computer vision tasks, developed by Yann LeCun for handwritten digit recognition.

ðŸ”‘ **KEY INSIGHT - LeNet Architecture**:
1. **Convolutional encoder**: Two conv layers that extract spatial features
2. **Dense block**: Three fully connected layers for classification

Key pattern: As we go deeper, spatial dimensions decrease while channel depth increases.

In [None]:
from d2l import torch as d2l
import torch
from torch import nn

In [None]:
def init_cnn(module):  #@save
    """Initialize weights for CNNs."""
    if type(module) == nn.Linear or type(module) == nn.Conv2d:
        nn.init.xavier_uniform_(module.weight)

In [None]:
class LeNet(d2l.Classifier):  #@save
    """The LeNet-5 model."""
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.LazyLinear(120), nn.Sigmoid(),
            nn.LazyLinear(84), nn.Sigmoid(),
            nn.LazyLinear(num_classes))

### Inspecting the Model

Let's trace data through the network to understand the shape transformations.

In [None]:
@d2l.add_to_class(d2l.Classifier)  #@save
def layer_summary(self, X_shape):
    X = d2l.randn(*X_shape)
    for layer in self.net:
        X = layer(X)
        print(layer.__class__.__name__, 'output shape:\t', X.shape)

model = LeNet()
model.layer_summary((1, 1, 28, 28))

### Training LeNet on Fashion-MNIST

In [None]:
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = LeNet(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], init_cnn)
trainer.fit(model, data)

---
## Summary

Key takeaways from Chapter 7:

1. **Translation invariance + locality** â†’ convolutions as the natural operation for images
2. **Cross-correlation** is the actual operation (convolution is a misnomer)
3. **Padding** preserves spatial dimensions; **stride** downsamples
4. **Multiple channels** allow learning diverse feature detectors
5. **1Ã—1 convolutions** mix channels (act like per-pixel fully connected layers)
6. **Pooling** provides translation invariance and downsampling
7. **LeNet** pioneered the conv-pool-conv-pool-fc pattern still used today