# Convolutional Neural Nets

CNNs are designed to learn from spatial data - including images, 3d volumes, graphs, and more. For simplicity, we will focus on the application to image data.

In [None]:
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn

First, let's grab an example image, use a CNN layer to process it, and see if we can calculate the output image.

In [None]:
data = load_digits()
image = data['data'][0,:] # images are packed as 1d vectors
image = np.reshape(image, (8,8)) #reshape to be a 2d matrix
print(image)
plt.imshow(image, cmap='gray')

Image data is represented on a per-pixel basis, so that each pixel is encoded with a brightness value. For color images, each pixel is encoded with an RGB value.

Our CNN layer will slide a window across the image and produce a value at each step. The CNN has several parameters that can be specified to control how it scans.

    Kernel size: how large the window is, in pixels.
    Stride: how many pixels are skipped before each convolution is taken
    Padding: whether extra pixels are added to the edges of images or not

From these parameters, you can calculate the exact dimensions of the output image. Let's try some examples.

In [None]:
# prepare our data for our CNN. a torch CNN expects data to follow this format
# [batch_size, channels, pixels horizontal, pixels vertical]
image = np.reshape(image, (1, 1, 8, 8))
image = torch.tensor(image, dtype = torch.float32)

In [None]:
conv = nn.Conv2d(
    in_channels = 1, # our image is black and white, so only 1 channel (RGB has 3 channels)
    out_channels = 1, # or number of kernels to train
    kernel_size = (3, 3), # the window is of size 3x3
    stride = (1, 1), # only move one pixel between convolutions
    padding = (0, 0) # amount of extra pixels to add to the top and sides of our image
)

# pass the image through our CNN. what will be the shape of the output?
output = conv(image)
print(output.shape)

Given our image's input shape (8x8), or any image's shape, could we calculate the output from a given `Conv2d` layer ahead of time?

Try writing the formula to do this by hand

In [None]:
def calculate_output_shape(image_size: list[int], kernel: list[int], stride: list[int], padding: list[int]):
    # code here

    return output_shape

calculate_output_shape([8,8],[3,3],[1,1],[0,0])

Each kernel (or filter) of our Conv2d is learned to extract features from the training data. CNNs come with a nice perk of being able to see what each kernel looks like after training. This can potentially tell us something about what our model has learned.

We can demonstrate this with a simple example of a CNN trained to recognize the number 3. If we set the kernel size to equal the image size, we can see this in practice.

In [None]:
conv_1kernel = nn.Conv2d(
    in_channels = 1, # our image is black and white, so only 1 channel (RGB has 3 channels)
    out_channels = 1, # or number of kernels to train
    kernel_size = (8, 8), # the window is of size 3x3
    stride = (1, 1), # only move one pixel between convolutions
    padding = (0, 0) # amount of extra pixels to add to the top and sides of our image
)

In [None]:
# do some clever numpy indexing to grab all examples of 3s from our dataset
x = data['data'][data['target'] == 3]
y = data['target'][data['target'] == 3]

# prep the data for training
x = np.reshape(x, (len(x), 1, 8, 8))
x = torch.tensor(x, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32)

# lets set up a training loop, and save snapshots of each kernel every epoch
optimizer = torch.optim.Adam(conv_1kernel.parameters(), lr = 0.001)
loss_fn = torch.nn.BCEWithLogitsLoss(reduction = 'none')
kernels = []

kernels.append([param.detach().numpy().copy() for param in conv_1kernel.parameters()][0])
for _ in range(10):
    for i in range(len(x)):
        optimizer.zero_grad()
        yhat = torch.squeeze(conv_1kernel(x[i:i+1]))
        loss = loss_fn(y[i], yhat)
        loss.backward()
        optimizer.step()
    kernels.append([param.detach().numpy().copy() for param in conv_1kernel.parameters()][0])

In [None]:
fig, axs = plt.subplots(1, 11, figsize=(30,3))

for i in range(len(kernels)):
    axs[i].imshow(kernels[i][0,0])

Indeed, our kernel is learning to look more and more like a 3. A typical CNN will have hundreds to thousands of kernels that are far smaller than the size of the input image, so interpreting them visually can be more challenging.

Also, CNNs can easily "stack" such that the output of one `Conv2d` layer is fed into the input of the next. Here, I have the first layer of a CNN written out. Fill in the following layers that are commented out, so that the final output is of a shape `(*,64,1,1)`. Then, add a final fully connected layer that converts the output into shape `(*,10)`.

In [None]:
example_img = torch.rand((1,1,28,28))

conv1 = nn.Conv2d(
    in_channels = 1, 
    out_channels = 16,
    kernel_size = (5, 5), 
    stride = (1, 1),
    padding = (0, 0),
)
# maxpool1 = nn.MaxPool2d(kernel_size= 
#                        stride = 
#                        padding= 
# )
# conv2 = nn.Conv2d(
#     in_channels = 
#     out_channels = 
#     kernel_size = 
#     stride = 
#     padding = 
# )
# maxpool2 = nn.MaxPool2d(kernel_size=
#                        stride = 
#                        padding= 
# )
# linear1 = nn.Linear(in_features=
#                     out_features= 
# )

Now that you have the final shape, think about how you have transformed the data from a spatially-sensitive 2d matrix, to a kernel-sensitive 1d vector. Essentially, you have compressed the image. Now create a class from your layers so we can re-use it.

Now, let's grab a real image dataset from `torch`. We can use Fashion MNIST, which is like the MNIST digits dataset but more challenging to learn. Check it out here https://github.com/zalandoresearch/fashion-mnist

`pip install torchvision` if you don't have it

In [None]:
import torchvision

# creates a data object
fashion_mnist_train = torchvision.datasets.FashionMNIST(root = './', download=True, train=True)
fashion_mnist_val = torchvision.datasets.FashionMNIST(root = './', download=True, train=False)

# we can fetch an item from the data like so, which gives us the image in PIL form, and the class label as an int
print(fashion_mnist_train.__getitem__(0))

# we can convert the image to a numpy array like so
img = fashion_mnist_train.__getitem__(0)[0]
img = np.array(img)

# and here is a shoe
plt.imshow(img)

To speed you along, I have code here that processes the data into a format ready-to-ingest by our `CustomDataloader`. Because 60k images is a lot to hold in memory, we will just train on the first 1000. You can run `del fashion_mnist_train` and likewise for val if your notebook starts crashing to free up some memory.

Write a training loop to train your new model on the image data. Note that when I did this, I found model initialization and hyperparameters made a very large difference in whether the model converged or not. You may need to try the same parameters multiple times to get good results.

Visualize performance on validation data