<a href="https://colab.research.google.com/github/bec2148/computer-vision-3/blob/main/Homework3_Release_Updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Homework 3: Object Recognition
==========

> **Submission Instructions:** Before the deadline, export the completed notebook to PDF and upload it to GradeScope. The PDF should clearly show your code, and the result of running the code. Check the PDF to ensure that it is readable, the font-size is not small, and no information is cut-off. There will be no make-ups or extensions for corrupted/damaged/unreadable PDFs.

Brendan Cunnie  `bec2148`

**Names of Collaborators:** Coded with help from ChatGPT, eager assistance from CoPilot, and much Googling.  And I looked at an implementation of resnet that I had coded for a previous class.

In this homework, we will investigate learning a neural network with PyTorch. This will give you some familarity with PyTorch and modern deep learning libraries. First, let's load in PyTorch and several functions that we will use throughout the homework.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

device = "cuda" if torch.cuda.is_available() else "cpu"

def imshow(img):
    img = img / 2 + 0.5
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()

For this homework, having access to a GPU will be very useful. To enable a GPU, follow the instructions from the previous homework. If the code below outputs `cuda`, then you are using a GPU.

In [None]:
print(device)

# 1. Problem 1: Building Neural Network

Introduction to PyTorch
-----------------------
PyTorch is similar to numpy. For the most part, if there is a numpy operation, there is an equivalent PyTorch operation. However, the advantage of PyTorch is that it will automatically calculate gradients through back-propagation and the algorithms are implemented on the GPU.

### Automatic Differentiation
You can view PyTorch as Numpy with gradient calculation built in. After you have finished computing your program, there is a `.backward()` function that calculates the gradients for all of the operations in the program. You no longer need to analytically calculate the gradients, code it up, and write a gradient checker.

Let's look at a basic example. We will perform matrix multiplication between `a` and `b`, followed by element-wise multiplication with `c`, and finally we sum the result. Since it is computed in PyTorch, we can simply call `result.backward()` to have all of the gradients calculated throughout the computational graph.

In [None]:
a = torch.rand(2,2, requires_grad=True, device=device)
b = torch.rand(2,2, requires_grad=True, device=device)
c = torch.rand(2,2, device=device)

result = torch.matmul(a, b) * c
result = result.sum()

result.backward() # calculate the gradients with back-propagation to the input

print(f'Result: {result.cpu().item()}')
print(f'Gradient a:\n {a.grad}')
print(f'Gradient b:\n {b.grad}')
print(f'Gradient c:\n {c.grad}')

Note that only the tensors explicitly marked with `requires_grad=True` will have gradients calculated. Consequently, the gradient for `c` is `None` in the above computation. Automatic differenation makes it possible to implement very creative and complex deep learning algorithms. There has been extensive research and development to create many differentiable functions.

PyTorch makes it easy to transfer variables between the CPU and the GPU. When the variable `a` is constructed, the `device=device` specifies whether to store it on the CPU or GPU, depending on the value of the `device` variable. There is also a `.cpu()` function to bring a variable back to the CPU, and a `.to(device)` function to transfer a variable between devices.

In [None]:
cpu_tensor = torch.rand(2,2)
gpu_tensor = cpu_tensor.to("cuda")

cpu_tensor_2 = gpu_tensor.to("cpu")
cpu_tensor_3 = gpu_tensor.cpu()

### Neural Network Layers

PyTorch also has a large library of deep learning layers. These layers allow you to operate at a higher-level of abstraction than low-level code. For example, one of the most basic layers in deep learning modules are linear layers, which is a matrix multiplication followed by a vector addition. PyTorch has layers that handle this for you, and automatically create the parameter vectors that need to be learned.

Below is an example. Notice how we first create the layer, then we call the layer. Creating the layer is like creating the function, which you can then later call. When the layer is created, the weights for the matrix multiplication (and addition) are automatically created, and initialized automatically with random numbers.

In [None]:
test_input = torch.randn(3)

layer = nn.Linear(3, 2) # create the layer
output = layer(test_input) # call the layer

print('Weights of the Layer:')
print(layer.weight)

This allows us to chain layers together in order to create a neural network. For example, the below code creates a neural network similar to the previous homework. And the gradients can be easily calculated through back-propagation throughout all of the layers.

In [None]:
test_input = torch.randn(3)

layer1 = nn.Linear(3, 20)
layer2 = nn.ReLU()
layer3 = nn.Linear(20, 1)

out = layer1(test_input)
out = layer2(out)
out = layer3(out)

print(f'Result: {out.item()}')

out.backward()

print('Gradient to weights in layer 1:')
print(layer1.weight.grad)

Let's use this knowledge to create a neural network for object recognition.

Loading Image Datasets
----------------------
We are going to work with the CIFAR10 dataset, which is a small image dataset consisting of just ten object categories. Most datasets today are many orders of magnitude larger in size, but the smaller dataset will allow us to work on commodity computers. The code below will download both the train/test splits of the CIFAR10 dataset, and visualize some of the images.

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# get some random training images
dataiter = iter(trainloader)
images, labels = next(dataiter)

# show images
imshow(torchvision.utils.make_grid(images))

Building the Neural Network
---------------------------

Create a convolutional neural network that classifies the category of the images in the CIFAR10 dataset. In the class below, there are two functions: `__init__` and `forward()`. In the constructor, instantiate the layers that you will need. In the `forward()` function, call these layers in order to run the neural network forwards.

Experiment with the below neural network layers to build a network that is able to classify the image:
- <a href="https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html">nn.Conv2d()</a>
- <a href="https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d">nn.MaxPool2d()</a>
- <a href="https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear">nn.Linear()</a>
- <a href="https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU">nn.ReLU()</a>

The input `x` will be an tensor of size `4x3x32x32`, which represents a batch of input images. The output should be a ten dimensional vector. You will most likely need to use other PyTorch operations as well, such as `torch.flatten()`. Feel free to use other layers and operations as you see fit.

We recommend first trying the following neural network: convolution, max pooling, convolution, max pooling, convolution, convolution, flattening, linear, linear, linear. Note that you should put the activation function in the right spots.

In [None]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # DONE: Initialized network layers
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
        self.conv4 = nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(128 * 8 * 8, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)


    def forward(self, x):
        # DONE: Implemented the forward pass with using the layers defined above
        #       and the proper activation functions
        c1 = self.conv1(x)
        relu1 = F.relu(c1)
        c2 = self.conv2(relu1)
        relu2 = F.relu(c2)
        pool2 = self.pool(relu2)
        c3 = self.conv3(pool2)
        relu3 = F.relu(c3)
        c4 = self.conv4(relu3)
        relu4 = F.relu(c4)
        pool4 = self.pool(relu4)
        # pool4.shape == [4, 128, 8, 8]
        flattened = torch.flatten(pool4, start_dim=1)
        # flattened.shape == [4, 8192]
        # because fully connected layers (usually) expect 1D input.  (4 is the batch size.)
        x = F.relu(self.fc1(flattened))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

Before we proceed, let's visualize the weights of the first convolutional layer in the neural network. Modify the code below in order to plot the weights of the first convolutional layer. (You need to use `detach()` in order for this to work.)

In [None]:
# imshow(torchvision.utils.make_grid(net.conv1.weight.cpu().detach()))


# Visualize the weights of the first convolutional layer
# kernel_size=3
weights = net.conv1.weight.cpu().detach() # move tensor to cpu. detach: we're visualizing, not training
plt.figure(figsize=(10, 10))
plt.axis("off") # don't need axis; just showing weights
# .permute(1, 2, 0)) to change PyTorche's (Channels, Height, Width) to plt's expected (Height, Width, Channels)
plt.imshow(torchvision.utils.make_grid(weights, normalize=True, pad_value=1).permute(1, 2, 0))
plt.show()

Training the Network
--------------------

Since the neural network is initialized with random noise, the filters visualized above are just random noise. In order to train them, we need to specify both a) a loss function and b) an optimization algorithm. We will use the cross entropy loss function with stochastic gradient descent. In PyTorch, we can specify these by creating the two objects below:

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.0001)

Notice that the optimizer accepts the parameter `net.parameters()`. The call `net.parameters()` is a bit of magic. It will automatically determine which tensors are learnable inside the network, and pack them into a vector that is fed into the gradient descent method. In the previous homework, you needed to manually track these variables, but in PyTorch there is book-keeping underneath the API that does this for you automatically.

Now, we are ready to train the neural network.

In [None]:
def train_network(net, n_epochs=2):
    net.to(device)

    for epoch in range(n_epochs):  # loop over the dataset multiple times
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data

            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            if i % 1000 == 0:
                print(f'Epoch={epoch + 1} Iter={i + 1:5d} Loss={loss.item():.3f}')
                running_loss = 0.0
    print('Finished Training')
    return net


If you want, you can train the network for longer too. This will help improve the performance.

In [None]:
train_network(net, n_epochs=5)

Visualizing Predictions
-----------------------

Unless you train the neural network for a long time, the loss will most likely not go to zero. However, it should still go down, which means it has learned some association between visual patterns and the category labels in the dataset. Let's try the model on some images in the test set and see what it predicts for them.

In [None]:
dataiter = iter(testloader)
images, labels = next(dataiter)

net.to(device)
predictions = net(images.to(device)).argmax(axis=1).cpu().detach()
accuracy = (labels==predictions).double().mean()

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', '\t'.join(f'{classes[labels[j]]:5s}' for j in range(4)))
print('Predictions: ', '\t'.join(f'{classes[predictions[j]]:5s}' for j in range(4)))
print(f'Accuracy: {accuracy*100}%')

In our implementation, the predictions are not always correct, but they are often reasonable. This is impressive considering we have barely trained the neural network at all.

Let's calculate the accuracy on the full test set.

In [None]:
dataiter = iter(testloader)

running_accuracy = 0
running_count = 0
for images, labels in dataiter:
  images = images.to(device)
  predictions = net(images.to(device)).argmax(axis=1).cpu().detach()
  accuracy = (labels==predictions).double().mean()

  running_accuracy += accuracy
  running_count += 1

print(f'Accuracy: {running_accuracy/running_count*100:.2f}%')

In our solution, we get 74% accuracy after training for about 10 minutes on the Colab GPU. Can you do better?

# 2. Problem 2: Building ResNet

Residual Network
----------------

Residual networks have become a standard architecture because they are able to efficiently scale to a large number of layers. While the state-of-the-art networks have thousands of layers, they would be too expensive to train in time for the homework deadline. Let's implement just a simple residual network.

Implement a ResNet block which contains convolutional layers with a skip connection across them. Note that the dimensions of the original input and the output of the convolutional layers may not match up for addition. Hint: one way to address this is to introduce a linear transformation (like a 1x1 kernel convolution) to resize the input when necessary. Another would be to 0-pad the input to match dimensions for addition.

In [None]:
# Code largely taken from an implementation I had done in a previous class,
# which relied on open source implementation https://github.com/kuangliu/pytorch-cifar/blob/master/models/resnet.py
class ResNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        # DONE: Initializing Two Convolutional Layers in the Residual Block
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        # Google recommended batch normalization to stabilize training and speed up learning rates
        self.bn1 = nn.BatchNorm2d(out_channels)

        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.skip_connection = nn.Identity()
        if stride != 1 or in_channels != out_channels:
            # DONE: Using a Conv2d layer with kernel_size=1 to "resize" input
            self.skip_connection = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity_x = self.skip_connection(x)

        # DONE: Implemented Forward pass using 2 Conv Layers.
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        assert out.shape == identity_x.shape
        out += identity_x
        out = F.relu(out)
        return out


In the code block below, complete the class for a residual network. (PyTorch has a residual network built in, but you should not use it. Instead, create a residual network using the building blocks introduced above.) We recommend the following architecture: Convolution, Maxpooling, ResNetBlock, ResNetBlock, flatten, linear, linear, linear. Be sure to use ReLU activations where necessary. Try experimenting with different channel dimensions.

In [None]:
class ResNet(nn.Module):
    def __init__(self):
        super().__init__()
        # DONE: initialize network layers
        # Initial convolution and pooling

        # CIFAR-10 has 3 input channels
        self.conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # Two ResNet blocks
        self.block1 = ResNetBlock(64, 128, stride=2)
        self.block2 = ResNetBlock(128, 128, stride=1)

        # After conv + pool + block1 (stride=2), spatial dims go from 32x32 → 16x16 → 8x8
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(128 * 8 * 8, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)  # 10 classes for CIFAR-10

    def forward(self, x):
        # DONE: Implemented forward pass
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.pool(x)

        x = self.block1(x)
        x = self.block2(x)

        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


# Instantiate the network
res_net = ResNet()

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(res_net.parameters(), lr=0.0001)

After you create the network, we can use the training loop from above in order to train it.

In [None]:
res_net = train_network(res_net, n_epochs=5)

Let's visualize some of the predictions from the trained network.

In [None]:
dataiter = iter(testloader)
images, labels = next(dataiter)

res_net.to(device)
predictions = res_net(images.to(device)).argmax(axis=1).cpu().detach()
accuracy = (labels==predictions).double().mean()

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', '\t'.join(f'{classes[labels[j]]:5s}' for j in range(4)))
print('Predictions: ', '\t'.join(f'{classes[predictions[j]]:5s}' for j in range(4)))
print(f'Accuracy: {accuracy*100}%')

Let's also calculate the accuracy on the full test set.

In [None]:
dataiter = iter(testloader)

running_accuracy = 0
running_count = 0
for images, labels in dataiter:
  images = images.to(device)
  predictions = res_net(images.to(device)).argmax(axis=1).cpu().detach()
  accuracy = (labels==predictions).double().mean()

  running_accuracy += accuracy
  running_count += 1

print(f'Accuracy: {running_accuracy/running_count*100:.2f}%')

In our solution, we get 75% accuracy.

Debugging Tips:
- Try increasing epochs or adjusting the learning rate.
- Try increasing model capacity by changing the number of layers.
- Check if Batch Norm is applied after every convolution.