# Convolutional Neural Networks for Computer Vision

## *Bogdan Bošković*



**Introduction.**  In this exercise you will utilize convolutional neural networks to solve problems in computer vision, such as image classification, object detection, and segmentation.  

**Instructions.** As usual, please submit your code and its output as a pdf file, generated from a Jupyter notebook.  I recommend you complete this assignment in Google CoLab [(link)](https://colab.research.google.com/), but it is also certainly possible to complete it in a local IDE if you first install pytorch (instructions not included here).  The assignment will be divided into "Problems", which will be indicated below along with the number of points awarded for completion.  We will begin the assignment by importing important libraries.

## **PROBLEM 1 (40 Total Points)**

**Part (a) (10 points)**
You will begin this problem by setting up a baseline neural network model, and helper functions, that you developed in the last assignment.  Therefore this first part will mostly involve running code that I provide to you here, or utilizing code that you developed in the previous assignment.  Start by importing the software libraries below.

In [1]:
# You will need the following libraries to complete the assignment
import torch
from torch import nn
from torch.utils.data import DataLoader
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt
import numpy as np
import torch.optim as optim
# suppress warnings
import warnings
warnings.filterwarnings("ignore")

**Now load the MNIST Data**, along with built-in PyTorch data loaders. Run the following code to load the MNIST data.

In [2]:
# Fill in the details for the "transform" variable
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.05), (0.05))])

# We will use a relatively large batch size of 128 here to
#  accelerate the training process
batch_size = 128

# Download the MNIST dataset and data loaders
trainset = torchvision.datasets.MNIST(root='./data', train=True,
                                        download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.MNIST(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

# Label the classes
classes = ('zero', 'one', 'two', 'three',
           'four', 'five', 'six', 'seven', 'eight', 'nine')

**Construct a Baseline Model**.  Our baseline model will be a fully-connected neural network with 8 total layers of parameters.  Aside from the output layer, each layer has 50 hidden units, with ReLU activations.  We will call this network *NetFC*.  Note that this is simply the first model that you were asked to create in the previous assignment, however I have provided the code for you below.

In [3]:
class NetFc(nn.Module):
    def __init__(self):
      super().__init__()
      self.fc1 = nn.Linear(28*28, 50)
      self.fc2 = nn.Linear(50, 50)
      self.fc3 = nn.Linear(50, 50)
      self.fc4 = nn.Linear(50, 50)
      self.fc5 = nn.Linear(50, 50)
      self.fc6 = nn.Linear(50, 50)
      self.fc7 = nn.Linear(50, 50)
      self.fc8 = nn.Linear(50, 10)


    def forward(self, x):

      x = torch.flatten(x, 1)
      x1 = F.relu(self.fc1(x))
      x2 = F.relu(self.fc2(x1))
      x3 = F.relu(self.fc3(x2))
      x4 = F.relu(self.fc4(x3))
      x5 = F.relu(self.fc5(x4))
      x6 = F.relu(self.fc6(x5))
      x7 = F.relu(self.fc7(x6))

      output = self.fc8(x5)

      # Return the output of the network
      return output

**Import Helper Functions** In the last assignment you were required to create two functions: *trainMyModel* and *testMyModel*.  You will need to re-use these functions again here and you can paste them here and run them.  To keep the notebook a little cleaner, I import these two functions from another python file called *dl_assignment4_helper_functions*, below.  I use the prefix *hlp* to call these functions. It is up to you whether you paste these functions into the notebook, or import them.  However, note that the code skeletons below assume that they are imported with the *hlp* prefix, and you will have to remove/modify the prefix if you don't import them in a similar fashion.

In [4]:
import dl_assignment4_helper_functions as hlp

**Train and Test the Baseline Model** Now Run your *NetFC* model using a learning rate of 0.01 for 2 epochs.  These values are chosen because they work relatively well.  This model should usually achieve around 94% accuracy, and this will serve as our baseline.

In [5]:
# Train your model.
net = NetFc();
lr = 0.01;
n_epochs = 2;
trainedNet = hlp.trainMyModel(net,lr,trainloader,n_epochs);

Training on cuda:0
[epoch: 1, batch: 100] loss: 0.731
[epoch: 1, batch: 200] loss: 0.325
[epoch: 1, batch: 300] loss: 0.269
[epoch: 1, batch: 400] loss: 0.257
[epoch: 2, batch: 100] loss: 0.260
[epoch: 2, batch: 200] loss: 0.230
[epoch: 2, batch: 300] loss: 0.243
[epoch: 2, batch: 400] loss: 0.195
✨ Finished Training ✨


In [6]:
# Test your model
hlp.testMyModel(trainedNet,testloader)

Accuracy of the network on the 10000 test images: 94.4 %


94.4

**Part (b) (20 points)** Now we will see the advantages of convolutional structures in a deep neural network.  Below, fill in the template to create convolutional neural network, called 'NetCnn' that has the following structure:

layer1: 8 3x3 convolutional filters, one pixel of zero-padding, and stride of one

layer2: 16 3x3 convolutional filters, one pixel of zero-padding, and stride of one

layer3: 2x2 max pooling, with stride of 2. No zero-padding.

layer4: 32 3x3 convolutional filters, one pixel of zero-padding, and stride of one

layer5: 64 3x3 convolutional filters, one pixel of zero-padding, and stride of one

layer6: 2x2 max pooling, with stride of 2. No zero-padding.

layer7: a fully connected layer of 50 neurons.

Layer8: a fully connected layer of 10 neurons.

In [66]:
# Convolutional model - adding in convolutional layers

class NetCnn(nn.Module):
    def __init__(self):
      super().__init__()
      # input channels = 1 because MNIST images are 1D
      self.conv1 = nn.Conv2d(in_channels=1, out_channels=8, kernel_size=(3,3), padding=1, stride=1)
      # in_channels = 8 because 8 output channels from previous layer
      self.conv2 = nn.Conv2d(in_channels=8, out_channels=16, kernel_size=(3,3), padding=1, stride=1)
      self.pool3 = nn.MaxPool2d(kernel_size=(2,2), stride=2, padding=0)
      self.conv4 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(3,3), padding=1, stride=1)
      self.conv5 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(3,3), padding=1, stride=1)
      self.pool6 = nn.MaxPool2d(kernel_size=(2,2), stride=2, padding=0)

      # 1x1 convolution to match channels of layer 1 with layer 4, for skip to 5
      self.channel_match = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=1)

      # fully connected layer with 50 neurons -- 64 is the number of output channels, 
      # 7x7 are the dimensions of the image after two max pooling layers
      self.fc7 = nn.Linear(in_features=64*7*7, out_features=50)
      # fully connected layer with 10 neurons = 10 classes
      self.fc8 = nn.Linear(in_features=50, out_features=10)

    def forward(self, x):
      x1 = F.relu(self.conv1(x))
      x2 = F.relu(self.conv2(x1))
      x3 = self.pool3(x2)
      x4 = F.relu(self.conv4(x3))

      # upsample x2 to mactn dimensions of x5
      x2_up = F.interpolate(x2, size=(x4.size(2), x4.size(3)), mode='nearest')
      # match the number of channels in x5 (run through 1x1 conv from init)
      x2_matched = self.channel_match(x2_up)
      
      # skip connection from layer 2 to layer 4
      x5 = F.relu(self.conv5(x4 + x2_matched))
      x6 = self.pool6(x5)
      # flatten the tensor for fully connected layer
      x6 = torch.flatten(x6, 1)
      x7 = F.relu(self.fc7(x6))
      output = self.fc8(x7)
      # return output of model
      return output


Now train the model for 2 epochs using your *trainMyModel* function, which should report the loss every 100 iterations, as was requested in the last assignment.  Then, using *testMyModel* function, evaluate the accuracy of your trained model on the test set. If done correctly, you should obtain around 97% accuracy on the testing set, a relatively significant improvement over *NetFc* if you consider how much error remains.  Note that you may need to tune the learning rate a little bit to achieve this level of accuracy.  

Not that we could add skip connections to 'NetCnn' as well, which would further improve its performance, but this is a little tricky and it will not be part of this assignment.  

In [67]:
# Train your model.
net = NetCnn()
lr = 0.01
n_epochs = 2

trainedNet = hlp.trainMyModel(net,lr,trainloader,n_epochs)

# Test your model
hlp.testMyModel(trainedNet,testloader)

Training on cuda:0


[epoch: 1, batch: 100] loss: 0.504
[epoch: 1, batch: 200] loss: 0.131
[epoch: 1, batch: 300] loss: 0.100
[epoch: 1, batch: 400] loss: 0.092
[epoch: 2, batch: 100] loss: 0.094
[epoch: 2, batch: 200] loss: 0.081
[epoch: 2, batch: 300] loss: 0.071
[epoch: 2, batch: 400] loss: 0.083
✨ Finished Training ✨
Accuracy of the network on the 10000 test images: 98.08 %


98.08

**Part (c) (10 points)**  Compute the number of parameters in the NetFC model and the NetCnn model, respectively, as described in UDL.  Please show your work, and then report your final answer in scientific notation $x \times 10^y$ where you need to fill in $x$ and $y$.  You need only report $x$ reported to one decimal place, and $y$ should be an integer.  You will be primarily graded on a correct order of magnitude, $y$.

**ANSWER:**

**Part (c):**

The number of parameters in a fully connected layer is:

$$
\text{Number of parameters} = \text{previous layer} \times \text{current layer} + \text{current layer}
$$

So, in the NetFC model, we have:

$$
\text{Number of parameters} = 784
$$
$$
\text{Number of parameters 1st layer} = 50 \times 784 + 50 = 39200
$$
$$
\text{Number of parameters 2nd layer} = 50 \times 50 + 50 = 2550
$$

Subsequent layers have the same number of parameters, except for the output layer, which has 10 neurons. So, the number of parameters in the NetFC model over the first 7 layers is:

$$
\text{Number of parameters 1-7} = 2550 \times 6 + 39200 = 54500
$$
$$
\text{Number of parameters 1-8} = 54500 + 50 \times 10 + 10 = 55010
$$

Now, the number of parameters in a convolutional layer is:

$$
\text{Number of parameters} = \text{number of filters} \times \text{channels in previous layer} \times \text{filter} + \text{number of filters}
$$

So, for the NetCnn model, we have:

$$
\text{Number of parameters layer 1} = 8 \times 1 \times 3 \times 3 + 8 = 80
$$
$$
\text{Number of parameters layer 2} = 16 \times 8 \times 3 \times 3 + 16 = 1168
$$
$$
\text{Number of parameters layer 4} = 32 \times 16 \times 3 \times 3 + 32 = 4640
$$
$$
\text{Number of parameters layer 5} = 64 \times 32 \times 3 \times 3 + 64 = 18464
$$
$$
\text{Number of parameters layer 7} = 64 \times 7 \times 7 \times 50 + 50 = 164200
$$
$$
\text{Number of parameters layer 8} = 50 \times 10 + 10 = 510
$$

So, the total number of parameters in the NetCnn model is:

$$
\text{Number of parameters NetCnn} = 80 + 1168 + 4640 + 18496 + 157850 + 510 = 189062
$$






**Part (d) (10 POINTS)**  In this last subproblem, you will add batch normalization layers to your network.  Batch normalization, and its variants (e.g., "layer norm") are another structure that is now widely-used in modern deep neural networks.  In this problem you will design a neural network called *NetCnnBn* with the exact same structure as 'NetCnn' except you will add two batch normalization layers in the following locations: (i) after the 2nd convolutional layer, and (ii) after the 1st fully connected layer.  

Train your NetCnnBn for 2 epochs using your *trainMyModel* function, and then report its accuracy on the test set using the *testMyModel* function.  If done properly, you should be now be able to achieve approximately 99% accuracy on the testing dataset after two epochs of training.  NOte that you may need to adjust the learning rate again.  Despite this significant performance improvement, note that batch normalization contributes a very small number of parameters.  In our case, for example, it adds $<200$ parameters.

In [88]:
# Convolutional model - adding in batch norm
class NetCnnBn(nn.Module):
    def __init__(self):
      super().__init__()
      # input channels = 1 because MNIST images are 1D
      self.conv1 = nn.Conv2d(in_channels=1, out_channels=8, kernel_size=(3,3), padding=1, stride=1)
      # in_channels = 8 because 8 output channels from previous layer
      self.conv2 = nn.Conv2d(in_channels=8, out_channels=16, kernel_size=(3,3), padding=1, stride=1)
      # adding batch normalization, num_features = 16 because 16 output channels from previous layer
      self.bn3 = nn.BatchNorm2d(num_features=16)
      self.pool4 = nn.MaxPool2d(kernel_size=(2,2), stride=2, padding=0)

      # 1x1 convolution to match channels of conv2 with conv5, for skip to conv6
      self.channel_match = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=1)

      self.conv5 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(3,3), padding=1, stride=1)
      self.conv6 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(3,3), padding=1, stride=1)
      self.pool7 = nn.MaxPool2d(kernel_size=(2,2), stride=2, padding=0)
      # fully connected layer with 50 neurons -- 64 is the number of output channels, 
      # 7x7 are the dimensions of the image after two max pooling layers
      self.fc8 = nn.Linear(in_features=64*7*7, out_features=50)
      # batch normalization
      self.bn9 = nn.BatchNorm1d(num_features=50)
      # fully connected layer with 10 neurons = 10 classes
      self.fc10 = nn.Linear(in_features=50, out_features=10)

    def forward(self, x):
      x1 = F.relu(self.conv1(x))
      x2 = F.relu(self.conv2(x1))
      x3 = self.bn3(x2)
      x4 = self.pool4(x3)
      x5 = F.relu(self.conv5(x4))

      # upsample x2 to match dimensions of x5
      x2_up = F.interpolate(x2, size=(x5.size(2), x5.size(3)), mode='nearest')
      # match the number of channels in x5 (run through 1x1 conv from init)
      x2_matched = self.channel_match(x2_up)

      # skip connection from layer 2 to layer 5
      x6 = F.relu(self.conv6(x5 + x2_matched))
      x7 = self.pool7(x6)
      # flatten the tensor for fully connected layer
      x8 = torch.flatten(x7, 1)
      x9 = F.relu(self.fc8(x8))
      x10 = self.bn9(x9)
      output = self.fc10(x10)
      # return output of model
      return output

In [89]:
# Train your model.
net = NetCnnBn()
lr = 0.01
n_epochs = 2

trainedNet = hlp.trainMyModel(net,lr,trainloader,n_epochs)

# Test your model
hlp.testMyModel(trainedNet,testloader)

Training on cuda:0


[epoch: 1, batch: 100] loss: 0.266
[epoch: 1, batch: 200] loss: 0.079
[epoch: 1, batch: 300] loss: 0.062
[epoch: 1, batch: 400] loss: 0.054
[epoch: 2, batch: 100] loss: 0.055
[epoch: 2, batch: 200] loss: 0.045
[epoch: 2, batch: 300] loss: 0.046
[epoch: 2, batch: 400] loss: 0.042
✨ Finished Training ✨
Accuracy of the network on the 10000 test images: 98.99 %


98.99

## **PROBLEM 2 (20 Total Points)**

In this problem you will investigate transfer learning.  Load in a resnet18 model, and initialize its training with weights that were pre-trained on the ImageNet dataset.  Call this model *pretrainedResNet* As a hint, you cannot simply apply a pre-trained resnet18 to this problem; you will need to make two changes to the model structure for it to work properly.  

Once you have made the proper modifications (fill in code below), train and test your model on the MNIST data, as you have done with previous models.  If done properly, you should only require a few lines of code, and you should usually obtain around 97% accuracy with 1 epoch of training and the learning rate provided below (lr = 0.0001).  You only need to show your results with these settings. Unfortunately the MNIST dataset is not ideal for demonstrating the tremendous benefits of transfer learning, but this exercise will help familiarize you with the process of adapting pre-trained models to a custom task, which is important in practice.   

For this problem I highly recommend that you use a GPU because training will be  relatively slow without it (e.g., a couple minutes for 1 epoch, depending upon your hardware).  With a GPU the training should generally run very quickly, finishing in under 30 seconds or less.  Note that you can procure a free GPU to use on Google Colab, however, you are given a limited GPU compute per day unless you pay. Consequently I strongly recommend that you debug on a CPU before deploying onto the GPU.  

In [92]:
# Please load a pre-trained resnet-18 model and make the necessary changes so that it will work on the MNIST problem

# load the pre-trained model
preTrainedResNet = torchvision.models.resnet18(pretrained=True)
# change input layer to accept 1 channel
preTrainedResNet.conv1 = nn.Conv2d(1, 64, kernel_size=1)
# change output layer to have 10 classes
preTrainedResNet.fc = nn.Linear(in_features=512, out_features=10, bias=True)

In [93]:
# Train and test your model
lr = 0.0001;
n_epochs = 1;
trainedNet = hlp.trainMyModel(preTrainedResNet,lr,trainloader,n_epochs);

hlp.testMyModel(trainedNet,testloader)

Training on cuda:0
[epoch: 1, batch: 100] loss: 0.754
[epoch: 1, batch: 200] loss: 0.208
[epoch: 1, batch: 300] loss: 0.150
[epoch: 1, batch: 400] loss: 0.116
✨ Finished Training ✨
Accuracy of the network on the 10000 test images: 96.99 %


96.99