# Homework 2: Convolutional Neural Networks (100 points)

### Overview

With new knowledge of convolutional neural networks, we can accomplish a more difficult image recognition task. The CIFAR-10 classification dataset consists of 60,000 labelled images split between 10 classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks.

For the purposes of this assignment, we will compare two models on the same dataset: a fully connected neural network (as in Homework 1) called ANN and a new convolutional architecture called CNN, as outlined in the next section. To be fair, we attempt to allow the same number of trainable parameters in the ANN as the CNN, which means we need to use the same input transformation to flatten grayscale used in Homework 1 for the ANN. The CNN reaps the full benefit of the original 2D image in RGB.

### CNN Architecture

Each image consists of 32x32 RGB pixel values between 0 and 255. We do not need to perform any preprocessing as the convolutional model will use all three channels concurrently as input.

The architecture in use has 5 layers: a convolution layer followed by a pooling layer, then another convolutional layer, then two fully connected dense layers. The latter of these has 10 neurons to provide classification output.

### Your Task

At the bottom of this notebook file, there are four short answer questions testing your understanding of this neural network architecture. As before, some questions will require you to experiment with model hyperparameters.

Below each question is a cell with the text “Type Markdown and LaTex.” Double-click the cell and type your response to the question. Save your responses by clicking on the floppy disk icon or choosing File - Save and Checkpoint.

After responding to the questions, download your notebook as a `.html` file by choosing File - Download as - html (.html). You will be submitting this `.html` file to your instructor for grading.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

In [2]:
torch.manual_seed(0)
torch.set_num_threads(4)
torch.set_num_interop_threads(4)

In [3]:
trainTransform = transforms.Compose([transforms.RandomRotation(5),
                                     transforms.ToTensor(),
                                     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
                                    ])
testTransform = transforms.Compose([transforms.ToTensor(),
                                     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
                                    ])

In [4]:
root_dir = 'assets_week2'
trainDataset = torchvision.datasets.CIFAR10(root=root_dir, train=True, download=True, transform=trainTransform)
trainLoader = torch.utils.data.DataLoader(trainDataset, batch_size=4, shuffle=True, num_workers=2)
testDataset = torchvision.datasets.CIFAR10(root=root_dir, train=False, download=True, transform=testTransform)
testLoader = torch.utils.data.DataLoader(testDataset, batch_size=4, shuffle=False, num_workers=2)

Files already downloaded and verified
Files already downloaded and verified


In [5]:
class ANNModel(nn.Module):
    def __init__(self, hiddenSize, dropoutRate, activate):
        super().__init__()
        # Note that 'layer' and 'dense' differ only in name (to show similarity to CNN)
        self.activate = nn.Sigmoid() if activate == "Sigmoid" else nn.ReLU()
        self.layer1 = nn.Linear(1024, 100)
        self.layer2 = nn.Linear(100, 15 * 5 * 5)
        self.dense1 = nn.Linear(15 * 5 * 5, hiddenSize)
        self.dropout = nn.Dropout(dropoutRate)
        self.dense2 = nn.Linear(hiddenSize, 10)
        
    def forward(self, x):
        x = self.activate(self.layer1(x))
        x = self.activate(self.layer2(x))
        x = self.dropout(self.activate(self.dense1(x)))
        return self.dense2(x)

class CNNModel(nn.Module):
    def __init__(self, hiddenSize, outChannels, dropoutRate, activate):
        super().__init__()
        self.outChannels = outChannels
        self.activate = nn.Sigmoid() if activate == "Sigmoid" else nn.ReLU()
        self.conv1 = nn.Conv2d(3, 24, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(24, outChannels, 5)
        self.dense1 = nn.Linear(outChannels * 5 * 5, hiddenSize)
        self.dropout = nn.Dropout(dropoutRate)
        self.dense2 = nn.Linear(hiddenSize, 10)

    def forward(self, x):
        x = self.pool(self.activate(self.conv1(x)))
        x = self.pool(self.activate(self.conv2(x)))
        x = x.view(-1, self.outChannels * 5 * 5)
        x = self.dropout(self.activate(self.dense1(x)))
        return self.dense2(x)

In [6]:
# Number of neurons in the first fully-connected layer
hiddenSize = 100
# Number of feature filters in second convolutional layer
numFilters = 25
# Dropout rate
dropoutRate = 0
# Activation function
activation = "ReLU"
# Learning rate
learningRate = 0.001
# Momentum for SGD optimizer
momentum = 0.9
# Number of training epochs
numEpochs = 10

In [7]:
ann = ANNModel(hiddenSize, dropoutRate, activation)
cnn = CNNModel(hiddenSize, numFilters, dropoutRate, activation)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(list(ann.parameters()) + list(cnn.parameters()), lr=learningRate, momentum=momentum)

print('>>> Beginning training!') 
ann.train()
cnn.train()
for epoch in range(numEpochs):  # loop over the dataset multiple times
    annRunningLoss, cnnRunningLoss = 0, 0
    for i, (inputs, labels) in enumerate(trainLoader, 0):
        annInputs = torch.sum(inputs, axis=1).view(-1, 32*32)
        
        optimizer.zero_grad()

        # Forward propagation
        annOutputs = ann(annInputs)
        cnnOutputs = cnn(inputs)
        
        # Backpropagation
        annLoss = criterion(annOutputs, labels)
        cnnLoss = criterion(cnnOutputs, labels)
        annLoss.backward()
        cnnLoss.backward()
        
        # Gradient update
        optimizer.step()

        annRunningLoss += annLoss.item()
        cnnRunningLoss += cnnLoss.item()
        if (i+1) % 2000 == 0:    # print every 2000 mini-batches
            print('Epoch [{}/{}], Step [{}/{}], ANN Loss: {}, CNN Loss: {}'.format(epoch + 1, numEpochs, i + 1, len(trainDataset)//4, annRunningLoss/2000, cnnRunningLoss/2000))
            annRunningLoss, cnnRunningLoss = 0, 0

print()
print('>>> Beginning validation!')
ann.eval()
cnn.eval()
annCorrect, cnnCorrect = 0, 0
total = 0
for inputs, labels in testLoader:
    annInputs = torch.sum(inputs, axis=1).view(-1, 32*32)
    annOutputs = ann(annInputs)
    cnnOutputs = cnn(inputs)
    _, annPredicted = torch.max(annOutputs.data, 1)
    _, cnnPredicted = torch.max(cnnOutputs.data, 1)
    total += labels.size(0)
    annCorrect += (annPredicted == labels).sum().item()
    cnnCorrect += (cnnPredicted == labels).sum().item()
print('ANN validation accuracy: {}%, CNN validation accuracy: {}%'.format(annCorrect / total * 100, cnnCorrect / total * 100))

>>> Beginning training!
Epoch [1/10], Step [2000/12500], ANN Loss: 2.0651431404650213, CNN Loss: 2.013633511543274
Epoch [1/10], Step [4000/12500], ANN Loss: 1.9222866840958595, CNN Loss: 1.6558185454905032
Epoch [1/10], Step [6000/12500], ANN Loss: 1.866415710836649, CNN Loss: 1.5057528860867024
Epoch [1/10], Step [8000/12500], ANN Loss: 1.8486922891438007, CNN Loss: 1.4342953490763903
Epoch [1/10], Step [10000/12500], ANN Loss: 1.819504773169756, CNN Loss: 1.3722973599135875
Epoch [1/10], Step [12000/12500], ANN Loss: 1.792697628468275, CNN Loss: 1.2984128977954388
Epoch [2/10], Step [2000/12500], ANN Loss: 1.7398476891368628, CNN Loss: 1.2341302007064223
Epoch [2/10], Step [4000/12500], ANN Loss: 1.721903137549758, CNN Loss: 1.2104474659375846
Epoch [2/10], Step [6000/12500], ANN Loss: 1.7148346040248872, CNN Loss: 1.1882972740493716
Epoch [2/10], Step [8000/12500], ANN Loss: 1.7160489804148673, CNN Loss: 1.145599445145577
Epoch [2/10], Step [10000/12500], ANN Loss: 1.71327527056634

## Homework Questions

**To make sure your code produces consistent results, it is advisable to click "Kernel -> Restart & Run All" every time you want to run your code.**

### Question 1: CNN Advantage (10 points)

Compute the accuracy of a simple dense neural network and a simple CNN on the dataset. Explain the results and briefly overview the advantages of a CNN over a standard neural network for image-related tasks.

**Answer:**

The accuracy of a simple dense neural network on the dataset is 42.12%, while the accuracy of a simple CNN on the dataset is 68.66%. The difference in these results (with that of the CNN performing better) is unsurprising, since a simple dense neural network would not be able to incorporate special input structures (such as kernels) needed for learning features from image data, whereas the CNN has these provisions. Flatting the image matrix and feeding it into the simple dense neural network causes the loss of spatial structures depicted in the images, which are better parsed by a CNN which connects patches of image pixels to neurons instead. Moreover, flattening the image into 1D vectors as in the case of the simple dense neural network increases the number of parameters, making the process computationally intensive. The CNN can automatically extract the spatial features from the images, whereas in the case of a dense neural network, the data points related to the image must be explicity provided, making it a cumbersome choice for image data. However, in this case, the results of the CNN leave a lot to be desired and can likely be improved further with some data augmentation and/or hyperparameter tuning.

### Question 2: Dropout Rate (25 points)

Explain the purpose of dropout in any neural network model. In doing so, note what can happen if the dropout rate is too high and what can happen if the dropout rate is too low.

**Answer:**

Neural network models tend to overfit the training dataset, leading to poor generalization on the test dataset. To overcome this, some neurons are randomly dropped out during training in the process known as Dropout, which effectively makes it equivalent to training several different neural networks. This simulates neural network model averaging so that the model doesn't rely on just one configuration of neurons to make a prediction.

If the dropout rate is too high, most of the neurons would get dropped resulting in a slow convergence rate of the model and suboptimal model performance. If the dropout rate is too low, the purpose of its usage (i.e. to prevent overfitting) is lost as it would result in suboptimal improvements in generalization capabilities since the model configuration would be not too different from the original one. Therefore, dropout rates need to be tuned for each layer and each training stage, or are ideally kept between 0.5 and 0.8 for most cases.

### Question 3: Kernel Size (25 points)

Explain the purpose of spatial filters (kernels) in a CNN. Additionally, explain where they fit into the overall architecture of the CNN in this coding example. Finally, explain what can happen if the kernel size is too large and what can happen if the kernel size is too small.

**Answer:**

Spatial filters (kernels) in a CNN are used in extracting features (such as edges, corners, etc.) from images and provide important signals to computer vision prediction tasks. They are essentially matrices that slide over the input data and provide output matrices of dot products with the input data. The kernels share their parameters spatially.

Within the overall architecture of the CNN, kernels are generally associated with non-linearities (such as the ReLU function) and are interspersed with pooling layers to downsample the activation maps while preserving the spatial structures, followed by fully-connected layers for the output prediction. 

A very large kernel size may detect global features but misses out on the finer, local features. Morever, it increases the computational time taken for model training. A very small kernel size, on the other hand, has the consequence of extracting features which would be extremely fine-grained and local with lack of information from neighboring pixels.

### Question 4: Data Augmentation (40 points)

Use the code snippet provided in the next box to implement data augmentation by updating the contents of box 3 and re-running the model. Compare your accuracy without and with data augmentation and explain the results. In doing so, explain the purpose of data augmentation.

In [8]:
transforms.RandomRotation(5),

(RandomRotation(degrees=[-5.0, 5.0], interpolation=nearest, expand=False, fill=0),)

**Answer:**

With data augmentation, the accuracy of a simple dense neural network on the dataset is 41.68%, while the accuracy of a simple CNN on the dataset is 68.23%.

Without data augmentation, the accuracy of a simple dense neural network on the dataset is 42.12%, while the accuracy of a simple CNN on the dataset is 68.66%.

In both cases, the CNN performs better than the simple dense neural network. However, clearly, compared to the results without data augmentation, the accuracy of both models with data augmentation have reduced slightly. This is likely due to the small data size, batch size, computational limit, and/or lack of further variations in data augmentation.

In general, data augmentation is used to augment the training data with artificially generated data involving the transformation of the input data (in this case, images) such that the original class label of this data is preserved. The number and variety of images in the training data can be increased by multiple techniques such as rotation, cropping, blurring, flipping, etc. In this case, only Random Rotation was used. Data augmentation is typically done so that the model can become more robust in identifying the diverse features and nuances of the training data, which would prevent it from memorizing the training data and overfitting. 