STAT 479: Deep Learning (Spring 2019)  
Instructor: Sebastian Raschka (sraschka@wisc.edu)  
Course website: http://pages.stat.wisc.edu/~sraschka/teaching/stat479-ss2019/  
GitHub repository: https://github.com/rasbt/stat479-deep-learning-ss19

---

# Homework 4: Implementing a Convolutional Neural Network (40 pts)

In this 4th homework, your task is to implement a convolutional neural network for classifying images in the CIFAR-10 dataset (https://www.cs.toronto.edu/~kriz/cifar.html). 

### Dataset Overview

- The CIFAR-10 dataset contains 60,000 color images with pixel dimensions 32x32. 
- There are 50,000 training images and 10,000 test images
- Shown below is a snapshot showing a random selection for the 10 different object classes (from https://www.cs.toronto.edu/~kriz/cifar.html):

![](cifar-snapshot.png)

The CIFAR-10 dataset is already made accessible via the PyTorch API as it is a common dataset for benchmarking image classifiers. Hence, you do not have to download the dataset manually -- it will be downloaded automatically when you call

```python
train_dataset = datasets.CIFAR10(root='data', 
                                 train=True, 
                                 transform=transforms.ToTensor(),
                                 download=True)
```

in the provided code cells below for the first time. Thus, keep in mind that calling this function for the first time may be a bit slow depending on your internet connection. On a conventional internet connection, it should be downloaded in a matter of seconds, though.

Note that we are **not** using a separate validation dataset in this homework for tuning this network. This is intentional for the purposes of simplicity. However, in a real-world application, you are highly advised to use a validation dataset to tune the hyper parameters of the network as discussed in class.

### Your Tasks

Your main task is to implement a simple convolutional neural network that is loosely inspired by the AlexNet architecture that one the ImageNet competition in 2012: 

- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). [Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) (pp. 1097-1105).

Then, you will make several simple modifications to this network architecture to improve its performance.

Note that in this homework, as explained above, you will NOT be working with ImageNet but CIFAR-10, which is a much smaller dataset, in order for you to be able to train the network in a timely manner.

In particular, you will be asked to first implement a basic convolutional neural network based on AlexNet and then make several improvements to optimize the performance and reduce overfitting. These "improvements" include Dropout, BatchNorm, and image augmentation, which will serve as a good exercise for familiarizing yourself with "Deep Learning Tricks" as well as convolutional neural networks.

Note that the homework is relatively easy and straightforward, but the training of the network in each of the 5 sections will take ~5 min to train on a GPU. On a CPU, it will probably be much longer. Because training on the CPU will take much longer, and because you probably don't want your computer to overheat, I **highly recommend running this homework on a cloud server**, for example, Google Colab (which allows you to use a GPU for free). Since you don't have to download the dataset manually, it should be relatively straightforward to do this homework on Google Colab and then download the solution for submission via Canvas. Please let me know if you have any questions about that. For a Cloud computing refresher, please see: https://github.com/rasbt/stat479-deep-learning-ss19/tree/master/L07_cloud-computing.

**The due date for this homework is Friday, 11 April 11:59 pm.** Please start as soon as possible, because while this homework is conceptually straightforward, you have to factor in the computation time (~30 min runtime for the complete notebook if you run this notebook with correct solutions).

## Imports

In [1]:
###############################
### NO NEED TO CHANGE THIS CELL
###############################

import os
import time

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader

from torchvision import datasets
from torchvision import transforms

import matplotlib.pyplot as plt
from PIL import Image


if torch.cuda.is_available():
    torch.backends.cudnn.deterministic = True

In [2]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -ud -iv 

numpy       1.15.4
pandas      0.23.4
torch       1.0.1.post2
PIL.Image   5.3.0
Sebastian Raschka 
last updated: 2019-04-02


<br>
<br>

## Model Settings

In [3]:
###############################
### NO NEED TO CHANGE THIS CELL
###############################


#-------------------------
### SETTINGS
#-------------------------

# Hyperparameters
RANDOM_SEED = 1
LEARNING_RATE = 0.001
BATCH_SIZE = 256
NUM_EPOCHS = 20

# Architecture
NUM_FEATURES = 32*32
NUM_CLASSES = 10

# Other
DEVICE = "cuda:0"

The following code cell that implements the ResNet-34 architecture is a derivative of the code provided at https://pytorch.org/docs/0.4.0/_modules/torchvision/models/resnet.html.

<br>
<br>

## Dataset

In [4]:
###############################
### NO NEED TO CHANGE THIS CELL
###############################

#-------------------------
### CIFAR-10 Dataset
#-------------------------


# Note transforms.ToTensor() scales input images
# to 0-1 range
train_dataset = datasets.CIFAR10(root='data', 
                                 train=True, 
                                 transform=transforms.ToTensor(),
                                 download=True)

test_dataset = datasets.CIFAR10(root='data', 
                                train=False, 
                                transform=transforms.ToTensor())


train_loader = DataLoader(dataset=train_dataset, 
                          batch_size=BATCH_SIZE, 
                          num_workers=8,
                          shuffle=True)

test_loader = DataLoader(dataset=test_dataset, 
                         batch_size=BATCH_SIZE,
                         num_workers=8,
                         shuffle=False)

# Checking the dataset
for images, labels in train_loader:  
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break

# Checking the dataset
for images, labels in train_loader:  
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break

Files already downloaded and verified
Image batch dimensions: torch.Size([256, 3, 32, 32])
Image label dimensions: torch.Size([256])
Image batch dimensions: torch.Size([256, 3, 32, 32])
Image label dimensions: torch.Size([256])


In [5]:
###############################
### NO NEED TO CHANGE THIS CELL
###############################

def compute_epoch_loss(model, data_loader):
    model.eval()
    curr_loss, num_examples = 0., 0
    with torch.no_grad():
        for features, targets in data_loader:
            features = features.to(DEVICE)
            targets = targets.to(DEVICE)
            logits, probas = model(features)
            loss = F.cross_entropy(logits, targets, reduction='sum')
            num_examples += targets.size(0)
            curr_loss += loss

        curr_loss = curr_loss / num_examples
        return curr_loss


def compute_accuracy(model, data_loader, device):
    model.eval()
    correct_pred, num_examples = 0, 0
    for i, (features, targets) in enumerate(data_loader):
            
        features = features.to(device)
        targets = targets.to(device)

        logits, probas = model(features)
        _, predicted_labels = torch.max(probas, 1)
        num_examples += targets.size(0)
        correct_pred += (predicted_labels == targets).sum()
    return correct_pred.float()/num_examples * 100

<br>
<br>

## 1) Implement a Convolutional Neural Network

In this part, you will be implementing the AlexNet-variant that you will be using and modifying throughout this homework. On purpose, this will be a bit more "hands-off" than usual, so that you get a chance to practice implementing neural networks from scratch based on sketches and short descriptions (which is a useful real-world skill as it is quite common to reimplement architectures from literature in order to verify results and compare those architectures to your own methods).

The architecture is as follows:

![](architecture-1.png)

Note that I made this network based on AlexNet, as mentioned in the introduction, but there are some differences. Overall though, there are 7 hidden layers in total: 5 convolutional layers and 2 fully-connected layers. There is one output layers mapping the last layer's activations to the classes. For this network, 

- all hidden layers are connected via ReLU activation functions
- the output layer uses a softmax activation function
- make sure you return the logits and the softmax output; the logits are used for computing the cross-entropy loss (instead of passing the softmax outputs) for numerical stability reasons as discussed in earlier lectures.

In [None]:
##########################
### MODEL
##########################

class ConvNet1(nn.Module):

    def __init__(self, num_classes=10):
        super(ConvNet1, self).__init__()

        self.conv1 =  nn.Conv2d(3, 64,    kernel_size=5, stride=1, padding=2)
        
        # ... <add the remaining convolutional layers
        # and fully connected layers ...
        
        self.linear3 = nn.Linear(4096, num_classes)  
        

    def forward(self, x):
        
        # ... IMPLEMENT FORWARD PASS ...
        logits = # ...
        probas = F.softmax(logits, dim=1)
        return logits, probas

    
torch.manual_seed(RANDOM_SEED)

model1 = ConvNet1(NUM_CLASSES)
model1.to(DEVICE)

optimizer = torch.optim.Adam(model1.parameters(), lr=LEARNING_RATE)  

In [None]:
###############################
### NO NEED TO CHANGE THIS CELL
###############################

def train(model, train_loader, test_loader):

    minibatch_cost, epoch_cost = [], []
    start_time = time.time()
    for epoch in range(NUM_EPOCHS):

        model.train()
        for batch_idx, (features, targets) in enumerate(train_loader):

            features = features.to(DEVICE)
            targets = targets.to(DEVICE)

            ### FORWARD AND BACK PROP
            logits, probas = model(features)
            cost = F.cross_entropy(logits, targets)
            optimizer.zero_grad()

            cost.backward()
            minibatch_cost.append(cost)

            ### UPDATE MODEL PARAMETERS
            optimizer.step()

            ### LOGGING
            if not batch_idx % 150:
                print ('Epoch: %03d/%03d | Batch %04d/%04d | Cost: %.4f' 
                       %(epoch+1, NUM_EPOCHS, batch_idx, 
                         len(train_loader), cost))

    
        with torch.set_grad_enabled(False): # save memory during inference
            print('Epoch: %03d/%03d | Train: %.3f%%' % (
                  epoch+1, NUM_EPOCHS, 
                  compute_accuracy(model, train_loader, device=DEVICE)))
            
            cost = compute_epoch_loss(model, train_loader)
            epoch_cost.append(cost)

        print('Time elapsed: %.2f min' % ((time.time() - start_time)/60))

    print('Total Training Time: %.2f min' % ((time.time() - start_time)/60))


    with torch.set_grad_enabled(False): # save memory during inference
        print('Test accuracy: %.2f%%' % (compute_accuracy(model, test_loader, device=DEVICE)))

    print('Total Time: %.2f min' % ((time.time() - start_time)/60))
    
    return minibatch_cost, epoch_cost
    

minibatch_cost, epoch_cost = train(model1, train_loader, test_loader)


plt.plot(range(len(minibatch_cost)), minibatch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Minibatch')
plt.show()

plt.plot(range(len(epoch_cost)), epoch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Epoch')
plt.show()

In [None]:
del model1  # to save memory if you don't use it anymore

<br>
<br>

## 2) Adding Dropout

In this second part, your task is now to add dropout layers to reduce overfitting. You can copy&paste your architecture from above and make the appropriate modifications. In particular,

- place a Dropout2d (this is also referred to as "spatial dropout"; will be explained in the lecture) before each maxpooling layer with dropout probability p=0.2,
- place a regular dropout after each fully connected layer with probability p=0.5, except for the last (output) layer.

The architecture is as follows (changes, compared to the previous section, are highlighted in red):

![](architecture-2.png)

In [None]:
##########################
### MODEL
##########################

class ConvNet2(nn.Module):

    def __init__(self, num_classes=10):
        super(ConvNet2, self).__init__()
        
        #### YOUR CODE
        

    def forward(self, x):

        #### YOUR CODE
        probas = F.softmax(logits, dim=1)
        return logits, probas

    
torch.manual_seed(RANDOM_SEED)

model2 = ConvNet2(NUM_CLASSES)
model2.to(DEVICE)

optimizer = torch.optim.Adam(model2.parameters(), lr=LEARNING_RATE)

minibatch_cost, epoch_cost = train(model2, train_loader, test_loader)


plt.plot(range(len(minibatch_cost)), minibatch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Minibatch')
plt.show()

plt.plot(range(len(epoch_cost)), epoch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Epoch')
plt.show()

In [None]:
del model2  # to save memory if you don't use it anymore

<br>
<br>

## 3) Add BatchNorm

In this 3rd part, you are now going to add BatchNorm layers to further improve the performance of the network. This use BatchNorm2D for the convolutional layers and BatchNorm1D for the fully connected layers.


The architecture is as follows (changes, compared to the previous section, are highlighted in red):

![](architecture-3.png)

In [None]:
##########################
### MODEL
##########################

class ConvNet3(nn.Module):

    def __init__(self, num_classes=10):
        super(ConvNet3, self).__init__()
        
        #### YOUR CODE
        

    def forward(self, x):

        #### YOUR CODE

        probas = F.softmax(logits, dim=1)
        return logits, probas

    
torch.manual_seed(RANDOM_SEED)

model3 = ConvNet3(NUM_CLASSES)
model3.to(DEVICE)

optimizer = torch.optim.Adam(model3.parameters(), lr=LEARNING_RATE)

minibatch_cost, epoch_cost = train(model3, train_loader, test_loader)


plt.plot(range(len(minibatch_cost)), minibatch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Minibatch')
plt.show()

plt.plot(range(len(epoch_cost)), epoch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Epoch')
plt.show()

In [None]:
del model3  # to save memory if you don't use it anymore

<br>
<br>

## 4) Going All-Convolutional

In this 4th part, your task is to remove all maxpooling layers and replace the fully-connected layers by convolutional layers. Note that the number of elements of the activation tensors in the hidden layers should not change. I.e., when you remove the max-pooling layers, you need to increase the stride of the convolutional layers from 1 to 2 to achieve the same scaling. Furthermore, you can replace a fully-connected conmvolutional layer by a convolutional layer using stride=1 and a kernel with height and width equal to 1.

The new architecture is as follows (changes, compared to the previous section, are highlighted in red):

![](architecture-4.png)

In [None]:
##########################
### MODEL
##########################

class ConvNet4(nn.Module):

    def __init__(self, num_classes=10):
        super(ConvNet4, self).__init__()
        
        #### YOUR CODE
        

    def forward(self, x):
        #### YOUR CODE

        x = self.conv8(x)
        logits = x.view(x.size(0), NUM_CLASSES)
        probas = F.softmax(logits, dim=1)
        return logits, probas

    
torch.manual_seed(RANDOM_SEED)

model4 = ConvNet4(NUM_CLASSES)
model4.to(DEVICE)

optimizer = torch.optim.Adam(model4.parameters(), lr=LEARNING_RATE)

minibatch_cost, epoch_cost = train(model4, train_loader, test_loader)


plt.plot(range(len(minibatch_cost)), minibatch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Minibatch')
plt.show()

plt.plot(range(len(epoch_cost)), epoch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Epoch')
plt.show()

In [None]:
del model4

<br>
<br>

## 5) Add Image Augmentation

In this last section, you should use the architecture from the previous section (section 4) but use additional image augmentation during training to improve the generalization performance. 


In particular, you should modify the `train_transform = transforms.Compose([...`) function so that it

- performs a random horizontal flip with propbability 50%
- resizes the image from 32x32 to 40x40
- performs a 32x32 random crop from the 40x40 images
- normalizes the pixel intensities such that they are within the range [-1, 1]

The `test_transform = transforms.Compose([...` function should be modified accordingly, such that it 

- resizes the image from 32x32 to 40x40
- performs a 32x32 **center** crop from the 40x40 images
- normalizes the pixel intensities such that they are within the range [-1, 1]

In [None]:
train_transform = transforms.Compose([
        #### YOUR CODE
])

test_transform = transforms.Compose([
        #### YOUR CODE
])


train_dataset = datasets.CIFAR10(root='data', 
                                 train=True, 
                                 transform=train_transform,
                                 download=True)

test_dataset = datasets.CIFAR10(root='data', 
                                train=False, 
                                transform=test_transform)


train_loader = DataLoader(dataset=train_dataset, 
                          batch_size=BATCH_SIZE, 
                          num_workers=8,
                          shuffle=True)

test_loader = DataLoader(dataset=test_dataset, 
                         batch_size=BATCH_SIZE,
                         num_workers=8,
                         shuffle=False)


torch.manual_seed(RANDOM_SEED)

model4 = ConvNet4(NUM_CLASSES)
model4.to(DEVICE)

optimizer = torch.optim.Adam(model4.parameters(), lr=LEARNING_RATE)

minibatch_cost, epoch_cost = train(model4, train_loader, test_loader)

<br>
<br>

## 6) Optional: Training the network for 200 epochs

In this optional section, train the network from the previous part for 200 epochs to see how it performs as the training loss converges. This will take about 50 minutes and is optional (you will not receive a penalty if you don't run this section).

In [None]:
NUM_EPOCHS = 200

torch.manual_seed(RANDOM_SEED)

model4 = ConvNet4(NUM_CLASSES)
model4.to(DEVICE)

optimizer = torch.optim.Adam(model4.parameters(), lr=LEARNING_RATE)

minibatch_cost, epoch_cost = train(model4, train_loader, test_loader)

In [None]:
plt.plot(range(len(minibatch_cost)), minibatch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Minibatch')
plt.show()

plt.plot(range(len(epoch_cost)), epoch_cost)
plt.ylabel('Cross Entropy')
plt.xlabel('Epoch')
plt.show()

<br>
<br>

## Conclusions (Your Answers Required)

Now that you implemented the AlexNet-like architecture and made several modifications to it, please report the number of learnable parameters for each model, i.e., the weights and biases and batchnorm parameters, etc. (excluding the parameters of the ADAM optimizer). Also, please paste the training and test set accuracies below.


---

- **Model from section 1)**
    - Train accuracy: ???
    - Test accuracy: ???

Number of learnable parameters: (include your computation to receive partial points if the final answer is wrong)

[insert computation and answer]


- Conv2d (1)          ???
- Conv2d (2)          ???
- Conv2d (3)          ???
- Conv2d (4)          ???
- Conv2d (5)          ???
- FC (1)              ???
- FC (2)              ???
- FC (3)              ???
- Total number of parameters: ???

---

- **Model from section 2)**
    - Train accuracy: ???
    - Test accuracy: ???

Number of learnable parameters: (include your computation to receive partial points if the final answer is wrong)

[insert computation and answer]


- Conv2d (1)          ???
- Conv2d (2)          ???
- Conv2d (3)          ???
- Conv2d (4)          ???
- Conv2d (5)          ???
- FC (1)              ???
- FC (2)              ???
- FC (3)              ???
- Total number of parameters: ???



---

- **Model from section 3)**
    - Train accuracy: ???
    - Test accuracy: ???

Number of learnable parameters: (include your computation to receive partial points if the final answer is wrong)

[insert computation and answer]


- Conv2d (1)          ???
- Conv2d (2)          ???
- Conv2d (3)          ???
- Conv2d (4)          ???
- Conv2d (5)          ???
- FC (1)              ???
- FC (2)              ???
- FC (3)              ???
- Total number of parameters: ???



---

- **Model from section 4)**
    - Train accuracy: ???
    - Test accuracy: ???

Number of learnable parameters: (include your computation to receive partial points if the final answer is wrong)

[insert computation and answer]


- Conv2d (1)          ???
- Conv2d (2)          ???
- Conv2d (3)          ???
- Conv2d (4)          ???
- Conv2d (5)          ???
- FC (1)              ???
- FC (2)              ???
- FC (3)              ???
- Total number of parameters: ???



---

- **Model from section 5)**
    - Train accuracy: ???
    - Test accuracy: ???

Number of learnable parameters: (include your computation to receive partial points if the final answer is wrong)

[insert computation and answer]


- Conv2d (1)          ???
- Conv2d (2)          ???
- Conv2d (3)          ???
- Conv2d (4)          ???
- Conv2d (5)          ???
- FC (1)              ???
- FC (2)              ???
- FC (3)              ???
- Total number of parameters: ???



---

- **Model from section 6) [optional]**
    - Train accuracy: ???%
    - Test accuracy: ???%


---