# ResNets

This week we discussed ResNets which are convolutional neural networks but with a residual learning framework realized by creating residual blocks via shortcut connections. In this notebook, we'll be implementing a fully connected NN, a plain CNN and a residual network and comparing the results on the MNIST dataset.

We'll start off by loading the data. We'll be working with the MNIST dataset, which provides 70,000 labeled 28x28 images of handwritten digits. Our goal is to construct a classifier that recognizes handwritten digits.

In [1]:
import torch
from torch import nn
import numpy as np
from sklearn.datasets import fetch_openml
import torch.utils.data as data_utils
from sklearn.model_selection import train_test_split
from torch import optim
import matplotlib.pyplot as plt
from torchvision import transforms

%matplotlib notebook

# Load the dataset... This can take a while
if("mnist" not in globals()):  # Don't load the dataset twice...
    mnist = fetch_openml('mnist_784', version=1)
    # Convert DataFrame to 24x24 numpy arrays...
    imgs = mnist.data.to_numpy().reshape(70000, 28, 28).astype(np.float32)
    # Labels for the mnist data, 0-9 being the number...
    labels = np.asarray(mnist.target).astype(int)
    
    imgs_train, imgs_test, labels_train, labels_test = train_test_split(
        imgs, labels, test_size=0.2, random_state=1
    )

Let's plot one of the digits at random to see what they look like.

In [2]:
idx = np.random.randint(0, imgs.shape[0])
single_digit = imgs[idx]
single_label = labels[idx]

fig, ax = plt.subplots(1, 1)
ax.set_title(f"Digit: {single_label}")
ax.imshow(single_digit, cmap="Greys")
fig.show()

<IPython.core.display.Javascript object>

We'll attempt to use a GPU to run these models. We'll check if one is available and assign it to our device.

In [3]:
is_cuda = torch.cuda.is_available()

if is_cuda:
    device = torch.device("cuda") # GPU
else:
    device = torch.device("cpu") # CPU
    
print(f"Running PyTorch Using: {device}")

Running PyTorch Using: cpu


Now we'll prepare our data for training and testing using Pytorch's DataLoader which will pass in samples in “minibatches” and reshuffle the data at every epoch to reduce model overfitting. 

In [4]:
# Set the batch size:
batch_size = 10

# Set up data loaders, these will be used to train and test models...
to_device = lambda a: torch.from_numpy(a).to(device)

train_loader = data_utils.DataLoader(
    data_utils.TensorDataset(to_device(imgs_train), to_device(labels_train)),
    batch_size = batch_size,
    shuffle = True
)

test_loader = data_utils.DataLoader(
    data_utils.TensorDataset(to_device(imgs_test), to_device(labels_test)),
    batch_size = batch_size,
    shuffle = True
)

## Simple Model: Fully Connected NN

Now that we have our data set-up, let's start to implement our first network: a fully connected neural network.

In [5]:
class SimpleNet(nn.Module):
    def __init__(
        self, 
        img_w: int, 
        img_h: int, 
        hidden_layer_sizes: list, 
        class_size: int
    ):
        super().__init__()
        
        self._in_layer = nn.Linear(img_w * img_h, hidden_layer_sizes[0])
        
        layers = []
        for hls1, hls2 in zip(hidden_layer_sizes[:-1], hidden_layer_sizes[1:]):
            layers.extend([nn.Linear(hls1, hls2), nn.ReLU()])
        
        self._hidden_layers = nn.Sequential(*layers)
        
        self._out_layer = nn.Linear(hidden_layer_sizes[-1], class_size)
        self._softmax = nn.Softmax(dim=1)
        
    def forward(self, x: torch.tensor) -> torch.tensor:
        x = x.reshape(-1, img_w * img_h)
        
        x = self._in_layer(x)
        x = self._hidden_layers(x)
        return self._softmax(self._out_layer(x))

Let's define some variables to use with our simple network.

In [6]:
# Set these!
img_w = 28  # Height of the images
img_h = 28  # Width of the images
hidden_layer_sizes = [5, 5, 5]  # List of None
class_size = 10# Size of output

simple_net = SimpleNet(img_w, img_h, hidden_layer_sizes, class_size)
simple_net.to(device)

SimpleNet(
  (_in_layer): Linear(in_features=784, out_features=5, bias=True)
  (_hidden_layers): Sequential(
    (0): Linear(in_features=5, out_features=5, bias=True)
    (1): ReLU()
    (2): Linear(in_features=5, out_features=5, bias=True)
    (3): ReLU()
  )
  (_out_layer): Linear(in_features=5, out_features=10, bias=True)
  (_softmax): Softmax(dim=1)
)

We also want some functions to see how our network performs. These are defined below:

In [7]:
# Functions for training a model...
def train_model(
    model, 
    train_data, 
    test_data, 
    optimizer, 
    error_func, 
    n_epochs, 
    augment_method=None, 
    print_every=300
):
    for epoch_i in range(1, n_epochs + 1):
        for i, (img, label) in enumerate(train_data, 1):
            model.zero_grad()
            # If an augmentation method is passed, use it before passing the image to the model.
            predicted = model.forward(img if(augment_method is None) else augment_method(img))
            loss = error_func(predicted, label)
            loss.backward()
            optimizer.step()
            
            if((i % print_every == 0) or (i == len(train_data))):
                print(f"Epoch: {epoch_i}/{n_epochs}, Iter: {i}/{len(train_data)}, Loss: {loss:.04f}")
                
        # Run against the test set and train set at the end of each epoch to get accuracy...
        acc1 = get_accuracy(model, train_data, augment_method)
        print(f"Epoch {epoch_i} Train Accuracy: {acc1 * 100:.02f}%")
        acc2 = get_accuracy(model, test_data, augment_method)
        print(f"Epoch {epoch_i} Test Accuracy: {acc2 * 100:.02f}%\n")
    
    return model
        
        
def get_accuracy(model, data, im_mod = None):
    run = 0
    correct = 0

    for img, label in data:
        img = im_mod(img) if(im_mod is not None) else img   # Allows us to modify the images...
        run += len(img)
        result = model.forward(img).cpu().detach().numpy()
        correct += np.sum(np.argmax(result, axis=1) == label.cpu().detach().numpy())

    return correct / run

Let's set a number of epochs and a learning rate for the model and see how it does.

In [8]:
# Set these!
n_epochs = 15
lr = 1e-4

# Set up everything...
optimizer = optim.Adam(simple_net.parameters(), lr=lr)
loss_func = nn.CrossEntropyLoss()

In [9]:
# Train the model...
simple_net = train_model(simple_net, train_loader, test_loader, optimizer, loss_func, n_epochs)

Epoch: 1/15, Iter: 300/5600, Loss: 2.1479
Epoch: 1/15, Iter: 600/5600, Loss: 2.3159
Epoch: 1/15, Iter: 900/5600, Loss: 2.3064
Epoch: 1/15, Iter: 1200/5600, Loss: 2.2287
Epoch: 1/15, Iter: 1500/5600, Loss: 2.3111
Epoch: 1/15, Iter: 1800/5600, Loss: 2.2402
Epoch: 1/15, Iter: 2100/5600, Loss: 2.2040
Epoch: 1/15, Iter: 2400/5600, Loss: 2.3169
Epoch: 1/15, Iter: 2700/5600, Loss: 2.1422
Epoch: 1/15, Iter: 3000/5600, Loss: 2.0449
Epoch: 1/15, Iter: 3300/5600, Loss: 2.2247
Epoch: 1/15, Iter: 3600/5600, Loss: 2.2280
Epoch: 1/15, Iter: 3900/5600, Loss: 2.1281
Epoch: 1/15, Iter: 4200/5600, Loss: 2.0428
Epoch: 1/15, Iter: 4500/5600, Loss: 2.1480
Epoch: 1/15, Iter: 4800/5600, Loss: 2.3084
Epoch: 1/15, Iter: 5100/5600, Loss: 2.1504
Epoch: 1/15, Iter: 5400/5600, Loss: 2.2113
Epoch: 1/15, Iter: 5600/5600, Loss: 2.2139
Epoch 1 Train Accuracy: 28.02%
Epoch 1 Test Accuracy: 28.22%

Epoch: 2/15, Iter: 300/5600, Loss: 2.0729
Epoch: 2/15, Iter: 600/5600, Loss: 2.0748
Epoch: 2/15, Iter: 900/5600, Loss: 2.134

Epoch: 10/15, Iter: 2700/5600, Loss: 1.7436
Epoch: 10/15, Iter: 3000/5600, Loss: 1.5730
Epoch: 10/15, Iter: 3300/5600, Loss: 1.6703
Epoch: 10/15, Iter: 3600/5600, Loss: 1.6566
Epoch: 10/15, Iter: 3900/5600, Loss: 1.6592
Epoch: 10/15, Iter: 4200/5600, Loss: 1.6206
Epoch: 10/15, Iter: 4500/5600, Loss: 1.9964
Epoch: 10/15, Iter: 4800/5600, Loss: 1.8477
Epoch: 10/15, Iter: 5100/5600, Loss: 1.5663
Epoch: 10/15, Iter: 5400/5600, Loss: 1.6639
Epoch: 10/15, Iter: 5600/5600, Loss: 1.4902
Epoch 10 Train Accuracy: 80.50%
Epoch 10 Test Accuracy: 79.99%

Epoch: 11/15, Iter: 300/5600, Loss: 1.6688
Epoch: 11/15, Iter: 600/5600, Loss: 1.5785
Epoch: 11/15, Iter: 900/5600, Loss: 1.6835
Epoch: 11/15, Iter: 1200/5600, Loss: 1.5950
Epoch: 11/15, Iter: 1500/5600, Loss: 1.5664
Epoch: 11/15, Iter: 1800/5600, Loss: 1.4634
Epoch: 11/15, Iter: 2100/5600, Loss: 1.7228
Epoch: 11/15, Iter: 2400/5600, Loss: 1.4889
Epoch: 11/15, Iter: 2700/5600, Loss: 1.4686
Epoch: 11/15, Iter: 3000/5600, Loss: 1.7608
Epoch: 11/15, I

### Is this model robust to translations?

Our initial results are pretty good! Let's see the accuracy when we "shift" the image data around a little bit...

In [10]:
img_transform = transforms.RandomAffine((-10, 10), (0.2, 0.2), (0.8, 1.2), (-5, 5))

def img_shift_and_warp(img: torch.tensor) -> torch.tensor:
    if(len(img.shape) == 2):
        img = img.reshape(1, *img.shape)   
    img = img_transform(img)
    return img
    

# Show what img_shift_and_warp does to our images...
random_idx = np.random.randint(0, len(imgs))

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.set_title("Original Image")
ax2.set_title("Shifted Image")
ax1.imshow(imgs[random_idx], cmap="Greys")
ax2.imshow(img_shift_and_warp(torch.from_numpy(imgs[random_idx]))[0], cmap="Greys")

fig.show()

<IPython.core.display.Javascript object>

In [11]:
print(f"Normal Test Accuracy: {get_accuracy(simple_net, test_loader) * 100:.02f}%")
print(f"Augmented Test Accuracy: {get_accuracy(simple_net, test_loader, img_shift_and_warp) * 100:.02f}%")

Normal Test Accuracy: 80.87%
Augmented Test Accuracy: 22.56%


Yikes! Shifting around the numbers destroys the accuracy of our simple fully connected network destroys its performance. Can we fix this through data augmentation?

In [12]:
# Train the model...
simple_net = train_model(simple_net, train_loader, test_loader, optimizer, loss_func, n_epochs, img_shift_and_warp)

Epoch: 1/15, Iter: 300/5600, Loss: 2.1107
Epoch: 1/15, Iter: 600/5600, Loss: 2.4468
Epoch: 1/15, Iter: 900/5600, Loss: 1.9593
Epoch: 1/15, Iter: 1200/5600, Loss: 2.1520
Epoch: 1/15, Iter: 1500/5600, Loss: 2.4082
Epoch: 1/15, Iter: 1800/5600, Loss: 2.4568
Epoch: 1/15, Iter: 2100/5600, Loss: 2.3871
Epoch: 1/15, Iter: 2400/5600, Loss: 2.2405
Epoch: 1/15, Iter: 2700/5600, Loss: 2.1792
Epoch: 1/15, Iter: 3000/5600, Loss: 2.4469
Epoch: 1/15, Iter: 3300/5600, Loss: 2.2720
Epoch: 1/15, Iter: 3600/5600, Loss: 2.3819
Epoch: 1/15, Iter: 3900/5600, Loss: 2.3634
Epoch: 1/15, Iter: 4200/5600, Loss: 2.2570
Epoch: 1/15, Iter: 4500/5600, Loss: 2.1936
Epoch: 1/15, Iter: 4800/5600, Loss: 2.1791
Epoch: 1/15, Iter: 5100/5600, Loss: 2.1778
Epoch: 1/15, Iter: 5400/5600, Loss: 2.3600
Epoch: 1/15, Iter: 5600/5600, Loss: 2.2356
Epoch 1 Train Accuracy: 27.02%
Epoch 1 Test Accuracy: 26.63%

Epoch: 2/15, Iter: 300/5600, Loss: 2.3521
Epoch: 2/15, Iter: 600/5600, Loss: 2.2553
Epoch: 2/15, Iter: 900/5600, Loss: 2.341

Epoch: 10/15, Iter: 2700/5600, Loss: 2.3420
Epoch: 10/15, Iter: 3000/5600, Loss: 2.1639
Epoch: 10/15, Iter: 3300/5600, Loss: 2.1686
Epoch: 10/15, Iter: 3600/5600, Loss: 2.0459
Epoch: 10/15, Iter: 3900/5600, Loss: 2.2221
Epoch: 10/15, Iter: 4200/5600, Loss: 2.2585
Epoch: 10/15, Iter: 4500/5600, Loss: 1.9659
Epoch: 10/15, Iter: 4800/5600, Loss: 1.9605
Epoch: 10/15, Iter: 5100/5600, Loss: 1.8629
Epoch: 10/15, Iter: 5400/5600, Loss: 2.0625
Epoch: 10/15, Iter: 5600/5600, Loss: 2.2526
Epoch 10 Train Accuracy: 37.22%
Epoch 10 Test Accuracy: 36.98%

Epoch: 11/15, Iter: 300/5600, Loss: 1.7620
Epoch: 11/15, Iter: 600/5600, Loss: 2.1262
Epoch: 11/15, Iter: 900/5600, Loss: 1.9238
Epoch: 11/15, Iter: 1200/5600, Loss: 2.0623
Epoch: 11/15, Iter: 1500/5600, Loss: 2.3997
Epoch: 11/15, Iter: 1800/5600, Loss: 2.0601
Epoch: 11/15, Iter: 2100/5600, Loss: 2.1320
Epoch: 11/15, Iter: 2400/5600, Loss: 2.1534
Epoch: 11/15, Iter: 2700/5600, Loss: 1.7854
Epoch: 11/15, Iter: 3000/5600, Loss: 2.1602
Epoch: 11/15, I

In [13]:
print(f"Normal Test Accuracy: {get_accuracy(simple_net, test_loader) * 100:.02f}%")
print(f"Augmented Test Accuracy: {get_accuracy(simple_net, test_loader, img_shift_and_warp) * 100:.02f}%")

Normal Test Accuracy: 60.59%
Augmented Test Accuracy: 39.92%


Even after training the network using augmented images, we can see that a fully connected network still performs worse on transformed images. Even more interesting, is that the network now performs more poorly on the original data.

### Question

__Why do you think the model performs more poorly on the original data when trained on augmented data?__

Feedforward neural nets have trouble recognizing images transformed under symmetry.  In other words they are not invariant to many symmetric transforms such as translation and rotation.

# Using a ResNet

Now that we've looked at a fully connected network and a, let's see if we can achieve even better results with a residual network (ResNet).

We'll start by defining a residual block. Let's do a quick recap on what this is.

![residual_block.png](attachment:residual_block.png)

Here we let $H(\textbf{x})$ be the output of a few stacked nonlinear layers where $\textbf{x}$ is the input to the first of these layers. 
We'll let these stacked layers fit a residual mapping (the difference between the input and output) with $F(\textbf{x}) = H(\textbf{x}) - \textbf{x}$.
Our original function now becomes $H(\textbf{x}) = F(\textbf{x}) + \textbf{x}$. We expect this to be easier for the solvers to optimize as opposed to the original unreferenced mapping.

A residual block can be formally defined with the equation
$\textbf{y} = F(\textbf{x}, \{W_{i}\}) + W_s\textbf{x}$ where $\textbf{x}, \textbf{y}$ are the input and output vectors and $F(\textbf{x}, \{W_{i}\})$ is the residual mapping to be learned.
For the example block shown here (what we'll be implementing) with two weight layers and a ReLU function, we let $F = W_2 \sigma (W_1\textbf{x})$

The main thing to takeaway from residual blocks (and what makes ResNets differ from plain CNNs) is the use of the identity shortcut connection which allows the input to be added to the output of the stacked layers.

In [14]:
class ResidualBlock(nn.Module):
    """
    A single residual block.
    """
    def __init__(self, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1):
        super().__init__()
        
        if(kernel_size % 2 == 0):
            raise ValueError("Kernel size must be odd!")
            
        self._in_channels = in_channels
        self._out_channels = out_channels
        
        self._conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding=kernel_size // 2)
        self._relu = nn.ReLU()
        self._conv2 = nn.Conv2d(out_channels, out_channels, kernel_size, 1, padding=kernel_size // 2)
        
        if(stride == 1):
            self._identity = nn.Identity()
        else:
            # As paper suggests use convolution with 1x1 kernel and stride of 2 to linearly downsample the data
            self._identity = nn.Conv2d(in_channels, out_channels, (1, 1), 2)
        
        self._relu2 = nn.ReLU()
    
    def forward(self, x: torch.tensor) -> torch.tensor:
        # Non-Linear part...
        x_nonlinear = self._conv2(self._relu(self._conv1(x)))
        # Linear part...
        x_linear = self._identity(x)
        return self._relu2(x_nonlinear + x_linear)

    
class VanillaBlock(nn.Module):
    """
    A single vanilla block (no skip connections).
    """
    def __init__(self, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1):
        super().__init__()
        
        if(kernel_size % 2 == 0):
            raise ValueError("Kernel size must be odd!")
            
        self._in_channels = in_channels
        self._out_channels = out_channels
        
        self._conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding=kernel_size // 2)
        self._relu = nn.ReLU()
        self._conv2 = nn.Conv2d(out_channels, out_channels, kernel_size, 1, padding=kernel_size // 2)
        self._relu2 = nn.ReLU()
    
    def forward(self, x: torch.tensor) -> torch.tensor:
        x_nonlinear = self._conv2(self._relu(self._conv1(x)))     
        return self._relu2(x_nonlinear)


Now, we'll define a network capable of taking a layer description list and building a ResNet style CNN out of this list.

In [15]:
class ResNet(nn.Module):
    def __init__(self,
        in_channels: int,
        hidden_size: int, 
        output_size: int,
        start_depth: int = 64
    ):
        super().__init__()
        
        self._output_size = output_size
        self._in_channels = in_channels
        
        l_in = start_depth
        
        self._initial_conv = nn.Sequential(
            nn.Conv2d(in_channels, l_in, 7, stride=1, padding=3),  # Modification, remove stride, makes final image too small...
            nn.MaxPool2d(3, 2, padding=1)
        )
        
        # Output dims, kernel size, stride, amount...
        layers = self.get_layers(start_depth)
        
        self._final_layer_len = layers[-1][0]
        
        residual_layers = []
                
        for l_out, kernel, stride, count, block_cls in layers:
            for i in range(count):
                res = block_cls(l_in, l_out, kernel, stride)
                residual_layers.append(res)
            l_in = l_out
        
        self._residual_blocks = nn.Sequential(*residual_layers)
        
        self._final_pooling = nn.AdaptiveAvgPool2d(1)
        
        self._fully_connected = nn.Sequential(
            nn.Linear(self._final_layer_len, hidden_size),
            nn.Linear(hidden_size, output_size)
        )
        self._softmax = nn.Softmax(dim=1)
        
    def forward(self, x: torch.tensor, run_pooling: bool=True) -> torch.tensor:
        batch_size = x.shape[0]
        im_shape = x.shape[1:] if(len(x.shape) == 3) else x.shape[2:]
        
        x = x.reshape(batch_size, self._in_channels, *im_shape)
        x = self._initial_conv(x)
        x = self._residual_blocks(x)
        
        if(run_pooling):
            x = torch.moveaxis(self._final_pooling(x), 1, -1)
            x = x.reshape(-1, self._final_layer_len)
            return self._softmax(self._fully_connected(x))
        else:
            # No pooling: Run the classifier on every pixel in the image...
            im_shape = x.shape[2:]
            x = torch.moveaxis(x, 1, -1).reshape(-1, self._final_layer_len)
            x = self._fully_connected(x)
            return x.reshape(batch_size, *im_shape, self._output_size)
            
    def get_layers(self, start_depth) -> list:
        raise NotImplementedError("Use one of the ResNet subclasses!")
        

Now we can use this class to define several types of ResNets and regular CNNs

In [16]:
# Layers: A list of tuples which describe the following in order:
# (CNN layer depth, kernel size, stride, number of blocks, block class)
        
class VanillaNet18(ResNet):
    def get_layers(self, l_in: int) -> list:
        return [
            (l_in, 3, 1, 2, VanillaBlock),
            (l_in * 2, 3, 2, 1, VanillaBlock),
            (l_in * 2, 3, 1, 1, VanillaBlock),
            (l_in * 4, 3, 2, 1, VanillaBlock),
            (l_in * 4, 3, 1, 1, VanillaBlock),
            (l_in * 8, 3, 2, 1, VanillaBlock),
            (l_in * 8, 3, 1, 1, VanillaBlock)
        ]
    
class ResNet18(ResNet):
    def get_layers(self, l_in: int) -> list:
        return [
            (l_in, 3, 1, 2, ResidualBlock),
            (l_in * 2, 3, 2, 1, ResidualBlock),
            (l_in * 2, 3, 1, 1, ResidualBlock),
            (l_in * 4, 3, 2, 1, ResidualBlock),
            (l_in * 4, 3, 1, 1, ResidualBlock),
            (l_in * 8, 3, 2, 1, ResidualBlock),
            (l_in * 8, 3, 1, 1, ResidualBlock)
        ]
    
class VanillaNet34(ResNet):
    def get_layers(self, l_in: int) -> list:
        return [
            (l_in, 3, 1, 3, VanillaBlock),
            (l_in * 2, 3, 2, 1, VanillaBlock),
            (l_in * 2, 3, 1, 3, VanillaBlock),
            (l_in * 4, 3, 2, 1, VanillaBlock),
            (l_in * 4, 3, 1, 5, VanillaBlock),
            (l_in * 8, 3, 2, 1, VanillaBlock),
            (l_in * 8, 3, 1, 2, VanillaBlock)
        ]
    
class ResNet34(ResNet):
    def get_layers(self, l_in: int) -> list:
        return [
            (l_in, 3, 1, 3, ResidualBlock),
            (l_in * 2, 3, 2, 1, ResidualBlock),
            (l_in * 2, 3, 1, 3, ResidualBlock),
            (l_in * 4, 3, 2, 1, ResidualBlock),
            (l_in * 4, 3, 1, 5, ResidualBlock),
            (l_in * 8, 3, 2, 1, ResidualBlock),
            (l_in * 8, 3, 1, 2, ResidualBlock)
        ]

We'll define a new ResNet (or non-ResNet) below.

In [17]:
# What values should be here?
in_channels = 1  # Number of channels in passed data...
output_size = 10  # Number of output nodes.
channel_start = 20  # Number of channels to start the residual blocks with.
hidden_layer_size = 20  # Size of the fully connected hidden layer at the end.
model_class = ResNet18  # The model class to use.


res_net = model_class(in_channels, hidden_layer_size, output_size, channel_start)
res_net.to(device)

ResNet18(
  (_initial_conv): Sequential(
    (0): Conv2d(1, 20, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (1): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  )
  (_residual_blocks): Sequential(
    (0): ResidualBlock(
      (_conv1): Conv2d(20, 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (_relu): ReLU()
      (_conv2): Conv2d(20, 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (_identity): Identity()
      (_relu2): ReLU()
    )
    (1): ResidualBlock(
      (_conv1): Conv2d(20, 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (_relu): ReLU()
      (_conv2): Conv2d(20, 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (_identity): Identity()
      (_relu2): ReLU()
    )
    (2): ResidualBlock(
      (_conv1): Conv2d(20, 40, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (_relu): ReLU()
      (_conv2): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (_identit

Now we can train the ResNet using the same methods as the simple model above.

In [18]:
# Set these! 
n_epochs2 = 10
lr2 = 1e-3

# Set up everything...
optimizer2 = optim.SGD(res_net.parameters(), lr=lr2, momentum=0.9)
loss_func2 = nn.CrossEntropyLoss()

In [19]:
# Train the model...
res_net = train_model(res_net, train_loader, test_loader, optimizer2, loss_func2, n_epochs2)

Epoch: 1/10, Iter: 300/5600, Loss: 2.0489
Epoch: 1/10, Iter: 600/5600, Loss: 1.9597
Epoch: 1/10, Iter: 900/5600, Loss: 1.9947
Epoch: 1/10, Iter: 1200/5600, Loss: 1.7181
Epoch: 1/10, Iter: 1500/5600, Loss: 1.9868
Epoch: 1/10, Iter: 1800/5600, Loss: 1.7549
Epoch: 1/10, Iter: 2100/5600, Loss: 2.0530
Epoch: 1/10, Iter: 2400/5600, Loss: 1.6162
Epoch: 1/10, Iter: 2700/5600, Loss: 1.5823
Epoch: 1/10, Iter: 3000/5600, Loss: 1.5750
Epoch: 1/10, Iter: 3300/5600, Loss: 1.5622
Epoch: 1/10, Iter: 3600/5600, Loss: 1.6630
Epoch: 1/10, Iter: 3900/5600, Loss: 1.4653
Epoch: 1/10, Iter: 4200/5600, Loss: 1.5681
Epoch: 1/10, Iter: 4500/5600, Loss: 1.6624
Epoch: 1/10, Iter: 4800/5600, Loss: 1.5612
Epoch: 1/10, Iter: 5100/5600, Loss: 1.4615
Epoch: 1/10, Iter: 5400/5600, Loss: 1.4619
Epoch: 1/10, Iter: 5600/5600, Loss: 1.4612
Epoch 1 Train Accuracy: 95.43%
Epoch 1 Test Accuracy: 94.73%

Epoch: 2/10, Iter: 300/5600, Loss: 1.5610
Epoch: 2/10, Iter: 600/5600, Loss: 1.4612
Epoch: 2/10, Iter: 900/5600, Loss: 1.494

Epoch: 10/10, Iter: 2700/5600, Loss: 1.4612
Epoch: 10/10, Iter: 3000/5600, Loss: 1.4612
Epoch: 10/10, Iter: 3300/5600, Loss: 1.5612
Epoch: 10/10, Iter: 3600/5600, Loss: 1.4612
Epoch: 10/10, Iter: 3900/5600, Loss: 1.4612
Epoch: 10/10, Iter: 4200/5600, Loss: 1.4612
Epoch: 10/10, Iter: 4500/5600, Loss: 1.4612
Epoch: 10/10, Iter: 4800/5600, Loss: 1.4612
Epoch: 10/10, Iter: 5100/5600, Loss: 1.4612
Epoch: 10/10, Iter: 5400/5600, Loss: 1.4612
Epoch: 10/10, Iter: 5600/5600, Loss: 1.4612
Epoch 10 Train Accuracy: 99.18%
Epoch 10 Test Accuracy: 98.80%



Let's see how this model performs on augmented data.

In [20]:
print(f"Normal Test Accuracy: {get_accuracy(res_net, test_loader) * 100:.02f}%")
print(f"Augmented Test Accuracy: {get_accuracy(res_net, test_loader, img_shift_and_warp) * 100:.02f}%")

Normal Test Accuracy: 98.80%
Augmented Test Accuracy: 48.93%


The model still performs poorly, but can we achieve better performance by training on the randomly augmented data?

In [21]:
# Retrain using image warping...
res_net = train_model(res_net, train_loader, test_loader, optimizer2, loss_func2, n_epochs2, img_shift_and_warp)

Epoch: 1/10, Iter: 300/5600, Loss: 1.5956
Epoch: 1/10, Iter: 600/5600, Loss: 2.1445
Epoch: 1/10, Iter: 900/5600, Loss: 1.6051
Epoch: 1/10, Iter: 1200/5600, Loss: 1.7611
Epoch: 1/10, Iter: 1500/5600, Loss: 1.6348
Epoch: 1/10, Iter: 1800/5600, Loss: 1.4612
Epoch: 1/10, Iter: 2100/5600, Loss: 1.4612
Epoch: 1/10, Iter: 2400/5600, Loss: 1.5675
Epoch: 1/10, Iter: 2700/5600, Loss: 1.5597
Epoch: 1/10, Iter: 3000/5600, Loss: 1.5386
Epoch: 1/10, Iter: 3300/5600, Loss: 1.5609
Epoch: 1/10, Iter: 3600/5600, Loss: 1.4875
Epoch: 1/10, Iter: 3900/5600, Loss: 1.7607
Epoch: 1/10, Iter: 4200/5600, Loss: 1.4612
Epoch: 1/10, Iter: 4500/5600, Loss: 1.5616
Epoch: 1/10, Iter: 4800/5600, Loss: 1.5891
Epoch: 1/10, Iter: 5100/5600, Loss: 1.5612
Epoch: 1/10, Iter: 5400/5600, Loss: 1.5611
Epoch: 1/10, Iter: 5600/5600, Loss: 1.5097
Epoch 1 Train Accuracy: 92.44%
Epoch 1 Test Accuracy: 91.95%

Epoch: 2/10, Iter: 300/5600, Loss: 1.4612
Epoch: 2/10, Iter: 600/5600, Loss: 1.7604
Epoch: 2/10, Iter: 900/5600, Loss: 1.858

Epoch: 10/10, Iter: 2700/5600, Loss: 1.4612
Epoch: 10/10, Iter: 3000/5600, Loss: 1.4612
Epoch: 10/10, Iter: 3300/5600, Loss: 1.4612
Epoch: 10/10, Iter: 3600/5600, Loss: 1.6611
Epoch: 10/10, Iter: 3900/5600, Loss: 1.4612
Epoch: 10/10, Iter: 4200/5600, Loss: 1.4612
Epoch: 10/10, Iter: 4500/5600, Loss: 1.4612
Epoch: 10/10, Iter: 4800/5600, Loss: 1.4612
Epoch: 10/10, Iter: 5100/5600, Loss: 1.4612
Epoch: 10/10, Iter: 5400/5600, Loss: 1.4612
Epoch: 10/10, Iter: 5600/5600, Loss: 1.5612
Epoch 10 Train Accuracy: 94.97%
Epoch 10 Test Accuracy: 94.38%



In [22]:
print(f"Normal Test Accuracy: {get_accuracy(res_net, test_loader) * 100:.02f}%")
print(f"Augmented Test Accuracy: {get_accuracy(res_net, test_loader, img_shift_and_warp) * 100:.02f}%")

Normal Test Accuracy: 96.51%
Augmented Test Accuracy: 94.63%


In [23]:
model_class = ResNet18  # The model class to use.
res_net0 = model_class(in_channels, hidden_layer_size, output_size, channel_start)
res_net0.to(device)
optimizer2 = optim.SGD(res_net0.parameters(), lr=lr2, momentum=0.9)
res_net0 = train_model(res_net0, train_loader, test_loader, optimizer2, loss_func2, n_epochs2, img_shift_and_warp)
print(f"Normal Test Accuracy: {get_accuracy(res_net0, test_loader) * 100:.02f}%")
print(f"Augmented Test Accuracy: {get_accuracy(res_net0, test_loader, img_shift_and_warp) * 100:.02f}%")

Epoch: 1/10, Iter: 300/5600, Loss: 2.2254
Epoch: 1/10, Iter: 600/5600, Loss: 2.2136
Epoch: 1/10, Iter: 900/5600, Loss: 2.2299
Epoch: 1/10, Iter: 1200/5600, Loss: 1.9648
Epoch: 1/10, Iter: 1500/5600, Loss: 2.0171
Epoch: 1/10, Iter: 1800/5600, Loss: 1.7678
Epoch: 1/10, Iter: 2100/5600, Loss: 2.1211
Epoch: 1/10, Iter: 2400/5600, Loss: 1.7940
Epoch: 1/10, Iter: 2700/5600, Loss: 2.1205
Epoch: 1/10, Iter: 3000/5600, Loss: 1.9226
Epoch: 1/10, Iter: 3300/5600, Loss: 1.7203
Epoch: 1/10, Iter: 3600/5600, Loss: 1.8044
Epoch: 1/10, Iter: 3900/5600, Loss: 1.6920
Epoch: 1/10, Iter: 4200/5600, Loss: 1.6621
Epoch: 1/10, Iter: 4500/5600, Loss: 1.9606
Epoch: 1/10, Iter: 4800/5600, Loss: 1.8294
Epoch: 1/10, Iter: 5100/5600, Loss: 1.6111
Epoch: 1/10, Iter: 5400/5600, Loss: 1.7171
Epoch: 1/10, Iter: 5600/5600, Loss: 1.6599
Epoch 1 Train Accuracy: 81.67%
Epoch 1 Test Accuracy: 81.34%

Epoch: 2/10, Iter: 300/5600, Loss: 1.6148
Epoch: 2/10, Iter: 600/5600, Loss: 1.7609
Epoch: 2/10, Iter: 900/5600, Loss: 1.494

Epoch: 10/10, Iter: 2700/5600, Loss: 1.4612
Epoch: 10/10, Iter: 3000/5600, Loss: 1.4632
Epoch: 10/10, Iter: 3300/5600, Loss: 1.4612
Epoch: 10/10, Iter: 3600/5600, Loss: 1.5612
Epoch: 10/10, Iter: 3900/5600, Loss: 1.4612
Epoch: 10/10, Iter: 4200/5600, Loss: 1.5877
Epoch: 10/10, Iter: 4500/5600, Loss: 1.5612
Epoch: 10/10, Iter: 4800/5600, Loss: 1.5612
Epoch: 10/10, Iter: 5100/5600, Loss: 1.4612
Epoch: 10/10, Iter: 5400/5600, Loss: 1.5612
Epoch: 10/10, Iter: 5600/5600, Loss: 1.5612
Epoch 10 Train Accuracy: 94.96%
Epoch 10 Test Accuracy: 94.84%

Normal Test Accuracy: 96.79%
Augmented Test Accuracy: 94.63%


In [24]:
model_class = VanillaNet34 # The model class to use.
res_net1 = model_class(in_channels, hidden_layer_size, output_size, channel_start)
res_net1.to(device)
optimizer2 = optim.SGD(res_net1.parameters(), lr=lr2, momentum=0.9)
res_net1 = train_model(res_net1, train_loader, test_loader, optimizer2, loss_func2, n_epochs2, img_shift_and_warp)
print(f"Normal Test Accuracy: {get_accuracy(res_net1, test_loader) * 100:.02f}%")
print(f"Augmented Test Accuracy: {get_accuracy(res_net1, test_loader, img_shift_and_warp) * 100:.02f}%")

Epoch: 1/10, Iter: 300/5600, Loss: 2.3001
Epoch: 1/10, Iter: 600/5600, Loss: 2.3032
Epoch: 1/10, Iter: 900/5600, Loss: 2.2973
Epoch: 1/10, Iter: 1200/5600, Loss: 2.3052
Epoch: 1/10, Iter: 1500/5600, Loss: 2.3046
Epoch: 1/10, Iter: 1800/5600, Loss: 2.3051
Epoch: 1/10, Iter: 2100/5600, Loss: 2.3037
Epoch: 1/10, Iter: 2400/5600, Loss: 2.2990
Epoch: 1/10, Iter: 2700/5600, Loss: 2.2986
Epoch: 1/10, Iter: 3000/5600, Loss: 2.3005
Epoch: 1/10, Iter: 3300/5600, Loss: 2.3001
Epoch: 1/10, Iter: 3600/5600, Loss: 2.3008
Epoch: 1/10, Iter: 3900/5600, Loss: 2.3006
Epoch: 1/10, Iter: 4200/5600, Loss: 2.3035
Epoch: 1/10, Iter: 4500/5600, Loss: 2.3040
Epoch: 1/10, Iter: 4800/5600, Loss: 2.2998
Epoch: 1/10, Iter: 5100/5600, Loss: 2.3060
Epoch: 1/10, Iter: 5400/5600, Loss: 2.3042
Epoch: 1/10, Iter: 5600/5600, Loss: 2.3027
Epoch 1 Train Accuracy: 11.15%
Epoch 1 Test Accuracy: 11.66%

Epoch: 2/10, Iter: 300/5600, Loss: 2.3062
Epoch: 2/10, Iter: 600/5600, Loss: 2.3046
Epoch: 2/10, Iter: 900/5600, Loss: 2.305

Epoch: 10/10, Iter: 2700/5600, Loss: 2.3100
Epoch: 10/10, Iter: 3000/5600, Loss: 2.3005
Epoch: 10/10, Iter: 3300/5600, Loss: 2.3118
Epoch: 10/10, Iter: 3600/5600, Loss: 2.3158
Epoch: 10/10, Iter: 3900/5600, Loss: 2.3039
Epoch: 10/10, Iter: 4200/5600, Loss: 2.3117
Epoch: 10/10, Iter: 4500/5600, Loss: 2.3093
Epoch: 10/10, Iter: 4800/5600, Loss: 2.3111
Epoch: 10/10, Iter: 5100/5600, Loss: 2.2935
Epoch: 10/10, Iter: 5400/5600, Loss: 2.2951
Epoch: 10/10, Iter: 5600/5600, Loss: 2.2963
Epoch 10 Train Accuracy: 11.15%
Epoch 10 Test Accuracy: 11.66%

Normal Test Accuracy: 11.66%
Augmented Test Accuracy: 11.66%


In [25]:
model_class = VanillaNet18 # The model class to use.
res_net2 = model_class(in_channels, hidden_layer_size, output_size, channel_start)
res_net2.to(device)
optimizer2 = optim.SGD(res_net2.parameters(), lr=lr2, momentum=0.9)
res_net1 = train_model(res_net2, train_loader, test_loader, optimizer2, loss_func2, n_epochs2, img_shift_and_warp)
print(f"Normal Test Accuracy: {get_accuracy(res_net2, test_loader) * 100:.02f}%")
print(f"Augmented Test Accuracy: {get_accuracy(res_net2, test_loader, img_shift_and_warp) * 100:.02f}%")

Epoch: 1/10, Iter: 300/5600, Loss: 2.3046
Epoch: 1/10, Iter: 600/5600, Loss: 2.3066
Epoch: 1/10, Iter: 900/5600, Loss: 2.2950
Epoch: 1/10, Iter: 1200/5600, Loss: 2.3003
Epoch: 1/10, Iter: 1500/5600, Loss: 2.3070
Epoch: 1/10, Iter: 1800/5600, Loss: 2.3010
Epoch: 1/10, Iter: 2100/5600, Loss: 2.3021
Epoch: 1/10, Iter: 2400/5600, Loss: 2.3009
Epoch: 1/10, Iter: 2700/5600, Loss: 2.2974
Epoch: 1/10, Iter: 3000/5600, Loss: 2.3035
Epoch: 1/10, Iter: 3300/5600, Loss: 2.3055
Epoch: 1/10, Iter: 3600/5600, Loss: 2.3028
Epoch: 1/10, Iter: 3900/5600, Loss: 2.3038
Epoch: 1/10, Iter: 4200/5600, Loss: 2.3066
Epoch: 1/10, Iter: 4500/5600, Loss: 2.3002
Epoch: 1/10, Iter: 4800/5600, Loss: 2.3088
Epoch: 1/10, Iter: 5100/5600, Loss: 2.2994
Epoch: 1/10, Iter: 5400/5600, Loss: 2.3050
Epoch: 1/10, Iter: 5600/5600, Loss: 2.3019
Epoch 1 Train Accuracy: 9.97%
Epoch 1 Test Accuracy: 9.81%

Epoch: 2/10, Iter: 300/5600, Loss: 2.3022
Epoch: 2/10, Iter: 600/5600, Loss: 2.3036
Epoch: 2/10, Iter: 900/5600, Loss: 2.3102


Epoch: 10/10, Iter: 2700/5600, Loss: 2.3004
Epoch: 10/10, Iter: 3000/5600, Loss: 2.3088
Epoch: 10/10, Iter: 3300/5600, Loss: 2.2913
Epoch: 10/10, Iter: 3600/5600, Loss: 2.2793
Epoch: 10/10, Iter: 3900/5600, Loss: 2.2939
Epoch: 10/10, Iter: 4200/5600, Loss: 2.3056
Epoch: 10/10, Iter: 4500/5600, Loss: 2.3055
Epoch: 10/10, Iter: 4800/5600, Loss: 2.3052
Epoch: 10/10, Iter: 5100/5600, Loss: 2.2960
Epoch: 10/10, Iter: 5400/5600, Loss: 2.2888
Epoch: 10/10, Iter: 5600/5600, Loss: 2.2897
Epoch 10 Train Accuracy: 11.15%
Epoch 10 Test Accuracy: 11.66%

Normal Test Accuracy: 11.66%
Augmented Test Accuracy: 11.66%


### Questions

__What results do you get using the ResNet? How do these differ from the simple fully connected network, and why do you think the results differ?__

The results from ResNet are far better for both Normal Test Accuracy (96.51% compared to 60.59%) and Augmented Test Accuracy (94.63% compared to 39.92%).  While some of this improvement may be attributable to the residual component, much of it is also likely due to the convolutional layers which are helpful for recognizing portions of images.

__Try experimenting with different models above (including non-residual CNNs). How do your results differ from the regular ResNet?__

While the ResNets of both sizes perform nearly equivalently, the non-residual CNNs have a significant amount of difficulty training above random choice.  This is likely due to the issues present with large deep neural nets relative to the residual nets which enable the identity operation to be carried forward much easier.

### Bonus Exercise: Localized Results

You may wonder if the CNN is not only able to tell us what digit we're looking at, but also where it is located. We can test if the CNN is providing localized results by removing the final average pooling layer, and applying the fully connected layers on every pixel instead of on the average of the pixels. We'll also pass the CNN an image with multiple numbers, to see if it is capable of identifying several numbers across the image.

__What results do you observe below? Is the ResNet able to provide localized results?__

In [None]:
for img, label in train_loader:
    fig = plt.figure(figsize=(6, 20))
    axs = fig.subplots(11, 1)
    
    ax1, axs = axs[0], axs[1:]
    
    # Take the first set of numbers...
    num_nums = min(batch_size, 10)
    
    img = torch.cat([p for p in img[:num_nums]], dim=1)
    
    # Network doesn't apply the softmax when we disable pooling, so we apply it here.
    # This places all the scores between 0 and 1, and emphasises the highest score. 
    single_res_grid = torch.softmax(res_net.forward(img.reshape(1, 28, num_nums * 28), False), -1)[0]
    
    ax1.imshow(img.cpu().detach().numpy(), cmap="Greys")
    ax1.set_title(f"Original Image\n(Labels: {[int(l) for l in label[:num_nums]]})")
    
    for i, ax in enumerate(axs):
        ax.set_title(f"Heatmap of CNN Output: {i}")
        total_len = single_res_grid[:, :, i]
        ax.imshow(single_res_grid[:, :, i].cpu().detach().numpy(), vmin=0, vmax=1)
    
    fig.tight_layout()
    fig.show()
    
    break