# Assignment 4: Self-Attention for Vision

For this assignment, we're going to implement self-attention blocks in a convolutional neural network for CIFAR-10 Classification.

# Part I. Preparation

First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

In [2]:
NUM_TRAIN = 49000

# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
                T.ToTensor(),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ])

# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./data/datasets', train=True, download=True,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10('./data/datasets', train=True, download=True,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10('./data/datasets', train=False, download=True, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/datasets/cifar-10-python.tar.gz


100%|████████████████████████| 170498071/170498071 [00:17<00:00, 9543232.34it/s]


Extracting ./data/datasets/cifar-10-python.tar.gz to ./data/datasets
Files already downloaded and verified
Files already downloaded and verified


You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment. 

In [3]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 100

print('using device:', device)

using device: cpu


## Flatten Function

In [4]:
def flatten(x):
    N = x.shape[0] # read in N, C, H, W
    return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

def test_flatten():
    x = torch.arange(12).view(2, 1, 3, 2)
    print('Before flattening: ', x)
    print('After flattening: ', flatten(x))

test_flatten()

Before flattening:  tensor([[[[ 0,  1],
          [ 2,  3],
          [ 4,  5]]],


        [[[ 6,  7],
          [ 8,  9],
          [10, 11]]]])
After flattening:  tensor([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]])


### Check Accuracy Function


In [5]:
import torch.nn.functional as F  # useful stateless functions
def check_accuracy(loader, model):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
        return 100 * acc

### Training Loop

In [6]:
def train(model, optimizer, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Nothing, but prints model accuracies during training.
    """
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    acc_max = 0
    for e in range(epochs):
        for t, (x, y) in enumerate(loader_train):
            
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # This is the backwards pass: compute the gradient of the loss with
            # respect to each  parameter of the model.
            loss.backward()

            # Actually update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()

            if t % print_every == 0:
                print('Epoch %d, Iteration %d, loss = %.4f' % (e, t, loss.item()))
                acc = check_accuracy(loader_val, model)
                if acc >= acc_max:
                    acc_max = acc
                print()
    print("Maximum accuracy attained: ", acc_max)

In [7]:
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
    def forward(self, x):
        return flatten(x)

## Vanilla CNN; No Attention
We implement the vanilla architecture for you here. Do not modify the architecture. You will use the same architecture in the following parts. Do not modify the hyper-parameters.

In [8]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3
num_classes = 10

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, num_classes),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)


train(model, optimizer, epochs=1)

Epoch 0, Iteration 0, loss = 2.3117
Checking accuracy on validation set
Got 116 / 1000 correct (11.60)

Epoch 0, Iteration 100, loss = 1.8795
Checking accuracy on validation set
Got 421 / 1000 correct (42.10)

Epoch 0, Iteration 200, loss = 1.5732
Checking accuracy on validation set
Got 475 / 1000 correct (47.50)

Epoch 0, Iteration 300, loss = 1.2348
Checking accuracy on validation set
Got 521 / 1000 correct (52.10)

Epoch 0, Iteration 400, loss = 1.7007
Checking accuracy on validation set
Got 519 / 1000 correct (51.90)

Epoch 0, Iteration 500, loss = 1.3144
Checking accuracy on validation set
Got 528 / 1000 correct (52.80)

Epoch 0, Iteration 600, loss = 1.1541
Checking accuracy on validation set
Got 559 / 1000 correct (55.90)

Epoch 0, Iteration 700, loss = 1.2107
Checking accuracy on validation set
Got 545 / 1000 correct (54.50)

Maximum accuracy attained:  55.900000000000006


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.
You should be able to see atleast 55% accuracy

In [9]:
vanillaModel = model
check_accuracy(loader_test, vanillaModel)


Checking accuracy on test set
Got 5632 / 10000 correct (56.32)


56.32

## Part II Self-Attention

In the next section, you will implement an Attention layer which you will then use within a convnet architecture defined above for cifar 10 classification task.

A self-attention layer is formulated as following:

Input: $X$ of shape $(H\times W, C)$

Query, key, value linear transforms are $W_Q$, $W_K$, $W_V$, of shape $(C, C)$. We implement these linear transforms as 1x1 convolutional layers of the same dimensions.

$XW_Q$, $XW_K$, $XW_V$, represent the output volumes when input X is passed through the transforms.


Self-Attention is given by the formula: $Attention(X) = X + Softmax(\frac{XW_Q(XW_K)^\top}{\sqrt{C}})XW_V$

### Inline Question 1: Self-Attention is equivalent to which of the following: (5 points)
1. K-means clustering <br />
2. Non-local means <br />
3. Residual Block <br />
4. Gaussian Blurring <br />

Your Answer:
Self-attention is equivalent to non-local means. The non-local means algorithm replaces the value of a pixel by an average of a selection of other pixels values. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. K-means is a method of vector quantization that aims to partition n observations into k clusters where each observation belongs to the cluster with the nearest mean. A residual block is a stack of layers set in such a way that the output of a layer is taken and added to another layer deeper in the block, but it includes no attention. Guassian blurring is to blur an image by Gaussian function.

### Here you implement the Attention module, and run it in the next section (40 points)

In [10]:
# Initialize the attention module as a nn.Module subclass
class Attention(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        
        # TODO: Implement the Key, Query and Value linear transforms as 1x1 convolutional layers
        # Hint: channel size remains constant throughout
        self.conv_query = nn.Conv2d(in_channels, in_channels, 1)
        self.conv_key = nn.Conv2d(in_channels, in_channels, 1)
        self.conv_value = nn.Conv2d(in_channels, in_channels, 1)


    def forward(self, x):
        N, C, H, W = x.shape
        
        # TODO: Pass the input through conv_query, reshape the output volume to (N, C, H*W)
        q = self.conv_query(x).reshape(N, C, H*W)
        # TODO: Pass the input through conv_key, reshape the output volume to (N, C, H*W)
        k = self.conv_key(x).reshape(N, C, H*W)
        # TODO: Pass the input through conv_value, reshape the output volume to (N, C, H*W)
        v = self.conv_value(x).reshape(N, C, H*W)
        # TODO: Implement the above formula for attention using q, k, v, C
        # NOTE: The X in the formula is already added for you in the return line
        temp = torch.matmul(q, torch.transpose(k, 1, 2))/(np.sqrt(C))
        attention = torch.matmul(F.softmax(temp, dim=-1), v)
        # Reshape the output to (N, C, H, W) before adding to the input volume
        attention = attention.reshape(N, C, H, W)
        return x + attention

## Single Attention Block: Early attention; After the first conv layer. (10 points)

In [11]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the first Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Attention->Relu->Conv->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    Attention(channel_1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, 10),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.2929
Checking accuracy on validation set
Got 160 / 1000 correct (16.00)

Epoch 0, Iteration 100, loss = 1.5620
Checking accuracy on validation set
Got 450 / 1000 correct (45.00)

Epoch 0, Iteration 200, loss = 1.3225
Checking accuracy on validation set
Got 497 / 1000 correct (49.70)

Epoch 0, Iteration 300, loss = 1.4317
Checking accuracy on validation set
Got 543 / 1000 correct (54.30)

Epoch 0, Iteration 400, loss = 1.4624
Checking accuracy on validation set
Got 572 / 1000 correct (57.20)

Epoch 0, Iteration 500, loss = 1.3272
Checking accuracy on validation set
Got 570 / 1000 correct (57.00)

Epoch 0, Iteration 600, loss = 1.2060
Checking accuracy on validation set
Got 588 / 1000 correct (58.80)

Epoch 0, Iteration 700, loss = 1.3392
Checking accuracy on validation set
Got 610 / 1000 correct (61.00)

Epoch 1, Iteration 0, loss = 1.3130
Checking accuracy on validation set
Got 598 / 1000 correct (59.80)

Epoch 1, Iteration 100, loss = 0.8741
Checking acc

Got 640 / 1000 correct (64.00)

Epoch 9, Iteration 600, loss = 0.1451
Checking accuracy on validation set
Got 626 / 1000 correct (62.60)

Epoch 9, Iteration 700, loss = 0.1046
Checking accuracy on validation set
Got 624 / 1000 correct (62.40)

Maximum accuracy attained:  67.60000000000001


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.
You should see improvement of about 2-3% over the vanilla convnet model. * Use this part to tune your Attention module and then move on to the next parts. *

In [12]:
earlyAttention = model
check_accuracy(loader_test, earlyAttention)

Checking accuracy on test set
Got 6155 / 10000 correct (61.55)


61.550000000000004

## Single Attention Block: Late attention; After the second conv layer. (10 points)

In [13]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the Second Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Conv->Relu->Attention->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1),
    nn.ReLU(),
    Attention(channel_2),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, 10),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.3125
Checking accuracy on validation set
Got 94 / 1000 correct (9.40)

Epoch 0, Iteration 100, loss = 1.4392
Checking accuracy on validation set
Got 454 / 1000 correct (45.40)

Epoch 0, Iteration 200, loss = 1.4765
Checking accuracy on validation set
Got 497 / 1000 correct (49.70)

Epoch 0, Iteration 300, loss = 1.2352
Checking accuracy on validation set
Got 519 / 1000 correct (51.90)

Epoch 0, Iteration 400, loss = 1.3181
Checking accuracy on validation set
Got 530 / 1000 correct (53.00)

Epoch 0, Iteration 500, loss = 1.1116
Checking accuracy on validation set
Got 573 / 1000 correct (57.30)

Epoch 0, Iteration 600, loss = 1.1926
Checking accuracy on validation set
Got 588 / 1000 correct (58.80)

Epoch 0, Iteration 700, loss = 1.3299
Checking accuracy on validation set
Got 585 / 1000 correct (58.50)

Epoch 1, Iteration 0, loss = 0.8473
Checking accuracy on validation set
Got 601 / 1000 correct (60.10)

Epoch 1, Iteration 100, loss = 1.0140
Checking accur

Got 613 / 1000 correct (61.30)

Epoch 9, Iteration 600, loss = 0.2296
Checking accuracy on validation set
Got 615 / 1000 correct (61.50)

Epoch 9, Iteration 700, loss = 0.0352
Checking accuracy on validation set
Got 615 / 1000 correct (61.50)

Maximum accuracy attained:  64.8


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [14]:
lateAttention = model
check_accuracy(loader_test, lateAttention)

Checking accuracy on test set
Got 6117 / 10000 correct (61.17)


61.17

### Inline Question 2: Provide one example each of usage of self-attention and attention in computer vision. Explain the difference between the two. (5 points)


Your Answer:

One example for self-attention mechanism in computer vision is Self-Attention Generative Adversarial Networks.
One example for attention mechanism in computer vision is Convolutional Block Attention Module

The main difference between the self-attention and attention mechanism is that self-attention can only learn attention from its own layer while the attention mechanism can learn it from other layers.


## Double Attention Blocks: After conv layers 1 and 2 (10 points)

In [15]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the Second Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Attention->Relu->Conv->Relu->Attention->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    Attention(channel_1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1),
    nn.ReLU(),
    Attention(channel_2),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, 10),
)



optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.3199
Checking accuracy on validation set
Got 122 / 1000 correct (12.20)

Epoch 0, Iteration 100, loss = 1.5206
Checking accuracy on validation set
Got 437 / 1000 correct (43.70)

Epoch 0, Iteration 200, loss = 1.3188
Checking accuracy on validation set
Got 514 / 1000 correct (51.40)

Epoch 0, Iteration 300, loss = 1.3824
Checking accuracy on validation set
Got 517 / 1000 correct (51.70)

Epoch 0, Iteration 400, loss = 1.2026
Checking accuracy on validation set
Got 549 / 1000 correct (54.90)

Epoch 0, Iteration 500, loss = 1.3991
Checking accuracy on validation set
Got 556 / 1000 correct (55.60)

Epoch 0, Iteration 600, loss = 1.3911
Checking accuracy on validation set
Got 578 / 1000 correct (57.80)

Epoch 0, Iteration 700, loss = 1.0877
Checking accuracy on validation set
Got 614 / 1000 correct (61.40)

Epoch 1, Iteration 0, loss = 0.9966
Checking accuracy on validation set
Got 596 / 1000 correct (59.60)

Epoch 1, Iteration 100, loss = 0.9535
Checking acc

Got 643 / 1000 correct (64.30)

Epoch 9, Iteration 600, loss = 0.0677
Checking accuracy on validation set
Got 642 / 1000 correct (64.20)

Epoch 9, Iteration 700, loss = 0.3252
Checking accuracy on validation set
Got 614 / 1000 correct (61.40)

Maximum accuracy attained:  67.80000000000001


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [16]:
vanillaModel = model
check_accuracy(loader_test, vanillaModel)

Checking accuracy on test set
Got 6233 / 10000 correct (62.33)


62.33

## Resnet with Attention 

Now we will experiment with applying attention within the Resnet10 architecture that we implemented in Homework 2. Please note that for a deeper model such as Resnet we do not expect significant improvements in performance with Attention

## Vanilla Resnet, No Attention

The architecture for Resnet is given below, please train it and evaluate it on the test set.

In [17]:
import torch
import torch.nn as nn

class ResNet(nn.Module):

    def __init__(self, block, layers, img_channels=3, num_classes=100, batchnorm=False):
        super(ResNet, self).__init__() #layers = [1, 1, 1, 1] 
        self.in_channels = 64
        self.conv1 = nn.Conv2d(img_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.batchnorm = batchnorm
        self.layer1 = self.make_layer(block, layers[0], out_channels=64, stride=1, batchnorm=batchnorm)
        self.layer2 = self.make_layer(block, layers[1], out_channels=128, stride=1, batchnorm=batchnorm)
        self.layer3 = self.make_layer(block, layers[2], out_channels=256, stride=1, batchnorm=batchnorm)
        self.layer4 = self.make_layer(block, layers[3], out_channels=512, stride=2, batchnorm=batchnorm)

        self.averagepool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    
    def forward(self, x):

        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.averagepool(x)
        x = x.reshape(x.shape[0], -1)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)

        return x


        

    def make_layer(self, block, num_blocks, out_channels, stride, batchnorm=False):
        downsampler = None
        layers = []
        if stride != 1 or self.in_channels != out_channels:
            downsampler = nn.Sequential(nn.Conv2d(self.in_channels, out_channels, kernel_size = 1, stride = stride), nn.BatchNorm2d(out_channels))

        layers.append(block(self.in_channels, out_channels, downsampler, stride, batchnorm=batchnorm))

        self.in_channels = out_channels

        for i in range(num_blocks - 1):
            layers.append(block(self.in_channels, out_channels))

        
        return nn.Sequential(*layers)
        
class block(nn.Module):

    def __init__(self, in_channels, out_channels, downsampler = None, stride = 1, batchnorm=False):
        
        super(block, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size = 3, padding = 2)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsampler = downsampler
        self.relu = nn.ReLU()
        self.batchnorm = batchnorm

    
    def forward(self, x):

        residual = x
        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        if self.batchnorm:
            x = self.bn2(x)
        x = self.relu(x)
        
        if self.downsampler:
            residual = self.downsampler(residual)

        return self.relu(residual + x)
    


def ResNet10(num_classes = 100, batchnorm= False):

    return ResNet(block, [1, 1, 1, 1], num_classes=num_classes, batchnorm=batchnorm)

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [18]:
learning_rate = 1e-3

model = ResNet10()

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

vanillaResnet = model
check_accuracy(loader_test, vanillaResnet)

Epoch 0, Iteration 0, loss = 4.6591
Checking accuracy on validation set
Got 119 / 1000 correct (11.90)

Epoch 0, Iteration 100, loss = 1.5874
Checking accuracy on validation set
Got 374 / 1000 correct (37.40)

Epoch 0, Iteration 200, loss = 1.7084
Checking accuracy on validation set
Got 479 / 1000 correct (47.90)

Epoch 0, Iteration 300, loss = 1.3771
Checking accuracy on validation set
Got 500 / 1000 correct (50.00)

Epoch 0, Iteration 400, loss = 1.6123
Checking accuracy on validation set
Got 508 / 1000 correct (50.80)

Epoch 0, Iteration 500, loss = 1.2234
Checking accuracy on validation set
Got 541 / 1000 correct (54.10)

Epoch 0, Iteration 600, loss = 1.0217
Checking accuracy on validation set
Got 571 / 1000 correct (57.10)

Epoch 0, Iteration 700, loss = 1.1153
Checking accuracy on validation set
Got 577 / 1000 correct (57.70)

Epoch 1, Iteration 0, loss = 1.0259
Checking accuracy on validation set
Got 612 / 1000 correct (61.20)

Epoch 1, Iteration 100, loss = 0.9818
Checking acc

Got 773 / 1000 correct (77.30)

Epoch 9, Iteration 600, loss = 0.5445
Checking accuracy on validation set
Got 750 / 1000 correct (75.00)

Epoch 9, Iteration 700, loss = 0.3145
Checking accuracy on validation set
Got 760 / 1000 correct (76.00)

Maximum accuracy attained:  77.3
Checking accuracy on test set
Got 7406 / 10000 correct (74.06)


74.06

## Resnet with Attention (5 points)

In [19]:
import torch
import torch.nn as nn

class ResNet_Attention(nn.Module):

    def __init__(self, block, layers, img_channels=3, num_classes=100, batchnorm=False):
        super(ResNet_Attention, self).__init__() #layers = [1, 1, 1, 1] 
        self.in_channels = 64
        self.conv1 = nn.Conv2d(img_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.batchnorm = batchnorm
        self.layer1 = self.make_layer(block, layers[0], out_channels=64, stride=1, batchnorm=batchnorm)
        self.layer2 = self.make_layer(block, layers[1], out_channels=128, stride=1, batchnorm=batchnorm)
        self.attention = Attention(128)
        self.layer3 = self.make_layer(block, layers[2], out_channels=256, stride=1, batchnorm=batchnorm)
        self.layer4 = self.make_layer(block, layers[3], out_channels=512, stride=2, batchnorm=batchnorm)

        self.averagepool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)
   
    def forward(self, x):

        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.attention(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.averagepool(x)
        x = x.reshape(x.shape[0], -1)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)

        return x

    def make_layer(self, block, num_blocks, out_channels, stride, batchnorm=False):
        downsampler = None
        layers = []
        if stride != 1 or self.in_channels != out_channels:
            downsampler = nn.Sequential(nn.Conv2d(self.in_channels, out_channels, kernel_size = 1, stride = stride), nn.BatchNorm2d(out_channels))

        layers.append(block(self.in_channels, out_channels, downsampler, stride, batchnorm=batchnorm))

        self.in_channels = out_channels

        for i in range(num_blocks - 1):
            layers.append(block(self.in_channels, out_channels))

        
        return nn.Sequential(*layers)
        
class block(nn.Module):

    def __init__(self, in_channels, out_channels, downsampler = None, stride = 1, batchnorm=False):
        
        super(block, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size = 3, padding = 2)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsampler = downsampler
        self.relu = nn.ReLU()
        self.batchnorm = batchnorm

    
    def forward(self, x):

        residual = x
        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        if self.batchnorm:
            x = self.bn2(x)
        x = self.relu(x)
        
        if self.downsampler:
            residual = self.downsampler(residual)

        return self.relu(residual + x)
    


def ResNet10_Attention(num_classes = 100, batchnorm= False):

    return ResNet_Attention(block, [1, 1, 1, 1], num_classes=num_classes, batchnorm=batchnorm)

In [20]:
## Resnet with Attention

learning_rate = 1e-3

# TODO: Use the above Attention module after the 2nd resnet block i.e. after self.layer2.

model = ResNet10_Attention()

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 4.5269
Checking accuracy on validation set
Got 107 / 1000 correct (10.70)

Epoch 0, Iteration 100, loss = 1.4466
Checking accuracy on validation set
Got 413 / 1000 correct (41.30)

Epoch 0, Iteration 200, loss = 1.2313
Checking accuracy on validation set
Got 440 / 1000 correct (44.00)

Epoch 0, Iteration 300, loss = 1.6204
Checking accuracy on validation set
Got 413 / 1000 correct (41.30)

Epoch 0, Iteration 400, loss = 1.0470
Checking accuracy on validation set
Got 535 / 1000 correct (53.50)

Epoch 0, Iteration 500, loss = 1.3376
Checking accuracy on validation set
Got 538 / 1000 correct (53.80)

Epoch 0, Iteration 600, loss = 1.1351
Checking accuracy on validation set
Got 563 / 1000 correct (56.30)

Epoch 0, Iteration 700, loss = 1.0257
Checking accuracy on validation set
Got 599 / 1000 correct (59.90)

Epoch 1, Iteration 0, loss = 0.9757
Checking accuracy on validation set
Got 587 / 1000 correct (58.70)

Epoch 1, Iteration 100, loss = 1.2443
Checking acc

Got 763 / 1000 correct (76.30)

Epoch 9, Iteration 600, loss = 0.2048
Checking accuracy on validation set
Got 767 / 1000 correct (76.70)

Epoch 9, Iteration 700, loss = 0.1444
Checking accuracy on validation set
Got 772 / 1000 correct (77.20)

Maximum accuracy attained:  78.8


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [21]:
AttentionResnet = model
check_accuracy(loader_test, AttentionResnet)

Checking accuracy on test set
Got 7758 / 10000 correct (77.58)


77.58

## Inline Question 3: Rank the above models based on their performance on test dataset (15 points)
( You are encouraged to run each of the experiments (training) at
least 3 times to get an average estimate )

Report the test accuracies alongside the model names. For example, 1. Vanilla CNN (57.45%, 57.99%).. etc

1. <br /> Attention Resnet10 : (77.2%, 77.4%, 77.58%) ~ avg accuracy = 77.4%
2. <br /> Vanilla Resnet10 : (74.4%, 74.25%, 74.06%) ~ avg accuracy = 74.24% 
3. <br /> Double Attention CNN : (63.76%, 63.52%, 62.33%) ~ avg accuracy = 63.2%
4. <br /> Single Early Attention CNN : (61.87%, 61.73%, 61.55%) ~ avg accuracy = 61.72%
5. <br /> Single Late Attention CNN : (60.8%, 61.02%, 61.17%) ~ avg accuracy = 61%
6. <br /> Vanilla CNN : (56.1%, 57.14%, 56.32%) ~ avg accuracy = 56.52%

### Bonus Question (Ungraded): Can you give a possible explanation that supports the rankings?
Your Answer:

While all layers are directly connected, residual networks offer a skip connection between every two layers. Unlike traditional CNN, which may experience problems with gradient vanishing when learning from deep layers, the residual block design of the network enables it to learn parameters from more deeper layers. This indicates that ResNet10 performs generally better than Vanilla CNN. The accuracy of a model can be increased by using an attention mechanism to direct it to pay "attention" to particular features. Since there is no variation in accuracy between the two strategies, it is difficult to identify the rankings of single early attention and single late attention. However, compared to single attention blocks, double attention blocks do really enhance accuracy more.

In [34]:
-nbconvert --topdf --assignment4.ipynb

NameError: name 'nbconvert' is not defined