# Assignment 4: Self-Attention for Vision

For this assignment, we're going to implement self-attention blocks in a convolutional neural network for CIFAR-10 Classification.

# Part I. Preparation

First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In [41]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

In [42]:
NUM_TRAIN = 49000

# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
                T.ToTensor(),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ])

# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./data/datasets', train=True, download=True,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10('./data/datasets', train=True, download=True,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10('./data/datasets', train=False, download=True, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment. 

In [43]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 100

print('using device:', device)

using device: cuda


## Flatten Function

In [44]:
def flatten(x):
    N = x.shape[0] # read in N, C, H, W
    return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

def test_flatten():
    x = torch.arange(12).view(2, 1, 3, 2)
    print('Before flattening: ', x)
    print('After flattening: ', flatten(x))

test_flatten()

Before flattening:  tensor([[[[ 0,  1],
          [ 2,  3],
          [ 4,  5]]],


        [[[ 6,  7],
          [ 8,  9],
          [10, 11]]]])
After flattening:  tensor([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]])


### Check Accuracy Function


In [45]:
import torch.nn.functional as F  # useful stateless functions
def check_accuracy(loader, model):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
        return 100 * acc

### Training Loop

In [46]:
def train(model, optimizer, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Nothing, but prints model accuracies during training.
    """
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    acc_max = 0
    for e in range(epochs):
        for t, (x, y) in enumerate(loader_train):
            
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # This is the backwards pass: compute the gradient of the loss with
            # respect to each  parameter of the model.
            loss.backward()

            # Actually update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()

            if t % print_every == 0:
                print('Epoch %d, Iteration %d, loss = %.4f' % (e, t, loss.item()))
                acc = check_accuracy(loader_val, model)
                if acc >= acc_max:
                    acc_max = acc
                print()
    print("Maximum accuracy attained: ", acc_max)

In [47]:
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
    def forward(self, x):
        return flatten(x)

## Vanilla CNN; No Attention
We implement the vanilla architecture for you here. Do not modify the architecture. You will use the same architecture in the following parts. Do not modify the hyper-parameters.

In [94]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3
num_classes = 10

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, num_classes),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)


train(model, optimizer, epochs=1)

Epoch 0, Iteration 0, loss = 2.2902
Checking accuracy on validation set
Got 119 / 1000 correct (11.90)

Epoch 0, Iteration 100, loss = 1.5537
Checking accuracy on validation set
Got 451 / 1000 correct (45.10)

Epoch 0, Iteration 200, loss = 1.5256
Checking accuracy on validation set
Got 473 / 1000 correct (47.30)

Epoch 0, Iteration 300, loss = 1.3774
Checking accuracy on validation set
Got 488 / 1000 correct (48.80)

Epoch 0, Iteration 400, loss = 1.0604
Checking accuracy on validation set
Got 535 / 1000 correct (53.50)

Epoch 0, Iteration 500, loss = 1.3256
Checking accuracy on validation set
Got 528 / 1000 correct (52.80)

Epoch 0, Iteration 600, loss = 1.2408
Checking accuracy on validation set
Got 570 / 1000 correct (57.00)

Epoch 0, Iteration 700, loss = 1.1947
Checking accuracy on validation set
Got 601 / 1000 correct (60.10)

Maximum accuracy attained:  60.099999999999994


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.
You should be able to see atleast 55% accuracy

In [95]:
vanillaModel = model
check_accuracy(loader_test, vanillaModel)


Checking accuracy on test set
Got 5952 / 10000 correct (59.52)


59.519999999999996

## Part II Self-Attention

In the next section, you will implement an Attention layer which you will then use within a convnet architecture defined above for cifar 10 classification task.

A self-attention layer is formulated as following:

Input: $X$ of shape $(H\times W, C)$

Query, key, value linear transforms are $W_Q$, $W_K$, $W_V$, of shape $(C, C)$. We implement these linear transforms as 1x1 convolutional layers of the same dimensions.

$XW_Q$, $XW_K$, $XW_V$, represent the output volumes when input X is passed through the transforms.


Self-Attention is given by the formula: $Attention(X) = X + Softmax(\frac{XW_Q(XW_K)^\top}{\sqrt{C}})XW_V$

### Inline Question 1: Self-Attention is equivalent to which of the following: (5 points)
1. K-means clustering <br />
2. Non-local means <br />
3. Residual Block <br />
4. Gaussian Blurring <br />

Your Answer: 2

### Here you implement the Attention module, and run it in the next section (40 points)

In [84]:
# Initialize the attention module as a nn.Module subclass
class Attention(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        
        # TODO: Implement the Key, Query and Value linear transforms as 1x1 convolutional layers
        # Hint: channel size remains constant throughout
        self.conv_query = nn.Conv2d(in_channels, in_channels, 1)
        self.conv_key = nn.Conv2d(in_channels, in_channels, 1)
        self.conv_value = nn.Conv2d(in_channels, in_channels, 1)

    def forward(self, x):
        N, C, H, W = x.shape
        
        # TODO: Pass the input through conv_query, reshape the output volume to (N, C, H*W)
        q = self.conv_query(x)
        q = torch.reshape(q, (N, C, -1))
        # TODO: Pass the input through conv_key, reshape the output volume to (N, C, H*W)
        k = self.conv_key(x)
        k = torch.reshape(k, (N, C, -1))
        # TODO: Pass the input through conv_value, reshape the output volume to (N, C, H*W)
        v = self.conv_value(x)
        v = torch.reshape(v, (N, C, -1))
        # TODO: Implement the above formula for attention using q, k, v, C
        # NOTE: The X in the formula is already added for you in the return line
        q = torch.transpose(q, 1, 2)
        att_distrib = torch.matmul(q, k)
        att_distrib = att_distrib / torch.sqrt(torch.tensor([C], device=device))
        att_distrib = torch.softmax(att_distrib, dim=1) 
        # print(att_distrib.shape, torch.Tensor([C]))
        attention = torch.matmul(att_distrib, torch.transpose(v, 1, 2)) # (N, H*W, C)
        attention = torch.transpose(attention, 1, 2) # (N, C, H*W)
        # Reshape the output to (N, C, H, W) before adding to the input volume
        attention = attention.reshape(N, C, H, W)
        return x + attention

## Single Attention Block: Early attention; After the first conv layer. (10 points)

In [85]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the first Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Attention->Relu->Conv->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    Attention(channel_1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1, stride=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, num_classes),
)


optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.2777
Checking accuracy on validation set
Got 145 / 1000 correct (14.50)

Epoch 0, Iteration 100, loss = 1.7835
Checking accuracy on validation set
Got 421 / 1000 correct (42.10)

Epoch 0, Iteration 200, loss = 1.3915
Checking accuracy on validation set
Got 451 / 1000 correct (45.10)

Epoch 0, Iteration 300, loss = 1.5871
Checking accuracy on validation set
Got 510 / 1000 correct (51.00)

Epoch 0, Iteration 400, loss = 1.4586
Checking accuracy on validation set
Got 529 / 1000 correct (52.90)

Epoch 0, Iteration 500, loss = 1.1199
Checking accuracy on validation set
Got 506 / 1000 correct (50.60)

Epoch 0, Iteration 600, loss = 1.1241
Checking accuracy on validation set
Got 541 / 1000 correct (54.10)

Epoch 0, Iteration 700, loss = 1.0821
Checking accuracy on validation set
Got 552 / 1000 correct (55.20)

Epoch 1, Iteration 0, loss = 1.1126
Checking accuracy on validation set
Got 583 / 1000 correct (58.30)

Epoch 1, Iteration 100, loss = 1.2216
Checking acc

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.
You should see improvement of about 2-3% over the vanilla convnet model. * Use this part to tune your Attention module and then move on to the next parts. *

In [86]:
earlyAttention = model
check_accuracy(loader_test, earlyAttention)

Checking accuracy on test set
Got 6112 / 10000 correct (61.12)


61.12

## Single Attention Block: Late attention; After the second conv layer. (10 points)

In [87]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the Second Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Conv->Relu->Attention->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1, stride=1),
    nn.ReLU(),
    Attention(channel_2),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, num_classes),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.3091
Checking accuracy on validation set
Got 125 / 1000 correct (12.50)

Epoch 0, Iteration 100, loss = 1.5839
Checking accuracy on validation set
Got 438 / 1000 correct (43.80)

Epoch 0, Iteration 200, loss = 1.9136
Checking accuracy on validation set
Got 497 / 1000 correct (49.70)

Epoch 0, Iteration 300, loss = 1.3352
Checking accuracy on validation set
Got 527 / 1000 correct (52.70)

Epoch 0, Iteration 400, loss = 1.3180
Checking accuracy on validation set
Got 532 / 1000 correct (53.20)

Epoch 0, Iteration 500, loss = 1.3627
Checking accuracy on validation set
Got 532 / 1000 correct (53.20)

Epoch 0, Iteration 600, loss = 1.4136
Checking accuracy on validation set
Got 554 / 1000 correct (55.40)

Epoch 0, Iteration 700, loss = 1.3577
Checking accuracy on validation set
Got 567 / 1000 correct (56.70)

Epoch 1, Iteration 0, loss = 1.3397
Checking accuracy on validation set
Got 557 / 1000 correct (55.70)

Epoch 1, Iteration 100, loss = 1.0281
Checking acc

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [88]:
lateAttention = model
check_accuracy(loader_test, lateAttention)

Checking accuracy on test set
Got 5831 / 10000 correct (58.31)


58.309999999999995

### Inline Question 2: Provide one example each of usage of self-attention and attention in computer vision. Explain the difference between the two. (5 points)


Your Answer:
* Usage of self-attention: The Vision Transformer (ViT) leverages self-attention to process images as sequences of patches, enabling it to capture long-range dependencies and interactions between different parts of the image. This approach provides an alternative to traditional CNNs in object classification, offering competitive performance on various computer vision tasks.

* Usage of attention: The Squeeze-and-Excitation Network (SENet) enhances CNNs by recalibrating channel-wise feature responses using attention method. This allows the network to focus on the most relevant parts of the input sequence, improving performance and global feature representation.

* Differences: 

    * Self-attention is typically part of a layer (like in ViT) capturing non-local dependencies by computing relationships between all pairs of elements in an input. It can be used very frequently in a network. It models within-sequence relationships mainly. The weights are computed from the input sequence directly, and hence different inputs correspond to different weights. It improves performance by capturing dependencies within the input sequence more effectively by doing dot-products using queries and keys for the weights. 
    
    * Attention is typically an individual module that concatenates encoder and decoder, serving the function of capturing the sequence-to-sequence relationships in the network. It's often used only a few times to concatenate two modules (like an encoder and a decoder) in a network. And hence, it models sequence-to-sequence relationships mainly. The weights in the MLP are learnt, and hence different inputs correspond to the same group of learnt weights. Attention improves performance by emphasizing important features based on the learnt weights indicating relative importance. 

## Double Attention Blocks: After conv layers 1 and 2 (10 points)

In [89]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the Second Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Attention->Relu->Conv->Relu->Attention->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    Attention(channel_1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1, stride=1),
    nn.ReLU(),
    Attention(channel_2),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, num_classes),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.2909
Checking accuracy on validation set
Got 109 / 1000 correct (10.90)

Epoch 0, Iteration 100, loss = 1.9164
Checking accuracy on validation set
Got 322 / 1000 correct (32.20)

Epoch 0, Iteration 200, loss = 1.5259
Checking accuracy on validation set
Got 450 / 1000 correct (45.00)

Epoch 0, Iteration 300, loss = 1.3118
Checking accuracy on validation set
Got 467 / 1000 correct (46.70)

Epoch 0, Iteration 400, loss = 1.4503
Checking accuracy on validation set
Got 506 / 1000 correct (50.60)

Epoch 0, Iteration 500, loss = 1.3227
Checking accuracy on validation set
Got 515 / 1000 correct (51.50)

Epoch 0, Iteration 600, loss = 1.2445
Checking accuracy on validation set
Got 524 / 1000 correct (52.40)

Epoch 0, Iteration 700, loss = 1.3021
Checking accuracy on validation set
Got 532 / 1000 correct (53.20)

Epoch 1, Iteration 0, loss = 1.2930
Checking accuracy on validation set
Got 530 / 1000 correct (53.00)

Epoch 1, Iteration 100, loss = 1.3517
Checking acc

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [90]:
vanillaModel = model
check_accuracy(loader_test, vanillaModel)

Checking accuracy on test set
Got 5862 / 10000 correct (58.62)


58.620000000000005

## Resnet with Attention 

Now we will experiment with applying attention within the Resnet10 architecture that we implemented in Homework 2. Please note that for a deeper model such as Resnet we do not expect significant improvements in performance with Attention

## Vanilla Resnet, No Attention

The architecture for Resnet is given below, please train it and evaluate it on the test set.

In [57]:
import torch
import torch.nn as nn

class ResNet(nn.Module):

    def __init__(self, block, layers, img_channels=3, num_classes=100, batchnorm=False):
        super(ResNet, self).__init__() #layers = [1, 1, 1, 1] 
        self.in_channels = 64
        self.conv1 = nn.Conv2d(img_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.batchnorm = batchnorm
        self.layer1 = self.make_layer(block, layers[0], out_channels=64, stride=1, batchnorm=batchnorm)
        self.layer2 = self.make_layer(block, layers[1], out_channels=128, stride=1, batchnorm=batchnorm)
        self.layer3 = self.make_layer(block, layers[2], out_channels=256, stride=1, batchnorm=batchnorm)
        self.layer4 = self.make_layer(block, layers[3], out_channels=512, stride=2, batchnorm=batchnorm)

        self.averagepool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    
    def forward(self, x):

        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.averagepool(x)
        x = x.reshape(x.shape[0], -1)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)

        return x


        

    def make_layer(self, block, num_blocks, out_channels, stride, batchnorm=False):
        downsampler = None
        layers = []
        if stride != 1 or self.in_channels != out_channels:
            downsampler = nn.Sequential(nn.Conv2d(self.in_channels, out_channels, kernel_size = 1, stride = stride), nn.BatchNorm2d(out_channels))

        layers.append(block(self.in_channels, out_channels, downsampler, stride, batchnorm=batchnorm))

        self.in_channels = out_channels

        for i in range(num_blocks - 1):
            layers.append(block(self.in_channels, out_channels))

        
        return nn.Sequential(*layers)
        
class block(nn.Module):

    def __init__(self, in_channels, out_channels, downsampler = None, stride = 1, batchnorm=False):
        
        super(block, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size = 3, padding = 2)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsampler = downsampler
        self.relu = nn.ReLU()
        self.batchnorm = batchnorm

    
    def forward(self, x):

        residual = x
        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        if self.batchnorm:
            x = self.bn2(x)
        x = self.relu(x)
        
        if self.downsampler:
            residual = self.downsampler(residual)

        return self.relu(residual + x)
    


def ResNet10(num_classes = 100, batchnorm= False):

    return ResNet(block, [1, 1, 1, 1], num_classes=num_classes, batchnorm=batchnorm)

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [58]:
learning_rate = 1e-3

model = ResNet10()

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

vanillaResnet = model
check_accuracy(loader_test, vanillaResnet)

Epoch 0, Iteration 0, loss = 4.7750
Checking accuracy on validation set
Got 105 / 1000 correct (10.50)

Epoch 0, Iteration 100, loss = 1.5069
Checking accuracy on validation set
Got 395 / 1000 correct (39.50)

Epoch 0, Iteration 200, loss = 1.3592
Checking accuracy on validation set
Got 449 / 1000 correct (44.90)

Epoch 0, Iteration 300, loss = 1.4675
Checking accuracy on validation set
Got 432 / 1000 correct (43.20)

Epoch 0, Iteration 400, loss = 1.2042
Checking accuracy on validation set
Got 471 / 1000 correct (47.10)

Epoch 0, Iteration 500, loss = 1.1853
Checking accuracy on validation set
Got 528 / 1000 correct (52.80)

Epoch 0, Iteration 600, loss = 1.1258
Checking accuracy on validation set
Got 571 / 1000 correct (57.10)

Epoch 0, Iteration 700, loss = 1.0300
Checking accuracy on validation set
Got 573 / 1000 correct (57.30)

Epoch 1, Iteration 0, loss = 1.0037
Checking accuracy on validation set
Got 577 / 1000 correct (57.70)

Epoch 1, Iteration 100, loss = 0.8076
Checking acc

74.88

## Resnet with Attention (5 points)

In [92]:
## Resnet with Attention

learning_rate = 1e-3

# TODO: Use the above Attention module after the 2nd resnet block i.e. after self.layer2.
class ResNetAttention(nn.Module):

    def __init__(self, block, layers, img_channels=3, num_classes=100, batchnorm=False):
        super(ResNetAttention, self).__init__() #layers = [1, 1, 1, 1] 
        self.in_channels = 64
        self.conv1 = nn.Conv2d(img_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.batchnorm = batchnorm
        self.layer1 = self.make_layer(block, layers[0], out_channels=64, stride=1, batchnorm=batchnorm)
        self.layer2 = self.make_layer(block, layers[1], out_channels=128, stride=1, batchnorm=batchnorm)
        self.attention = Attention(128)
        self.layer3 = self.make_layer(block, layers[2], out_channels=256, stride=1, batchnorm=batchnorm)
        self.layer4 = self.make_layer(block, layers[3], out_channels=512, stride=2, batchnorm=batchnorm)

        self.averagepool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    
    def forward(self, x):

        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.attention(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.averagepool(x)
        x = x.reshape(x.shape[0], -1)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)

        return x

    def make_layer(self, block, num_blocks, out_channels, stride, batchnorm=False):
        downsampler = None
        layers = []
        if stride != 1 or self.in_channels != out_channels:
            downsampler = nn.Sequential(nn.Conv2d(self.in_channels, out_channels, kernel_size = 1, stride = stride), nn.BatchNorm2d(out_channels))

        layers.append(block(self.in_channels, out_channels, downsampler, stride, batchnorm=batchnorm))

        self.in_channels = out_channels

        for i in range(num_blocks - 1):
            layers.append(block(self.in_channels, out_channels))

        
        return nn.Sequential(*layers)
        
class ResNetAttention10block(nn.Module):

    def __init__(self, in_channels, out_channels, downsampler = None, stride = 1, batchnorm=False):
        
        super(ResNetAttention10block, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size = 3, padding = 2)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsampler = downsampler
        self.relu = nn.ReLU()
        self.batchnorm = batchnorm

    
    def forward(self, x):

        residual = x
        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        if self.batchnorm:
            x = self.bn2(x)
        x = self.relu(x)
        
        if self.downsampler:
            residual = self.downsampler(residual)

        return self.relu(residual + x)

def ResNetAttention10(num_classes = 100, batchnorm= False):

    return ResNetAttention(ResNetAttention10block, [1, 1, 1, 1], num_classes=num_classes, batchnorm=batchnorm)

model = ResNetAttention10()

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 4.6545
Checking accuracy on validation set
Got 98 / 1000 correct (9.80)

Epoch 0, Iteration 100, loss = 1.7813
Checking accuracy on validation set
Got 301 / 1000 correct (30.10)

Epoch 0, Iteration 200, loss = 1.5791
Checking accuracy on validation set
Got 466 / 1000 correct (46.60)

Epoch 0, Iteration 300, loss = 1.3994
Checking accuracy on validation set
Got 482 / 1000 correct (48.20)

Epoch 0, Iteration 400, loss = 1.3925
Checking accuracy on validation set
Got 496 / 1000 correct (49.60)

Epoch 0, Iteration 500, loss = 1.2355
Checking accuracy on validation set
Got 516 / 1000 correct (51.60)

Epoch 0, Iteration 600, loss = 1.5204
Checking accuracy on validation set
Got 507 / 1000 correct (50.70)

Epoch 0, Iteration 700, loss = 1.3769
Checking accuracy on validation set
Got 584 / 1000 correct (58.40)

Epoch 1, Iteration 0, loss = 0.8449
Checking accuracy on validation set
Got 538 / 1000 correct (53.80)

Epoch 1, Iteration 100, loss = 1.1106
Checking accur

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [93]:
AttentionResnet = model
check_accuracy(loader_test, AttentionResnet)

Checking accuracy on test set
Got 7701 / 10000 correct (77.01)


77.01

## Inline Question 3: Rank the above models based on their performance on test dataset (15 points)
( You are encouraged to run each of the experiments (training) at
least 3 times to get an average estimate )

Report the test accuracies alongside the model names. For example, 1. Vanilla CNN (57.45%, 57.99%).. etc

1. Resnet with attention (75.98%, 76.34%, 77.01%) <br /> 
2. Vanilla Resnet (74.88%, 74.43%, 73.52%) <br />
3. CNN with early attention (60.60%, 62.32%, 61.12%) <br />
4. CNN with double attention blocks (58.31%, 60.24%, 58.62%) <br />
5. Vanilla CNN (59.51%, 56.98%, 59.52%) <br />
6. CNN with late attention (56.37%, 57.68%, 58.30%) <br />

### Bonus Question (Ungraded): Can you give a possible explanation that supports the rankings?
Your Answer:

Sometimes Resnet with attention performs worse than Resnet vanilla. This is because the sampling efficiency decreases as we use Resnet with attention instead of Resnet, making the network require more epochs to train. After increasing the number of training epochs, the performance of Resnet with attention stably gets better than the vanilla one.