# Welcome to CS 5242 **Assignment 2**

ASSIGNMENT DEADLINE ⏰ : ** 3 March 2024**

In this assignment, we have three parts:
1. Implement some operations in CNNs from scratch *(2 Points)*
2. Implement a simple CNN and train on MNIST using PyTorch  *(4 Points)*
3. Implement a VGG network with PyTorch *(4 Points)*

Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs. In this semester, we will use Colab to run our experiments.
1. Login Google Colab https://colab.research.google.com/
2. In this assignment, We **need GPU** to training the CNN model. You may need to **choose GPU in Runtime -> Change runtime type -> Hardware accerator**
![Alt text](image-1.png)


### **Grades Policy**

We have 10 points for this homework. 15% off per day late, 0 scores if you submit it 7 days after the deadline.

### **Cautions**

**DO NOT** copy the code from the internet, e.g. GitHub.
---

**DO NOT** use any LLMs to write the code, e.g. ChatGPT.
---

### **Contact**

Please feel free to contact us if you have any question about this homework or need any further information.

Slack: Wangbo Zhao


> If you have not join the slack group, you can click [here](https://join.slack.com/t/cs5242-2024spring/shared_invite/zt-2cw3jgqab-wFhoaIVa4RIX4fCZ_k~vjQ)

## Setup

Start by running the cell below to set up all required software.

In [1]:
!pip install numpy matplotlib
!pip install torch torchvision

[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

Import the neccesary library and fix seed for Python, NumPy and PyTorch.

In [2]:
import math
import random

import numpy as np
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

random.seed(0)
np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x10944caf0>

Now let's setup the GPU environment. The colab provides a free GPU to use. Do as follows:

- Runtime -> Change Runtime Type -> select `GPU` in Hardware accelerator
- Click `connect` on the top-right

After connecting to one GPU, you can check its status using `nvidia-smi` command.

In [3]:
!nvidia-smi

torch.cuda.is_available()

zsh:1: command not found: nvidia-smi


False

Everything is ready, you can move on and ***Good Luck !*** 😃

## Implement some operations in CNNs from scratch

In this section, you need to implement some operations commonly used in CNNs, including convolution and pooling.

You need to compare the computational results of your implemented version with those of Pytorch, expecting that the error between the correct implementation and pytorch will be very small.


### Step 1
Given a 32x32 pixels, 3 channels input, get a torch tensor with torch.randn().

In [4]:
batch_size = 2
c = 3
h = 32
w = 32
x = torch.randn(batch_size, c, h, w)
print(x)
print(x.shape)

tensor([[[[-1.1258, -1.1524, -0.2506,  ...,  1.5863,  0.9463, -0.8437],
          [-0.6136,  0.0316, -0.4927,  ..., -1.2341,  1.8197, -0.5515],
          [-0.5692,  0.9200,  1.1108,  ..., -0.9565,  0.0335,  0.7101],
          ...,
          [ 1.0166,  1.2868,  2.0820,  ...,  0.8161, -0.5711, -0.1195],
          [-0.4274,  0.8143, -1.4121,  ..., -0.1394, -0.3677, -0.4574],
          [-1.2945,  0.7012, -1.9098,  ...,  0.5374,  1.0826, -1.7105]],

         [[-1.0841, -0.1287, -0.6811,  ..., -0.9825,  0.7184,  0.4402],
          [-0.5619,  0.6640, -2.1033,  ..., -0.7821, -2.1407,  0.3337],
          [-1.1230,  0.6210, -0.8764,  ...,  0.9159,  0.2990,  0.1771],
          ...,
          [ 2.2746, -0.9119,  0.5105,  ...,  0.4876, -0.9265, -0.5748],
          [ 0.7300, -0.9287,  0.1743,  ..., -0.7073, -0.8813, -0.5895],
          [-0.8363, -1.8354,  0.4765,  ..., -0.3812, -1.6687,  1.0869]],

         [[ 0.6657,  0.8847,  0.4671,  ...,  0.7709, -0.8416,  1.7962],
          [ 0.1924, -0.1777,  

### Step 2
We first implement these operations with Pytorch so that we can compare the computational results of our implemented version with those of original pytorch.


In [5]:

# 1. Build a max pooling layer torch_max_pool with Pytorch. The kernel size of the pooling is 2, the stride is 2, and there is not any padding.
torch_max_pool = nn.MaxPool2d(kernel_size=2,
                              stride=2,
                              padding=0)

# 2. Build a average pooling layer torch_avg_pool with Pytorch. The kernel size of the pooling is 2, the stride is 1. The padding should be set to 1.
torch_avg_pool = nn.AvgPool2d(kernel_size=2,
                              stride=1,
                              padding=1)

# 3.Build a 2D convolutional layer torch_conv with Pytorch. The kernel size of the convolution is 3. Stride is 1. The input channel and output channel should be set to 3 and 64, respectively. We use zero padding to keep the spatial size of the output feature.
torch_conv = nn.Conv2d(in_channels=3,
                       out_channels=64,
                       kernel_size=3,
                       stride=1,
                       padding=1)

# 2D batchnorm with channel=3
torch_norm = nn.BatchNorm2d(3)



In [6]:
torch_max_pool_out = torch_max_pool(x)
print(torch_max_pool_out.shape)

torch_avg_pool_out = torch_avg_pool(x)
print(torch_avg_pool_out.shape)

torch_conv_out = torch_conv(x)
print(torch_conv_out.shape)

torch_norm_out = torch_norm(x)
print(torch_norm_out.shape)


torch.Size([2, 3, 16, 16])
torch.Size([2, 3, 33, 33])
torch.Size([2, 64, 32, 32])
torch.Size([2, 3, 32, 32])


### Step 3

Implement these operations from scratch. Output your tensors as "my_xxx_out".

In [7]:
def my_max_pool(x, kernel_size, stride, padding):
    """
    Args:
        x: torch tensor with size (N, C_in, H_in, W_in),
        kernel_size: size of the window to take a max over,
        stride: stride of the window,
        padding: implicit zero padding to be added on both sides,

    Return:
        y: torch tensor of size (N, C_out, H_out, W_out).
    """

    y = None
    # === Complete the code (0.5')
    N, C_in, H_in, W_in = x.shape
    # Calculate output dimensions
    H_out = (H_in + 2 * padding - kernel_size) // stride + 1
    W_out = (W_in + 2 * padding - kernel_size) // stride + 1

    # Apply padding
    if padding > 0:
        x_padded = torch.nn.functional.pad(x, (padding, padding, padding, padding), "constant", 0)
    else:
        x_padded = x

    # Initialize output tensor
    y = torch.zeros((N, C_in, H_out, W_out), dtype=x.dtype)

    # Apply max pooling
    for n in range(N):
        for c in range(C_in):
            for h in range(H_out):
                for w in range(W_out):
                    h_start = h * stride
                    w_start = w * stride
                    h_end = h_start + kernel_size
                    w_end = w_start + kernel_size

                    window = x_padded[n, c, h_start:h_end, w_start:w_end]
                    y[n, c, h, w] = torch.max(window)

    # === Complete the code
    return y

In [8]:
def my_avg_pool(x, kernel_size, stride, padding):
    """
    Args:
        x: torch tensor with size (N, C_in, H_in, W_in),
        kernel_size: size of the window,
        stride: stride of the window,
        padding: implicit zero padding to be added on both sides,

    Return:
        y: torch tensor of size (N, C_out, H_out, W_out).
    """

    y = None
    # === Complete the code (0.5')
    N, C_in, H_in, W_in = x.shape
    # Calculate output dimensions
    H_out = (H_in + 2 * padding - kernel_size) // stride + 1
    W_out = (W_in + 2 * padding - kernel_size) // stride + 1

    # Apply padding
    if padding > 0:
        x_padded = torch.nn.functional.pad(x, (padding, padding, padding, padding), "constant", 0)
    else:
        x_padded = x

    # Initialize output tensor
    y = torch.zeros((N, C_in, H_out, W_out), dtype=x.dtype)

    # Apply average pooling
    for n in range(N):
        for c in range(C_in):
            for h in range(H_out):
                for w in range(W_out):
                    h_start = h * stride
                    w_start = w * stride
                    h_end = h_start + kernel_size
                    w_end = w_start + kernel_size

                    window = x_padded[n, c, h_start:h_end, w_start:w_end]
                    y[n, c, h, w] = torch.mean(window)

    # === Complete the code
    return y

In [9]:
def my_conv(x, in_channels, out_channels, kernel_size, stride, padding, weight, bias):
    """
    Args:
        x: torch tensor with size (N, C_in, H_in, W_in),
        in_channels: number of channels in the input image, it is C_in;
        out_channels: number of channels produced by the convolution;
        kernel_size: size of onvolving kernel,
        stride: stride of the convolution,
        padding: implicit zero padding to be added on both sides of each dimension,

    Return:
        y: torch tensor of size (N, C_out, H_out, W_out)
    """

    y = None
    # === Complete the code (0.5')
    N, C_in, H_in, W_in = x.shape
    # Calculate output dimensions
    H_out = (H_in + 2 * padding - kernel_size) // stride + 1
    W_out = (W_in + 2 * padding - kernel_size) // stride + 1

    # Apply padding
    x_padded = torch.nn.functional.pad(x, (padding, padding, padding, padding), "constant", 0)

    # Initialize output tensor
    y = torch.zeros((N, out_channels, H_out, W_out), dtype=x.dtype)

    # Apply convolution
    for n in range(N):
        for c_out in range(out_channels):
            for h in range(H_out):
                for w in range(W_out):
                    h_start = h * stride
                    w_start = w * stride
                    h_end = h_start + kernel_size
                    w_end = w_start + kernel_size

                    x_slice = x_padded[n, :, h_start:h_end, w_start:w_end]
                    y[n, c_out, h, w] = torch.sum(x_slice * weight[c_out], dim=(0, 1, 2)) + bias[c_out]
    # === Complete the code
    return y

In [10]:
def my_batchnorm(x, num_features, eps=1e-5):
    """
    Args:
        x: torch tensor with size (N, C, H, W),
        num_features: number of features in the input tensor, it is C;
        eps: a value added to the denominator for numerical stability. Default: 1e-5

    Return:
        y: torch tensor of size (N, C, H, W)
    """

    y = torch.empty_like(x)
    # === Complete the code (0.5')
    y = torch.empty_like(x)

    # Calculate mean and variance
    mean = x.mean(dim=(0, 2, 3), keepdim=True)
    var = x.var(dim=(0, 2, 3), keepdim=True, unbiased=False)

    # Normalize
    y = (x - mean) / torch.sqrt(var + eps)

    # === Complete the code
    return y

In [12]:
my_max_pool_out = my_max_pool(x, kernel_size=2, stride=2, padding=0)
my_avg_pool_out = my_avg_pool(x, kernel_size=2, stride=1, padding=1)
my_conv_out = my_conv(x,
                      in_channels=3,
                      out_channels=64,
                      kernel_size=3,
                      stride=1,
                      padding=1,
                      weight=torch_conv.weight,
                      bias=torch_conv.bias)
my_norm_out = my_batchnorm(x, num_features=3, eps=1e-5)


### Step 4

Compare and show that "torch_xxx_out" and "my_xxx_out" are equal up to small numerical errors.

In [13]:
print(F.mse_loss(my_max_pool_out, torch_max_pool_out))
print(F.mse_loss(my_avg_pool_out, torch_avg_pool_out))
print(F.mse_loss(my_conv_out, torch_conv_out))
print(F.mse_loss(my_norm_out, torch_norm_out))

tensor(0.)
tensor(4.2238e-16)
tensor(4.8772e-15, grad_fn=<MseLossBackward0>)
tensor(3.8154e-15, grad_fn=<MseLossBackward0>)


## Implement a simple CNN and train it on MNIST using PyTorch

### Step 1
Create datasets. The MNIST data set is composed of handwritten digit images and digit labels from 0 to 9. It consists of 60,000 training samples and 10,000 test samples. Each sample is a 28 * 28 pixel grayscale handwritten digit image.

In [14]:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

train_set = torchvision.datasets.FashionMNIST(
    root = 'FashionMNIST/',
    train = True,
    download = True,
    transform = transforms.Compose([
        transforms.ToTensor()
    ])
)

test_set = torchvision.datasets.FashionMNIST(
    root = 'FashionMNIST/',
    train = False,
    download = True,
    transform = transforms.Compose([
        transforms.ToTensor()
    ])
)

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=100)

### Step 2
Create the model.
You can build a simple convolutional neural network to conduct the classification. You may refine the architecture based on the accuracy. You can also try different learning rates.
**The test accuracy should achieve 85%.**


In [15]:
class Network(nn.Module):
    def __init__(self):
        super(Network,self).__init__()
        # Define the network layers
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        # Fully connected layer
        self.fc1 = nn.Linear(in_features=64 * 7 * 7, out_features=128)  # Assuming input images are 28x28
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(in_features=128, out_features=10)  # Assuming 10 classes


    def forward(self, input):
        x = self.pool1(self.relu1(self.conv1(input)))
        x = self.pool2(self.relu2(self.conv2(x)))
        x = x.view(-1, 64 * 7 * 7)  # Flatten the tensor for the fully connected layer
        x = self.relu3(self.fc1(x))
        t = self.fc2(x)

        return t

network = Network()
if torch.cuda.is_available():
    network = network.cuda()


optimizer = optim.Adam(network.parameters(), lr=0.01)

### Step 3

Build the train and test loops

In [16]:
for epoch in range(10):
    total_loss = 0
    total_correct = 0
    for batch in train_loader:
        images, labels = batch
        if torch.cuda.is_available():
            images = images.cuda()
            labels = labels.cuda()

        optimizer.zero_grad()
        preds = network(images)
        loss = F.cross_entropy(preds, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _,prelabels=torch.max(preds,dim=1)
        total_correct += (prelabels==labels).sum().item()
    accuracy = total_correct/len(train_set)
    print("Epoch:%d  ,  Loss:%f  , Train Accuracy:%f "%(epoch, total_loss, accuracy * 100))


correct=0
total=0
network.eval()
with torch.no_grad():
    for batch in test_loader:
        imgs,labels=batch
        if torch.cuda.is_available():
            imgs = imgs.cuda()
            labels = labels.cuda()
        preds=network(imgs)
        _,prelabels=torch.max(preds,dim=1)
        #print(prelabels.size())
        total=total+labels.size(0)
        correct=correct+int((prelabels==labels).sum())
    #print(total)
    accuracy=correct / total
    print("Test Accuracy: ", accuracy * 100)

Epoch:0  ,  Loss:249.585365  , Train Accuracy:84.935000 
Epoch:1  ,  Loss:173.049925  , Train Accuracy:89.330000 
Epoch:2  ,  Loss:154.921687  , Train Accuracy:90.358333 
Epoch:3  ,  Loss:144.079812  , Train Accuracy:91.001667 
Epoch:4  ,  Loss:134.901535  , Train Accuracy:91.515000 
Epoch:5  ,  Loss:129.654091  , Train Accuracy:91.803333 
Epoch:6  ,  Loss:122.124460  , Train Accuracy:92.360000 
Epoch:7  ,  Loss:117.382260  , Train Accuracy:92.713333 
Epoch:8  ,  Loss:116.227659  , Train Accuracy:92.713333 
Epoch:9  ,  Loss:109.195896  , Train Accuracy:93.123333 
Test Accuracy:  88.88000000000001


# Implement a VGG network with PyTorch
VGG is a type of CNN (Convolutional Neural Network) that was considered to be one of the best computer vision models in 2015.
https://arxiv.org/abs/1409.1556

Here is the configuration of the network from its paper. Now, you need to implement **Config C** it with Pytorch.

![Alt text](image-2.png)

In [17]:
import torch
from torch import nn

class VGG(nn.Module):
    def __init__(self, num_classes=1000) -> None:
        super().__init__()
        # Define the convolutional blocks
        # The configuration for Config C is:
        # '64', '64', 'M', '128', '128', 'M', '256', '256', '256', 'M', '512', '512', '512', 'M', '512', '512', '512', 'M'
        self.features = nn.Sequential(
            # Conv Block 1
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            # Conv Block 2
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            # Conv Block 3
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            # Conv Block 4
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            # Conv Block 5
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        
        # Define the classifier (fully connected layers)
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )
    
    def forward(self, image):
        x = self.features(image)
        x = torch.flatten(x, 1)  # Flatten the output for the fully connected layers
        x = self.classifier(x)
        return x

Then, please calculate the number of parameters and FLOPs (Floating point operations) of **Config C**.
You can only consider the FLOPs of the convolution and FC in **Config C**.


In [2]:
model_layers = [
    (3, 64, 3, 1, 1),
    (64, 64, 3, 1, 1),
    'M',  # Adding 'M' to represent MaxPooling which affects the size but not FLOPs directly
    (64, 128, 3, 1, 1),
    (128, 128, 3, 1, 1),
    'M',
    (128, 256, 3, 1, 1),
    (256, 256, 3, 1, 1),
    (256, 256, 1, 1, 1),
    'M',
    (256, 512, 3, 1, 1),
    (512, 512, 3, 1, 1),
    (512, 512, 1, 1, 1),
    'M',
    (512, 512, 3, 1, 1),
    (512, 512, 3, 1, 1),
    (512, 512, 1, 1, 1),
    'M',
]

fc_layers = [
    (512 * 7 * 7, 4096),
    (4096, 4096),
    (4096, 1000),
]

# Initialize counts
total_params = 0
total_flops = 0

# Image dimensions, starting with the input layer
input_width, input_height = 224, 224

# Convolutional layers
for layer in model_layers:
    if layer == 'M':  # MaxPooling, halving the dimensions
        input_width /= 2
        input_height /= 2
    else:
        in_channels, out_channels, kernel_size, stride, padding = layer
        # Parameter count for Conv Layers: (kernel_size^2 * in_channels + 1) * out_channels
        params = (kernel_size ** 2 * in_channels + 1) * out_channels
        
        # Output dimensions
        output_width = int((input_width + 2*padding - kernel_size) / stride + 1)
        output_height = int((input_height + 2*padding - kernel_size) / stride + 1)
        
        # FLOPs for Conv Layers: 2 * kernel_size^2 * in_channels * output_width * output_height * out_channels
        flops = 2 * kernel_size ** 2 * in_channels * output_width * output_height * out_channels
        
        total_params += params
        total_flops += flops
        
        # Update dimensions for next layer
        input_width, input_height = output_width, output_height

# Fully connected layers
for in_features, out_features in fc_layers:
    # Parameter count for FC layers: (in_features + 1) * out_features (including bias)
    params = (in_features + 1) * out_features
    # FLOPs for FC layers: 2 * in_features * out_features
    flops = 2 * in_features * out_features
    
    total_params += params
    total_flops += flops

total_params, total_flops



(133638952, 24390336512)

For Config C of the VGG network:
The total number of parameters is approximately 134 million.
The total number of floating point operations (FLOPs) for the convolutional and fully connected layers is approximately 2.4 billion.