## <h1><center>Assignment 3: Contrastive Learning</center></h1>



<center>
    <img src="https://www.cs.cornell.edu/courses/cs4782/2025sp/images/p3_header.jpeg" style="width:45%;">
</center>



&nbsp;


---


**Goal:** In this project you will be exploring different strategies in contrastive pre-training. You will be using different computer vision models to generate meaningful image features that will later be used to perform image classification on cifar10, a widely used machine learning dataset.

&nbsp;

**WHAT YOU'LL SUBMIT:** Your submission to Gradescope includes:


1.   A `.zip` file uploaded to ***Coding Assignment 3*** [here](https://www.gradescope.com/courses/963234/assignments/5901455) containing the following files:

<center>

\#|Files
---|---
i. | `submission.py`
ii. |`linear_model_preds.csv`

</center>


2.   A `.pdf` version of `responses.tex` with responses to the questions in this notebook uploaded to ***Coding Assignment 3 Responses*** [here](https://www.gradescope.com/courses/963234/assignments/5901482).

*More on how you are expected to access, modify and save these files as you follow along the instructions in the notebook.*

&nbsp;

**IMPORTANT:**

This coding assignment requires training 3 separate models for (~30+ minutes each) and thus requires the use of GPUs. We have requested Google compute credits for every student in the class. These credits can be used to provision a GPU in Google Cloud, that can then be used in colab. To receive your compute credits and set up a Google Cloud instance, use the instructions [here](https://www.cs.cornell.edu/courses/cs4782/2025sp/docs/gc_guide.pdf).

Things to keep in mind:
*   **The GPU charges by the minute. Only use it when required and remember to delete the GPU job as soon as you're done.**  If you use up the credits given to you, we will not be able to provide you with more credits.
*   You are only allowed one access code per person. The credits given using the code will be used across Coding Assignments 3, 4, 5 and the Final project. It is imperative that you use them wisely. In the case you run out of credits, we will not be able to award you more and you may not be able to complete future assignments!
*   When selecting the GPU, select the region to be somewhere in the US.
*   For GPU type, request either T4 or P100. In our experience, P100s are more easily available.
*   If the GPU resource is not available, you can try to reserve a GPU in another region.
*   To check which GPUs are assigned to you and to delete the GPU job, go [here](https://console.cloud.google.com/marketplace/product/colab-marketplace-image-public/colab) and click "View Deployments"

&nbsp;

**GOOGLE CLOUD:**

While working on Google Cloud, there is no option to mount Google drive to access `submission.py` like you've done in previous projects. Instead, you **MUST** upload your `submission.py` to the session storage of your notebook on your Google Cloud instance. 

**WARNING:** Files in the session storage are automatically deleted when you close your Google Cloud instance, so you MUST download your `submission.py` if you make any changes to it on your Google Cloud instance before exiting. 

**Our recommendation:** Write your code in regular Colab as you have for the previous 2 assignments. Try to debug using the free T4 instance provided by Colab before launching your notebook on the Google Cloud instance for a final run through (you have ~2 hours of free T4 access a day which is not powerful enough to complete the entire assignment but can allow you to debug). Running on T4 GPU: You can click on the runtime option and change your runtime type to the T4 GPU (this should make your training faster).

&nbsp;

**DO's:**


1.   **Recommendation:** Finish coding and debugging on regular Colab; only use the Google Cloud in the end to get the final results.
2.   As before, all functionality you need to modify is within `submission.py`.
3.   When on Google Cloud, upload your `submission.py` to the session storage.
4.   Remember to execute all code cells sequentially, not just those you’ve edited, to ensure your code runs properly.
5.   Please cite any external sources you use to complete this assignment in your written responses.
6.   Before starting your work, please review <a href="https://s3.amazonaws.com/ecornell/global/eCornellPlagiarismPolicy.pdf">eCornell's policy regarding plagiarism</a> (the presentation of someone else's work as your own without source credit).

&nbsp;

**DONT's:**


1.   DO NOT leave your GPU instance running after you are done working on this assignment. **AGAIN: The GPU charges by the minute. Only use it when required and remember to delete the GPU job as soon as you're done.**  If you use up the credits given to you, we will not be able to provide you with more credits.
2.   DO NOT put your credit card on your Google Cloud account. If you accidentally leave your GPU instance running, you can be charged and will not be refunded! This has happened in the past!
3.   DO NOT forget to download `submission.py` from the session storage in Google Cloud before closing your instance if you made any edits!
2.   DO NOT change the names of any provided functions, classes, or variables within the existing code cells, as this will interfere with grading.
3.   DO NOT delete any provided code/imports.

&nbsp;

***NOTE:***
    
*You can resubmit your work as many times as necessary before the submission deadline. If you experience difficulty or have questions about this exercise, use the Ed discussion board to engage with your peers or seek assistance from the TAs.*

# Part 0: Setting up the Colab environment.

The first few code blocks will set up your Colab environment.  Upload the `a3_release` folder to your Google Drive and run/update the cells below, following the TODO instructions. Just like in the first assignment, you must specify the paths to your implementation so it can be accessed by this notebook (see *TODO 1*).

In [None]:
!pip install einops
!pip install datasets
!pip install umap-learn

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from torchvision.transforms import RandAugment
from IPython.core.debugger import set_trace

import transformers
from einops import rearrange
from einops.layers.torch import Rearrange
from datasets import load_dataset

import os
import sys
import json
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

import umap

from google.colab import drive

torch.manual_seed(0)

**IMPORTANT:**

- If on **regular Google Colab**, run the following cell to mount your Drive!  
- If on **Google Cloud**, SKIP this cell as mounting your Google Drive is not supported on Google Cloud. Instead, upload your `submission.py` to the session storage and run the cell after this one. 
- **DO NOT FORGET** to download your `submission.py` before closing your Google Cloud instance if you make any changes as files in the session storage are deleted.

In [None]:
# TODO 0: Mount your Google Drive; this allows the runtime environment to access your drive.
drive.mount('/content/gdrive')

# NOTE: Make sure your path does NOT include a '/' at the end!
base_dir = "/content/gdrive/MyDrive/<path-to-a3-release>"
sys.path.append(base_dir)
## END TODO

Always run the following cell (even if on Google Cloud).

In [None]:
# This makes sure the submission module is reloaded whenever you make edits.
%load_ext autoreload
%aimport submission
%autoreload 1
import submission

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(F"Device set to {device}")

In [None]:
NUM_EPOCHS = 10

# Part 0.5: Setting up ResNet and Cifar10

First, we provide a ResNet implementation, which will be trained to extract image features using the contrastive learning losses you will implement in Part 2.

In [None]:
class ResidualBlock(nn.Module):
    def __init__(self, in_channel, interm_channel, out_channel, stride=1):
        """
        Inputs:
        in_channel = number of channels in the input to the first convolutional layer
        interm_channel = number of channels in the output of the first convolutional layer
                       = number of channels in the input to the second convolutional layer
        out_channel = number of channels in the output
        stride = stride for convolution, defaults to 1
        """
        super().__init__()

        self.conv1 = nn.Conv2d(in_channel, interm_channel, kernel_size = 3, stride = stride, padding = 1)
        self.conv2 = nn.Conv2d(interm_channel, out_channel, kernel_size = 3, stride = stride, padding = 1)
        self.conv3 = nn.Conv2d(in_channel, out_channel, kernel_size = 1, stride = stride) # 1x1 convolution
        self.bn1 = nn.BatchNorm2d(interm_channel)
        self.bn2 = nn.BatchNorm2d(out_channel)

    def forward(self, x):
        y = F.relu(self.bn1(self.conv1(x)))
        y = self.bn2(self.conv2(y))
        x = self.conv3(x)
        y +=  x # identity mapping
        return F.relu(y)


class ResNet(nn.Module):
    def __init__(self, num_blocks, layer1_channel, layer2_channel, out_channel):
        """
        Inputs:
        num_blocks = number of blocks in a block layer
        layer1_channel = number of channels in the input to the first block layer
        layer2_channel = number of channels in the output of the first block layer
                       = number of channels in the input to the second blcok layer
        out_channel = number of channels in the output
        """
        super(ResNet, self).__init__()
        self.first = nn.Sequential(
            nn.Conv2d(3, layer1_channel, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(layer1_channel), nn.SiLU(),
        )

        self.last = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),

        )

        self.layer1 = self.block_layer(num_blocks, layer1_channel, layer2_channel)
        self.layer2 = self.block_layer(num_blocks, layer2_channel, out_channel)

        self.projection_head = nn.Sequential(
            nn.Linear(out_channel, out_channel),
            nn.SiLU(),
            nn.Linear(out_channel, out_channel)
        )


    def block_layer(self, num_blocks, in_channel, out_channel):
        """
        Inputs:
        num_blocks = number of blocks in the block layer
        in_channel = number of input channels to the entire block layer
        out_channel = number of output channels in the output of the entire block layer
        """
        blk = []
        for i in range(num_blocks):
            if i == 0:
                blk.append(ResidualBlock(in_channel, out_channel, out_channel))
            else:
                blk.append(ResidualBlock(out_channel, out_channel, out_channel))

        return nn.Sequential(*blk)


    def forward(self, x, return_embedding=False):
        # x: (batch_size, 3, 32, 32)
        y = self.first(x)
        y = self.layer1(y)
        # 2x2 avg pooling
        y = F.avg_pool2d(y, kernel_size=2, stride=2)
        y = self.layer2(y)
        y = self.last(y)
        if return_embedding:
            return y
        y = self.projection_head(y)
        return y



The following code block defines the dataset we will be using for this assignment. We will be using the [Cifar10](https://huggingface.co/datasets/cifar10) dataset, which contains color images of size 32x32 for 10 different classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck).

We will be creating positive pairs of images by performing two different, randomized image transformations on the same image.

In [None]:
# Load cifar10 data from Hugging Face
cifar10 = load_dataset('cifar10')

# Load the data
train_data = cifar10['train']
test_data = cifar10['test']

# Define randomized transforms for training
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(32),
    transforms.RandomHorizontalFlip(),
    RandAugment(2, 9),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Define fixed SimCLR transforms for test-time
test_transform = transforms.Compose([
    transforms.Resize(32),
    transforms.CenterCrop(32),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Define the SimCLR dataset
class SimCLRDataset(Dataset):
    def __init__(self, data, transform, split='train'):
        self.split = split
        self.data = data
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        image = self.data[idx]['img']
        if self.split == 'train':
            image1 = self.transform(image)
            image2 = self.transform(image)
            return image1, image2
        else:
            image = self.transform(image)
            return image

# Create the SimCLR dataset
train_dataset = SimCLRDataset(train_data, train_transform, split='train')
val_dataset = SimCLRDataset(train_data, test_transform, split='test')
test_dataset = SimCLRDataset(test_data, test_transform, split='test')

# Create the SimCLR dataloaders
BATCH_SIZE = 256
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

## Part 1: Contrastive Learning Loss Functions (55 pts)

For this assignment, you will be implementing two different contrastive learning loss functions, the triplet loss and the SimCLR loss.

### Part 1.1: Triplet Loss (25 pts)

The formula for the triplet loss is as follows:
$$Loss = \texttt{max}(0,\text{sim}(x_i,\ x_i^n) - \text{sim}(x_i,\ x_i^p) + m)$$

where $x_i$ is a training example, $x_i^n$ and $x_i^p$ are negative and positive examples of $x_i$, respectively, $m$ is the margin, and $\text{sim}(a,b)$ is some similarity function.

You will implement the `triplet_loss` function as follows:

1.   L2-normalize of each of the input training examples.
2.   Calculate the triplet loss between all possible triplets of negative and positive examples.

        a. The similarity function is the dot product of the examples, i.e. $\text{sim}(a, b) = a(b)^T$

        b. `queries` and `keys` both contain $b$ examples, where $b$ is the batch size. The $i^{th}$ example in `keys` is a **positive example** for the $j^{th}$ example in `queries` if $i = j$. Otherwise, if $i \neq j$, then the $i^{th}$ example in `keys` is a **negative example** for the $j^{th}$ example in `queries`.

3. Average the loss across all triplets.

The following PyTorch functions may be helpful for implementing the triplet loss:


*   https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html
*   https://pytorch.org/docs/stable/generated/torch.matmul.html
*   https://pytorch.org/docs/stable/generated/torch.eye.html
*   https://pytorch.org/docs/stable/generated/torch.nn.functional.relu.html








In [None]:
# TODO: Implement the triplet_loss function in submission.py
from submission import triplet_loss

Now that we have implemented the triplet loss, we can train our ResNet to produce meaningful image features.

In [None]:
# Define the triplet model
triplet_model = ResNet(2, 64, 128, 256).cuda()

# Define the optimizer
optimizer = optim.AdamW(triplet_model.parameters(), lr=0.001)

# Define the learning rate scheduler
scheduler = transformers.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=NUM_EPOCHS * len(train_dataloader))

In [None]:
def train_epoch(model, loader, optimizer, scheduler, loss_fn):
    model.train()
    total_loss = 0
    for image1, image2 in tqdm(loader):
        image1 = image1.to(device)
        image2 = image2.to(device)

        queries = model(image1)
        keys = model(image2)

        loss = loss_fn(queries, keys)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()
        total_loss += loss.item()

    return total_loss / len(loader)

The following code will train the ResNet using the triplet loss you just implemented. This cell may take several minutes to run (~30-40 minutes on GPU).

In [None]:
# Train the model
loss_fn = triplet_loss
for epoch in range(NUM_EPOCHS):
    train_loss = train_epoch(triplet_model, train_dataloader, optimizer, scheduler, loss_fn)
    print(f'Epoch {epoch}, Train Loss: {train_loss:.4f}')

# Test the model by extracting features and training a linear classifier
triplet_model.eval()

Before we can use the image features produced by the ResNet to help our image classification model, we need to define a function, `extract_features`, that will pass each of our images through the pre-trained model to extract the corresponding image features.

In [None]:
def extract_features(model, val_dataloader):
    features = []
    pixels = []
    with torch.no_grad():
        for image in val_dataloader:
            image = image.cuda()
            pixels.append(image.to('cpu'))
            feature = model(image, return_embedding=True)
            features.append(feature)
    features = torch.cat(features).to('cpu').numpy()
    return features, pixels

features, train_img_features = extract_features(triplet_model, val_dataloader)

The following code will train a linear classifier on the extracted features.

In [None]:
subsample = np.random.choice(features.shape[0], size=5000, replace=False)
features_subsample = features[subsample]
train_data_label = np.array(train_data['label'])[subsample]

classifier = make_pipeline(StandardScaler(), LogisticRegression(max_iter=100, solver='saga', multi_class='multinomial', verbose=1))
classifier.fit(features_subsample, train_data_label)

To examine how beneficial these image features are for classification, we can compare the model trained using the ResNet image features to a linear model that simply uses the pixels of the original image as an input. The following code will train the pixel-space classifier. This cell may take 1-2 minutes to run on GPU. 

In [None]:
# Train the pixel-space classifier
train_img_array = np.concatenate(train_img_features, axis=0)

train_img_array = train_img_array[subsample]
pixel_classifier = make_pipeline(StandardScaler(), LogisticRegression(max_iter=100, solver='saga', multi_class='multinomial', verbose=1))
pixel_classifier.fit(train_img_array.reshape(-1, 32 * 32 * 3), train_data_label)

In [None]:
test_features, test_img_features = extract_features(triplet_model, test_dataloader)

Finally, let us compare the performance of our two models on the test set.

In [None]:
# Test the feature classifier
predictions = classifier.predict(test_features)
triplet_accuracy = accuracy_score(test_data['label'], predictions)
print(f'Accuracy: {triplet_accuracy:.4f}')

# Convert the 'img' list into a numpy array
test_img_array = np.concatenate(test_img_features, axis=0)

# Test the pixel-space classifier
pixel_predictions = pixel_classifier.predict(test_img_array.reshape(-1, 32 * 32 * 3))
pixel_accuracy = accuracy_score(test_data['label'], pixel_predictions)
print(f'Pixel Accuracy: {pixel_accuracy:.4f}')

To reference the performance of our models later, you can run this cell to save the model accuracies to a JSON file.

In [None]:
# Save accuracy to a file
with open(f'accuracy_epoch{NUM_EPOCHS}_0.json', 'w') as f:
    json.dump({'triplet_accuracy': triplet_accuracy, 'pixel_accuracy': pixel_accuracy}, f)

### Q1: How do the performances of the two models you just trained compare? What do you think might contribute to the differences you noticed? Write 2-3 sentences. (5 pts)

**Answer: add your answer to responses.tex**

### Part 1.2: SimCLR Loss (25 pts)

Another popular contrastive learning framework is SimCLR, which learns representations by maximizing the agreement between differently augmented views of the same training example. Given $2B$ training examples, the loss for SimCLR is calculated via the following steps:


1.   L2-normalize of each of the input training examples.
2.   Calculate the similarity matrix $S$, where $S_{ij} = x_i(x_j)^T \ \ \forall i,j \in \{1,...,2B\}$.
3.   Let the loss between two examples $l(i,j) = -\log \left(\frac{\exp(S_{ij}/ \tau)}{\sum_{k=1}^{2B}\mathbf{1}_{k\neq i}\exp(S_{ik}/ \tau)}\right)$.
     
     Calculate the loss $L = \frac{1}{2B} \sum_{k=1}^B l(k, 2k) + l(2k, k)$, where the $k^{th}$ and $2k^{th}$ images are two differently-augmented views of the same image.


Here are some PyTorch functions that maybe be useful in implementing this loss:


*   https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html
*   https://pytorch.org/docs/stable/generated/torch.matmul.html
*   https://pytorch.org/docs/stable/generated/torch.eye.html
*   https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html





In [None]:
# TODO: Implement the nt_xent_loss function in submission.py
from submission import nt_xent_loss

Now that we have implemented the SimCLR loss, we can, again, train a ResNet with it to produce image features.

In [None]:
# Define the SimCLR model
simclr_model = ResNet(2, 64, 128, 256)
simclr_model = simclr_model.cuda()

simclr_model = simclr_model

# Define the optimizer
optimizer = optim.AdamW(simclr_model.parameters(), lr=0.001)

# Define the learning rate scheduler
scheduler = transformers.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=NUM_EPOCHS * len(train_dataloader))

The following code will train the ResNet using the SimCLR loss you just implemented. This cell may take several minutes to run (~30-40 minutes on GPU).

In [None]:
# Train the model
loss_fn = nt_xent_loss
for epoch in range(NUM_EPOCHS):
    train_loss = train_epoch(simclr_model, train_dataloader, optimizer, scheduler, loss_fn)
    print(f'Epoch {epoch}, Train Loss: {train_loss:.4f}')


The following code will train another logistic regression model to perform image classification, using the SimCLR image features.

In [None]:
# Test the model by extracting features and training a linear classifier
simclr_model.eval()

features, train_img_features = extract_features(simclr_model, val_dataloader)
test_features, test_img_features = extract_features(simclr_model, test_dataloader)

features_subsample = features[subsample]
train_data_label = np.array(train_data['label'])[subsample]

classifier = make_pipeline(StandardScaler(), LogisticRegression(max_iter=100, solver='saga', multi_class='multinomial', verbose=1))
classifier.fit(features_subsample, train_data_label)

Finally, let us compare how the classifier using SimCLR features compares to the previous classifiers.

In [None]:
predictions = classifier.predict(test_features)
simclr_accuracy = accuracy_score(test_data['label'], predictions)
print(f'Accuracy: {simclr_accuracy:.4f}')

with open(f'accuracy_epoch{NUM_EPOCHS}_1.json', 'w') as f:
    json.dump({'simclr_accuracy': simclr_accuracy,'triplet_accuracy': triplet_accuracy, 'pixel_accuracy': pixel_accuracy}, f)

## Part 2: Vision Transformers (ViTs) (45 pts)

Another popular model for image tasks is the Vision Transformer (ViT), a transformer adapted for processing images instead of sequences. In this portion of the assignment, you will be implementing a ViT model that we will later train to produce image features to be used for image classification, similar to the ResNet we used in part 1.

ViTs contain many of the same components as the Transformer you implemented for the previous project.

We provide the function, `posemb_sincod_2d`, to produce the positional encoding for image patches. Since images are 2-dimensional (rather than 1-dimensional like text), the function is slightly different from the one we used in the previous assignment.

In [None]:
def posemb_sincos_2d(h, w, dim, temperature: int = 10000, dtype = torch.float32):
    '''
    h: Height of the patch.
    w: Width of the patch.
    dim: The dimension of the model embeddings.
    '''

    y, x = torch.meshgrid(torch.arange(h), torch.arange(w), indexing="ij")
    assert (dim % 4) == 0, "feature dimension must be multiple of 4 for sincos emb"

    omega = torch.arange(dim // 4) / (dim // 4 - 1)
    omega = 1.0 / (temperature ** omega)

    y = y.flatten()[:, None] * omega[None, :]
    x = x.flatten()[:, None] * omega[None, :]
    pe = torch.cat((x.sin(), x.cos(), y.sin(), y.cos()), dim=1)
    return pe.type(dtype)

### Part 2.1: ViT Implementation (30 pts)
You will be implementing a ViT as depicted in the diagram below.

<center><img src="https://www.cs.cornell.edu/courses/cs4782/2025sp/images/vit.png"/></center>


The architecture you will implement is as follows:

1.   `to_patch_embedding`, which will:
      1.  Rearrange the training images (`b` x `c` x `h` x `w`) into flattened patches (`b` x `number of patches` x `size of patches`).
      2.  Pass the flattened patches through a LayerNorm layer.
      3.  Project the LayerNorm output up to the dimension of the Transformer Encoder, `d_model`.
      4.  Pass the projected embeddings through a second LayerNorm layer.
2.   `pos_embedding`, which will add the 2D positional embeddings produced by `posemb_sincod_2d` to the outputs of the second LayerNorm Layer.
3.   `encoder`, instead of implementing the transformer encoder from scratch, use [nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html). The encoder has `num_layer` layers, `d_model` features, `num_heads` attention heads, `d_ff` feedforward layer dimensionality, and a dropout probability of `p`. Note that there is a dropout layer (with dropout probability of `p`) applied before the encoder and a LayerNorm applied after the encoder. While there are no naming restrictions for the dropout layer, ensure to call the LayerNorm layer `output_ln`.
3.   `projection_head`, which consists of a MLP with 2 layers similar to the ResNet model. There is a SiLU activation after the first layer. Each MLP layer takes as input `d_model` features and produces `d_model` features. `projection_head` should only be used if `return_embedding` is False.


Here are some functions that might be helpful in the implementation of the ViT:

*   https://einops.rocks/api/rearrange/
*   https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html





In [None]:
# TODO: Implement the ViT class in submission.py
from submission import ViT

## ViT Training

The following cells will train the ViT model using the SimCLR loss function implemented in part 1. This cell may take several minutes to run (~30-40 minutes on GPU).

In [None]:
simclr_vit = ViT(256, 4)
simclr_vit = simclr_vit.cuda()

# Define the optimizer
optimizer = optim.AdamW(simclr_vit.parameters(), lr=0.001)

# Define the learning rate scheduler
scheduler = transformers.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=50, num_training_steps=NUM_EPOCHS * len(train_dataloader))

# Train the model
loss_fn = nt_xent_loss
for epoch in range(NUM_EPOCHS):
    train_loss = train_epoch(simclr_vit, train_dataloader, optimizer, scheduler, loss_fn)
    print(f'Epoch {epoch}, Train Loss: {train_loss:.4f}')

Next, we can train another linear classifier using the SimCLR features produced by the ViT we just trained.

In [None]:
# Test the model by extracting features and training a linear classifier
simclr_vit.eval()

features, _ = extract_features(simclr_vit, val_dataloader)
test_features, _ = extract_features(simclr_vit, test_dataloader)

subsample = np.random.choice(features.shape[0], size=5000, replace=False)

features_subsample = features[subsample]
train_data_label = np.array(train_data['label'])[subsample]

classifier = make_pipeline(StandardScaler(), LogisticRegression(max_iter=100, solver='saga', multi_class='multinomial', verbose=1))
classifier.fit(features_subsample, train_data_label)

Finally, let us compare how the ViT performed compared to the previous models on the test set.

In [None]:
predictions = classifier.predict(test_features)
simclr_vit_accuracy = accuracy_score(test_data['label'], predictions)
print(f'Accuracy: {simclr_vit_accuracy:.4f}')

If you would like to save the test accuracies of the four classification models for future reference, run the following cell.

In [None]:
with open(f'accuracy_epoch{NUM_EPOCHS}_2.json', 'w') as f:
    json.dump({'simclr_vit_accuracy': simclr_vit_accuracy, 'simclr_accuracy': simclr_accuracy,'triplet_accuracy': triplet_accuracy, 'pixel_accuracy': pixel_accuracy}, f)

## Part 2.2: Image Feature Visualization
Similar to how we visualized the word embeddings in Homework 2, we can also visualize the image features obtained from our ViT model using UMAP (similar to tSNE).

The following code will produce two plots. Each point represents a test set image and is colored according to its label. The first plot visualizes the image features obtained from the ViT. The second plot visualizes the same points (images) using the image pixels instead.

In [None]:
# Visualize the features
class2name_list = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

# Reduce the dimensionality of the features
reducer = umap.UMAP()
scaled_test_features = StandardScaler().fit_transform(test_features)

reduced_features = reducer.fit_transform(scaled_test_features)

# Plot the features
plt.figure(figsize=(10, 10))
for i in range(10):
    test_label_array = np.array(test_data['label'])
    mask = test_label_array == i
    subsample = np.random.choice(np.where(mask)[0], size=100, replace=False)
    plt.scatter(reduced_features[subsample, 0], reduced_features[subsample, 1], label=class2name_list[i])
plt.title('Contrastive Feature Visualization')
plt.legend()
plt.show()
plt.savefig(f'features_epoch{NUM_EPOCHS}.png')

In [None]:
# Visualize the pixel space
# Reduce the dimensionality of the pixel space
reducer = umap.UMAP()
test_img_array = np.concatenate(test_img_features, axis=0)
reduced_pixels = reducer.fit_transform(test_img_array.reshape(-1, 32 * 32 * 3))

# Plot the pixel space
plt.figure(figsize=(10, 10))
for i in range(10):
    mask = test_label_array == i
    subsample = np.random.choice(np.where(mask)[0], size=100, replace=False)
    plt.scatter(reduced_pixels[subsample, 0], reduced_pixels[subsample, 1], label=class2name_list[i])
plt.title('Pixel-Space Feature Visualization')
plt.legend()
plt.show()
plt.savefig(f'pixels_epoch{NUM_EPOCHS}.png')

### Q2: How do the visualizations of the contrastive learning features compare to the visualization of the image pixels? How tightly clustered are the points for the different classes in each of the two visualizations? How might your observations relate to the utility of these two sets of features for image classification? Write 3-4 sentences. (5 pts)

**Answer: add your answer to responses.tex**

## Run the following to create your submission files.
`linear_model_preds.csv` tests the accuracy of your ViT implementation trained with the SimCLR loss. (10 pts)

In [None]:
simclr_vit.eval()

features, train_img_features = extract_features(simclr_model, val_dataloader)
test_features, test_img_features = extract_features(simclr_model, test_dataloader)

subsample = np.random.choice(features.shape[0], size=5000, replace=False)

features_subsample = features[subsample]
train_data_label = np.array(train_data['label'])[subsample]

classifier = make_pipeline(
    StandardScaler(),
    LogisticRegression(
        max_iter=100, solver="saga", multi_class="multinomial", verbose=1
    ),
)
classifier.fit(features_subsample, train_data_label)

predictions = classifier.predict(test_features)
pred_df = pd.DataFrame(predictions, columns=["preds"])
pred_df.to_csv("linear_model_preds.csv", index=False)

print("Predictions saved to 'linear_model_preds.csv'")