<a href="https://colab.research.google.com/github/dlvh/biods271/blob/main/BIODS271_Homework_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Vision-Language Foundation Models
Vision-language models (VLMs) jointly learn relationships between images and text. In this assignment, we will explore how VLMs can be used to perform a variety of reasoning tasks on medical images.

❗Make sure to click File > Save a copy in Drive before you get started on this assignment. If you edit this notebook directly, your changes will not be saved.

❗Before you get started, click Runtime > Change Runtime Type and select "T4 GPU". Then, click "Connect" in the upper right hand corner of this notebook.

### Install Python Packages

In [None]:
# Run this cell to install necessary Python packages
!pip install open_clip_torch

### Load Python Packages

In [None]:
# Python packages that may be useful
import open_clip
import torch
import torchvision
import matplotlib.pyplot as plt
import numpy as np
from torch.utils.data import DataLoader

## Part 1 [Coding Questions]: Exploring General-Domain Vision-Language Models

### Load OpenCLIP VLM (2 points)

The [OpenCLIP](https://github.com/mlfoundations/open_clip) codebase provides access to a range of pretrained VLMs. Load the pretrained OpenCLIP ViT-B/16 model (id = laion2B-s34B-b88K) and compute the number of model parameters, the context length (i.e. the number of input tokens for the text encoder), and the vocabulary size of the text encoder. This VLM uses a Vision Transformer backbone for the image encoder and was pretrained on the 2 billion image-text pairs included in the LAION-2B dataset.

**Expected Outputs:**
- Number of trained parameters in an OpenCLIP ViT-B/16 model (Hint: this should be in the hundreds of millions!)
- Context length for the text encoder
- Vocabulary size for the text encoder

In [None]:
open_clip.create_model_and_transforms(?, pretrained=?)

### Exploring CIFAR-100 (2 points)
For the first part of this assignment, we will be working with the [CIFAR-100](https://paperswithcode.com/dataset/cifar-100) dataset, which includes objects from 100 classes.

We'll begin by performing an exploratory analysis. Visualize any ten images from CIFAR-100.

**Expected Output:**
- Visualization of ten CIFAR-100 images

In [None]:
# Load Dataset
cifar_100 = torchvision.datasets.CIFAR100(root='.', train=False, download=True)
class_names = cifar_100.classes

In [None]:
# Your code here

### Perform Zero-Shot Classification (10 points)

Let's evaluate the OpenCLIP ViT-B/16 model by performing zero-shot classification on CIFAR-100.

Use the OpenCLIP ViT-B/16 to encode each image and each label in CIFAR-100. For this task, please directly encode the provided CIFAR-100 class names as text and **do not** perform any prompt tuning or modifications to the label name.

Hint: Is your code too slow? Make sure you're using the GPU!

**Expected Outputs**
- Zero-shot classification accuracy of the OpenCLIP ViT-B/16 model on CIFAR-100. Hint: Classification accuracy should be > 0.6.

In [None]:
# Set up dataloader for CIFAR100
cifar_100 = torchvision.datasets.CIFAR100(root='.', train=False, download=True, transform=?) #Don't forget to fill in the transform!
data_loader = DataLoader(cifar_100, batch_size=?, shuffle=False, drop_last=False)

# Your code here

### Prompt Engineering for VLMs (6 points)

VLMs are sensitive to the input prompts. In the previous step, we directly encoded the labels as text (i.e. "apple" or "cloud"). However, using more descriptive prompts that encode labels as phrases or sentences (e.g. "the apple") can help improve model performance. Can you improve zero-shot classification performance by customizing prompts?

Hint: Is it possible to ensemble a set of multiple prompts for a given class label? See the lecture slides from 04/23 for more information on prompt ensembles.

**Expected Output**
- Zero-shot classification accuracy of the OpenCLIP ViT-B/16 model on CIFAR-100 using your custom prompts. Make sure that your prompts contribute to performance improvements when compared to the results from the previous cell.

Note that this is an open-ended question, and there are many possible solutions.



In [None]:
# Your code here

## Part 2 [Coding Questions]: Exploring Medical Vision-Language Models

Let's now try a medical image dataset. We will use PatchCamelyon (PCam), a dataset consisting of color images extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue (label of 1 = lymph node with metastasis; label of 0 = normal lymph node)

In [None]:
# Run the following cell to download PCam

!gdown 1qV65ZqZvWzuIVthK8eVDhIwrbnsJdbg_
!gdown 17BHrSrwWKjYsOgTMmoqrIjDy6Fa2o_gP

# NOTE: The previous two lines of code may occasionally throw an error if too
# many people have attempted to download the file within a short time-frame. If
# you see this error, download the files manually from Google Drive and upload
# to your Colab disk. Then, run the rest of this cell to format the files.
# (File 1) https://drive.google.com/uc?id=1qV65ZqZvWzuIVthK8eVDhIwrbnsJdbg_
# (File 2) https://drive.google.com/uc?id=17BHrSrwWKjYsOgTMmoqrIjDy6Fa2o_gP

!gunzip camelyonpatch_level_2_split_test_x.h5.gz
!gunzip camelyonpatch_level_2_split_test_y.h5.gz

!mkdir pcam
!mv camelyonpatch_level_2_split_test_x.h5 pcam
!mv camelyonpatch_level_2_split_test_y.h5 pcam

### Exploring PCam (2 points)
We'll begin by performing an exploratory analysis. Visualize any ten images from PCam and describe any visual variabilities that you observe between the two classes.

**Expected Output:**
- Visualization of ten PCam images
- Description of visual differences between images from the two classes.

In [None]:
# Load dataset
pcam = torchvision.datasets.PCAM(root='.', split='test', download=False)

In [None]:
# Your code here

**Description of Visual Differences Between Class=0 and Class=1**: [Your Answer Here]

### Perform Zero-Shot Classification with General-Domain VLM (4 points)

Let's evaluate the OpenCLIP ViT-B/16 model by performing zero-shot binary classification on PCam.

Use the OpenCLIP ViT-B/16 to encode each image and each label in PCam. Feel free to experiment with prompts of your choice.

**Expected Outputs**
- Zero-shot classification accuracy of the OpenCLIP ViT-B/16 model on Pcam

In [None]:
# Set up dataloader for PCam
pcam = torchvision.datasets.PCAM(root='.', split='test', download=False, transform=?) #Don't forget to fill in the transform!
data_loader = DataLoader(pcam, batch_size=?, shuffle=False, drop_last=False)

# Your code here

### Perform Zero-Shot Classification with a Biomedical VLM (6 points)

Next, let's explore [BiomedCLIP](https://arxiv.org/abs/2303.00915), a medical VLM pretrained on a large collection of images and text from PubMed. Again, feel free to experiment with prompts of your choice.

**Expected Outputs**:

- Zero-shot classification accuracy of BiomedCLIP on PCam

In [None]:
from open_clip import create_model_from_pretrained, get_tokenizer

# Load BiomedCLIP Weights
model, preprocess = create_model_from_pretrained('hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224')
tokenizer = get_tokenizer('hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224')
model.eval()

# Your code here

## Part 3 [Open-Ended Exploration] - Evaluating FMs

### Evaluations (10 points)
Now that we have explored applications of the OpenCLIP ViT-B/16 and the BiomedCLIP VLMs on Pcam, let's perform an open-ended exploration in order to analyze the learned embedding spaces of various FMs. For this task, you will (1) generate image embeddings for each sample in the PCAM test set, (2) cluster the image-level embeddings, and (3) generate cluster visualizations. Details are provided below:
1. *Generate image embeddings*: You will generate embeddings for each image in the PCAM test set.
2. *Cluster image embeddings*: You will use a clustering algorithm of your choice (such as K-means) in order to cluster the embeddings. Hint: If this stage is too time-consuming, it might help to reduce the dimensionality of image embeddings via algorithms like PCA or UMAP.
3. *Visualize clusters*: Generate plots that visualize the clusters. There are many possible options for visualizing clusters; feel free to choose any reasonable approach. Use the ground-truth labels associated with each sample to color each point in your plot.

For full credit on this section, you must evaluate **at least 3** models. At least one of these models must be distinct from the OpenCLIP ViT-B/16 and BiomedCLIP VLMs explored earlier in this notebook. Some examples of models you may evaluate include PLIP (discussed in the 04/07 lecture), other OpenCLIP variants, and CONCH. You may also experiment with vision-only FMs, such as UNI.

In [None]:
#Your code here

### Written Analysis (8 points)

Provide details on the open-ended experiments you conducted. In particular, provide justification for (1) the three (or more) models that you chose to evaluate, (2) the clustering algorithm and hyperparameters that you selected, and (3) the visualization algorithm you utilized. Then, summarize your key findings from your visualizations. Which models appear to generate the best image embeddings for this dataset? How can you tell?

[Your answer here]

## Submission Guidelines

When you have completed this assignment, please export to PDF by selecting File > Print > Save as PDF. **Please confirm that the outputs of all code cells are visible in the PDF.** Then, upload your PDF on Canvas.