Task 3 - Evaluating Contrastive CLIP Model on Out of Domain Data to Study Semantic Shift - CIFAR-100 Dataset

Making All the necessary Imports

In [1]:
import torch
from transformers import CLIPModel, CLIPProcessor
from torchvision.datasets import CIFAR100
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
from torchvision import transforms

Load in the CLIP Model and Processor. The Model that we will be using is "openai/clip-vit-base-patch32". We will also have processor which processes the image and text data required by CLIP-ViT-Base. We also move the model to GPU if available.

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

  return self.fget.__get__(instance, owner)()


We need to define and make custom transformations So that CLIP processor can 1 work with a torch vision dataset and 2 enhance the dataset and help the model in Image classification 

In [3]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Below we define the Cifar-100 dataset and dataloader. Interestingly in Zero - Shot Image classification There is no need for train test splits so we will only be using one the test set of CIFAR-100. We also here define the dataloader and split the cifar-100 dataset into chunks of size 128

In [4]:
cifar100 = CIFAR100(root="./data", download=True, transform=transform, train=False)
dataloader = DataLoader(cifar100, batch_size=128, shuffle=False)

Files already downloaded and verified


Here we get the class names inside the CIFAR-100 Dataset. We then use it to define text inputs. Zer0 - Shot Learning relies finding cosine similarity  between Image and text. The more text description and context of each class it has, the better the model can understand what to look for in an image.

In [5]:
class_names = cifar100.classes 
text_inputs = processor(text=[f"a photo of a {c}" for c in class_names], return_tensors="pt", padding=True).to(device)

This is where Cosine Similarity and predictions are made. Image and text embeddings are made, normalized and there cosine similarity computed. In the end predict the labels based on the highest cosine similarity. This is done for the complete dataset and accuracy is reported at the end. In zero shot predictions we will predict the image whose features match the most closest with the textual inputs provided.

In [6]:
correct = 0
total = 0

model.eval()

with torch.no_grad():
    
    for batch in tqdm(dataloader, desc="Evaluating"):
        
        images, labels = batch
        images = images.to(device)
        labels = labels.to(device)

        image_features = model.get_image_features(pixel_values=images)

        text_features = model.get_text_features(**text_inputs)

        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

        predicted_labels = similarity.argmax(dim=-1)

        correct += (predicted_labels == labels).sum().item()
        total += labels.size(0)

accuracy = correct / total
print(f"Zero-shot classification accuracy on CIFAR-100: {accuracy * 100:.2f}%")

Evaluating: 100%|██████████| 79/79 [00:43<00:00,  1.82it/s]

Zero-shot classification accuracy on CIFAR-100: 59.40%



