Task 3 - Evaluating Contrastive CLIP Model on Out of Domain Data to Study Covariate Shift - PACS Dataset

Making All the necessary Imports

In [15]:
import os
from PIL import Image
import numpy as np
import torch
from transformers import CLIPProcessor, CLIPModel
from sklearn.metrics import accuracy_score


Load in the CLIP Model and Processor. The Model that we will be using is "openai/clip-vit-base-patch32". We will also have processor which processes the image and text data required by CLIP-ViT-Base. We also move the model to GPU if available.

In [16]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")



The Function below is used to load in the PACS Dataset. It essentially travels through all subfolders of all the domains and returns for each of them an np.array of image and text pairs which we can then further use for zero shot image classification and other purposes.

In [17]:


def load_pacs_dataset(dataset_path):
    domains = ['art_painting', 'cartoon', 'photo', 'sketch']
    classes = ['dog', 'elephant', 'giraffe', 'guitar', 'horse', 'house', 'person']
    
    images = []
    labels = []
    
    for domain in domains:
        for class_idx, class_name in enumerate(classes):
            class_dir = os.path.join(dataset_path, domain, class_name)
            for img_name in os.listdir(class_dir):
                img_path = os.path.join(class_dir, img_name)
                img = Image.open(img_path).convert('RGB')
                img_array = np.array(img)
                images.append(img_array)
                labels.append(class_idx)
    
    return np.array(images), np.array(labels)


Function Call to load the dataset In.

In [18]:
dataset_path = './PACS'
images, labels = load_pacs_dataset(dataset_path)
print(f"Loaded {len(images)} images and {len(labels)} labels from the PACS dataset.")

Loaded 9991 images and 9991 labels from the PACS dataset.


Defining the class names present in the PACS Dataset.

In [19]:
classes = ['dog', 'elephant', 'giraffe', 'guitar', 'horse', 'house', 'person']

This is where Cosine Similarity and predictions are made. Image and text embeddings are made, normalized and there cosine similarity computed. In the end predict the labels based on the highest cosine similarity. This is done for the complete dataset and accuracy is reported at the end. This should be Slightly Lower than 

In [20]:
def zero_shot_classification(images, labels):
    text_inputs = processor(text=classes, return_tensors="pt", padding=True).to(device)
    text_features = model.get_text_features(**text_inputs)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    
    predictions = []
    true_labels = []

    for img, label in zip(images, labels):

        
        image_input = processor(images=img, return_tensors="pt").to(device)
        
        with torch.no_grad():
            image_features = model.get_image_features(**image_input)
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        
        similarity = (image_features @ text_features.T).squeeze(0)
        
        pred_class = similarity.argmax().item()
        
        predictions.append(pred_class)
        true_labels.append(label)
    
    accuracy = accuracy_score(true_labels, predictions)
    return accuracy


Giving the function call for zero_shot_classification class to get accuracy.

In [21]:
accuracy = zero_shot_classification(images, labels)
print(f"Zero-shot classification accuracy on PACS: {accuracy * 100:.2f}%")

Zero-shot classification accuracy on PACS: 81.34%
