Locality Bias Of CLIP-ViT-Base : Localized Noise Injection

In this part we will see how introduction of sound into a random localized region of an image messes with the accuracy of our model

Making all the neccessary Imports

In [1]:
import torch
from torchvision import datasets, transforms
import torch
from transformers import CLIPModel, CLIPProcessor
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
from torchvision import transforms

Load in the CLIP Model and Processor. The Model that we will be using is "openai/clip-vit-base-patch32". We will also have processor which processes the image and text data required by CLIP-ViT-Base. We also move the model to GPU if available.

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

  return self.fget.__get__(instance, owner)()


We need to define and make custom transformations So that CLIP processor can 1 work with the dataset as CLIP expects the IMages to be in the size 224x224, and we convert it to tensor.

In [3]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),  
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

Load in the Noise Injected tensor that we have created and split it into chunks of size 128 in the data_loader 

In [4]:
root_dir = './CIFAR_noise'

dataset = datasets.ImageFolder(root=root_dir, transform=transform)
data_loader = DataLoader(dataset, batch_size=128, shuffle=False)

Define Class names of our noise injected CIFAR-10 Dataset and make text prompts to aid with zero shot image classification

In [5]:
class_names = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]

text_inputs = processor(text=[f"a photo of a {c}" for c in class_names], return_tensors="pt", padding=True).to(device)

Now Perform Zero shot Image Classification and see how much accuracy we get on this new dataset. Remember we were previously getting 87.06% accuracy on CIFAR-10 Dataset using CLIP-ViT-Base using the Normal dataset

In [6]:
correct = 0
total = 0

model.eval()

with torch.no_grad():
    for batch in tqdm(data_loader, desc="Evaluating"):
        images, labels = batch
        images = images.to(device)
        labels = labels.to(device)

        image_features = model.get_image_features(pixel_values=images)

        text_features = model.get_text_features(**text_inputs)

        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

        predicted_labels = similarity.argmax(dim=-1)

        correct += (predicted_labels == labels).sum().item()
        total += labels.size(0)

accuracy = correct / total
print(f"Zero-shot classification accuracy on noise CIFAR-10: {accuracy * 100:.2f}%")

Evaluating: 100%|██████████| 79/79 [00:44<00:00,  1.79it/s]

Zero-shot classification accuracy on noise CIFAR-10: 73.51%





So After Introducing Noise Now we Have An accuracy Of 73.51 which is down 13.55 points so there is a significant decrease in accuracy there.