Evaluating Texture Bias Of Clip-ViT-Base 

Making all necessary Imports

In [2]:
import os
from PIL import Image
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import torch
from transformers import CLIPProcessor, CLIPModel
from tqdm import tqdm

Class Names in the CIFAR-10G Dataset

In [3]:
CIFAR10_CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

Load In this subset of the CIFAR-10 Dataset. Loop through the CIFAR-10 subset dataset class folder and get all image files in the class directory. This class also has method to apply image transformations onto it as well.

In [4]:
class NewCIFAR10Dataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.root_dir = root_dir
        self.transform = transform
        self.image_paths = []
        self.labels = []
        
        for idx, category in enumerate(CIFAR10_CLASSES):
            category_dir = os.path.join(root_dir, category)
            
            for img_file in os.listdir(category_dir):
                if img_file.endswith(('.png', '.jpg', '.jpeg')):  
                    img_path = os.path.join(category_dir, img_file)
                    self.image_paths.append(img_path)
                    self.labels.append(idx) 

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        image = Image.open(img_path).convert('L') 

        label = self.labels[idx]

        if self.transform:
            image = self.transform(image)
        
        return image, label


We need to define and make custom transformations So that CLIP processor can 1 work with a torch vision dataset and 2 enhance the dataset and help the model in Image classification. 

In [5]:
transform = transforms.Compose([
    transforms.Grayscale(num_output_channels=3), 
    transforms.RandomHorizontalFlip(),
    transforms.Resize((224, 224)),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Below we define the route to the CIFAR-10 Dataset along with that load in the Dataset. We also here define the dataloader and split the cifar-10 dataset into chunks of size 64

In [6]:
root_dir = './texture_bias_dataset'  
new_dataset = NewCIFAR10Dataset(root_dir=root_dir, transform=transform)

new_dataloader = DataLoader(new_dataset, batch_size=64, shuffle=False)

Load in the CLIP Model and Processor. The Model that we will be using is "openai/clip-vit-base-patch32". We will also have processor which processes the image and text data required by CLIP-ViT-Base. We also move the model to GPU if available. 

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

  return self.fget.__get__(instance, owner)()


Function For zero shot image classification 

This is where Cosine Similarity and predictions are made. Image and text embeddings are made, normalized and there cosine similarity computed. In the end predict the labels based on the highest cosine similarity. This is done for the complete dataset and accuracy is reported at the end.

In [8]:
def zero_shot_classification(model, processor, dataloader, device):
    text_inputs = processor(text=CIFAR10_CLASSES, return_tensors="pt", padding=True).to(device)
    with torch.no_grad():
        text_features = model.get_text_features(**text_inputs)

    correct_predictions = 0
    total_images = 0

    model.eval()
    
    for images, labels in tqdm(dataloader):
        
        images = images.to(device)
        labels = labels.to(device)

        with torch.no_grad():
            image_features = model.get_image_features(images)

        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)  

        predicted_class = similarity.argmax(dim=1)

        correct_predictions += (predicted_class == labels).sum().item()
        total_images += labels.size(0)

    accuracy = correct_predictions / total_images
    return accuracy


Accuracy We got On CIFAR-10 Was 87.06 hence that will be our total accuracy. To calculate our texture Bias we will use the following formula.

$$
\text{Texture Bias} = \frac{\text{Texture Accuracy}}{\text{Total Accuracy}}
$$


The Output we will get will go in the numerator. So in the following cell we will do the function call and the resultant output will be texture Bias.

In [9]:
accuracy = zero_shot_classification(model, processor, new_dataloader, device)

print(f"Texture Bias is : {(accuracy*100/87.06):.3f}%")

100%|██████████| 16/16 [00:07<00:00,  2.16it/s]

Texture Bias is : 0.420%



