# 👩‍💻 Day 8: CLIP in Action — Fine-Tuning and Probing Multimodal Capabilities

Welcome back to Day 8 of our VLM journey! 🎯 Yesterday, we dissected the powerful dual-encoder architecture and contrastive learning mechanism that makes OpenAI’s CLIP model so effective.

Today, we’ll bring all that theory to life — by fine-tuning CLIP on a small image-text dataset and exploring its performance using zero-shot probing and classification tasks.

Let’s dive in 🚀

🛠️ What We'll Build
* We'll set up a Kaggle-friendly experiment that does the following:
* Loads a pretrained CLIP model.
* Prepares a small image-text dataset (we’ll use Flickr8k or a custom dummy dataset if needed).
* Encodes both image and text using CLIP encoders.
* Trains a linear probing head (optional) or fine-tunes CLIP.
* Applies early stopping to avoid overfitting.
* Evaluates using cosine similarity.
* Visualizes predictions and logs training results.



### ✅ Step 0: Setup (Kaggle Environment & Libraries)

Kaggle notebooks already have many libraries preinstalled, but you may need to install HuggingFace and torchvision manually.

In [1]:
!pip install -q transformers torchvision ftfy

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m84.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━

In [4]:
# Import libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
from transformers import CLIPProcessor, CLIPModel

2025-06-15 12:22:43.992041: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749990164.171440      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749990164.226619      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### 📦 Step 1: Load Pretrained CLIP Model from HuggingFace

We’ll use OpenAI’s CLIP ViT-B/32 variant — a widely adopted and efficient version of CLIP with a strong balance between performance and speed.

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

### 🖼️ Step 2: Prepare Your Dataset

You can upload the Flickr8k dataset.

In [6]:
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
import torchvision.transforms as transforms

class ImageTextDataset(Dataset):
    def __init__(self, image_paths, captions, processor, transform=None):
        self.image_paths = image_paths
        self.captions = captions
        self.processor = processor
        self.transform = transform

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        if self.transform:
            image = self.transform(image)

        caption = self.captions[idx]
        return image, caption

In [9]:
from collections import defaultdict

caption_dict = defaultdict(list)

with open("/kaggle/input/flickr8k/captions.txt", "r") as f:  # Update the path
    next(f)  # skip header
    for line in f:
        parts = line.strip().split(',', 1)  # Split only at the first comma
        if len(parts) != 2:
            continue  # Skip bad lines
        filename, caption = parts
        caption_dict[filename].append(caption)


In [11]:
import os

image_folder = "/kaggle/input/flickr8k/Images"  # Path to your image folder
image_paths, captions = [], []

for img_name, caps in caption_dict.items():
    full_path = os.path.join(image_folder, img_name)
    if os.path.exists(full_path):
        image_paths.append(full_path)
        captions.append(caps[0])  # use the first caption only


In [12]:
print(f"Total Image-Caption Pairs: {len(image_paths)}")
print("Sample Image Path:", image_paths[0])
print("Sample Caption:", captions[0])

Total Image-Caption Pairs: 8091
Sample Image Path: /kaggle/input/flickr8k/Images/1000268201_693b08cb0e.jpg
Sample Caption: A child in a pink dress is climbing up a set of stairs in an entry way .


### 🔁 Step 3: Encode Images and Captions Using CLIP

Now we’ll use CLIPProcessor to preprocess both images and text, and CLIPModel to generate image and text embeddings.

In [23]:
from torch.utils.data import Dataset, DataLoader
from PIL import Image

class CLIPDataset(Dataset):
    def __init__(self, image_paths, captions):
        self.image_paths = image_paths
        self.captions = captions

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        caption = self.captions[idx]
        return image, caption


In [24]:
def collate_fn(batch):
    images, texts = zip(*batch)
    inputs = processor(
        text=list(texts),
        images=list(images),
        return_tensors="pt",
        padding=True,
        truncation=True
    )
    return inputs


### 🧪 Create Dataset and Dataloader

In [26]:
dataset = CLIPDataset(image_paths, captions)
loader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)

### 🔍 Quick Sanity Check

In [27]:
batch = next(iter(loader))
for key in batch:
    print(f"{key}: shape = {batch[key].shape}")

input_ids: shape = torch.Size([4, 16])
attention_mask: shape = torch.Size([4, 16])
pixel_values: shape = torch.Size([4, 3, 224, 224])


In [28]:
with torch.no_grad():
    image_embeds = model.get_image_features(pixel_values=batch["pixel_values"].to(device))
    text_embeds = model.get_text_features(input_ids=batch["input_ids"].to(device),
                                          attention_mask=batch["attention_mask"].to(device))


### 🧠 Step 4: Training Loop with InfoNCE Loss (Contrastive Learning)

We'll now define the core components to train/fine-tune CLIP using image-text pairs.

In [29]:
# Define Cosine Similarity & Contrastive Loss (InfoNCE)
import torch.nn.functional as F

def clip_contrastive_loss(image_embeds, text_embeds, temperature=0.07):
    # Normalize embeddings
    image_embeds = F.normalize(image_embeds, p=2, dim=-1)
    text_embeds = F.normalize(text_embeds, p=2, dim=-1)

    # Compute cosine similarity matrix
    logits_per_image = torch.matmul(image_embeds, text_embeds.T) / temperature
    logits_per_text = logits_per_image.T

    batch_size = image_embeds.size(0)
    labels = torch.arange(batch_size).to(image_embeds.device)

    loss_i = F.cross_entropy(logits_per_image, labels)
    loss_t = F.cross_entropy(logits_per_text, labels)

    return (loss_i + loss_t) / 2


In [30]:
# Set Up Optimizer & Model
from transformers import CLIPModel
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

In [31]:
# Training Loop (Basic)
num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in loader:
        pixel_values = batch['pixel_values'].to(device)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        outputs = model(
            pixel_values=pixel_values,
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_loss=False
        )

        image_embeds = outputs.image_embeds
        text_embeds = outputs.text_embeds

        loss = clip_contrastive_loss(image_embeds, text_embeds)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")


Epoch 1/5, Loss: 0.1264
Epoch 2/5, Loss: 0.0599
Epoch 3/5, Loss: 0.0610
Epoch 4/5, Loss: 0.0622
Epoch 5/5, Loss: 0.0475


#### 📝 Bonus: Evaluation Code Snippet

In [32]:


model.eval()
with torch.no_grad():
    batch = next(iter(loader))
    outputs = model(
        pixel_values=batch['pixel_values'].to(device),
        input_ids=batch['input_ids'].to(device),
        attention_mask=batch['attention_mask'].to(device),
        return_loss=False
    )

    sim_matrix = torch.matmul(
        F.normalize(outputs.image_embeds, dim=1),
        F.normalize(outputs.text_embeds, dim=1).T
    )

    print("Similarity Matrix:\n", sim_matrix)

Similarity Matrix:
 tensor([[ 0.3049, -0.1690, -0.0891,  0.1364],
        [-0.1880,  0.5654, -0.0407, -0.1371],
        [-0.1063, -0.1542,  0.5103, -0.0806],
        [ 0.0287, -0.0455, -0.1565,  0.6267]], device='cuda:0')
