# Retraining CLIP

## Disscussion

This is what the original training procedure of the CLIP model looks like:

![Alt text](https://raw.githubusercontent.com/avaoneal/Retrain_CLIP/main/diagrams/base.drawio.png)

We had to try several types of retraining and ensembling methods before one was successful. I will briefly discuss what we tried here.

1. Firstly, we tried to retrain the transformer models.

  The parts we retrained are in orange and the parts we froze are in blue.

  ![Alt text](https://raw.githubusercontent.com/avaoneal/Retrain_CLIP/main/diagrams/model1.drawio.png)

  This was because we believed that adding convolutional layers would aid the model in recognizing fine-tuned details and creating a more context-specific set of text embeddings would aid in classification process. However, this method was unsuccessful. Additional convolutional layers actually began to increase the overfitting problem we encountered later in the embedding space as we added layers, and the additional transformer had no profound impact on performance. This helped us recognize where issue of missclassification was originating; in the embedding space.

2. Secondly, we tried to add a transformer classifier head.

  The parts we retrained are in orange and the parts we froze are in blue.

  ![Alt text](https://raw.githubusercontent.com/avaoneal/Retrain_CLIP/main/diagrams/model2.drawio.png)

  This was because we believed that adding classifier after the contrastive learning steps would fine-tune the selection of text we chose from, and pick up on more details in images that can be used to distinguish similar classes. However, this method was also unsuccessful because the largest problem CLIP experiences when classifying is overfitting. The transformer's complex architecture only increased this problem, even when using dropout, schedulers, low learning rates, and other overclassification prevention techniques. It continued to model only towards our train data, instead of reducing the overfitting from contrastive learning because the transformer only reaffirmed CLIP's overfit decision. This showed us that we needed to simplify the architecture and use a structure that could help widen the boandaries between classes.

3. Finally, we landed on this successful architecture of an added MLP classifier head.

  The parts we retrained are in orange and the parts we froze are in blue.

  ![Alt text](https://raw.githubusercontent.com/avaoneal/Retrain_CLIP/main/diagrams/model3.drawio.png)

  This was because the simpler arcitchture allows us to limit the power of contrastive learning, so that we can cut down on overfitting. We used a single linear layer to find logits for each class. We unfroze some of the ViT layers, and used a very gentle learning rate to update the weights from the vision encoder, while training our MLP layer more agressively. We evaluated with cross entropy; we used an adam optimizer, cosine scheduler, and early stopping for efficiency and to prevent vanishing gradients. We used blended logits to evaluate, averaging the scores from zero-shot CLIP and our linear classifier, to retain the power of CLIP without overfitting. This architecture has shown success.



## Set Up

Load in all our packages

In [None]:
# Install necessary packages
!pip install ftfy regex tqdm
!pip install git+https://github.com/openai/CLIP.git
!pip install scipy

import clip
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from torchvision import transforms
from torchvision import datasets
from PIL import Image
import os
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import shutil
import packaging
import kagglehub
import pandas as pd

Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ftfy
Successfully installed ftfy-6.3.1
Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-xf8uds_h
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-xf8uds_h
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

In [None]:
# Check PyTorch version
# Ensure compatibility with CUDA
version = packaging.version.parse(torch.__version__)
if version > packaging.version.parse('1.7.0'):
    print("Pytorch version is above 1.7.0")
    print("It is version:", version)
else:
    print("PyTorch version is not above 1.7.0. Please Upgrade")

Pytorch version is above 1.7.0
It is version: 2.6.0+cu124


Get the Clip Model

In [None]:
# Load CLIP model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, preprocess = clip.load("ViT-B/32", device=device)
model = model.float()

100%|███████████████████████████████████████| 338M/338M [00:10<00:00, 32.9MiB/s]


### Unfreeze more layers from CLIP

By default, many pre-trained models like CLIP freeze their internal layers. This means the weights of those layers don't get updated during training. Freezing maintains the extracted features from the initial training. But if we want the model to adapt to our new data, we need to "unfreeze" certain layers so they can be trained.

In [None]:
for name, param in model.named_parameters():
    # This loop goes through every parameter (weight/bias) in the CLIP model.
    # `name` is a string describing which layer the parameter belongs to.
    # `param` is the actual parameter tensor (a PyTorch object containing weights).

    if "visual" in name:
        # Only unfreeze layers in the "visual" part of the model.
        # CLIP has two main parts: a visual encoder (for images) and a text encoder (for text).
        # We only want to modify the visual encoder.

        param.requires_grad = True
        # This tells PyTorch: "Yes, this parameter should be updated during training."
        # Any parameter with `requires_grad = False` will be ignored during backpropagation.


###  Define linear classification head

This is a very simple MLP neural network: a single fully connected linear layer. It's used to map the output of CLIP's image encoder to a set of class predictions. Think of it like the final decision layer that says: "I think this image is class X."

In [None]:
class LinearClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        # Constructor for the class. Called when we create an instance of LinearClassifier.
        # `input_dim` is the size of the input features (from the CLIP image encoder: 512).
        # `num_classes` is the number of categories we want to classify

        super(LinearClassifier, self).__init__()

        self.fc = nn.Linear(input_dim, num_classes)
        # This creates the linear (fully connected) layer.
        # It takes a vector of size `input_dim`
        # and outputs a vector of size `num_classes` with values
        # representing the similarity of an image to each class.

    def forward(self, image_features):
        # This function defines how the data flows through the model during forward propogation.
        # It's called automatically during training and inference.

        return self.fc(image_features)
        # The output is a set of raw scores (logits) for each class.


## Step-by-Step Instructions for Re-training on New Data

This notebook is set up to work with any image classification dataset that follows this folder structure:

```
data/
├── train/
│   ├── class_1/
│   │   ├── image_001.png
│   │   ├── ...
│   ├── class_2/
│   │   ├── image_001.png
│   │   ├── ...
│   └── ...
└── test/
```

If it does not, that is ok, but you will need to change the structure slightly.

### **How to Run on New Data (Folder Structure)**

1. Replace the Kaggle Dataset

  Make sure the dataset has a train/ folder with subfolders for each class like the structure above

2. Define Dataset Paths
Now set the path to your training folder. You may need to inspect the downloaded folder structure with os.listdir(path):

  Then run:

  ```
  dataset_root = os.path.join(path, 'YOUR_TRAIN_FOLDER_NAME_HERE')  # Update this!

  val_root = "/kaggle/working/YOUR_PROJECT_NAME/val"  # Folder where val set will be saved
  ```

3. Automatically Create a Validation Set (from training data)

  There is a clear example of this below. This code splits 20% of each class into a new validation set. No need to touch anything in this part.

4. Load the Data into PyTorch

  No changes needed here — this automatically loads your training and validation data using torchvision.datasets.ImageFolder, which works with your folder structure.

5. Automatically Generate Class Labels for CLIP

  This code grabs all the class folder names and uses them to generate natural-language prompts like "A photo of a husky". You don’t need to manually type class names — it's all done for you.

6. Define Test Set Path

  Now, set the path to your test folder. You may need to inspect the new test folder structure just like you did with the training data.

  ```
  test_root = os.path.join(path, 'YOUR_TEST_FOLDER_NAME_HERE')  # Update this!
  ```
  
7. Load the Test Data into PyTorch

  Once the test path is set, this part will load the test data and create the corresponding DataLoader for evaluation

8. Generate Class Names for the Test Set

  This code retrieves all class names from the test dataset, which are used for generating the CLIP text features. This part is also automatic — no need to manually update class names.

9. Generate Text Features for the Test Set

  CLIP needs to know about your test classes to make predictions. This part creates text features for all classes in the test set, using the format: “A photo of a [class name].”



**Summary - What to Edit:**

* Kaggle dataset to use

  ```
  kagglehub.dataset_download(...)
  ```
* Name of the train folder
  ```
  os.path.join(path, 'YOUR_FOLDER')
  ```
* Validation output folder name (somewhere on your working directory)
  ```
  val_root = "/kaggle/working/..."
  ```
* Test Set Folder Name Update the test folder name and path:
  ```
  test_root = os.path.join(path, 'YOUR_TEST_FOLDER_NAME_HERE')
  ```

### **If Your Dataset Uses a CSV File for Labels (Not Folder Structure)**

Some data is instead stored like this:

```
data/
├── train/
│   ├── image_001.png
│   ├── image_001.png
│   └── ...
├── test/
├── train.csv
└── test.csv
```

with a csv file of labels

```
filename,class
dog_001.jpg,husky
dog_002.jpg,beagle
dog_003.jpg,husky
````

No problem! Here’s what you need to change:

1. Load the CSV

  ```
  csv_path = os.path.join(path, "your_labels.csv")  # update this!
  df = pd.read_csv(csv_path)
  ```
  Make sure your CSV has at least two columns: the image filename and the label.

2. Update the Root Directory

  This should point to the folder where the images live, usually called train or images:

  ```
  image_folder = os.path.join(path, "train")
  ```
3. Create a Custom Dataset Class

  Replace the ImageFolder logic with a custom dataset that uses your CSV and image folder:

  ```
  class CSVDataset(Dataset):
      def __init__(self, dataframe, root_dir, transform=None):
          self.data = dataframe
          self.root_dir = root_dir
          self.transform = transform

          # Map class names to numeric labels
          self.classes = sorted(self.data['class'].unique())
          self.class_to_idx = {cls: idx for idx, cls in enumerate(self.classes)}
          self.data['label'] = self.data['class'].map(self.class_to_idx)

      def __len__(self):
          return len(self.data)

      def __getitem__(self, idx):
          row = self.data.iloc[idx]
          img_path = os.path.join(self.root_dir, row['filename'])
          image = Image.open(img_path).convert("RGB")
          label = row['label']

          if self.transform:
              image = self.transform(image)

          return image, label
    ```
4. Create Train/Val Splits with Custom Data Set

  Once the custom dataset is ready, you can split the data into training and validation sets. This code will split 20% of each class into a validation set:

  ```
  from sklearn.model_selection import train_test_split

  train_df, val_df = train_test_split(df, test_size=0.2, stratify=df['class'], random_state=42)

  train_dataset = CSVDataset(train_df, root_dir=image_folder, transform=preprocess)
  val_dataset = CSVDataset(val_df, root_dir=image_folder, transform=preprocess)

  train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
  val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
  ```
5. Update Class Names and Text Prompts

  After loading the dataset, generate the text prompts for each class using the class column. This step is necessary to create the text features for CLIP, which will be used to train the model.

  ```
  class_names = train_dataset.classes

  with torch.no_grad():
      all_text_prompts = [f"A photo of a {classname}" for classname in class_names]
      tokenized_texts = clip.tokenize(all_text_prompts).to(device)
      text_features_all = model.encode_text(tokenized_texts)
      text_features_all = F.normalize(text_features_all, dim=-1).float()
    ```
6. Load the CSV File for the Test Set

  Similar to the training set, you need to load the test.csv file that contains the image filenames and their corresponding labels:

  ```
  csv_test_path = os.path.join(path, "test.csv")  # Update this with your test CSV file path!
  df_test = pd.read_csv(csv_test_path)
  ```

  Ensure that the test CSV has at least two columns: filename and class.

7. Update the Root Directory for Test Images

  ```
  test_image_folder = os.path.join(path, "test")
  ```

8. Use the Same Custom Dataset Class for the Test Set

  ```
  test_dataset = CSVDataset(df_test, root_dir=test_image_folder, transform=preprocess)
  test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
  ```

9. Generate Class Names and Text Prompts for CLIP (Test Set)

  ```
  class_names_test = test_dataset.classes  # Retrieve the class names for the test set

  with torch.no_grad():
      all_text_prompts_test = [f"A photo of a {classname}" for classname in class_names_test]
      tokenized_texts_test = clip.tokenize(all_text_prompts_test).to(device)
      text_features_all_test = model.encode_text(tokenized_texts_test)  # Shape: (num_classes, 512)
      text_features_all_test = F.normalize(text_features_all_test, dim=-1).float()  # Normalize the text features
  ```

There is an example of this in the GTSRB file.

## **DATA: This is the part you edit**

To run this re-training procedure this is the **only** part you want to edit. All necessary changes can be made here. Changes elsewhere may effect the model and make them difficult to compare.

### Ok, now we actually do this on a dogs data set

Load data

In [None]:
# Download latest version
path = kagglehub.dataset_download("amirmakir/dogs-dataset") # Change to your data set
print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/dogs-dataset


Split out validation set

In [None]:
dataset_root = os.path.join(path, 'dogs', 'train')  # Path to your "train" folder, change as needed
val_root = "/kaggle/working/dogs/val"  # Path to save the validation split (working directory), change as needed
# Define paths


if not os.path.exists(val_root):
    os.makedirs(val_root)
# Create the validation root folder if it doesn't exist


for class_name in os.listdir(dataset_root):
# Split data within each class folder

    class_folder = os.path.join(dataset_root, class_name)

    if os.path.isdir(class_folder):

        image_files = [f for f in os.listdir(class_folder) if os.path.isfile(os.path.join(class_folder, f))]
        # Get list of image files in the class folder

        train_files, val_files = train_test_split(image_files, test_size=0.2, random_state=42)
        # Split the images into training and validation sets

        val_class_folder = os.path.join(val_root, class_name)
        if not os.path.exists(val_class_folder):
            os.makedirs(val_class_folder)
        # Create corresponding folders in the validation directory

        for val_image in val_files:
            src = os.path.join(class_folder, val_image)
            dst = os.path.join(val_class_folder, val_image)
            shutil.copy(src, dst)  # Use copy instead of move
        # Copy validation images to the validation folder

# After this, the validation set should be created in /kaggle/working/dogs/val


Preprocess

In [None]:
train_transform = preprocess
val_transform = preprocess
# Define the transformation for CLIP preprocessing (same as when we loaded the model)
# CLIP preprocess automatically resizes, normalizes, and converts to tensor

train_dataset = datasets.ImageFolder(root=dataset_root, transform=train_transform)
val_dataset = datasets.ImageFolder(root=val_root, transform=val_transform)
# Create datasets for train and validation using ImageFolder

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
# Create DataLoaders for train and validation sets

Get classes

In [None]:
class_names = train_dataset.classes
print(class_names)
# Extract class names from folders

with torch.no_grad():
    all_text_prompts = [f"A photo of a {classname}" for classname in class_names]
    tokenized_texts = clip.tokenize(all_text_prompts).to(device)
    text_features_all = model.encode_text(tokenized_texts)
    text_features_all = F.normalize(text_features_all, dim=-1).float()  # <- add .float() here
# Update class names with text prompt

['Afghan_hound', 'Blenheim_spaniel', 'Chihuahua', 'Japanese_spaniel', 'Maltese_dog', 'Pekinese', 'Rhodesian_ridgeback', 'Shih_Tzu', 'papillon', 'toy_terrier']


Get test set set up for later

In [None]:
# Paths
test_root = os.path.join(path, 'dogs', 'test') # Change this for your own data

# Load test set
test_dataset = datasets.ImageFolder(root=test_root, transform=preprocess)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Get all class names
class_names_test = test_dataset.classes
print("Class names:", class_names_test)

# Generate text features for all classes once
with torch.no_grad():
    all_texts = [f"A photo of a {classname}" for classname in class_names_test] # Feel free to change the prompt is desired
    tokenized_texts = clip.tokenize(all_texts).to(device)
    text_features_all = model.encode_text(tokenized_texts)  # Shape: (num_classes, 512)

Class names: ['Afghan_hound', 'Blenheim_spaniel', 'Chihuahua', 'Japanese_spaniel', 'Maltese_dog', 'Pekinese', 'Rhodesian_ridgeback', 'Shih_Tzu', 'papillon', 'toy_terrier']


## **OK, STOP EDITING HERE**

The rest of this file should work just fine without edits if you didn't change any variable names

## Let's Retrain

In [None]:
classifier = LinearClassifier(input_dim=512, num_classes=len(class_names)).to(device)
# Initializes the classifier we defined earlier

optimizer = torch.optim.AdamW([
    {"params": model.visual.parameters(), "lr": 1e-6},
    {"params": classifier.parameters(), "lr": 1e-4}
], weight_decay=1e-4)
# We're training two parts:
# 1) model.visual: The vision encoder from CLIP — we fine-tune it very gently using a small learning rate (1e-6)
# 2) classifier: Our new linear layer — it starts from scratch, so we train it more aggressively (1e-4)
# AdamW is a common optimizer

criterion = nn.CrossEntropyLoss()
# Cross-entropy compares the predicted scores (logits) against the true label and penalizes wrong guesses.

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
# This slowly reduces the learning rate over time in a smooth cosine curve
# this is a common trick to make training more stable and avoid overshooting the minimum loss.


num_epochs = 10
# how many times we loop through the whole dataset

best_val_acc = 0
# keeps track of the best accuracy we've seen so far

patience = 3
# For early stopping — we stop training if validation accuracy doesn’t improve for 3 straight epochs
# This trains more efficiently and prevents overfitting

epochs_no_improve = 0
# how many times we've failed to beat our best accuracy

for epoch in range(num_epochs):

    classifier.train()
    # classifier.train() puts the model in training mode

    total_loss, correct, total = 0, 0, 0

    # Training ##################################################

    for images, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} (Train)"):

        images, labels = images.to(device), labels.to(device)
        image_features = model.encode_image(images).float()
        # Use CLIP’s vision model to encode the images into 512-dimension feature vectors
        image_features = F.normalize(image_features, dim=-1)
        # Normalize them (unit length) so comparisons (dot products) behave like cosine similarity

        with torch.no_grad():
            clip_logits = image_features @ text_features_all.T  # (B, num_classes)
        # Dot product between image and text features. Gives similarity scores
        # (logits) between each image and all class names.

        classifier_logits = classifier(image_features)
        # our classifier’s own guess — based on its trained weights

        clip_logits = clip_logits / clip_logits.norm(dim=-1, keepdim=True)
        classifier_logits = classifier_logits / classifier_logits.norm(dim=-1, keepdim=True)
        # Normalize - ensures the same scale

        blended_logits = 0.5 * classifier_logits + 0.5 * clip_logits
        # average the scores from CLIP and our linear classifier

        loss = criterion(blended_logits, labels)
        # Calculate the loss from the blended prediction

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # Clear old gradients, backpropagate new ones, and take an optimizer step

        total_loss += loss.item()
        correct += (blended_logits.argmax(dim=1) == labels).sum().item()
        total += labels.size(0)
        # Count how many predictions were correct and update total loss and accuracy

    train_acc = 100 * correct / total
    print(f"Epoch {epoch+1}: Train Loss = {total_loss:.4f}, Train Acc = {train_acc:.2f}%")

    # Validation ################################################

    classifier.eval()
    # Switch model to evaluation mode

    correct, total = 0, 0

    with torch.no_grad():
        for images, labels in tqdm(val_loader, desc=f"Epoch {epoch+1}/{num_epochs} (Val)"):
            images, labels = images.to(device), labels.to(device)
            image_features = model.encode_image(images).float()
            image_features = F.normalize(image_features, dim=-1)

            with torch.no_grad():
                clip_logits = image_features @ text_features_all.T
            classifier_logits = classifier(image_features)
            clip_logits = clip_logits / clip_logits.norm(dim=-1, keepdim=True)
            classifier_logits = classifier_logits / classifier_logits.norm(dim=-1, keepdim=True)
            logits = 0.5 * classifier_logits + 0.5 * clip_logits
            correct += (logits.argmax(dim=1) == labels).sum().item()
            total += labels.size(0)

    val_acc = 100 * correct / total
    print(f"Epoch {epoch+1}: Val Acc = {val_acc:.2f}%")
    # Count correct predictions to compute validation accuracy

    # Early Stopping #############################################

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        epochs_no_improve = 0
        torch.save(classifier.state_dict(), 'best_linear_classifier.pth')
        print("Improved validation accuracy. Saved model.")
    # If we beat our best validation accuracy, save the model

    else:
        epochs_no_improve += 1
        if epochs_no_improve >= patience:
            print("Early stopping.")
            break
    # If we’ve gone patience epochs with no improvement, stop training early

    scheduler.step()
    # Move along the cosine schedule — lower the learning rate a bit


Epoch 1/10 (Train): 100%|██████████| 51/51 [00:17<00:00,  2.84it/s]


Epoch 1: Train Loss = 112.3489, Train Acc = 33.65%


Epoch 1/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  5.76it/s]


Epoch 1: Val Acc = 68.52%
Improved validation accuracy. Saved model.


Epoch 2/10 (Train): 100%|██████████| 51/51 [00:13<00:00,  3.89it/s]


Epoch 2: Train Loss = 100.9878, Train Acc = 76.72%


Epoch 2/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  6.08it/s]


Epoch 2: Val Acc = 88.58%
Improved validation accuracy. Saved model.


Epoch 3/10 (Train): 100%|██████████| 51/51 [00:12<00:00,  3.97it/s]


Epoch 3: Train Loss = 97.3580, Train Acc = 90.70%


Epoch 3/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  6.15it/s]


Epoch 3: Val Acc = 91.98%
Improved validation accuracy. Saved model.


Epoch 4/10 (Train): 100%|██████████| 51/51 [00:12<00:00,  3.95it/s]


Epoch 4: Train Loss = 95.6824, Train Acc = 94.19%


Epoch 4/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  6.18it/s]


Epoch 4: Val Acc = 95.37%
Improved validation accuracy. Saved model.


Epoch 5/10 (Train): 100%|██████████| 51/51 [00:12<00:00,  3.97it/s]


Epoch 5: Train Loss = 94.5789, Train Acc = 96.32%


Epoch 5/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  6.06it/s]


Epoch 5: Val Acc = 96.91%
Improved validation accuracy. Saved model.


Epoch 6/10 (Train): 100%|██████████| 51/51 [00:12<00:00,  4.00it/s]


Epoch 6: Train Loss = 93.9600, Train Acc = 97.19%


Epoch 6/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  6.02it/s]


Epoch 6: Val Acc = 97.22%
Improved validation accuracy. Saved model.


Epoch 7/10 (Train): 100%|██████████| 51/51 [00:12<00:00,  3.96it/s]


Epoch 7: Train Loss = 93.3998, Train Acc = 97.69%


Epoch 7/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  6.00it/s]


Epoch 7: Val Acc = 98.15%
Improved validation accuracy. Saved model.


Epoch 8/10 (Train): 100%|██████████| 51/51 [00:12<00:00,  3.99it/s]


Epoch 8: Train Loss = 93.1269, Train Acc = 98.13%


Epoch 8/10 (Val): 100%|██████████| 11/11 [00:02<00:00,  5.49it/s]


Epoch 8: Val Acc = 98.77%
Improved validation accuracy. Saved model.


Epoch 9/10 (Train): 100%|██████████| 51/51 [00:12<00:00,  3.96it/s]


Epoch 9: Train Loss = 92.9793, Train Acc = 98.19%


Epoch 9/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  5.87it/s]


Epoch 9: Val Acc = 99.07%
Improved validation accuracy. Saved model.


Epoch 10/10 (Train): 100%|██████████| 51/51 [00:12<00:00,  3.94it/s]


Epoch 10: Train Loss = 92.9358, Train Acc = 98.31%


Epoch 10/10 (Val): 100%|██████████| 11/11 [00:01<00:00,  6.05it/s]

Epoch 10: Val Acc = 99.07%





## Compute Accuracy with Newly Trained Model

In [None]:
def compute_topk_accuracy(logits, labels, topk=(1, 3, 5)):
    max_k = max(topk)
    batch_size = labels.size(0)

    _, pred = logits.topk(max_k, dim=1, largest=True, sorted=True)
    pred = pred.t()
    correct = pred.eq(labels.view(1, -1).expand_as(pred))

    topk_accs = {}
    for k in topk:
        correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
        topk_accs[f"top{k}"] = (correct_k / batch_size).item() * 100.0

    return topk_accs

# Evaluate fine-tuned classifier
classifier.eval()
top1_total, top3_total, top5_total, total_samples = 0, 0, 0, 0

with torch.no_grad():
    for images, labels in tqdm(test_loader, desc="Evaluating Fine-tuned Classifier"):
        images, labels = images.to(device), labels.to(device)
        image_features = model.encode_image(images).float()
        image_features = F.normalize(image_features, dim=-1)

        logits = classifier(image_features)
        accs = compute_topk_accuracy(logits, labels)

        top1_total += accs['top1'] * images.size(0)
        top3_total += accs['top3'] * images.size(0)
        top5_total += accs['top5'] * images.size(0)
        total_samples += images.size(0)

print(f"\nFine-tuned Classifier Accuracy:")
print(f"Top-1: {top1_total / total_samples:.2f}%")
print(f"Top-3: {top3_total / total_samples:.2f}%")
print(f"Top-5: {top5_total / total_samples:.2f}%")


Evaluating Fine-tuned Classifier: 100%|██████████| 10/10 [00:03<00:00,  3.09it/s]


Fine-tuned Classifier Accuracy:
Top-1: 88.64%
Top-3: 98.11%
Top-5: 99.37%





## Compare To Zero Shot Accuracy

In [None]:
original_model = clip.load("ViT-B/32", device=device)[0].float().eval()
with torch.no_grad():
    tokenized_texts = clip.tokenize([f"A photo of a {classname}" for classname in class_names_test]).to(device)
    text_features_all = original_model.encode_text(tokenized_texts)
    text_features_all = F.normalize(text_features_all, dim=-1).float()

def compute_zero_shot_topk_accuracy(model, image_loader, text_features_all, device):
    model.eval()
    text_features_all = F.normalize(text_features_all, dim=-1)

    top1_total, top3_total, top5_total, total_samples = 0, 0, 0, 0

    with torch.no_grad():
        for images, labels in tqdm(image_loader, desc="Evaluating Zero-Shot CLIP"):
            images, labels = images.to(device), labels.to(device)
            image_features = model.encode_image(images).float()
            image_features = F.normalize(image_features, dim=-1)

            logits = image_features @ text_features_all.T
            accs = compute_topk_accuracy(logits, labels)

            top1_total += accs['top1'] * images.size(0)
            top3_total += accs['top3'] * images.size(0)
            top5_total += accs['top5'] * images.size(0)
            total_samples += images.size(0)

    return {
        'top1': top1_total / total_samples,
        'top3': top3_total / total_samples,
        'top5': top5_total / total_samples,
    }

# Run zero-shot evaluation
zero_shot_results = compute_zero_shot_topk_accuracy(original_model, test_loader, text_features_all, device)

print("\nZero-Shot CLIP Accuracy:")
print(f"Top-1: {zero_shot_results['top1']:.2f}%")
print(f"Top-3: {zero_shot_results['top3']:.2f}%")
print(f"Top-5: {zero_shot_results['top5']:.2f}%")

Evaluating Zero-Shot CLIP: 100%|██████████| 10/10 [00:02<00:00,  4.84it/s]


Zero-Shot CLIP Accuracy:
Top-1: 74.45%
Top-3: 91.17%
Top-5: 97.48%



