# Environment setup 


## Installing CLIP and YoloV5 and Imports

It the first section of this file, the installation of the needed components is performed. These first bash lines install CLIP and YoloV5 respectively. These two Neural Network will represent the ground base of the project development.

In [1]:
%%capture
%%bash
# Download CLIP and YOLO
pip install git+https://github.com/openai/CLIP.git
pip install -qr https://raw.githubusercontent.com/ultralytics/yolov5/master/requirements.txt

# Command to install some needed dependencies in the AWS machine
sudo apt-get update && sudo apt-get install ffmpeg libsm6 libxext6 -y

## List of imports

In [2]:
# general imports
import pickle
import json
import tarfile
import os
import math
import torch
import clip
from PIL import Image, ImageFilter, ImageDraw

# utility libraries imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tqdm import tqdm

# torch imports
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
from torchvision import transforms

## Setting the Clip model and Yolo model variables

In [3]:
# Chosing the device 
device = "cuda" if torch.cuda.is_available() else "cpu"

# choosing the clip model and the yolo versions, both pre-trained
clip_model, preprocess = clip.load('RN50', device)
yolo_model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True, trust_repo=True)

# Ensure the model is in float32 precision and transferred to the correct device
clip_model = clip_model.to(device).float()

Downloading: "https://github.com/ultralytics/yolov5/zipball/master" to /home/sagemaker-user/.cache/torch/hub/master.zip
YOLOv5 🚀 2025-1-29 Python-3.11.10 torch-2.3.1.post300 CUDA:0 (Tesla T4, 14918MiB)

Fusing layers... 
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
Adding AutoShape... 


## Dataset

The following code sections contain the needed structures to load the data from the refcoco dataset.\
The purpose of the Refcocog is Referring Expression Grounding, whose goal is to identify an object given a referring example. This is corresponds with the objective of this project.\
The dataset is composed of 25799 images, each having an average of 3.7 referring expression. These expression are related to specific objects inside the image. The Ground truth is represented by the bounding boxes.

The set of files composing the dataset is:
 - `instances.json` which contains all the information about the bunding boxes of each image
   example of instance
 - `ref(umd).p` which is a serialized file with all the description related to a bounding box and the split it belongs to (train/validation/test)
 - the `images directory` with all the images

This Dataset class, reads the instances.json and refs(umd).p files, creates an association image_id->image_name and annotation_id -> bounding_boxes to simplify the retrivial of the single element in the `__getitem__` method.\
Moreover, a set of samples is created with all the datase entries, each seample is composed of: image id, annotation id, and the sentence. The oobjective of this structure, besides contaioning all samples for the len() method, is to simplify the implementation of the getitem method.\
The latter takes as input an idx (which is the element currently being processed by the iterator) and return the image cropped to the bounding boxes and the sentence related with that box.\
the dataset can perform Data augmentation if needed, and can decide what to do with the bounding boxes depending on th crop_borders variable:
- `cut` -> simply cut the image to the bounding boxes
- `blur` -> blur what's outside the box
- `none` -> do nothing

The before being returned, the image is preprocessed using the clip preprocessing (self.transformation).\
The object returned by the `__getitem__` contains:
- the image path (for the BaseModel)
- the image (with eventually the modifications required)
- the sentence related to that image
- the bounding boxes (converted into coordinates)

In [4]:
class RefCOCOgDataset(Dataset):
    def __init__(self, transform=None, augment=None, split='train', device='cuda', crop_borders='cut'):

        ## Load images' management properties
        self.image_dir = os.path.join('refcocog', 'images')
        self.transform = transform
        self.augment = augment
        self.crop_borders = crop_borders  # Can be 'cut', 'blur', or 'none'

        # Define class properties for split and device
        self.split = split
        self.device = device

        # Load dataset references and instances
        self.refs = self.load_refs()
        self.instances = self.load_instances()

        # Create lookup dictionaries for efficient indexing
        self.image_id_to_filename = {img['id']: img['file_name'] 
                                     for img in self.instances['images']}
        self.ann_id_to_bbox = {ann['id']: ann['bbox'] 
                               for ann in self.instances['annotations']}

        # Prepare samples for the dataset
        self.samples = self._prepare_samples()

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample = self.samples[idx]

        # Load the image
        image_name = self.image_id_to_filename[sample['image_id']]
        image_path = os.path.join(self.image_dir, image_name)


        image = Image.open(image_path).convert("RGB")
        # Get the bounding box
        bbox = self.ann_id_to_bbox[sample['ann_id']]
        x1, y1, w, h = bbox
        x2, y2 = x1 + w, y1 + h

        if self.crop_borders == 'cut':
            # cut the image to the bounding box
            x1, y1, x2, y2 = map(int, (x1, y1, x2, y2))
            x1, y1 = max(0, x1), max(0, y1)
            x2, y2 = min(image.size[0], x2), min(image.size[1], y2)
            image = image.crop((x1, y1, x2, y2))

        elif self.crop_borders == 'blur':
            # Blur the area outside the bounding box
            x1, y1, x2, y2 = map(int, (x1, y1, x2, y2))
            x1, y1 = max(0, x1), max(0, y1)
            x2, y2 = min(image.size[0], x2), min(image.size[1], y2)

            # Create a blurred version of the image
            blurred_image = image.filter(ImageFilter.GaussianBlur(radius=15))

            # Create a mask to preserve the bounding box region
            mask = Image.new("L", image.size, 0)
            draw = ImageDraw.Draw(mask)
            draw.rectangle((x1, y1, x2, y2), fill=255)

            # Composite the image with the blurred background
            image = Image.composite(image, blurred_image, mask)

        # Apply data augmentations
        if self.augment:
            image = self.augment(image)

        # Apply preprocessing transforms
        if self.transform:
            image = self.transform(image)

        # Prepare the sample
        sample = {
            'image_path':image_path,
            'image': image,
            'sentence': sample['sentence'],
            'bbox': torch.Tensor([x1, y1, x2, y2])
        }
        return sample

    def load_refs(self):
        annotation_file = os.path.join('refcocog', 'annotations', 'refs(umd).p')
        with open(annotation_file, 'rb') as f:
            data = pickle.load(f)
        return [item for item in data if item['split'] == self.split]

    def load_instances(self):
        instances_file = os.path.join('refcocog', 'annotations', 'instances.json')
        with open(instances_file, 'r') as f:
            return json.load(f)

    def _prepare_samples(self):
        samples = []
        for ref in self.refs:
            for sentence in ref['sentences']:
                samples.append({
                    'image_id': ref['image_id'],
                    'ann_id': ref['ann_id'],
                    'sentence': sentence['sent']
                })
        return samples

# Fine-tuning Clip

## Dataloaders

For all the models, always three dataset will be instantiated, one for training, one for validation and one for test. Each step of course will use different parameters, in order to test what are the best options to train models.

For the fine tuning of CLIP, after different attempts with cropped immages, I decided to blur what is outside the bounding boxes. In such a way, clip can finetune keeping the information on the location of the boxes, and be slightly more resilient to overfitting, as images that could get cropped into too small format and result too altered after preprocessing would now be more easly understood by the network.

Each class is then loaded in a DataLoader wrapper. All these dataloader have been designed to work leveraging multithreading, with the goal of speeding up training and validation.\
It is important to point out that while the train set is shuffled, the validation and test set are not, since it would be pointless to shuffle them. \
Moreover, data are split in batches whose size is `64`. This parameter has also been chosen for speed reason, and 64 elements batches represent a good trade-of, since batches are nor too large or too small, and the update of the weights happens after a reasonable amount of examples (given the dataset size).

In [14]:
# Train, validation, and test dataset blurring what's outside the boxes in the train set trying to avoid overfitting
finetune_train_dataset = RefCOCOgDataset(transform=preprocess, split='train', crop_borders='crop')
finetune_val_dataset = RefCOCOgDataset(transform=preprocess, split='val', crop_borders='crop')
finetune_test_dataset = RefCOCOgDataset(transform=preprocess, split='test', crop_borders='crop')

# DataLoader options
batch_size = 64
num_workers = 4
pin_memory = True
persistent_workers = True

# DataLoaders
finetune_train_loader = DataLoader(
    dataset=finetune_train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    pin_memory=pin_memory,
    persistent_workers=persistent_workers
)

finetune_val_loader = DataLoader(
    dataset=finetune_val_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    pin_memory=pin_memory,
    persistent_workers=persistent_workers
)

finetune_test_loader = DataLoader(
    dataset=finetune_test_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    pin_memory=pin_memory,
    persistent_workers=persistent_workers
)

# Print dataset sizes
print("=======================================================")
print(f"Number of training samples: {len(finetune_train_dataset)}")
print(f"Number of validation samples: {len(finetune_val_dataset)}")
print(f"Number of test samples: {len(finetune_test_dataset)}")
print("=======================================================")

Number of training samples: 80512
Number of validation samples: 4896
Number of test samples: 9602


## Train and validation functions

The training and loop functions have been implemented for both pre-train.

The `train_epoch` function iterates through the batches. For each one, the images and tokenized text sentences are passed through the model to obtain their respective feature embeddings. These embeddings are normalized to unit vectors to ensure a stable similarity measure, which is calculated using the dot product between image and text features. The similarity matrix reflects how well each image aligns with each text, guiding the model’s learning. The cross-entropy loss is computed using the similarity matrix, where the labels correspond to the correct image-text pairs. This contrastive loss is computed in both directions (image-to-text and text-to-image), ensuring that both modalities are equally optimized to match each other. The two losses are averaged, and gradients are backpropagated through the model using the optimizer. Gradient clipping is applied to prevent gradients from becoming too large, ensuring stable training. The accuracy is calculated by comparing the predicted alignment (i.e., the most similar image-text pair) to the true labels. For each batch, the function determines the index of the maximum similarity in the similarity matrix, which corresponds to the predicted match. If the predicted match is correct (i.e., the index matches the label), the prediction is considered accurate. The correct predictions are accumulated, and the total accuracy is updated by dividing correct predictions by the total number of samples processed.

The `validate` function performs similar steps, but without updating the model’s weights. It evaluates the model on a validation set by calculating the same similarity metric, contrastive loss, and accuracy as in training, providing an assessment of how well the model generalizes to unseen data.



In [15]:
def train_epoch(model, train_loader, optimizer, device, epoch):
    model.train()
    total_loss = 0
    correct_predictions = 0
    total_samples = 0

    pbar = tqdm(train_loader, desc=f'Epoch {epoch}')
    for batch in pbar:
        images = batch['image'].to(device)
        texts = clip.tokenize(batch['sentence']).to(device)

        optimizer.zero_grad(set_to_none=True)

        # Forward pass
        image_features, text_features = model(images, texts)

        # Normalize features
        image_features = F.normalize(image_features, dim=1)
        text_features = F.normalize(text_features, dim=1)

        # Similarity matrix
        similarity = (image_features @ text_features.t())

        # Labels for contrastive learning
        labels = torch.arange(len(images)).to(device)

        # Calculate loss
        loss_i2t = F.cross_entropy(similarity, labels)
        loss_t2i = F.cross_entropy(similarity.t(), labels)
        loss = (loss_i2t + loss_t2i) / 2

        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        # Calculate accuracy
        predictions = similarity.argmax(dim=1)
        correct_predictions += (predictions == labels).sum().item()
        total_samples += len(images)

        # Update progress bar
        total_loss += loss.item()
        pbar.set_postfix({
            'loss': f'{loss.item():.4f}',
            'acc': f'{100 * correct_predictions / total_samples:.2f}%'
        })

    return total_loss / len(train_loader), correct_predictions / total_samples



def validate(model, val_loader, device):
    model.eval()
    total_loss = 0
    correct_predictions = 0
    total_samples = 0

    with torch.no_grad():
        for batch in tqdm(val_loader, desc='Validating'):
            images = batch['image'].to(device)
            texts = clip.tokenize(batch['sentence']).to(device)

            # Forward pass
            image_features, text_features = model(images, texts)

            # Normalize features
            image_features = F.normalize(image_features, dim=1)
            text_features = F.normalize(text_features, dim=1)

            # Similarity matrix
            similarity = image_features @ text_features.t()

            # Labels for contrastive learning
            labels = torch.arange(len(images)).to(device)

            # Calculate loss
            loss_i2t = F.cross_entropy(similarity, labels)
            loss_t2i = F.cross_entropy(similarity.t(), labels)
            loss = (loss_i2t + loss_t2i) / 2

            # Calculate accuracy
            predictions = similarity.argmax(dim=1)
            correct_predictions += (predictions == labels).sum().item()
            total_samples += len(images)
            total_loss += loss.item()

    # Return average loss and accuracy for validation
    return total_loss / len(val_loader), correct_predictions / total_samples


## Print results function

This function is used to store the results of the pretraining into an image in a folder dedicated to either Default model (bare CLIP) or the custom CLIP model

In [16]:
def plot_training_curves(num_epochs, training_losses, validation_losses,
                         training_accuracies, validation_accuracies,
                         lr, output_folder='finetuning2'):

    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)

    # Plot training curves
    plt.figure(figsize=(15, 5))

    # Plot Losses
    plt.subplot(1, 2, 1)
    plt.plot(range(1, num_epochs + 1), training_losses, label='Training Loss', marker='o')
    plt.plot(range(1, num_epochs + 1), validation_losses, label='Validation Loss', marker='o', linestyle='--')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training and Validation Loss')
    plt.legend()
    plt.grid(True)

    # Plot Accuracies
    plt.subplot(1, 2, 2)
    plt.plot(range(1, num_epochs + 1), training_accuracies, label='Training Accuracy', marker='o')
    plt.plot(range(1, num_epochs + 1), validation_accuracies, label='Validation Accuracy', marker='o', linestyle='--')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.title('Training and Validation Accuracy')
    plt.legend()
    plt.grid(True)

    # Adjust layout
    plt.tight_layout()

    # Save the plot
    image_name = f'{output_folder}/training_{num_epochs}_{lr}.png'
    plt.savefig(image_name)
    plt.close()

    print(f"Plot saved as {image_name}")

## 1. Default CLIP model

### Train and validation loop
The training loop runs the train function over the train dataloader and then runs the validation function to check the performances of the model in order to see eventual overfit.\
In this process various learning rates have been used, as coming across overfitting is pretty easy, and finding the best learning rate that can finetune CLIP preserving its pretreining information is not easy.\
However, the lr that seems to have the best performance is `1e-7`.\
As it can be seen from the code, different optimizer with different settings and learning rates were tested, but running into overfitting resulted very commond and difficult to avoid. Therefore, trying to avoiding this, different techniques were applied (on on the dataset, on the loop and on the train function):
- gradient clipping
- learning rate scheduling with `CosineAnnealingLR` to decrease the lr each epoch
- early stopping (to store the best model)

Given the size of the dataset and the depth of the clip model, the number of epochs is set to `10`.

Reminding that the notebook was executed in a `ml.g4dn.xlarge` aws machine (the most powerful allowed as reported in the course's slides), the train for each epoch took about 22 minutes.

Trying to achieve better result, another model has been implemented below. 


In [None]:
# Learning rate and optimizer
lr=1e-7
# # optimizer = Adam(clip_model.parameters(), lr=1e-3) # Overfit
# # optimizer = Adam(clip_model.parameters(), lr=5e-5, betas=(0.9, 0.98), eps=1e-6, weight_decay=0.02) # Overfit
# # optimizer = Adam(clip_model.parameters(), lr=5e-6, betas=(0.9, 0.98), weight_decay=0.02) # Overfit
# # optimizer = Adam(clip_model.parameters(), lr=1e-6, betas=(0.9, 0.98), weight_decay=0.02) # Overfit
# optimizer = Adam(clip_model.parameters(), lr=lr, betas=(0.9, 0.98), weight_decay=0.1)

# Training loop
training_losses = []
validation_losses = []
training_accuracies = []
validation_accuracies = []

# Values to get the best model
best_val_loss = float('inf')
patience = 3
patience_counter = 0


# Initialize optimizer
optimizer = Adam(
    clip_model.parameters(),
    lr=lr,
    betas=(0.9, 0.98),
    weight_decay=0.1
)

# Scheduler for the learning rate
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer,
    T_0=5,
    T_mult=2
)

# Number of epochs for training
num_epochs = 10

# Best model's name
best_model_name = f'best_clip_{lr}.pth'

# Training loop
for epoch in range(1, num_epochs + 1):
    print(f"\nEpoch {epoch}/{num_epochs}")

    # Train and Validate for one epoch
    train_loss, train_accuracy = train_epoch(
        clip_model,
        finetune_train_loader,
        optimizer,
        device,
        epoch
    )

    val_loss, val_accuracy = validate(
        clip_model,
        finetune_val_loader,
        device
    )

    # Step the learning rate scheduler
    scheduler.step()

    # Store losses for plotting
    training_losses.append(train_loss)
    training_accuracies.append(train_accuracy)
    validation_losses.append(val_loss)
    validation_accuracies.append(val_accuracy)

    # Model checkpoint and early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0

        # Save best model
        torch.save({
            'epoch': epoch,
            'model_state_dict': clip_model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'val_loss': val_loss,
            'val_accuracy': val_accuracy
        }, best_model_name)
    else:
        patience_counter += 1

    # Print epoch metrics
    print(f'Train Loss: {train_loss:.4f} | Train Acc: {train_accuracy:.4f}')
    print(f'Val Loss: {val_loss:.4f} | Val Acc: {val_accuracy:.4f}')
    print(f'Current LR: {optimizer.param_groups[0]["lr"]:.2e}')

    # Early stopping check
    if patience_counter >= patience:
        print(f'Early stopping at epoch {epoch}')
        break


# Save learning curve in an image
plot_training_curves(num_epochs,
                     training_losses,
                     validation_losses,
                     training_accuracies,
                     validation_accuracies,
                     lr,
                     output_folder='finetuning1')

### Test

In [None]:
test_loss, test_accuracy = validate(clip_model, finetune_test_loader, device)

print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

### Results
<div style="text-align: center;">
    <h3>Comparison of Blurred and Cropped Images</h3>
</div>
<div style="display: flex; justify-content: space-between;">
    <img src="attachment:35d7dc98-e2aa-4a50-87e6-e61e650d88ce.png" alt="Blurred Images" style="width: 48%;"/>
    <img src="attachment:401f2182-051b-4a17-83e8-f13407c91f55.png" alt="Cropped Images" style="width: 48%;"/>
</div>

Results show not really promising curves. The Two images on the left refers to the finetune using blurred borders around the bounding boxes. Here we can see that the reduction of the loss does not follow a typical expected progression, being lower and almost costant up until epoch 5 and decreasing  even faster afterwards. The pretrain with the cropped images (two graphs on the right) shows a bette curve, but train stil get a slow down around epoch 5 to then start lowering again.\
From what I can see, these images suggest that more epochs would benefit the finetuning process, but since the training time is already very long, and the expected increase of performance would not be really huge in my point of view, I decided to stop here and try another approach.


## 2. Custom CLIP model
After trying different parameters for the learning rate, to avoid overfitting while still trying to improve the error loss on the validation set, I decided to try change of approach.\
I decided to add a projection layer at the end of CLIP. The idea is to define a new model that does the image and text encoding, then apply a linear layer at the and, and again train the network, allowing the last layer to be fully trained by the clip images.

The idea is to optimize the fine-tuning, also applying normalization to stabilize the learning process as it doesn't introduce much overhead to the computation.

The final goal is of course to obtain better results than the bare clip fine-tuning above.

### Model

In [5]:
class CLIPGrounding(nn.Module):
    def __init__(self, clip_model):
        super().__init__()
        self.clip_model = clip_model
        self.clip_dim = self.clip_model.visual.output_dim

        # Add a projection layer
        self.projection = nn.Linear(self.clip_dim, self.clip_dim)

    def forward(self, images, texts):
        # Encode images and text with CLIP

        # with torch.no_grad(): # Uncomment to don't train CLIP
        image_features = self.clip_model.encode_image(images)
        text_features = self.clip_model.encode_text(texts)

        # Simple projection
        image_features = self.projection(image_features)
        text_features = self.projection(text_features)

        # Normalize features (add epsilon to avoid NaN)
        image_features = F.normalize(image_features, dim=-1, eps=1e-6)
        text_features = F.normalize(text_features, dim=-1, eps=1e-6)

        return image_features, text_features

### Train and validation loop

The training loop fine-tunes a CLIP-based model for visual grounding, using separate learning rates for the CLIP model and its projection layers to prevent overfitting. The model is initialized using the `CLIPGrounding` class and moved to the correct device. The learning rate for the CLIP model is set to a small value of `1e-7` to allow for fine-tuning (and avoid loosing pre-trained weights), while the projection layer has a higher learning rate of `1e-4` to focus on enhancing its training.

The optimizer used is `AdamW` (Similar to Adam but with a better implementation of weight decay), which is configured with weight decay and epsilon values for both the CLIP model and the projection layer to ensure stable training. To adjust the learning rate dynamically during training, the loop uses a `CosineAnnealingLR` lr scheduler.

Each epoch consists of training and validation. During the training phase, the `train_epoch()` function computes the training loss and accuracy. Afterward, the model is validated using the `validate()` function to calculate the validation loss and accuracy. These metrics are tracked throughout the training process, with training and validation losses and accuracies stored in separate lists.

At the end of each epoch, the learning rate scheduler updates the learning rate based on the cosine annealing strategy. The model is then saved if it achieves the best validation accuracy seen so far, ensuring the best model is kept. After training is complete, the results are visualized through training curves, and the model’s state is saved for future use, enabling continued evaluation or deployment.


In [None]:
# Learning rates form smallest to highest
# lrs = [1e-6, 1e-5, 1e-4]


# for lr in lrs:

# Create model
ft_clip_model = CLIPGrounding(clip_model).to(device)

# Print lr iteration information
# print(f'Learning rate = {lr}:')

# number of epochs for each learnig rate
num_epochs = 10
lr_clip = 1e-7
lr_proj = 1e-4

# Best model's name
best_model_name = f'best_CLIPGrounding_{num_epochs}_{lr_clip}_{lr_proj}.pth'

# Initialize different lr for clip to avoid overfitting
# while enhancing train on projection layer.
optimizer = torch.optim.AdamW(
    params=[
        {
            "params": ft_clip_model.clip_model.parameters(),
            "lr": lr_clip,
            "weight_decay": 0.01,
            "eps": 1e-8
        },
        {
            "params": ft_clip_model.projection.parameters(),
            "lr": lr_proj,
            "weight_decay": 0.01,
            "eps": 1e-8
        }
    ]
)

# optimizer = torch.optim.AdamW(ft_clip_model.parameters(), lr=1e-4, weight_decay=0.01, eps=1e-8)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=num_epochs
)

# Lists to store metrics
training_losses = []
validation_losses = []
training_accuracies = []
validation_accuracies = []

# Train/validation loop
for epoch in range(1, num_epochs+1):
    # Train epoch
    train_loss, train_acc = train_epoch(
        ft_clip_model, finetune_train_loader, optimizer, device, epoch
    )

    # Validate
    val_loss, val_acc = validate(ft_clip_model, finetune_val_loader, device)

    # Store metrics
    training_losses.append(train_loss)
    validation_losses.append(val_loss)
    training_accuracies.append(train_acc)
    validation_accuracies.append(val_acc)

    # Update learning rate
    scheduler.step()

    # Print metrics
    print(f'Epoch {epoch}/{num_epochs}:')
    print(f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc*100:.2f}%')
    print(f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc*100:.2f}%')

torch.save({
    'epoch': epoch,
    'ft_clip_model_state_dict': ft_clip_model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'val_acc': val_acc,
}, best_model_name)

# Save learning curve in an image
plot_training_curves(num_epochs,
                     training_losses,
                     validation_losses,
                     training_accuracies,
                     validation_accuracies,
                     lr_clip,
                     output_folder='finetuning2')

### Test

In [None]:
# Test
test_loss, test_acc = validate(ft_clip_model, finetune_test_loader, device)

print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

### Results
<div style="text-align: center;">
    <h3>Results with Blurred Borders vs Cropped Images</h3>
</div>
<div style="display: flex; justify-content: space-between;">
    <img src="attachment:ca3f94c7-c21c-4667-805a-270ada44b98c.png" alt="Results with Blurred Borders" style="width: 48%;"/>
    <img src="attachment:31bdeecd-a2fd-458c-b1c8-c0e8cbd66991.png" alt="Results with Cropped Images" style="width: 48%;"/>
</div>


The results have effectively enhanced in this second model, with the train over the cropped images still performing better. The curves now appear more coherent with the expectations, as the loss decreases exponentially with the validation increasing consistently.\
The accuracy in the second model increases very slowly, reaching its top at the iteration number 10. This might suggest to try with an higher learning rate, which I tried but ending up again in overfitting without really any performance improvement.\
Again, trying to increase the number of epochs would likely result in a long training process withouth any significative improvement, are stopping decreasing significantly already after 4/5 epochs.\
For this reason, I decided to stop here for the pretrain phase, and focus on the next steps moving onto the models, which are the main focus of the project. I expected better result, but I decided to use this finetune model in the custom visual grounding model implemeted below, the goal is to see the effective difference in performance. I am aware that this may cause a sensible reduction in the expressive power of the model, but so far I haven't been able to improve the pre-train.

# Models 

## Dataloaders

Since the goal of this model is to predict bounding boxes of an object in a picture from a textual description, new Dataset classes are created without any crop to the images, applying only the clip preprocessing

In [8]:
# Train, test, and validation set split cropping images, without applying any transformation 
train_dataset = RefCOCOgDataset(transform=preprocess, split='train', crop_borders='none')
val_dataset = RefCOCOgDataset(transform=preprocess, split='val', crop_borders='none')
test_dataset = RefCOCOgDataset(transform=preprocess, split='test', crop_borders='none')

# DataLoaders batch size and other options. Computation is done with 4 workers to speed it up
batch_size = 64
shuffle = True
pin_memory = True
num_workers = 4
persistent_workers = True

# DataLoader, shuffled in case of training set and not shuffled in case of test and validation sets
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    pin_memory=pin_memory,
    persistent_workers=persistent_workers
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    pin_memory=pin_memory,
    persistent_workers=persistent_workers
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    pin_memory=pin_memory,
    persistent_workers=persistent_workers
)

# Print dataset sizes
print("=======================================================")
print(f"Number of training samples: {len(train_dataset)}")
print(f"Number of validation samples: {len(val_dataset)}")
print(f"Number of test samples: {len(test_dataset)}")
print("=======================================================")

Number of training samples: 80512
Number of validation samples: 4896
Number of test samples: 9602


## Loss function

While thinking about what could be a suitable function to minimize in order to train the network, I decided to look into already existing models. For this reason I started looking into YOLO error functions, finding out that its most recent version uses particular implementations of `IoU` (Intersection over Union).

IoU itself aims to calculate the intersection of the two images and their union, and then subdivides the first by the second hence returning a value always between 0 and 1. It can represent a good metrics for evaluation as it gives a numerical idea of how close the prediction and ground truth boxes are, and if used in the form of `1-IoU` it can be used in a minimization problem. However it presents some important drawbacks:
- when the boxes are close, the derivative of the function becomes small making harder for the optimizer to understand when the model is performing well
- if the boxes are small, and the shape of the predicted box is a bit smaller than the ground truth, IoU might drop leading to large gradients and unstable training

So, while looking through the various versions of IoU (Generalized IoU, Distance IoU, Complete IoU), I found out that a good fit as error measure could be the Efficient IoU called `EIoU` [[reference](https://medium.com/@cshyo1004/iou-and-variants-overview-a328acf177cd)], an implementation of IoU that breaks down the error into 3 main components:

- `IoU` -> measures the overlap between the predicted and ground truth boxes
- `Distance Penalty` -> penalizes the normalized squared Euclidean distance between the centroids of the predicted and ground truth boxes. This is calculated as the ratio of the squared distance between centers (c²) to the squared diagonal length of the smallest enclosing box (d²)
- `Aspect Ratio Penalty` -> penalizes mismatches in the width-to-height ratios of the bounding boxes, calculated using the arctangent of the width-to-height ratio difference

EIoU helps achieve better regression by considering both the overlap and geometric factors of the bounding boxes. The distance term helps improve localization accuracy, while the aspect ratio term helps maintain proper box shapes. The normalization of the distance penalty by the enclosing box diagonal makes the metric scale-invariant, helping with both large and small objects. By combining these terms, EIoU provides more informative gradients for optimization compared to standard IoU, especially when boxes are close to overlapping or when dealing with small objects.

Here a class has been defined to implement this loss function that will be used to train the models. This class leverages functions to calculate IoU (which will be also used as evaluation metric) and EIoU itself.

In [9]:
def compute_iou(boxes1, boxes2):

    # Calculate intersection coordinates
    x1 = torch.max(boxes1[:, 0], boxes2[:, 0])
    y1 = torch.max(boxes1[:, 1], boxes2[:, 1])
    x2 = torch.min(boxes1[:, 2], boxes2[:, 2])
    y2 = torch.min(boxes1[:, 3], boxes2[:, 3])

    # Calculate intersection area
    intersection = torch.clamp(x2 - x1, min=0) * torch.clamp(y2 - y1, min=0)

    # Calculate union area
    area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
    area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])
    union = area1 + area2 - intersection

    # Calculate IoU
    iou = intersection / (union + 1e-6)  # Add small epsilon to avoid division by zero
    return iou

def compute_eiou(boxes1, boxes2):

    iou = compute_iou(boxes1, boxes2)

    # Calculate centers for both sets of boxes
    boxes1_cx = (boxes1[:, 0] + boxes1[:, 2]) / 2
    boxes1_cy = (boxes1[:, 1] + boxes1[:, 3]) / 2
    boxes2_cx = (boxes2[:, 0] + boxes2[:, 2]) / 2
    boxes2_cy = (boxes2[:, 1] + boxes2[:, 3]) / 2

    # Calculate widths and heights
    boxes1_w = boxes1[:, 2] - boxes1[:, 0]
    boxes1_h = boxes1[:, 3] - boxes1[:, 1]
    boxes2_w = boxes2[:, 2] - boxes2[:, 0]
    boxes2_h = boxes2[:, 3] - boxes2[:, 1]

    # Central point distance term
    c_2 = torch.pow(boxes2_cx - boxes1_cx, 2) + torch.pow(boxes2_cy - boxes1_cy, 2)

    # Diagonal length of the smallest enclosing box
    d_2 = torch.pow(torch.max(boxes1[:, 2], boxes2[:, 2]) - torch.min(boxes1[:, 0], boxes2[:, 0]), 2) + \
          torch.pow(torch.max(boxes1[:, 3], boxes2[:, 3]) - torch.min(boxes1[:, 1], boxes2[:, 1]), 2)

    # Distance penalty
    distance_penalty = c_2 / (d_2 + 1e-7)

    # Aspect ratio penalty
    boxes1_aspect = torch.atan2(boxes1_w, boxes1_h)
    boxes2_aspect = torch.atan2(boxes2_w, boxes2_h)
    aspect_penalty = torch.pow(boxes1_aspect - boxes2_aspect, 2) / 4

    # Final EIoU
    eiou = iou - distance_penalty - aspect_penalty

    return eiou

class EIoULoss(nn.Module):
    def __init__(self, reduction='mean'):
        super().__init__()
        self.reduction = reduction

    def forward(self, pred_boxes, target_boxes, similarity_scores=None):
        eiou_loss = 1 - compute_eiou(pred_boxes, target_boxes)

        # Add similarity score component if provided
        if similarity_scores is not None:
            similarity_loss = 1 - similarity_scores
            combined_loss = eiou_loss + 0.5 * similarity_loss
        else:
            combined_loss = eiou_loss

        if self.reduction == 'mean':
            return combined_loss.mean()
        elif self.reduction == 'sum':
            return combined_loss.sum()
        else:
            return combined_loss

## Metric functions 
The first function, `calculate_metrics` is desaigned to calculate performances of both models ad represents the commond ground for evaluation. It computes the performance of the model by comparing predicted bounding boxes (`pred_boxes`) against target bounding boxes (`target_boxes`). The function calculates the Intersection over Union (IoU) for all predictions using the `compute_iou` function, which measures the overlap between predicted and target boxes. From this, it derives the **localization accuracy**, defined as the proportion of predictions with an IoU greater than 0.5, indicating how well the model localizes objects. Additionally, it computes the **grounding accuracy**, which is essentially the recall of predictions with IoU > 0.5 relative to the total number of target boxes, providing insight into the model's ability to correctly identify objects. If feature vectors (`pred_features` and `target_features`) are provided, the function also calculates the **semantic similarity** using cosine similarity, which measures how closely the predicted and target features align in the feature space. Finally, it computes the **mean IoU**, which is the average IoU across all predictions, giving an overall measure of localization performance. All these metrics are returned as a dictionary, making it easy to track and compare model performance.

The second function, `display_metrics`, takes the metrics dictionary produced by `calculate_metrics` and prints them in a structured and readable format.

In [10]:
def calculate_metrics(pred_boxes, target_boxes, pred_features=None, target_features=None):

    metrics = {}

    # Move tensors to CPU
    pred_boxes = pred_boxes.cpu()
    target_boxes = target_boxes.cpu()

    # Calculate IoU for all predictions
    ious = compute_iou(pred_boxes, target_boxes)

    # Localization Accuracy (IoU > 0.5)
    localization_accuracy = torch.mean((ious > 0.5).float()).item()
    metrics['localization_accuracy'] = localization_accuracy

    # Grounding Accuracy (recall with IoU > 0.5)
    true_positives = torch.sum(ious > 0.5).item()
    total_targets = len(target_boxes)
    grounding_accuracy = true_positives / total_targets if total_targets > 0 else 0
    metrics['grounding_accuracy'] = grounding_accuracy

    # Semantic Similarity (if features are provided)
    if pred_features is not None and target_features is not None:
        pred_features = pred_features.cpu()
        target_features = target_features.cpu()
        semantic_similarity = F.cosine_similarity(pred_features, target_features).mean().item()
        metrics['semantic_similarity'] = semantic_similarity

    # Calculate mean IoU
    metrics['miou'] = ious.mean().item()

    # Return all the metrics
    return metrics

def display_metrics(metrics, epoch=None, split="test"):

    print("\n" + "="*50)
    if epoch is not None:
        print(f"Epoch {epoch} - {split.capitalize()} Metrics:")
    else:
        print(f"{split.capitalize()} Metrics:")
    print("="*50)

    if "loss" in metrics:
        print(f"Loss: {metrics['loss']:.4f}")
    if "localization_accuracy" in metrics:
        print(f"Localization Accuracy: {metrics['localization_accuracy']:.2%}")
    if "grounding_accuracy" in metrics:
        print(f"Grounding Accuracy: {metrics['grounding_accuracy']:.2%}")
    if "semantic_similarity" in metrics:
        print(f"Semantic Similarity: {metrics['semantic_similarity']:.4f}")
    if "miou" in metrics:
        print(f"Mean IoU: {metrics['miou']:.2%}")

    print("="*50)

## 1. Base Model

### Model definition

The `BaseModel` is defined as a starting point to further study the task and become familiar with this visual grounding task. The approach described in the project statement is to combine YOLO with CLIP, taking the bounding boxes computed by the first and find the most similar to the description using the second.

This model has been created using `YOLO V5`, and takes objects recognized with a confidence higher than `0.4`. Images for which no bounding box has been found are skipped, and they're grouped in a variable that tells how many they are. However, It has been observer that images with no bounding boxes are really a few.

Once YOLO has the bounding boxes, they are used to get the cropped images, which are preprocessed using `process_crops_for_clip`, which applies the CLIP needed preprocess. Once this is done the embeddings for the cropped immages and the text description are computed with CLIP and compared with cosine similarity to get the bounding box that matches the description the best As mentioned, the forward model also returns an object keepoing tracks of the paths and count of immages with no objects recognized (generally <10).

In [11]:
class BaseModel(nn.Module):
    def __init__(self, yolo_model, clip_model, confidence_threshold=0.4, transform=None, device="cuda"):
        super().__init__()

        # Initialize class' variables
        self.device = device
        self.confidence_threshold = confidence_threshold
        self.transform = transform

        # Initialize class' models
        self.yolo_model = yolo_model
        self.clip_model = clip_model
        self.yolo_model.conf = confidence_threshold
        self.yolo_model.to(device)
        self.clip_model.to(device)

    def forward(self, images_paths, descriptions):

        # Lists to store results for the batch
        batch_best_boxes = []
        batch_best_scores = []
        images_producing_no_bbox = {
            "paths": [],
            "count": 0
        }
        # Process each image in the batch
        for idx in range(len(images_paths)):
            image_path = images_paths[idx]
            description = descriptions[idx]

            # Get all objects with YOLO inference
            with torch.no_grad():

                # Get yolo objects
                yolo_results = self.yolo_model(image_path)
                crops = yolo_results.crop(save=False)

                # Extract bounding boxes
                boxes = [torch.tensor(crop['box'], device=self.device) for crop in crops]

                # Prepare crops for CLIP
                crops = process_crops_for_clip(crops, transform=self.transform, device=self.device)


            # Convert list to batch tensor
            if crops is not None and len(crops) > 0:
                # Encode text description
                with torch.no_grad():
                    description_tokens = clip.tokenize(description).to(self.device)
                    description_embedding = self.clip_model.encode_text(description_tokens).float()
                    object_embeddings = self.clip_model.encode_image(crops).float()

                # Calculate similarities for all objects at once
                similarities = torch.cosine_similarity(
                    object_embeddings,
                    description_embedding.repeat(len(object_embeddings), 1),
                    dim=1
                )

                # Find best match
                best_match_index = similarities.argmax()
                best_match_box = boxes[best_match_index]
                best_match_score = similarities[best_match_index]
            else:
                # Handle case where transform failed for all objects
                best_match_box = torch.zeros(4, device=self.device)
                best_match_score = torch.tensor(0.0, device=self.device)
                images_producing_no_bbox["paths"].append(image_path)
                images_producing_no_bbox["count"] += 1

            batch_best_boxes.append(best_match_box)
            batch_best_scores.append(best_match_score)

        # Stack results into tensors
        pred_boxes = torch.stack(batch_best_boxes)
        similarity_scores = torch.stack(batch_best_scores)

        return pred_boxes, similarity_scores, images_producing_no_bbox


def process_crops_for_clip(crops, transform=None, device='cuda'):

    processed_crops = []
    for crop in crops:
        # Extract image from crop dictionary
        if isinstance(crop, dict) and 'im' in crop:
            crop_img = crop['im']
        else:
            crop_img = crop

        # Convert numpy array to PIL Image if necessary
        if isinstance(crop_img, np.ndarray):
            crop_img = Image.fromarray(crop_img)

        # Process the image with the provided transform
        if transform is not None:
            processed_crop = transform(crop_img).to(device)
            processed_crops.append(processed_crop)

    # Stack all processed crops into a batch
    if processed_crops:
        return torch.stack(processed_crops)
    return None

### Validation and test functions

The loss function used to train the model is the IoU (Intersection over Union), 

In [12]:
def validate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    all_preds = []
    all_targets = []
    num_batches = len(dataloader)

    localization_correct = 0
    total_samples = 0
    total_no_bbox_images = 0

    with torch.no_grad():
        pbar = tqdm(dataloader, total=num_batches, desc='Validation')
        for batch in pbar:
            # Get preprocessed data from dataset
            images = batch["image_path"]
            texts = batch["sentence"]
            target_boxes = batch["bbox"].to(device, non_blocking=True)

            # Forward pass - unpack the tuple
            pred_boxes, similarity_scores, no_bbox_info = model(images, texts)
            # Update the loss function call to include similarity scores
            loss = criterion(pred_boxes, target_boxes, similarity_scores)

            # Update metrics
            total_loss += loss.item()
            all_preds.append(pred_boxes.cpu())
            all_targets.append(target_boxes.cpu())

            # Calculate IoU for localization accuracy
            ious = compute_iou(pred_boxes, target_boxes)
            localization_correct += (ious > 0.5).sum().item()
            total_samples += len(images)

            # Update total_no_bbox_images with the count from model output
            total_no_bbox_images += no_bbox_info["count"]

            pbar.set_postfix({
                'loss': f'{loss.item():.4f}',
                'localization_accuracy': f'{localization_correct / total_samples:.4f}',
                'total_no_bbox_images': f'{total_no_bbox_images}'
            })


def test(model, test_loader, criterion, device):
    model.eval()
    total_loss = 0
    num_batches = len(test_loader)

    all_preds = []
    all_targets = []
    all_pred_features = []
    all_target_features = []

    total_samples = 0
    total_no_bbox_images = 0

    with torch.no_grad():
        pbar = tqdm(test_loader, total=num_batches, desc='Testing')
        for batch in pbar:
            # Get preprocessed data from dataset
            images = batch["image_path"]
            texts = batch["sentence"]
            target_boxes = batch["bbox"].to(device, non_blocking=True)

            # Forward pass - unpack the tuple
            pred_boxes, similarity_scores, no_bbox_info = model(images, texts)
            # Update the loss function call to include similarity scores
            loss = criterion(pred_boxes, target_boxes, similarity_scores)

            # Get text features using CLIP model directly
            text_tokens = model.clip_model.tokenize(texts).to(device)
            pred_features = model.clip_model.encode_text(text_tokens).float()
            target_features = model.clip_model.encode_text(text_tokens).float()

            # Store predictions and targets
            all_preds.append(pred_boxes.cpu())
            all_targets.append(target_boxes.cpu())
            all_pred_features.append(pred_features.cpu())
            all_target_features.append(target_features.cpu())

            total_samples += len(images)
            total_loss += loss.item()
            total_no_bbox_images += no_bbox_info["count"]

            # Calculate running metrics for progress bar
            running_metrics = calculate_metrics(pred_boxes, target_boxes)

            pbar.set_postfix({
                'loss': f'{loss.item():.4f}',
                'localization_accuracy': f'{running_metrics["localization_accuracy"]:.4f}',
                'total_no_bbox_images': f'{total_no_bbox_images}'
            })

    if all_preds and all_targets:
        all_preds = torch.cat(all_preds, dim=0)
        all_targets = torch.cat(all_targets, dim=0)
        all_pred_features = torch.cat(all_pred_features, dim=0)
        all_target_features = torch.cat(all_target_features, dim=0)

        # Calculate final metrics using your function
        metrics = calculate_metrics(
            all_preds, 
            all_targets,
            all_pred_features,
            all_target_features
        )

        # Add additional metrics not covered by calculate_metrics
        metrics['loss'] = total_loss / num_batches
        metrics['total_no_bbox_images'] = total_no_bbox_images

        # Display the metrics
        display_metrics(metrics, split="test")

        return metrics

    return {
        'loss': float('inf'),
        'localization_accuracy': 0,
        'grounding_accuracy': 0,
        'semantic_similarity': 0,
        'miou': 0,
        'total_no_bbox_images': total_no_bbox_images
    }


### Model test

In [None]:
# Initialize model, criterion, optimizer
base_model = BaseModel(yolo_model, clip_model)
criterion = EIoULoss()

# Test the model on the validation and test set
val_metrics = validate(base_model, val_loader, criterion, device)
test_metrics = test(base_model, test_loader, criterion, device)

display_metrics(test_metrics, split="test")

## 2.  Custom Model

### Model definition

The idea of the custom model is rather simple, and somehow it resempbles what has been done for the pre-train.\
The modeltakes the fine-tuned CLIP anc concatenates layers to predict the bounding boxes. The idea is that the output of the encoded text and image are fed into the last layers to obtain the bounding boxes. 

In [10]:
class CustomModel(nn.Module):

    def __init__(self, clip_grounding_model, hidden_dim=1024):
        super().__init__()

        # Set the pretrained clip model
        self.clip_grounding_model = clip_grounding_model

        # concatenate a MLP for bounding boxes
        self.bbox_head = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 4)
        )

    def forward(self, images, texts):
        # Extract global image and text features
        image_features, text_features = self.clip_grounding_model(images, texts)

        # Debug prints
        # print("Image features shape:", image_features.shape)
        # print("Text features shape:", text_features.shape)

        # Normalize features
        image_features = F.normalize(image_features, dim=-1, eps=1e-6)
        text_features = F.normalize(text_features, dim=-1, eps=1e-6)

        # Print combined shape
        combined_features = torch.cat([image_features, text_features], dim=-1)
        # print("Combined features shape:", combined_features.shape)

        # Predict bounding boxes
        bounding_boxes = self.bbox_head(combined_features)
        # print("Bounding boxes shape:", bounding_boxes.shape)

        return bounding_boxes

    def get_embeddings(self, images, texts):
        text_tokens = clip.tokenize(texts).to(self.device)
        text_embeddings = self.clip_grounding_model.clip_model.encode_text(text_tokens).float()
        images_embeddings = self.clip_grounding_model.clip_model.encode_image(images).float()

        return images_embeddings, text_embeddings

### Train, validation, and test functions

The loss function used to train the model is the IoU (Intersection over Union), 

In [11]:
def train_epoch(model, dataloader, optimizer, criterion, device):
    model.train()
    model.clip_grounding_model.train()
    total_loss = 0
    num_batches = len(dataloader)
    pbar = tqdm(dataloader, total=num_batches, desc='Training')

    for batch in pbar:
        # Get preprocessed data from dataset
        images = batch["image"].to(device, non_blocking=True)
        texts = clip.tokenize(batch["sentence"]).to(device)

        target_boxes = batch["bbox"].to(device, non_blocking=True)

        # Training step
        optimizer.zero_grad(set_to_none=True)
        pred_boxes = model(images, texts)

        loss = criterion(pred_boxes, target_boxes)
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        # Update metrics
        total_loss += loss.item()
        pbar.set_description(f'Train Loss: {loss.item():.4f}')


    return total_loss / num_batches

def validate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    all_preds = []
    all_targets = []
    num_batches = len(dataloader)

    with torch.no_grad():
        pbar = tqdm(dataloader, total=num_batches, desc='Validation')
        for batch in pbar:
            try:
                
                # Get preprocessed data from dataset
                images = batch["image"].to(device, non_blocking=True)
                texts = clip.tokenize(batch["sentence"]).to(device)

                target_boxes = batch["bbox"].to(device, non_blocking=True)

                # Forward pass
                pred_boxes = model(images, texts)
                loss = criterion(pred_boxes, target_boxes)

                # Update metrics
                total_loss += loss.item()
                all_preds.append(pred_boxes.cpu())
                all_targets.append(target_boxes.cpu())

                pbar.set_description(f'Val Loss: {loss.item():.4f}')

            except Exception as e:
                print(f"Error in validation batch: {str(e)}")
                continue

    # Calculate metrics only if we have predictions
    if all_preds and all_targets:
        all_preds = torch.cat(all_preds, dim=0)
        all_targets = torch.cat(all_targets, dim=0)

        # Calculate IoU for localization accuracy
        ious = compute_iou(all_preds, all_targets)
        localization_accuracy = torch.mean((ious > 0.5).float()).item()

        return total_loss / num_batches, {'localization_accuracy': localization_accuracy}

    return float('inf'), None

def test(model, test_loader, criterion, device):
    model.eval()
    total_loss = 0
    num_batches = len(test_loader)

    all_preds = []
    all_targets = []
    all_image_features = []
    all_text_features = []

    with torch.no_grad():
        pbar = tqdm(test_loader, total=num_batches, desc="Testing")
        for batch in pbar:
            # Get preprocessed data from the dataset
            images = batch["image"].to(device, non_blocking=True) 
            texts = clip.tokenize(batch["sentence"]).to(device)

            target_boxes = batch["bbox"].to(device, non_blocking=True)  # Ground truth bounding boxes

            # Forward pass: predict bounding boxes
            pred_boxes, _ = model(images, texts)
            loss = criterion(pred_boxes, target_boxes)

            # Get image and text embeddings for similarity calculation
            image_features, text_features = model.get_embeddings(images, texts)

            # Store predictions, targets, and embeddings
            all_preds.append(pred_boxes.cpu())
            all_targets.append(target_boxes.cpu())
            all_image_features.append(image_features.cpu())
            all_text_features.append(text_features.cpu())

            total_loss += loss.item()

            # Update progress bar with current loss
            pbar.set_postfix({"loss": f"{loss.item():.4f}"})


    # Process results if there are predictions
    if all_preds and all_targets:
        all_preds = torch.cat(all_preds, dim=0)
        all_targets = torch.cat(all_targets, dim=0)
        all_image_features = torch.cat(all_image_features, dim=0)
        all_text_features = torch.cat(all_text_features, dim=0)

        # Calculate metrics
        metrics = calculate_metrics(
            all_preds, 
            all_targets,
            all_image_features,
            all_text_features
        )

        # Add loss to metrics
        metrics["loss"] = total_loss / num_batches

        return metrics

    # If no predictions were made, return default metrics
    return {
        "loss": float("inf"),
        "localization_accuracy": 0,
        "grounding_accuracy": 0,
        "semantic_similarity": 0,
        "miou": 0,
    }


### Pretrained model instantiation

In [None]:
# Path to the fine-tuned model
finetuned_model_path = "best_ClipGROUND_per_lr/best_CLIPGrounding_10_1e-07_0.0001_crop.pth"

# Initialize CLIPGrounding with the base clip_model
clip_grounding = CLIPGrounding(clip_model).to(device)

# Load the fine-tuned weights
checkpoint = torch.load(finetuned_model_path, map_location=device)

# Load state_dict into the clip_grounding model
clip_grounding.load_state_dict(checkpoint['ft_clip_model_state_dict'], strict=True)


# Wrap clip_grounding with CustomModel1
custom_model = CustomModel(clip_grounding).to(device)

### New model instantiation

In [12]:
# Initialize CLIPGrounding with the base clip_model
clip_grounding = CLIPGrounding(clip_model).to(device)

# Wrap clip_grounding with CustomModel1
custom_model = CustomModel(clip_grounding).to(device)

### Train and validation loop

In [None]:
# Training parameters
criterion = EIoULoss(reduction='mean').to(device)
# optimizer = torch.optim.AdamW(custom_model.parameters(), lr=1e-4, weight_decay=0.01)

optimizer = torch.optim.AdamW(
    params=[
        {
            "params":  custom_model.clip_grounding_model.parameters(),
            "lr": lr_clip,
            "weight_decay": 0.01,
            "eps": 1e-8
        },
        {
            "params": custom_model.bbox_head.parameters(),
            "lr": lr_proj,
            "weight_decay": 0.01,
            "eps": 1e-8
        }
    ]
)

# Scheduler for the learning rate
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer,
    T_0=5,
    T_mult=2
)

# Number of epochs
num_epochs = 10

best_val_loss = float('inf')
patience = 3
patience_counter = 0

for epoch in range(num_epochs):
    print(f'\nEpoch {epoch+1}/{num_epochs}')

    # Train
    train_loss = train_epoch(custom_model, train_loader, optimizer, criterion, device)

    # Validate
    val_loss, val_metrics = validate(custom_model, val_loader, criterion, device)

    # Learning rate scheduling
    scheduler.step()

    # Early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        # Save best model
        torch.save({
            'epoch': epoch,
            'model_state_dict': custom_model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'val_loss': val_loss,
            'val_metrics': val_metrics,
        }, 'best_model.pth')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f'Early stopping at epoch {epoch}')
            break

    # Print metrics
    print(f'Train Loss: {train_loss:.4f}')
    print(f'Val Loss: {val_loss:.4f}')
    print('Validation Metrics:')
    for k, v in val_metrics.items():
        print(f'{k}: {v:.4f}')
    print(f'Learning Rate: {optimizer.param_groups[0]["lr"]:.2e}')

### Model test

In [None]:
metrics = test(custom_model, val_loader, criterion, device)
display_metrics(metrics, split="test")

# Result discussion

takes the graph of baseModel, customModel pretrain and no pretrain and compare

# Notes on what to do 

Things to consider when implementing the custom model are:
 - Data augmentation (increase the size of the dataset by applying transformations to the images)
 - Regularization techniques
 - Hyperparameters tuning (partially done in the fine-tuning part, as different values of learning rate have been tested)

In [None]:
# yolo_model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True, trust_repo=True) 

base_transform = transforms.Compose([
            transforms.ToTensor(),  # Convert PIL to tensor and normalize to [0,1]
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                              std=[0.229, 0.224, 0.225])  # ImageNet normalization
        ])


#Images returning an empty crop
# refcocog/images/COCO_train2014_000000276874.jpg
# refcocog/images/COCO_train2014_000000276874.jpg
# refcocog/images/COCO_train2014_000000497807.jpg
# refcocog/images/COCO_train2014_000000497807.jpg


yolo_model.conf = 0.2

im1 = 'refcocog/images/COCO_train2014_000000380440.jpg'
im2 = 'refcocog/images/COCO_train2014_000000560180.jpg'


# Load the images
image1 = Image.open(im1)
image2 = Image.open(im2)

# image1 = base_transform(image1)
# image2 = base_transform(image2)

# image1 = image1.resize((640, 640))
# image2 = image2.resize((640, 640))

print(f"Shape of Image 1: {image1.size}")  # (width, height)
print(f"Shape of Image 2: {image2.size}")

# Show the images
# image1.show(title="Image 1")
# image2.show(title="Image 2")

#Process first image
results = yolo_model(im1)  # inference
print("results image 1")
# print("result1")
# print(results) 
results.show()

#Process second image
results = yolo_model(im2)  # inference
print("results image 2")
# print("result2")
# print(results)
results.show()


In [None]:
def compute_iou(boxes1, boxes2):
    # Calculate intersection coordinates
    x1 = torch.max(boxes1[:, 0], boxes2[:, 0])
    y1 = torch.max(boxes1[:, 1], boxes2[:, 1])
    x2 = torch.min(boxes1[:, 2], boxes2[:, 2])
    y2 = torch.min(boxes1[:, 3], boxes2[:, 3])

    # Calculate intersection area
    intersection = torch.clamp(x2 - x1, min=0) * torch.clamp(y2 - y1, min=0)

    # Calculate union area
    area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
    area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])
    union = area1 + area2 - intersection

    # Calculate IoU
    iou = intersection / (union + 1e-6)  # Add small epsilon to avoid division by zero
    return iou

def compute_siou(pred_boxes, target_boxes):
    '''
        Part 1: Calculating classical IoU using the compute_iou function
    '''
    # calcuate the IoU
    iou = compute_iou(pred_boxes, target_boxes)

    # Get box coordinates for both predicted and target boxes
    pred_x1, pred_y1, pred_x2, pred_y2 = pred_boxes.chunk(4, dim=-1)
    target_x1, target_y1, target_x2, target_y2 = target_boxes.chunk(4, dim=-1)

    # Calculate box centers
    pred_cx = (pred_x1 + pred_x2) / 2
    pred_cy = (pred_y1 + pred_y2) / 2
    target_cx = (target_x1 + target_x2) / 2
    target_cy = (target_y1 + target_y2) / 2

    # Box widths and heights
    pred_w = pred_x2 - pred_x1
    pred_h = pred_y2 - pred_y1
    target_w = target_x2 - target_x1
    target_h = target_y2 - target_y1

    '''
        Part 2: Calculating angle, distance, and shape costs
    '''
    # Distance component
    c_2 = torch.pow(target_cx - pred_cx, 2) + torch.pow(target_cy - pred_cy, 2)
    c = torch.sqrt(c_2 + 1e-7)

    # Diagonal length of the enclosing box
    d = torch.sqrt(torch.pow(torch.max(pred_x2, target_x2) - torch.min(pred_x1, target_x1), 2) +
                  torch.pow(torch.max(pred_y2, target_y2) - torch.min(pred_y1, target_y1), 2))

    # Distance ratio
    rho = c / (d + 1e-7)

    # Angle component
    pred_angle = torch.atan2(pred_cy - target_cy, pred_cx - target_cx)
    target_angle = torch.atan2(target_h, target_w)
    v = (4 / (math.pi ** 2)) * torch.pow(torch.atan2(target_w, target_h) -
                                        torch.atan2(pred_w, pred_h), 2)

    # Calculate alpha (trade-off parameter for angle cost)
    alpha = v / (1 - iou + v + 1e-7)

    # Angle cost
    omega = v / (v + 1e-7)

    '''
        Part 3: Return final SIoU
    '''
    siou = 1 - iou + (rho + alpha * omega)

    return siou.squeeze()

class SIoULoss(nn.Module):
    def __init__(self, reduction='mean'):
        super().__init__()
        self.reduction = reduction

    def forward(self, pred_boxes, target_boxes, similarity_scores=None):

        siou_loss = compute_siou(pred_boxes, target_boxes)

        # Add similarity score component if provided
        if similarity_scores is not None:
            # Convert similarity scores to a loss (1 - similarity)
            similarity_loss = 1 - similarity_scores
            # Combine losses (you can adjust the weighting)
            combined_loss = siou_loss + 0.5 * similarity_loss
        else:
            combined_loss = siou_loss

        # Decide what to return
        if self.reduction == 'mean':
            return combined_loss.mean()
        elif self.reduction == 'sum':
            return combined_loss.sum()
        else: 
            return combined_loss