# **Part-A**


### Import Required Libraries

In this cell, we import the necessary libraries for the image captioning model:

- **PyTorch**: for model building and training (`torch`, `torch.nn`, `torch.utils.data`, etc.).
- **Vision and Image Processing**: libraries like `torchvision`, `albumentations`, and `PIL` for handling image data and transformations.
- **Transformers**: for utilizing pre-trained Vision Transformer (ViT) and GPT-2 models from the Hugging Face `transformers` library.
- **Data Handling**: `pandas` for dataset manipulation, `numpy` for numerical computations.
- **Evaluation**: libraries like `nltk` and `rouge_score` for evaluating captioning performance using BLEU, ROUGE, and METEOR scores.
- **Other utilities**: `timm` for model creation, `Path` for file paths, and `gc` for garbage collection.

We also download the NLTK wordnet resource to assist with tokenization and other NLP tasks.

Finally, we set the device to use a GPU if available, otherwise fallback to CPU.


In [None]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from timm import create_model, list_models
from types import SimpleNamespace
from tqdm import tqdm
from transformers import ViTModel, GPT2LMHeadModel,GPT2TokenizerFast, get_linear_schedule_with_warmup
import albumentations as A
from albumentations.pytorch import ToTensorV2
from transformers import AutoProcessor, AutoModelForVision2Seq
from pathlib import Path
from sklearn.model_selection import train_test_split
from torch.cuda.amp import GradScaler, autocast
import gc
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from nltk.translate.meteor_score import meteor_score

import nltk
nltk.download('wordnet')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


### Download and Extract Dataset

In this cell, we use the `gdown` library to download a dataset from Google Drive. The provided URL points to a zipped dataset file. We specify the output file name as `custom_captions_dataset.zip`.

Once the dataset is downloaded, we use Python's `zipfile` module to extract the contents of the zip file into a directory called `custom_captions_dataset`. This step is essential for accessing the dataset for further processing and model training.


In [None]:
import gdown

# Your Google Drive file link
url = "https://drive.google.com/uc?id=1FMVcFM78XZE1KE1rIkGBpCdcdI58S1LB"
output = "custom_captions_dataset.zip"

gdown.download(url, output, quiet=False)

import zipfile

with zipfile.ZipFile("custom_captions_dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("custom_captions_dataset")

### Data Preprocessing and Custom Dataset Class

In this cell, we define the data augmentation and transformation pipelines, as well as a custom dataset class for loading and preparing data for training:

1. **Data Augmentation**:
   - `sample_tfms`: A list of augmentation transformations applied to the training images. These include horizontal flipping, random brightness and contrast adjustments, color jittering, random shifting, scaling, and rotating, and hue-saturation adjustments.
   - `train_tfms`: Composes the transformations for training data, including resizing, normalization, and the augmentations from `sample_tfms`.
   - `valid_tfms`: Composes transformations for validation data, focusing on resizing, normalization, and ensuring the image is in tensor format.

2. **Custom Dataset Class (`CustomCaptionDataset`)**:
   - The dataset class inherits from `torch.utils.data.Dataset`. It loads the captions and images from a CSV file and a directory respectively.
   - The `__getitem__` method fetches an image and its corresponding caption, applies the transformations, and tokenizes the caption.
   - Captions are tokenized using the provided tokenizer, and the input IDs are shifted left to create the labels.
   - The `collate_fn` function ensures that the batched images are stacked correctly, and the input IDs and labels are padded and masked appropriately for training.

This class will be used for both training and validation datasets.


In [None]:
import os

sample_tfms = [
    A.HorizontalFlip(),
    A.RandomBrightnessContrast(),
    A.ColorJitter(),
    A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.3, rotate_limit=45, p=0.5),
    A.HueSaturationValue(p=0.3),
]

train_tfms = A.Compose([
    *sample_tfms,
    A.Resize(224, 224),
    A.Normalize(mean=[0.5]*3, std=[0.5]*3),
    ToTensorV2()
])

valid_tfms = A.Compose([
    A.Resize(224, 224),
    A.Normalize(mean=[0.5]*3, std=[0.5]*3),
    ToTensorV2()
])


class CustomCaptionDataset(Dataset):
    def __init__(self, csv_path, image_dir, tokenizer, transform):
        self.df = pd.read_csv(csv_path)
        self.image_dir = image_dir
        self.tokenizer = tokenizer
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        image_path = os.path.join(self.image_dir, row['filename'])
        caption = row['caption'] + " <|endoftext|>"

        image = np.array(Image.open(image_path).convert('RGB'))
        image = self.transform(image=image)['image']

        # Tokenize caption
        input_ids = self.tokenizer(caption, truncation=True)['input_ids']
        labels = input_ids.copy()
        labels[:-1] = input_ids[1:]  # Shift left

        return image, input_ids, labels
    
def collate_fn(batch):
    images = [b[0] for b in batch]
    input_ids = [b[1] for b in batch]
    labels = [b[2] for b in batch]

    images = torch.stack(images)

    # Pad input_ids and labels
    input_ids = tokenizer.pad({'input_ids': input_ids}, return_tensors='pt')['input_ids']
    labels = tokenizer.pad({'input_ids': labels}, return_tensors='pt')['input_ids']

    # Mask for loss
    mask = (input_ids != tokenizer.pad_token_id)
    labels[~mask] = -100

    return images, input_ids, labels

### Initialize Tokenizer and Load Datasets

In this cell, we:

1. **Initialize the GPT-2 Tokenizer**:
   - We load the GPT-2 tokenizer (`GPT2TokenizerFast`) from the Hugging Face model hub.
   - We set the `pad_token` to be the same as the `eos_token` (end-of-sequence token) because GPT-2 does not have a dedicated padding token by default.
   - We add a special token `<|endoftext|>` to mark the end of captions in our dataset.

2. **Load Training and Validation Datasets**:
   - We create instances of the `CustomCaptionDataset` for both the training and validation sets.
   - We pass the CSV file paths (`train.csv` and `val.csv`) along with the corresponding image directories for both datasets.
   - The `train_tfms` and `valid_tfms` transformations are applied to the training and validation datasets, respectively.
   - The tokenizer and transformation pipeline ensure the data is correctly processed before being fed into the model during training and evaluation.


In [None]:
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens({'eos_token': '<|endoftext|>'})

root_dir = "custom_captions_dataset/custom_captions_dataset"

train_ds = CustomCaptionDataset(
    csv_path=os.path.join(root_dir, "train.csv"),
    image_dir=os.path.join(root_dir, "train"),
    tokenizer=tokenizer,
    transform=train_tfms
)

val_ds = CustomCaptionDataset(
    csv_path=os.path.join(root_dir, "val.csv"),
    image_dir=os.path.join(root_dir, "val"),
    tokenizer=tokenizer,
    transform=valid_tfms
)

### GPT-2 Attention Layer Implementation

In this cell, we define a custom attention layer inspired by the GPT-2 attention mechanism. This is the core building block for the Transformer model's self-attention mechanism. Key components include:

1. **Initialization (`__init__`)**:
   - The attention layer receives a configuration object (`config`) that contains hyperparameters such as embedding dimension (`embed_dim`), number of attention heads (`num_heads`), sequence length (`seq_len`), and dropout values.
   - The embedding dimension must be divisible by the number of attention heads to ensure even splitting of the embeddings across heads.
   - The `c_attn` linear layer computes the query (`q`), key (`k`), and value (`v`) matrices.
   - A triangular attention mask (`mask`) is registered to prevent attention to future positions (important for autoregressive models like GPT-2).
   - `c_proj` is another linear layer to project the output of the attention mechanism back into the embedding space.
   - Dropout layers are applied for attention (`attn_dropout`) and residual connections (`resid_dropout`).

2. **Forward Pass (`forward`)**:
   - The input tensor `x` is of shape `(batch_size, seq_len, embed_dim)`.
   - The `q`, `k`, and `v` matrices are computed by passing the input through the `c_attn` linear layer and splitting it into three parts.
   - The queries and keys are reshaped to separate the attention heads and are used to calculate attention scores (`qk_t`).
   - A mask is applied to prevent attending to future tokens in the sequence.
   - The attention weights are then computed using the softmax function and applied to the value tensor `v`.
   - The attention output is projected back into the original embedding space using `c_proj`, and residual dropout is applied to the final output.

This custom attention layer is designed for use in a Transformer-based architecture and mimics the attention mechanism found in the GPT-2 model.


In [6]:
class GPT2Attention(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.embed_dim = config.embed_dim
        self.n_heads = config.num_heads
        assert self.embed_dim % self.n_heads == 0, 'embedding dimension by be divisible by number of heads'
        self.head_size = self.embed_dim // self.n_heads
        self.seq_len = config.seq_len
        
        self.c_attn = nn.Linear(self.embed_dim, self.head_size * self.n_heads * 3,bias=True)
        self.scale = self.head_size ** -0.5
        
        self.register_buffer('mask',torch.tril(torch.ones(1,1,self.seq_len,self.seq_len)))
        
        self.c_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=True)
        
        self.attn_dropout = nn.Dropout(config.attention_dropout)
        self.resid_dropout = nn.Dropout(config.residual_dropout)
        
        
    def forward(self, x):
        b,t,c = x.shape
        # q,k,v shape individually: batch_size x seq_len x embed_dim
        # we know that qk_t = q x k_t, where q=bxtxhead_dim, k_t=bxhead_timxt
        q,k,v = self.c_attn(x).chunk(3,dim=-1)
        q = q.view(b,t,self.n_heads,self.head_size).permute(0,2,1,3) # batch x n_heads x seq_len x head_dim
        k = k.view(b,t,self.n_heads,self.head_size).permute(0,2,1,3)
        v = v.view(b,t,self.n_heads,self.head_size).permute(0,2,1,3)
        
        qk_t = (q@k.transpose(-2,-1)) * self.scale
        qk_t = qk_t.masked_fill(self.mask[:,:,:t,:t]==0,float('-inf'))
        qk_t = F.softmax(qk_t,dim=-1)
        weights = self.attn_dropout(qk_t)
        
        attention = weights @ v # batch x n_heads x t x head_size
        attention = attention.permute(0,2,1,3).contiguous().view(b,t,c) # batch x t x embed_dim
        
        out = self.c_proj(attention)
        out = self.resid_dropout(out)
        
        return out

### GPT-2 Cross Attention Layer Implementation

In this cell, we define a custom **Cross Attention** layer, similar to the attention mechanism used in transformer models, but specifically designed to work across two different input sequences (e.g., image and text). Key components include:

1. **Initialization (`__init__`)**:
   - The cross-attention layer takes in a configuration object (`config`) with parameters like embedding dimension (`embed_dim`), number of attention heads (`num_heads`), and sequence length (`seq_len`).
   - The embedding dimension must be divisible by the number of attention heads for proper splitting.
   - `q`, `k`, and `v` are three linear layers that project the input queries, keys, and values to the appropriate dimensions for attention calculation.
   - A scaling factor is applied to the attention scores, based on the size of the attention heads (`head_size`).
   - `c_proj` is a linear layer that projects the output of the attention mechanism back into the embedding space.
   - Dropout layers are included to apply regularization during attention computation (`attn_dropout`) and residual connections (`resid_dropout`).
   - A custom weight initialization function (`_init_weights`) is applied to initialize the weights of the linear layers using a normal distribution and set the biases to zero.

2. **Forward Pass (`forward`)**:
   - The input `q`, `k`, and `v` are the queries, keys, and values that are passed into the layer. These typically represent different sequences (e.g., text queries and image features).
   - The queries, keys, and values are passed through the respective linear layers to compute the projected representations.
   - The reshaped queries, keys, and values are used to compute the attention scores (`qk_t`), which are then normalized using softmax.
   - The attention weights are applied to the values (`v`), and the output is projected back to the original embedding space using `c_proj`.
   - Finally, residual dropout is applied to the output.

This custom cross-attention layer can be used to combine information from different modalities (e.g., images and text), making it useful in models that involve multimodal input, such as vision-language transformers.


In [7]:
class GPT2CrossAttention(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.embed_dim = config.embed_dim
        self.n_heads = config.num_heads
        assert self.embed_dim % self.n_heads == 0, 'embedding dimension by be divisible by number of heads'
        self.head_size = self.embed_dim // self.n_heads
        self.seq_len = config.seq_len
        
        self.q = nn.Linear(self.embed_dim,self.embed_dim)
        self.k = nn.Linear(self.embed_dim,self.embed_dim)
        self.v = nn.Linear(self.embed_dim,self.embed_dim)
        self.scale = self.head_size ** -0.5
        
        self.c_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=True)
        
        self.attn_dropout = nn.Dropout(config.attention_dropout)
        self.resid_dropout = nn.Dropout(config.residual_dropout)
        
        self.apply(self._init_weights)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        
        
    def forward(self, q,k,v):
        b,t,c = q.shape
        
        q = self.q(q)
        k = self.k(k)
        v = self.v(v)
        
        q = q.view(b,q.size(1),self.n_heads,self.head_size).permute(0,2,1,3) # batch x n_heads x seq_len x head_dim
        k = k.view(b,k.size(1),self.n_heads,self.head_size).permute(0,2,1,3)
        v = v.view(b,v.size(1),self.n_heads,self.head_size).permute(0,2,1,3)
        
        qk_t = (q@k.transpose(-2,-1)) * self.scale
        qk_t = F.softmax(qk_t,dim=-1)
        weights = self.attn_dropout(qk_t)
        
        attention = weights @ v # batch x n_heads x t x head_size
        attention = attention.permute(0,2,1,3).contiguous().view(b,t,c) # batch x t x embed_dim
        
        out = self.c_proj(attention)
        out = self.resid_dropout(out)
        
        return out

### GPT-2 MLP (Multilayer Perceptron) Layer Implementation

In this cell, we define a custom **MLP** (Multilayer Perceptron) layer used within the GPT-2 architecture. The MLP layer is applied to the output of the attention layers to project the features into a new space. The key components include:

1. **Initialization (`__init__`)**:
   - The MLP layer is configured using the `config` object, which provides parameters such as the embedding dimension (`embed_dim`), the MLP ratio (`mlp_ratio`), and the dropout probability (`mlp_dropout`).
   - `c_fc`: A linear layer that projects the input embedding dimension to a higher dimensional space (scaled by `mlp_ratio`).
   - `c_proj`: A linear layer that projects the output of the hidden layer back to the original embedding dimension.
   - `act`: The activation function applied between the two linear layers. In this case, we use the GELU (Gaussian Error Linear Unit) activation function, commonly used in GPT-2.
   - `dropout`: A dropout layer applied after the final projection to prevent overfitting.

2. **Forward Pass (`forward`)**:
   - The input `x` passes through the `c_fc` linear layer, followed by the GELU activation (`act`).
   - It then passes through the `c_proj` linear layer, and dropout is applied to the output to ensure regularization.

This MLP layer is used to transform the attention outputs and help the model learn higher-level representations, forming a critical part of the feed-forward network in GPT-2.


In [8]:
class GPT2MLP(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.embed_dim = config.embed_dim
        self.mlp_ratio = config.mlp_ratio
        self.mlp_dropout = config.mlp_dropout
        
        self.c_fc = nn.Linear(self.embed_dim,self.embed_dim*self.mlp_ratio)
        self.c_proj = nn.Linear(self.embed_dim*self.mlp_ratio,self.embed_dim)
        self.act = nn.GELU()
        self.dropout = nn.Dropout(self.mlp_dropout)
        
    def forward(self,x):
        x = self.c_fc(x)
        x = self.act(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

### GPT-2 Block (Transformer Block) Implementation

In this cell, we define a custom **GPT-2 Block**, which is a standard Transformer block containing a series of layers for both self-attention and cross-attention. The block also includes feed-forward operations. Key components of the block include:

1. **Initialization (`__init__`)**:
   - The `GPT2Block` is initialized with the following layers:
     - `ln_1`, `ln_2`, `ln_3`: Layer normalization layers that are applied to the input before the attention and MLP layers.
     - `attn`: The self-attention layer (`GPT2Attention`) that processes the input sequence on its own.
     - `cross_attn`: The cross-attention layer (`GPT2CrossAttention`), which attends to external (encoder) outputs, enabling the model to process multimodal data (e.g., images and text).
     - `mlp`: The MLP layer (`GPT2MLP`) that processes the output of the attention layers and applies non-linear transformations.

2. **Forward Pass (`forward`)**:
   - The input `x` is first passed through the self-attention layer (`attn`), with the result added back to the input (`x + attention_output`).
   - Then, the cross-attention layer (`cross_attn`) is applied, attending to an external encoder output (`enc_out`), and the result is added back to the current sequence (`x + cross_attention_output`).
   - Finally, the MLP layer is applied to the output, and the result is again added back to the input (`x + mlp_output`).

This GPT-2 Block is a crucial building block of the GPT-2 architecture, handling both self-attention and cross-attention, along with feed-forward processing. It supports complex sequence modeling, particularly useful in tasks involving both sequence data and external context (e.g., image and text).


In [9]:
class GPT2Block(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.embed_dim = config.embed_dim
        self.ln_1 = nn.LayerNorm(self.embed_dim)
        self.attn = GPT2Attention(config)
        self.ln_2 = nn.LayerNorm(self.embed_dim)
        self.mlp = GPT2MLP(config)
        self.ln_3 = nn.LayerNorm(self.embed_dim)
        self.cross_attn = GPT2CrossAttention(config)
        
    def forward(self,x,enc_out):
        x = x+self.attn(self.ln_1(x))
        x = x+self.cross_attn(self.ln_2(x),enc_out,enc_out)
        x = x+self.mlp(self.ln_3(x))
        return x


### Image Captioning Model (ViT-GPT2 Encoder-Decoder Architecture)

In this cell, we define the **ImageCaptionModel**, which combines a **Vision Transformer (ViT)** encoder with a **GPT-2** decoder to generate captions for images. This architecture is capable of processing image data (via ViT) and text data (via GPT-2), making it suitable for image captioning tasks.

#### Key Components:

1. **Initialization (`__init__`)**:
   - **ViT Encoder**: The ViT encoder (`vit_small_patch16_224`) is used as the image feature extractor. The input image is divided into patches, and each patch is passed through the ViT encoder.
     - `patch_embed`: Embedding layer for patches.
     - `cls_token`: Class token for the sequence.
     - `pos_embed`: Positional embedding for patches.
     - `blocks`: The transformer blocks of ViT.
     - `encoder_to_decoder`: A linear layer that projects the output of the ViT encoder to match the embedding size of the GPT-2 decoder.
   
   - **GPT-2 Decoder**:
     - **Embedding Layer**: `wte` and `wpe` represent the token and positional embeddings for the GPT-2 decoder.
     - **Transformer Blocks**: The GPT-2 model consists of a series of `GPT2Block`s, which include attention and feed-forward layers.
     - **Final Layer Norm (`ln_f`)**: Applied to the final output before feeding into the language model head.
     - **Language Model Head (`lm_head`)**: A linear layer that projects the output embeddings to the vocabulary size.

2. **Positional Embedding** (`_pos_embed` method):
   - The positional embedding is added to the input image sequence, with the class token (`cls_token`) prepended to the image tokens.

3. **Training Functionality**:
   - **Freezing Layers**: The `pretrained_layers_trainable` method allows selective freezing or unfreezing of model layers during training.
   - **Unfreezing GPT-2 Layers**: The `unfreeze_gpt_layers` method enables the unfreezing of GPT-2 layers for fine-tuning.

4. **Model Initialization from Pretrained Weights** (`from_pretrained` class method):
   - This method loads the model weights from a pretrained GPT-2 model and initializes the ViT components.
   - Weights for GPT-2 layers are transposed as needed to match the shapes between ViT and GPT-2 layers.

5. **Forward Pass**:
   - **Image Input**: The image is passed through the ViT encoder (`patch_embed` and `blocks`), then projected to the decoder embedding space.
   - **Text Input**: The token embeddings (`wte`) are added to the positional embeddings (`wpe`), and the result is passed through the transformer layers.
   - **Loss Calculation**: If `labels` are provided, the model computes the cross-entropy loss using the `lm_head` layer.

6. **Text Generation** (`generate` method):
   - The model can generate captions by autoregressively predicting the next token in the sequence until the maximum number of tokens is reached or the `<|endoftext|>` token is generated. The `temperature` parameter controls the randomness of the predictions, and the `deterministic` flag determines whether to use greedy decoding or sampling.

This architecture enables the generation of descriptive captions for images by leveraging both vision and language models. It is an encoder-decoder style model where the encoder processes the image, and the decoder generates the caption text.


In [None]:
class ImageCaptionModel(nn.Module):
    def __init__(self,config):
        super().__init__()
        
        self.config = config
        
        vit = create_model('vit_small_patch16_224',pretrained=True,num_classes=0)
        self.patch_embed = vit.patch_embed
        num_patches = self.patch_embed.num_patches
        
        self.cls_token = vit.cls_token
        embed_len = num_patches + vit.num_prefix_tokens
        self.pos_embed = vit.pos_embed
        self.pos_drop = nn.Dropout(p=0.)
        
        self.blocks = nn.ModuleList([vit.blocks[i] for i in range(config.depth)])
        #check this
        self.encoder_to_decoder = nn.Linear(config.vit_embed_dim, config.embed_dim)

        
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size,config.embed_dim),
            wpe = nn.Embedding(config.seq_len,config.embed_dim),
            drop = nn.Dropout(config.emb_dropout),
            h = nn.ModuleList([GPT2Block(config) for _ in range(config.depth)]),
            ln_f = nn.LayerNorm(config.embed_dim)
        ))
        self.lm_head = nn.Linear(config.embed_dim,config.vocab_size,bias=False)
        self.transformer.wte.weight = self.lm_head.weight
        
    def _pos_embed(self,x):
        pos_embed = self.pos_embed
        x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)
        x = x + pos_embed
        return self.pos_drop(x)
    
    def pretrained_layers_trainable(self,trainable=False):
        layers = [
            self.cls_token, self.patch_embed, self.pos_embed, self.blocks,
            self.transformer.wte, self.transformer.wpe,
            self.transformer.ln_f, self.lm_head
        ]
        gpt_layers = [[
            self.transformer.h[i].ln_1,self.transformer.h[i].ln_2,
            self.transformer.h[i].attn,self.transformer.h[i].mlp
        ] for i in range(self.config.depth)]
        for l in gpt_layers:
            layers.extend(l)
        
        for layer in layers:
            if not isinstance(layer,nn.Parameter):
                for p in layer.parameters():
                    p.requires_grad = trainable
            else:
                layer.requires_grad = trainable
                
        total_frozen_params = sum([p.numel() for p in self.parameters() if not p.requires_grad])
        print(f'{total_frozen_params=}')
        
    def unfreeze_gpt_layers(self,):
        gpt_layers = [[
            self.transformer.h[i].ln_1,self.transformer.h[i].ln_2,
            self.transformer.h[i].attn,self.transformer.h[i].mlp
        ] for i in range(self.config.depth)]
        flatten = []
        for l in gpt_layers:
            flatten.extend(l)
            
        for layer in flatten:
            if not isinstance(layer,nn.Parameter):
                for p in layer.parameters():
                    p.requires_grad = True
            else:
                layer.requires_grad = True
        
    @classmethod    
    def from_pretrained(self,config):
        model = ImageCaptionModel(config)
        sd = model.state_dict()
        keys = sd.keys()
        ignore_matches = ['blocks.','cross_attn.','ln_3','cls_token','pos_embed','patch_embed.','.attn.mask']
        vit_keys = [key for key in keys if any(match in key for match in ignore_matches)]
        gpt_keys = [key for key in keys if key not in vit_keys]
        
        gpt2_small = GPT2LMHeadModel.from_pretrained('gpt2')
        sd_hf = gpt2_small.state_dict()
        hf_keys = sd_hf.keys()
        hf_keys = [k for k in hf_keys if not k.endswith('.attn.masked_bias')]
        hf_keys = [k for k in hf_keys if not k.endswith('.attn.bias')]
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        
        for k in hf_keys:
            if any(match in k for match in ignore_matches):
                continue
            if any(k.endswith(w) for w in transposed):
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])
            
        model.load_state_dict(sd)
        
        return model
    
    def forward(self,image,input_ids,labels=None):
        
        image = self.patch_embed(image)
        image = self._pos_embed(image)
        #check this out 
        image_proj = self.encoder_to_decoder(image)


        
        token_embeddings = self.transformer.wte(input_ids) # batch x seq_len
        pos_embs = torch.arange(0,input_ids.size(1)).to(input_ids.device)
        positional_embeddings = self.transformer.wpe(pos_embs)
        input_ids = self.transformer.drop(token_embeddings+positional_embeddings)
        
        for i in range(self.config.depth):
            image = self.blocks[i](image)
            # input_ids = self.transformer.h[i](input_ids, image)
            #check this out
            input_ids = self.transformer.h[i](input_ids, image_proj)
        
        input_ids = self.transformer.ln_f(input_ids)
        
        if labels is not None:
            lm_logits = self.lm_head(input_ids)
            loss = F.cross_entropy(lm_logits.view(-1, lm_logits.shape[-1]), labels.view(-1))
            return loss
        
        lm_logits = self.lm_head(input_ids[:,[-1],:])
        return lm_logits
    
    def generate(self,image,sequence,max_tokens=50,temperature=1.0,deterministic=False):
        for _ in range(max_tokens):
            out = self(image,sequence)
            out = out[:,-1,:] / temperature
            probs = F.softmax(out,dim=-1)
            if deterministic:
                next_token = torch.argmax(probs,dim=-1,keepdim=True)
            else:
                next_token = torch.multinomial(probs,num_samples=1)
            sequence = torch.cat([sequence,next_token],dim=1)
            if next_token.item() == tokenizer.eos_token_id:
                break
            
        return sequence.cpu().flatten()

### Trainer Class for Training and Evaluation

The `Trainer` class handles the training and evaluation process for the **ImageCaptionModel**. It integrates the model training, validation, and saving/loading of the best model, as well as the text generation functionality for captioning.

#### Key Components:

1. **Initialization (`__init__`)**:
   - **Model Initialization**: The model is instantiated using the `ImageCaptionModel.from_pretrained` method, which loads the pretrained weights and sets up the model. Initially, all pretrained layers are frozen.
   - **Tokenizer**: A GPT-2 tokenizer (`GPT2TokenizerFast`) is used to process input text. The `bos_token` and `pad_token` are explicitly defined for the tokenizer.
   - **GradScaler**: The `GradScaler` is used for mixed precision training with `autocast` for automatic mixed-precision scaling.
   - **DataLoaders**: `train_dl` and `val_dl` are the training and validation DataLoader objects.
   - **Optimizer**: Adam optimizer with a learning rate scheduler (`OneCycleLR`) for dynamic learning rate adjustment during training.

2. **Saving and Loading Models**:
   - **`save_model`**: Saves the current model state to the specified directory.
   - **`load_best_model`**: Loads the best saved model based on validation performance.

3. **Training Loop**:
   - **`train_one_epoch`**: Handles the training process for a single epoch. The loss is computed for each batch, gradients are backpropagated, and the optimizer is updated using mixed precision.
   - **`valid_one_epoch`**: Evaluates the model on the validation set for one epoch and computes the loss and perplexity.
   - **`fit`**: Manages the entire training process over multiple epochs. It trains the model, evaluates it, saves the best model based on validation perplexity, and supports layer freezing/unfreezing during training.

4. **Metrics**:
   - The metrics DataFrame (`self.metrics`) tracks the loss and perplexity for both training and validation during the training process.

5. **Model Cleaning**:
   - **`clean`**: Frees up GPU memory by running garbage collection and clearing the cache.

6. **Text Generation (for Captioning)**:
   - **`generate_caption`**: Given an image, the method generates a caption by feeding the image through the model and decoding the generated sequence into text using the tokenizer.

#### Training and Fine-Tuning:
- **Freezing Layers**: The model layers can be selectively frozen or unfrozen during training to prevent certain parts of the model (such as the GPT-2 decoder) from being updated in the early training phases. The freezing/unfreezing process is controlled by `freeze_epochs_gpt` and `freeze_epochs_all` in the `train_config`.
  
#### Example Usage:

```python
# Initialize Trainer with model configuration, training configuration, and DataLoaders
trainer = Trainer(model_config, train_config, dls)

# Train the model
trainer.fit()

# Generate captions from an image
caption = trainer.generate_caption(image_path)
print(caption)


In [None]:
class Trainer:
    def __init__(self,model_config,train_config, dls):
        
        self.train_config = train_config
        self.model_config = model_config
        self.device = self.train_config.device
        
        self.model = ImageCaptionModel.from_pretrained(model_config).to(self.device)
        self.model.pretrained_layers_trainable(trainable=False)
        
        print(f'trainable parameters: {sum([p.numel() for p in self.model.parameters() if p.requires_grad])}')
        
        self.tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.bos_token = self.tokenizer.bos_token  # Explicitly define BOS token
        
        self.scaler = GradScaler()
        
        self.train_dl, self.val_dl = dls
        
        total_steps = len(self.train_dl)
        
        self.optim = torch.optim.Adam(self.model.parameters(), lr=self.train_config.lr / 25.)
        self.sched = torch.optim.lr_scheduler.OneCycleLR(
            self.optim,
            max_lr=self.train_config.lr,
            epochs=self.train_config.epochs,
            steps_per_epoch=total_steps
        )
        
#         self.sched = get_linear_schedule_with_warmup(self.optim,num_warmup_steps=0,num_training_steps=total_steps)
        
        self.metrics = pd.DataFrame()
        self.metrics[['train_loss','train_perplexity','val_loss','val_perplexity']] = None
        
        self.gen_tfms = A.Compose([
            A.Resize(224,224),
            A.Normalize(mean=[0.5,0.5,0.5],std=[0.5,0.5,0.5],always_apply=True),
            ToTensorV2()
        ])
            
        
    def save_model(self,):
        self.train_config.model_path.mkdir(exist_ok=True)
        sd = self.model.state_dict()
        torch.save(sd,self.train_config.model_path/'captioner.pt')
        
        
    def load_best_model(self,):
        sd = torch.load(self.train_config.model_path/'captioner.pt')
        self.model.load_state_dict(sd)
    
    
    def train_one_epoch(self,epoch):
        
        prog = tqdm(self.train_dl,total=len(self.train_dl))
        
        running_loss = 0.
        
        for image, input_ids, labels in prog:
            
            with autocast():
                image = image.to(self.device)
                input_ids = input_ids.to(self.device)
                labels = labels.to(self.device)
                
                loss = self.model(image,input_ids,labels)
                
                self.scaler.scale(loss).backward()
                self.scaler.step(self.optim)
                self.scaler.update()
                self.sched.step()
                self.optim.zero_grad(set_to_none=True)
                
                running_loss += loss.item()
                
                prog.set_description(f'train loss: {loss.item():.3f}')
                
            del image, input_ids, labels, loss
            
        train_loss = running_loss / len(self.train_dl)
        train_pxp = np.exp(train_loss)
        
        self.metrics.loc[epoch,['train_loss','train_perplexity']] = (train_loss,train_pxp)
        
        
    @torch.no_grad()
    def valid_one_epoch(self,epoch):
        
        prog = tqdm(self.val_dl,total=len(self.val_dl))
        
        running_loss = 0.
        
        for image, input_ids, labels in prog:
            
            with autocast():
                image = image.to(self.device)
                input_ids = input_ids.to(self.device)
                labels = labels.to(self.device)
                
                loss = self.model(image,input_ids,labels)
                running_loss += loss.item()
                
                prog.set_description(f'valid loss: {loss.item():.3f}')
                
            del image, input_ids, labels, loss
            
        val_loss = running_loss / len(self.val_dl)
        val_pxp = np.exp(val_loss)
        
        self.metrics.loc[epoch,['val_loss','val_perplexity']] = (val_loss,val_pxp)
        
        return val_pxp
        
        
    def clean(self):
        gc.collect()
        torch.cuda.empty_cache()
       
    
    def fit(self,):
        
        best_pxp = 1e9
        best_epoch = -1
        prog = tqdm(range(self.train_config.epochs))
        
        for epoch in prog:
            
            if epoch == self.train_config.freeze_epochs_gpt:
                self.model.unfreeze_gpt_layers()
                print('unfreezing GPT2 entirely...')
                
            if epoch == self.train_config.freeze_epochs_all:
                self.model.pretrained_layers_trainable(trainable=True)
            
            self.model.train()
            prog.set_description('training')
            self.train_one_epoch(epoch)
            self.clean()
            
            self.model.eval()
            prog.set_description('validating')
            pxp = self.valid_one_epoch(epoch)
            self.clean()
            
            print(self.metrics.tail(1))
            
            if pxp < best_pxp:
                best_pxp = pxp
                best_epoch = epoch
                print('saving best model...')
                self.save_model()
                
        return {
            'best_perplexity': best_pxp,
            'best_epoch': best_epoch
        }
           
        
    @torch.no_grad()
    def generate_caption(self,image,max_tokens=50,temperature=1.0,deterministic=False):
        
        self.model.eval()
        
        image = Image.open(image).convert('RGB')
        image = np.array(image)
        image = self.gen_tfms(image=image)['image']
        image = image.unsqueeze(0).to(self.device)
        sequence = torch.ones(1,1).to(device=self.device).long() * self.tokenizer.bos_token_id
        
        caption = self.model.generate(
            image,
            sequence,
            max_tokens=max_tokens,
            temperature=temperature,
            deterministic=deterministic
        )
        caption = self.tokenizer.decode(caption.numpy(),skip_special_tokens=True)
        
        return caption

### Model Configuration (`model_config`)

This configuration defines the structure and behavior of the `ImageCaptionModel`.

- **vocab_size**: `50,257` - The size of the vocabulary used by the tokenizer.
- **embed_dim**: `768` - The dimensionality of the embeddings in the model (for both ViT and GPT-2 components).
- **vit_embed_dim**: `384` - The embedding dimension for the Vision Transformer (ViT).
- **num_heads**: `12` - Number of attention heads in the multi-head attention mechanism.
- **seq_len**: `1024` - Maximum sequence length for input token sequences.
- **depth**: `12` - Number of transformer layers in the GPT-2 decoder.
- **attention_dropout**: `0.1` - Dropout rate applied to the attention layers.
- **residual_dropout**: `0.1` - Dropout applied to the residual connections in the transformer model.
- **mlp_ratio**: `4` - The ratio determining the hidden layer size in the MLP block.
- **mlp_dropout**: `0.1` - Dropout rate applied to the MLP (feedforward) block.
- **emb_dropout**: `0.1` - Dropout applied to the embeddings before the transformer layers.

### Training Configuration (`train_config`)

This configuration defines the training parameters, optimizer settings, and model saving/loading paths.

- **epochs**: `1` - Number of training epochs. Set to 1 to avoid running out of memory during the second epoch.
- **freeze_epochs_gpt**: `1` - Number of epochs to freeze the GPT layers. After this epoch, the GPT layers will be unfrozen.
- **freeze_epochs_all**: `2` - Number of epochs to freeze all pretrained layers (including ViT). After this, all layers will be trainable.
- **lr**: `1e-4` - Learning rate for the Adam optimizer.
- **device**: `'cuda'` - The device for training (GPU).
- **model_path**: `Path('captioner')` - Directory where the trained model will be saved.
- **batch_size**: `32` - Batch size used during training.


In [12]:
model_config = SimpleNamespace(
    vocab_size = 50_257,
    embed_dim = 768,
    vit_embed_dim=384,
    num_heads = 12,
    seq_len = 1024,
    depth = 12,
    attention_dropout = 0.1,
    residual_dropout = 0.1,
    mlp_ratio = 4,
    mlp_dropout = 0.1,
    emb_dropout = 0.1,
)
train_config = SimpleNamespace(
    epochs = 1,#changed this since oom during epoch 2
    freeze_epochs_gpt = 1,
    freeze_epochs_all = 2,
    lr = 1e-4,
    device = 'cuda',
    model_path = Path('captioner'),
    batch_size = 32
)

### Create DataLoaders

We create two DataLoaders for the training and validation datasets:

- **Training DataLoader**:
    - Loads the `train_ds` dataset.
    - Batch size is defined by the `train_config.batch_size`.
    - Shuffling is enabled for training.
    - Data is pinned to memory for faster access.
    - Uses 2 worker threads for parallel data loading.
    - A custom `collate_fn` function is used to handle batching.

- **Validation DataLoader**:
    - Loads the `val_ds` dataset.
    - Batch size is defined by the `train_config.batch_size`.
    - Shuffling is disabled for validation (no need to shuffle).
    - Data is pinned to memory for faster access.
    - Uses 2 worker threads for parallel data loading.
    - A custom `collate_fn` function is used for batching.

### Create Trainer Instance

A `Trainer` instance is created to handle the training loop. The model configuration, training configuration, and both DataLoaders (training and validation) are passed as arguments.

### Start Training

The training process is initiated by calling the `fit()` method of the `Trainer` class.


In [13]:
# Create DataLoaders
train_dl = torch.utils.data.DataLoader(
    train_ds,
    batch_size=train_config.batch_size,
    shuffle=True,
    pin_memory=True,
    num_workers=2,
    persistent_workers=True,
    collate_fn=collate_fn
)

val_dl = torch.utils.data.DataLoader(
    val_ds,
    batch_size=train_config.batch_size,
    shuffle=False,
    pin_memory=True,
    num_workers=2,
    persistent_workers=True,
    collate_fn=collate_fn
)

# Create Trainer instance
trainer = Trainer(model_config, train_config, (train_dl, val_dl))

# Start training
trainer.fit()

model.safetensors:   0%|          | 0.00/88.2M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

total_frozen_params=146104704
trainable parameters: 28662528


  self.scaler = GradScaler()
  A.Normalize(mean=[0.5,0.5,0.5],std=[0.5,0.5,0.5],always_apply=True),
training:   0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/179 [00:00<?, ?it/s][AYou're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  with autocast():

train loss: 8.315:   0%|          | 0/179 [00:02<?, ?it/s][A
train loss: 8.315:   1%|          | 1/179 [00:02<07:18,  2.46s/it][A
train loss: 8.415:   1%|          | 1/179 [00:02<07:18,  2.46s/it][A
train loss: 8.415:   1%|          | 2/179 [00:02<03:50,  1.30s/it][A
train loss: 8.403:   1%|          | 2/179 [00:03<03:50,  1.30s/it]

  train_loss train_perplexity val_loss val_perplexity
0   6.500959       665.779517  5.60402     271.515759
saving best model...


validating: 100%|██████████| 1/1 [02:12<00:00, 132.29s/it]


{'best_perplexity': 271.515759109497, 'best_epoch': 0}

In [14]:
trainer.metrics

Unnamed: 0,train_loss,train_perplexity,val_loss,val_perplexity
0,6.500959,665.779517,5.60402,271.515759


### Evaluation on Test Set

This section outlines the process for evaluating the model on the test set. The evaluation involves generating captions for images in the test dataset and computing several common evaluation metrics: BLEU, METEOR, and ROUGE-L.

1. **Prepare the Test Dataset**:
   - The test dataset is loaded using a custom `CustomCaptionDataset` class, which reads the CSV file (`test.csv`) and the images from the specified directory.
   - The dataset is tokenized using the pre-trained tokenizer, and any necessary transformations (such as resizing or normalization) are applied to the images.

2. **Load the Best Model**:
   - The best model (based on validation performance) is loaded using the `trainer.load_best_model()` method.

3. **Setup Evaluation Metrics**:
   - The following evaluation metrics are loaded using the `evaluate` library:
     - **BLEU**: Measures the precision of n-grams in the generated captions compared to the reference captions.
     - **METEOR**: Computes the harmonic mean of precision and recall, considering synonymy and stemming.
     - **ROUGE-L**: Measures the longest common subsequence between the generated and reference captions.

4. **Generate Captions**:
   - For each image in the test dataset, the model generates a caption. The generated caption is stored in the `predictions` list, while the reference caption is stored in the `references` list. Note that BLEU and METEOR expect the references to be in a list format.

5. **Evaluate the Model**:
   - The BLEU, METEOR, and ROUGE-L scores are computed by comparing the model's predictions with the ground truth references.

6. **Report Results**:
   - Finally, the evaluation scores are printed, providing insight into the model's performance on the test set. The BLEU, METEOR, and ROUGE-L scores indicate the quality of the generated captions.


In [17]:
import evaluate
from tqdm import tqdm

# Step 1: Prepare the test dataset
test_ds = CustomCaptionDataset(
    csv_path=os.path.join(root_dir, "test.csv"),
    image_dir=os.path.join(root_dir, "test"),
    tokenizer=tokenizer,
    transform=valid_tfms
)

# Step 2: Load the best model
trainer.load_best_model()

# Step 3: Setup evaluation metrics
bleu = evaluate.load("bleu")
meteor = evaluate.load("meteor")
rouge = evaluate.load("rouge")

predictions = []
references = []

# Step 4: Generate captions
for i in tqdm(range(len(test_ds))):
    image_path = test_ds.df.iloc[i]['filename']
    full_path = os.path.join(test_ds.image_dir, image_path)
    
    ref_caption = test_ds.df.iloc[i]['caption']
    gen_caption = trainer.generate_caption(full_path)

    predictions.append(gen_caption.strip())
    references.append([ref_caption.strip()])  # BLEU and METEOR need list of references

# Step 5: Evaluate
bleu_score = bleu.compute(predictions=predictions, references=references)
meteor_score = meteor.compute(predictions=predictions, references=references)
rouge_score = rouge.compute(predictions=predictions, references=[r[0] for r in references])

# Step 6: Report
print("\n=== Evaluation on Test Set ===")
print(f"BLEU Score: {bleu_score['bleu']:.4f}")
print(f"METEOR Score: {meteor_score['meteor']:.4f}")
print(f"ROUGE-L Score: {rouge_score['rougeL']:.4f}")


  sd = torch.load(self.train_config.model_path/'captioner.pt')


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

100%|██████████| 928/928 [10:25<00:00,  1.48it/s]



=== Evaluation on Test Set ===
BLEU Score: 0.0000
METEOR Score: 0.1194
ROUGE-L Score: 0.1495


### Load Pretrained SmolVLM Model and Processor

In this step, we load the pretrained SmolVLM model and its associated processor from the Hugging Face model hub. The processor is responsible for pre-processing the input data (e.g., images and text) to ensure compatibility with the model. The model is loaded using `AutoModelForVision2Seq`, which is a Vision-to-Sequence model designed for tasks like image captioning.

1. **Load the Processor**:
   - The processor is loaded using `AutoProcessor.from_pretrained()`, which automatically retrieves the processor for the specified pretrained model, `"HuggingFaceTB/SmolVLM-256M-Instruct"`. This processor will handle the transformation of input data into a format suitable for the model.

2. **Load the Model**:
   - The model is loaded using `AutoModelForVision2Seq.from_pretrained()` with the same model identifier (`"HuggingFaceTB/SmolVLM-256M-Instruct"`). The model is loaded with the appropriate precision (either `bfloat16` or `float32`) depending on the availability of GPU support.
   - The model is then moved to the specified device (`cuda` if available, otherwise `cpu`) to ensure optimal computation.
   - The `_attn_implementation="eager"` argument is set as a fallback for `flash_attention_2` in case it is unavailable.

The loaded model and processor will be used for inference and image captioning tasks.


In [40]:

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-256M-Instruct",
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    _attn_implementation="eager",  # fallback for flash_attention_2
).to(device)


### Setup Evaluation Metrics and Zero-Shot Captioning

In this section, we set up the evaluation metrics (BLEU, METEOR, and ROUGE) and define a function for zero-shot image captioning.

1. **Setup BLEU Smoothing**:
   - The `SmoothingFunction` from the `nltk.translate.bleu_score` module is used to apply a smoothing function to the BLEU score calculation. This helps avoid zero BLEU scores for very short sentences or captions. We use `method4` as the smoothing method.

2. **Setup Evaluation Metrics**:
   - The evaluation metrics (BLEU, METEOR, and ROUGE) are loaded using the `evaluate` library. These metrics are used to assess the quality of generated captions in comparison to reference captions.

3. **Zero-Shot Captioning**:
   - The `zero_shot_captioning` function takes an image as input and generates a caption without the need for fine-tuning on a specific dataset. The process involves:
     - Constructing a prompt asking the model, "What’s in this image?"
     - Applying the processor's `apply_chat_template` method to format the prompt and image for the model.
     - Generating a caption using the model with a maximum token length of 50.
     - Cleaning the caption to trim it to just the response (removing the question and assistant label).

The generated caption provides a description of the image, which can be evaluated using the aforementioned metrics.


In [None]:


from nltk.translate.bleu_score import SmoothingFunction
smooth_fn = SmoothingFunction().method4


# Step 2: Setup evaluation metrics
bleu = evaluate.load("bleu")
meteor = evaluate.load("meteor")
rouge = evaluate.load("rouge")


def zero_shot_captioning(image: Image.Image) -> str:
    messages = [{
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What’s in this image?"}
        ]
    }]
    prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

    inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
    generated_ids = model.generate(**inputs, max_new_tokens=50)
    caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

    # ✅ Clean prompt & trim to just the answer
    caption = caption.split("What’s in this image?")[-1].replace("Assistant:", "").strip()
    return caption




### Evaluate SmolVLM Zero-Shot on Test Set

In this step, we evaluate the SmolVLM model's zero-shot captioning performance on the test set. We generate captions for each image in the test set, compare them with reference captions, and compute evaluation metrics (BLEU, METEOR, and ROUGE).

1. **Generate Captions**:
   - For each image in the test set, we load the image and attempt to generate a caption using the `zero_shot_captioning` function.
   - If an error occurs during image processing (e.g., an invalid image file), the image is skipped, and an error message is printed.
   - The generated captions are stored in the `predictions` list, and the reference captions are stored in the `references` list.

2. **Compute Metrics**:
   - After generating captions for all the images, we compute the evaluation metrics:
     - **BLEU**: Measures n-gram precision.
     - **METEOR**: Harmonic mean of precision and recall, accounting for synonyms and stemming.
     - **ROUGE-L**: Longest common subsequence between generated and reference captions.

3. **Report Results**:
   - The computed scores (BLEU, METEOR, and ROUGE-L) are printed to give an overview of the model's performance on the test set.

This evaluation gives insight into how well the SmolVLM model performs in a zero-shot setting on the given test data.


In [42]:
# Step 5: Evaluate
predictions = []
references = []

print("Evaluating SmolVLM Zero-Shot on Test Set...")
for i in tqdm(range(len(test_ds))):
    image_path = test_ds.df.iloc[i]['filename']
    full_path = os.path.join(test_ds.image_dir, image_path)

    ref_caption = test_ds.df.iloc[i]['caption']

    try:
        image = Image.open(full_path).convert("RGB")
        gen_caption = zero_shot_captioning(image)
    except Exception as e:
        print(f"[Skipped] Error processing {image_path}: {e}")
        continue

    predictions.append(gen_caption.strip())
    references.append([ref_caption.strip()])

# Step 6: Compute metrics
bleu_score = bleu.compute(predictions=predictions, references=references)
meteor_score = meteor.compute(predictions=predictions, references=references)
rouge_score = rouge.compute(predictions=predictions, references=[r[0] for r in references])

# Step 7: Report results
print("\n=== Zero-Shot Evaluation (SmolVLM) ===")
print(f"BLEU Score:   {bleu_score['bleu']:.4f}")
print(f"METEOR Score: {meteor_score['meteor']:.4f}")
print(f"ROUGE-L:      {rouge_score['rougeL']:.4f}")




Evaluating SmolVLM Zero-Shot on Test Set...


100%|██████████| 928/928 [37:58<00:00,  2.45s/it]



=== Zero-Shot Evaluation (SmolVLM) ===
BLEU Score:   0.0289
METEOR Score: 0.1690
ROUGE-L:      0.2175


# **Part-B**


### Image Occlusion Function

This function occludes a given image by masking random patches. The percentage of occlusion is controlled by the `occlusion_pct` parameter, and the patch size is controlled by the `patch_size` parameter.

1. **Function Overview**:
   - The `occlude_image` function takes an image (`img`), an occlusion percentage (`occlusion_pct`), and an optional patch size (`patch_size`).
   - The image is first converted to a NumPy array for easier manipulation.
   - The image dimensions are checked to ensure that they are divisible by the patch size.
   - A number of patches to mask is calculated based on the `occlusion_pct`.
   - Random patches are selected and set to black (occluded).

2. **Steps**:
   - The image is divided into non-overlapping patches of the specified size (`patch_size`).
   - A set number of patches is randomly chosen to be masked (set to black).
   - The function returns the occluded image as a PIL image.

This function can be useful for tasks like testing robustness to occlusion or generating perturbations for training.


In [35]:
import numpy as np
from PIL import Image
import random

def occlude_image(img: Image.Image, occlusion_pct: float, patch_size: int = 16) -> Image.Image:
    img_np = np.array(img)
    h, w, _ = img_np.shape
    assert h % patch_size == 0 and w % patch_size == 0, "Image must be divisible by patch size."

    num_patches = (h // patch_size) * (w // patch_size)
    num_mask = int(occlusion_pct * num_patches)

    patch_coords = [(i, j) for i in range(h // patch_size) for j in range(w // patch_size)]
    masked_coords = random.sample(patch_coords, num_mask)

    for i, j in masked_coords:
        y1, y2 = i * patch_size, (i + 1) * patch_size
        x1, x2 = j * patch_size, (j + 1) * patch_size
        img_np[y1:y2, x1:x2] = 0  # black patch

    return Image.fromarray(img_np)


### Evaluate on Occluded Images

This function evaluates a model on images with varying levels of occlusion. It applies occlusion to each image in the test dataset, generates a caption using the model, and tracks the results.

1. **Function Overview**:
   - The `evaluate_on_occluded_images` function takes the following inputs:
     - `model_type`: Specifies the model being used (e.g., "smolvlm").
     - `occlusion_pct`: The percentage of the image to occlude (e.g., 0.2 for 20% occlusion).
     - `test_df`: The test dataframe containing image filenames and reference captions.
     - `image_dir`: The directory containing the images.
   - The function loops through each image in the test dataset, applies occlusion, and generates a caption.
   - It tracks the predictions, references, occlusion levels, and image indices for later analysis.

2. **Steps**:
   - For each image:
     - The image is loaded and resized.
     - Occlusion is applied using the `occlude_image` function.
     - The model is used to generate a caption for the occluded image.
     - The generated caption, along with the reference caption and other relevant data, is stored for evaluation.
   - If the model type is "smolvlm", the `zero_shot_captioning` function is used; otherwise, the occluded image is saved temporarily, and the model generates a caption using that file.

3. **Outputs**:
   - The function returns four lists:
     - `predictions`: The captions generated by the model.
     - `references`: The reference captions for the images.
     - `levels`: The occlusion percentage applied to each image.
     - `indices`: The indices of the images in the test set (to track which samples were processed).

This function is useful for evaluating how well the model performs under various occlusion levels, which can provide insight into the model’s robustness to partial information.


In [None]:
import tempfile

def evaluate_on_occluded_images(model_type: str, occlusion_pct: float, test_df, image_dir):
    predictions, references, levels, indices = [], [], [], []

    for i in tqdm(range(len(test_df)), desc=f"{model_type} - Occlusion {int(occlusion_pct*100)}%"):
        row = test_df.iloc[i]
        img_path = os.path.join(image_dir, row['filename'])
        ref_caption = row['caption'].strip()

        img = Image.open(img_path).convert("RGB").resize((256, 256))  # Resize for patching
        occluded_img = occlude_image(img, occlusion_pct=occlusion_pct)

        if model_type == "smolvlm":
            try:
                gen_caption = zero_shot_captioning(occluded_img)
            except Exception as e:
                print(f"Error on image {img_path}: {e}")
                continue
        else:
            with tempfile.NamedTemporaryFile(suffix=".png") as tmpfile:# Save occluded image to a temporary file and pass its path
                occluded_img.save(tmpfile.name)
                gen_caption = trainer.generate_caption(tmpfile.name)

        predictions.append(gen_caption.strip())
        references.append([ref_caption])
        levels.append(int(occlusion_pct * 100))
        indices.append(i)  # ✅ Track which sample this was

    return predictions, references, levels, indices


### Compute Evaluation Metrics

This section defines a utility function to compute standard evaluation metrics for image captioning: BLEU, METEOR, and ROUGE-L.

1. **Metric Initialization**:
   - The evaluation metrics are loaded using the `evaluate` library:
     - **BLEU**: Measures n-gram precision between the generated and reference captions.
     - **METEOR**: Considers precision, recall, stemming, and synonymy for a more nuanced evaluation.
     - **ROUGE-L**: Measures the longest common subsequence between the prediction and reference.

2. **`compute_metrics` Function**:
   - Takes in:
     - `predictions`: A list of generated captions.
     - `references`: A list of corresponding reference captions (in nested list format).
   - Computes each metric using the loaded evaluation tools.
   - Returns the scores for BLEU, METEOR, and ROUGE-L as floats.

This function serves as a centralized way to evaluate and compare captioning model performance across different scenarios or perturbations.


In [None]:
import evaluate

bleu = evaluate.load("bleu")
meteor = evaluate.load("meteor")
rouge = evaluate.load("rouge")

def compute_metrics(predictions, references):
    bleu_score = bleu.compute(predictions=predictions, references=references)["bleu"]
    meteor_score = meteor.compute(predictions=predictions, references=references)["meteor"]
    rouge_score = rouge.compute(predictions=predictions, references=[r[0] for r in references])["rougeL"]
    return bleu_score, meteor_score, rouge_score


### Evaluate Models at Multiple Occlusion Levels

This section runs a comparative evaluation of two models—`custom` and `smolvlm`—under various levels of image occlusion. It computes captioning performance metrics and prepares a dataset for further analysis in Part C.

1. **Setup**:
   - A list of occlusion levels (`10%`, `50%`, `80%`) is defined.
   - `results` is used to store aggregated metric scores.
   - `full_outputs` stores detailed per-sample results for later use in Part C.

2. **Evaluation Loop**:
   - For each model (`custom` and `smolvlm`) and each occlusion level:
     - Occluded images are evaluated using `evaluate_on_occluded_images`.
     - Metrics (BLEU, METEOR, ROUGE-L) are computed using `compute_metrics`.
     - The results are appended to the `results` list.
     - Each prediction, along with metadata (filename, occlusion level, model type, original and generated captions), is stored in `full_outputs`.

3. **Export Results for Part C**:
   - The `full_outputs` list is converted into a DataFrame.
   - The data is saved as a CSV file named `occlusion_eval_partC.csv`, which can be used for training or evaluating a classifier in Part C.

This block enables systematic robustness evaluation of captioning models under occlusion and prepares labeled data for downstream analysis or classification tasks.


In [None]:
import pandas as pd

occlusion_levels = [0.1, 0.5, 0.8]
results = []
full_outputs = []  # ✅ For Part C

for model_type in ["custom","smolvlm"]:
    for level in occlusion_levels:
        preds, refs, levels, indices = evaluate_on_occluded_images(
            model_type=model_type,
            occlusion_pct=level,
            test_df=test_ds.df,
            image_dir=test_ds.image_dir
        )
        bleu_score, meteor_score, rouge_score = compute_metrics(preds, refs)

        results.append({
            "model": model_type,
            "occlusion_level": int(level * 100),
            "BLEU": bleu_score,
            "METEOR": meteor_score,
            "ROUGE-L": rouge_score,
        })

        # ✅ Save full per-sample data for Part C using correct indices
        for i, idx in enumerate(indices):
            full_outputs.append({
                "filename": test_ds.df.iloc[idx]["filename"],
                "perturbation_level": int(level * 100),
                "original_caption": refs[i][0],
                "generated_caption": preds[i],
                "model": model_type
            })

# ✅ After the loop: save all data to CSV
df_full = pd.DataFrame(full_outputs)
df_full.to_csv("occlusion_eval_partC.csv", index=False)

custom - Occlusion 10%: 100%|██████████| 928/928 [10:52<00:00,  1.42it/s]
custom - Occlusion 50%: 100%|██████████| 928/928 [10:35<00:00,  1.46it/s]
custom - Occlusion 80%: 100%|██████████| 928/928 [10:54<00:00,  1.42it/s]
smolvlm - Occlusion 10%: 100%|██████████| 928/928 [36:59<00:00,  2.39s/it]
smolvlm - Occlusion 50%: 100%|██████████| 928/928 [29:18<00:00,  1.90s/it]
smolvlm - Occlusion 80%: 100%|██████████| 928/928 [28:33<00:00,  1.85s/it]


### Print Robustness Summary

This section displays a formatted summary of the model performance across different occlusion levels.

1. **Loop Through Results**:
   - Iterates over each entry in the `results` list (which contains performance metrics for both `custom` and `smolvlm` models at various occlusion levels).
   
2. **Formatted Output**:
   - For each result, prints the model name, occlusion level, and its corresponding evaluation scores:
     - **BLEU**
     - **METEOR**
     - **ROUGE-L**

This summary gives a quick and clear view of how robust each model is when faced with increasing levels of image occlusion.


In [39]:
print("\n=== Robustness Summary ===")
for res in results:
    print(f"{res['model']} | Occlusion {res['occlusion_level']}% -> "
          f"BLEU: {res['BLEU']:.4f}, METEOR: {res['METEOR']:.4f}, ROUGE-L: {res['ROUGE-L']:.4f}")



=== Robustness Summary ===
custom | Occlusion 10% -> BLEU: 0.0012, METEOR: 0.1200, ROUGE-L: 0.1498
custom | Occlusion 50% -> BLEU: 0.0000, METEOR: 0.1184, ROUGE-L: 0.1487
custom | Occlusion 80% -> BLEU: 0.0000, METEOR: 0.1207, ROUGE-L: 0.1505
smolvlm | Occlusion 10% -> BLEU: 0.0157, METEOR: 0.1399, ROUGE-L: 0.1974
smolvlm | Occlusion 50% -> BLEU: 0.0011, METEOR: 0.0793, ROUGE-L: 0.1453
smolvlm | Occlusion 80% -> BLEU: 0.0002, METEOR: 0.0529, ROUGE-L: 0.0963
