# Baseline Modeling & Evaluation for Performance Benchmarks

This notebook evaluates various baseline image captioning models (Random, Most Common,etc.) to provide a performance benchmark.

## Setup and Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import numpy as np
import pandas as pd
import logging
from typing import Dict, List, Tuple

from vtt.data.caption_preprocessing import load_and_clean_captions
from vtt.data.data_loader import load_split_datasets
from vtt.data.image_preprocessing import load_features 
from vtt.evaluation.evaluate import evaluate_captions
from vtt.baselines import (
    generate_random_captions,
    generate_most_common_caption,
    generate_nearest_neighbor_captions
)

2025-07-20 13:48:56.551750: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-20 13:48:56.569352: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753033736.592145   62355 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753033736.597656   62355 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1753033736.607268   62355 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [5]:
# Configure logging
logging.basicConfig(
    level=logging.INFO, 
    format='%(asctime)s - %(levelname)s: %(message)s', 
    datefmt='%Y-%m-%d %H:%M:%S'
)
logger = logging.getLogger(__name__)

logger.info("Notebook setup complete and logging configured.")

2025-07-20 13:49:35 - INFO: Notebook setup complete and logging configured.


## Define Paths and Load Data

In [6]:
logger.info("Defining data paths.")
dataset_name = "flickr8k"

features_path = f"../data/processed/{dataset_name}_features.npz"
captions_path = f"../data/processed/{dataset_name}_padded_caption_sequences.npz"
tokenizer_path = f"../data/processed/{dataset_name}_tokenizer.json"
captions_file = f"../data/raw/{dataset_name}_captions.csv"
logger.info("Finished defining paths.")

2025-07-20 13:49:37 - INFO: Defining data paths.
2025-07-20 13:49:37 - INFO: Finished defining paths.


In [7]:
# Load data in numpy format to easily access image IDs, features, and captions
logger.info("Loading dataset splits and raw captions...")
train_data_np, val_data_np, test_data_np = load_split_datasets(
    features_path=features_path,
    captions_path=captions_path,
    batch_size=64, # Batch size not directly relevant for numpy return, but kept for consistency
    val_split=0.15,
    test_split=0.10,
    shuffle=True,
    buffer_size=1000,
    seed=42,
    cache=False, # No need to cache if returning numpy arrays
    return_numpy=True # Crucial for accessing raw numpy arrays (features, captions, IDs)
)

# Unpack the numpy arrays for easier access
train_features_all_samples, train_caption_seqs_all_samples, train_image_ids_all_samples = train_data_np
test_features_all_samples, test_caption_seqs_all_samples, test_image_ids_all_samples = test_data_np

logger.info(f"Finished loading.") 

2025-07-20 13:49:44 - INFO: Loading dataset splits and raw captions...
2025-07-20 13:50:15 - INFO: Finished loading.



--- Dataset Split Sizes (number of individual samples) ---
Total samples loaded: 38008
Train samples: 28507
Validation samples: 5701
Test samples: 3800
----------------------------------------------------------



In [8]:
# Load all raw (cleaned) captions for reference and pooling
# This dictionary is {unique_image_id: [list of cleaned captions]}
logger.info("Loading and cleaning all raw captions (for references and pool)...")
clean_captions_dict = load_and_clean_captions(captions_file)

# Load pre-extracted image features as a dictionary for Nearest Neighbor baseline
# This dictionary is {unique_image_id: single_feature_vector}
logger.info("Loading all unique image features...")
all_unique_image_features_dict = load_features(features_path)

# Create a flat pool of all training captions for random and most common baselines
# This flattens the list of lists of captions for all UNIQUE training images.
# We get unique train_image_ids first to ensure we don't duplicate captions if
# train_image_ids_all_samples has duplicates due to multiple captions per image.
unique_train_image_ids_for_pool = list(set(train_image_ids_all_samples))
training_captions_pool = [
    caption for img_id in unique_train_image_ids_for_pool
    if img_id in clean_captions_dict
    for caption in clean_captions_dict[img_id]
]

# if not training_captions_pool:
#     logger.error("No training captions found for the random pool. Check data loading or paths.")
#     # You might want to halt execution or load a dummy pool here in a real scenario

logger.info(f"Prepared a pool of {len(training_captions_pool)} training captions for random and most common assignment.")

# Get unique test image IDs for evaluation (these are the images we need to caption)
unique_test_image_ids = list(set(test_image_ids_all_samples))
logger.info(f"Total unique test images to evaluate: {len(unique_test_image_ids)}.")

results = {} # Dictionary to store evaluation scores for all baselines

2025-07-20 13:51:19 - INFO: Loading and cleaning all raw captions (for references and pool)...
2025-07-20 13:51:19 - INFO: Loading all unique image features...
2025-07-20 13:51:26 - INFO: Prepared a pool of 37957 training captions for random and most common assignment.
2025-07-20 13:51:26 - INFO: Total unique test images to evaluate: 3158.


## Run Random Caption Model and Evaluate

In [9]:
logger.info("\n--- Running Random Caption Model ---")

random_assignments = generate_random_captions(unique_test_image_ids, training_captions_pool)
random_scores = evaluate_captions(clean_captions_dict, random_assignments)
results['Random'] = random_scores

logger.info("Random Caption Model Results:")
for metric, value in random_scores.items():
    print(f"  {metric}: {value:.4f}")

2025-07-20 13:52:11 - INFO: 
--- Running Random Caption Model ---
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2025-07-20 13:57:28 - INFO: Random Caption Model Results:


  BLEU-1: 0.3955
  BLEU-2: 0.1748
  BLEU-3: 0.0778
  BLEU-4: 0.0474
  METEOR: 0.2580
  BERTScore_P: 0.8735
  BERTScore_R: 0.8733
  BERTScore_F1: 0.8733


## Run Most Common Caption Model and Evaluate

In [10]:
logger.info("\n--- Running Most Common Caption Model ---")

most_common_assignments = generate_most_common_caption(unique_test_image_ids, training_captions_pool)
most_common_scores = evaluate_captions(clean_captions_dict, most_common_assignments)
results['Most Common'] = most_common_scores

logger.info("Most Common Caption Model Results:")
for metric, value in most_common_scores.items():
    print(f"  {metric}: {value:.4f}")

2025-07-20 13:58:32 - INFO: 
--- Running Most Common Caption Model ---
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2025-07-20 14:01:04 - INFO: Most Common Caption Model Results:


  BLEU-1: 0.4381
  BLEU-2: 0.1852
  BLEU-3: 0.0978
  BLEU-4: 0.0659
  METEOR: 0.2692
  BERTScore_P: 0.9002
  BERTScore_R: 0.8782
  BERTScore_F1: 0.8890


## Run Nearest Neighbor Image Caption Model and Evaluate

In [11]:
logger.info("\n--- Running Nearest Neighbor Image Caption Model (k=1) ---")

# Filter unique image IDs that actually have features loaded
train_image_ids_for_nn = [img_id for img_id in unique_train_image_ids_for_pool if img_id in all_unique_image_features_dict]
test_image_ids_for_nn = [img_id for img_id in unique_test_image_ids if img_id in all_unique_image_features_dict]

# Create feature dictionaries containing only relevant (unique) image features
train_features_for_nn = {
    img_id: all_unique_image_features_dict[img_id]
    for img_id in train_image_ids_for_nn
}
test_features_for_nn = {
    img_id: all_unique_image_features_dict[img_id]
    for img_id in test_image_ids_for_nn
}

# Define number of neighbors
k = 1

nn_assignments = generate_nearest_neighbor_captions(
    test_image_ids_for_nn,
    train_image_ids_for_nn,
    train_features_for_nn,
    test_features_for_nn,
    clean_captions_dict,
    k_neighbors=k
)
nn_scores = evaluate_captions(clean_captions_dict, nn_assignments)
results[f'Nearest Neighbor (k={k})'] = nn_scores

logger.info(f"Nearest Neighbor (k={k}) Model Results:")
for metric, value in nn_scores.items():
    print(f"  {metric}: {value:.4f}")

2025-07-20 14:06:49 - INFO: 
--- Running Nearest Neighbor Image Caption Model (k=1) ---
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2025-07-20 14:11:55 - INFO: Nearest Neighbor (k=1) Model Results:


  BLEU-1: 0.9975
  BLEU-2: 0.9967
  BLEU-3: 0.9963
  BLEU-4: 0.9962
  METEOR: 0.9966
  BERTScore_P: 0.9260
  BERTScore_R: 0.9253
  BERTScore_F1: 0.9256


## Compile Summary Table

In [15]:
logger.info("\n--- Baseline Model Performance Summary --- ")

results_df = pd.DataFrame(results).T # Transpose to have models as rows

# Define preferred order of metrics for display
ordered_columns = [
    'BLEU-1', 'BLEU-2', 'BLEU-3', 'BLEU-4',
    'METEOR',
    'BERTScore_P', 'BERTScore_R', 'BERTScore_F1'
]
# Filter to only include columns that actually exist in the DataFrame
final_columns = [col for col in ordered_columns if col in results_df.columns]
results_df = results_df[final_columns]

# Format numbers for display
results_df_formatted = results_df.applymap(lambda x: f"{x:.4f}")
results_df_formatted

2025-07-20 14:15:55 - INFO: 
--- Baseline Model Performance Summary --- 
  results_df_formatted = results_df.applymap(lambda x: f"{x:.4f}")


Unnamed: 0,BLEU-1,BLEU-2,BLEU-3,BLEU-4,METEOR,BERTScore_P,BERTScore_R,BERTScore_F1
Random,0.3955,0.1748,0.0778,0.0474,0.258,0.8735,0.8733,0.8733
Most Common,0.4381,0.1852,0.0978,0.0659,0.2692,0.9002,0.8782,0.889
Nearest Neighbor (k=1),0.9975,0.9967,0.9963,0.9962,0.9966,0.926,0.9253,0.9256


In [None]:
# Print as Markdown table (good for copying to reports)
print(results_df_formatted.to_markdown(numalign="left", stralign="left"))

2025-07-20 14:15:14 - INFO: 
--- Baseline Model Performance Summary --- 
  results_df_formatted = results_df.applymap(lambda x: f"{x:.4f}")
2025-07-20 14:15:14 - INFO: 
--- Baseline Model Evaluation Finished ---


|                        | BLEU-1   | BLEU-2   | BLEU-3   | BLEU-4   | METEOR   | BERTScore_P   | BERTScore_R   | BERTScore_F1   |
|:-----------------------|:---------|:---------|:---------|:---------|:---------|:--------------|:--------------|:---------------|
| Random                 | 0.3955   | 0.1748   | 0.0778   | 0.0474   | 0.258    | 0.8735        | 0.8733        | 0.8733         |
| Most Common            | 0.4381   | 0.1852   | 0.0978   | 0.0659   | 0.2692   | 0.9002        | 0.8782        | 0.889          |
| Nearest Neighbor (k=1) | 0.9975   | 0.9967   | 0.9963   | 0.9962   | 0.9966   | 0.926         | 0.9253        | 0.9256         |


### Interpretation

* Random and Most Common: These are solid, realistic lower bounds for a generative image captioning model. Any successful generative model should significantly outperform these.
* Nearest Neighbor (k=1): While a valid baseline, its interpretation is crucial. These scores are exceptionally high, nearing perfection, and are highly unusual for a typical generative captioning task.
    * What this likely means:
        * Retrieval, not Generation: This baseline is effectively acting as a retrieval system, not a generative one. For each test image, it's finding the most visually similar image in the training set and then assigning one of the ground truth reference captions associated with that training image.
        * After verifying the data splitting process is not allowing data leakage and there's no direct overlap or duplicates, the most likely reason for such near-perfect scores is that the test images are extremely similar to the training images, which leads to very similar image feature vectors. Visual inspection of the images shows this to be true.