# AML Challenge: Code Submission

**Group: Golden retrieval**
* Valeria Avino (ID: `1905974`)
* Xavier Del Giudice (ID: `1967219`)
* Gabriel Pinos (ID: `1965035`)
* Leonardo Rocci (ID: `1922496`)

---

## 1. Notebook Description

This notebook presents the code and configuration that generated one of our best-performing submissions.

This notebook will:
1.  Import all necessary libraries and custom modules from our `/src` project structure.
2.  Define the main `run_experiment` function that encapsulates the entire pipeline.
3.  Define the final, optimized `CONFIG` dictionary containing the best hyperparameters.
4.  Execute the experiment to train the model and generate the final `submission.csv`.
5.  Load our Optuna study database to demonstrate the hyperparameter tuning process.

## 2. The Experiment Pipeline

We define a single function, `run_experiment`, which encapsulates the entire end-to-end process. This function is highly configurable via the `CONFIG` dictionary, allowing us to control every aspect of the pipeline.

The main components and their configurable features are:

### `EmbeddingDataModule`
Our custom data module handles all data loading, preprocessing, and sampling.
* **Data Augmentation:** It supports loading an additional training dataset (`coco_npz_path`) to be merged and shuffled with the original Flickr30 data.
    * *Note: The datasets we load have been **pre-cleaned by us** to remove any training captions (and their associated images) that matched captions in the test set, preventing data leakage.*
* **Flexible Sampling:** We can control the `batch_size` for training and the `val_mrr_batch_size` for validation (which defaults to 100 as per Kaggle's evaluation).
* **Advanced Training Samplers:** The module allows for:
    * `group_captions`: Ensures all captions for a single image are grouped into the same batch.
    * `sample_hard_images`: Uses a pre-computed Faiss index to sample the hardest images for the batch, improving training efficiency.
* **Rich Validation:** We can control the validation strategy with `val_mrr_folds` (how many batches to use) and `val_num_caption_views` (1-5), logging the overall MRR and the MRR for each view separately.
* **Feature Engineering:** Includes an option to use `num_anchors` to compute and include relative embeddings as input features.
* **Optional Standardization:** Implements **optional** automatic standardization of training data and de-standardization for validation and test predictions.

### `MlpConnector` (Model)
The model architecture is fully defined by the config.
* **Architecture:** We can set all architectural parameters, such as `input_dim`, `hidden_dim`, `output_dim`, and the `activation_fun` (e.g., `torch.nn.SiLU`).
* **Hyperparameters:** We also control `dropout_rate` and whether to `normalize_out`.

### `TranslationTrainer` (Lightning Module)
Our Lightning module wraps the model and contains the core training logic.
* **Flexible Losses:** It can use different loss functions (like `Triplet` or `InfoNCE`), which are applied directly to a similarity matrix. The loss computation is highly configurable:
    * **`multi_positive`**: Can be set to handle multiple positive examples (correct captions for one image) within the loss calculation.
    * **`symmetric`**: Can be set to compute the loss on the transposed similarity matrix as well, effectively adding the inverse task (image-to-caption) to the training objective.
* **Similarity Metric:** The similarity matrix itself is configurable, supporting standard `dot` product or a kernel-based similarity (with an `alpha` params).
* **Dynamic Schedulers:** We implemented custom schedulers for:
    * `BatchModeScheduler`: To dynamically change the batch composition (e.g., start with easy samples and move to hard ones).
    * `AlphaScheduler`: To anneal loss hyperparameters.

Finally, the function sets up the PyTorch Lightning `Trainer`. While we used **Early Stopping** extensively during hyperparameter tuning (see Section 5), this final notebook uses a custom `StopAtEpochCallback`. This allows us to disable validation, train on the **entire dataset** (Flickr30 + COCO), and stop at the specific epoch we identified as optimal during tuning. The trainer then generates predictions using the final model checkpoint and saves the `submission.csv`.

In [3]:
# Environment setup and imports
import torch
import wandb
import pytorch_lightning as pl
from dotenv import load_dotenv
import importlib
from src import utils, data, models, trainer, losses

# Reload modules in case they were edited during development
importlib.reload(utils)
importlib.reload(data)
importlib.reload(models)
importlib.reload(trainer)
importlib.reload(losses)

from src.utils import set_seed
from src.data import EmbeddingDataModule
from src.models.load_model import build_model_from_config
from src.trainer import TranslationTrainer
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint, LearningRateMonitor
from src.utils import StopAtEpochCallback, save_submission

load_dotenv()
wandb.login()


def run_experiment(config: dict):
    """
    Launch a full training + validation + submission pipeline.
    """
    
    # Setup & Logging
    set_seed(config["seed"])
    wandb.init(project=config["project_name"], name=config["run_name"], config=config)
    wandb_logger = WandbLogger(log_model=False)

    # Data
    datamodule = EmbeddingDataModule(
        data_path = config["data_path"],
        coco_npz_path = config["coco_npz_path"],
        batch_size=config["batch_size"],
        seed=config["seed"],
        standardize=config.get("standardize", False),
        val_mrr_batch_size=config.get("val_mrr_batch_size", 100),
        val_mrr_folds=config.get("val_mrr_folds", 15),
        val_num_caption_views=config.get("val_num_caption_views", 5),
        num_anchors=config.get("num_anchors", 0),
        group_captions=config.get("group_captions", False),
        sample_hard_images=config.get("sample_hard_images", False),
    )
    datamodule.setup()

    # Model
    print(f"\n Building model: {config['model'].get('type', 'MLPTranslator')}")
    model = build_model_from_config(config["model"])

    # Lightning Module
    val_data = {
        "gallery": datamodule.val_image_gallery,
        "mu_y": datamodule.mu_y,
        "std_y": datamodule.std_y,
    }
    alpha_scheduler = None
    batch_scheduler  = None
    if config.get("alpha_scheduler"):
        from src.utils import AlphaScheduler
        alpha_scheduler = AlphaScheduler(**config["alpha_scheduler"])
    if config.get("batch_scheduler"):
        from src.utils import BatchModeScheduler
        batch_scheduler = BatchModeScheduler(**config["batch_scheduler"])
    lightning_module = TranslationTrainer(
        model=model,
        config={
            **config,
            "alpha_scheduler": alpha_scheduler,
            "batch_scheduler": batch_scheduler,   
        },
        val_data=val_data,
    )

    # Callbacks
    # Commented out early stopping and best model checkpointing to always save last model, for reproducibility
    # early_stopping = EarlyStopping(**config["early_stopping"], verbose=True)
    # checkpoint = ModelCheckpoint(
    #     monitor=config["early_stopping"]["monitor"],
    #     mode=config["early_stopping"]["mode"],
    #     dirpath="checkpoints",
    #     filename=f"{config['run_name']}-best-model",
    #     save_top_k=1,
    # )
    checkpoint = ModelCheckpoint(
        dirpath="checkpoints",
        save_top_k=0,
        save_last=True
        )
    
    lr_monitor = LearningRateMonitor(logging_interval="step")
    stop_callback = StopAtEpochCallback(stop_epoch=115) # Best model at epoch 114
    
    # Trainer
    trainer = pl.Trainer(
        **config["trainer"],
        logger=wandb_logger,
        callbacks=[checkpoint, lr_monitor, stop_callback], #early_stopping
    )

    # Training
    print("Starting training.")
    trainer.fit(lightning_module, datamodule=datamodule)

    # Prediction & Submission
    print("Training finished. Generating predictions with the best model.")
    predictions = trainer.predict(datamodule=datamodule, ckpt_path="last")
    all_preds = torch.cat(predictions, dim=0)

    submission_file = f"submission_{config['run_name']}.csv"
    save_submission(datamodule.test_ids, all_preds, filename=submission_file)

    print(f"Run complete. Submission saved to {submission_file}")
    wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mdelgiudice-1967219[0m ([33mdelgiudice-1967219-sapienza-universit-di-roma[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## 3. Final Configuration + Submission generation

This is the `CONFIG` dictionary containing the optimized hyperparameters for our best run. These values were determined through extensive tuning (shown in section 5).

In [4]:
CONFIG = {
    # Project info
    "project_name": "Triplet",
    "run_name": "Golden_Submission",
    "seed": 42,

    # Data 
    "data_path": "data",
    "coco_npz_path": "data/coco_25.npz",
    "batch_size": 2000,            # multiple of 5 (for grouped captions)
    "standardize": False,
    "group_captions": True,        # grouping captions by image
    "sample_hard_images": False,
    
    
    # Validation / MRR
    "val_mrr_batch_size": 0,
    "val_mrr_folds": 0,
    "val_num_caption_views": 0,
    "val_log_chunks": False,
    "num_anchors": 0,

    # Model
    "model": {
        "type": "MlpConnector",
        "input_dim": 1024,
        "output_dim": 1536,
        "hidden_dim": 2048,
        "dropout_rate": 0.7042707796929302,
        "normalize_out": True,
        "activation_fun": torch.nn.SiLU,
    },

    # Loss
    "loss_name": "triplet",
    "margin": 1.164467807751587,
    "similarity_type": "dot",   
    "symmetric": False,
    "multi_positive": True,
    "use_mse": True,
    "lambda_mse" : 7.876500463590652,
    
    
    # Optimization
    "optimizer": {
        "lr": 0.00030361953337802,
        "weight_decay": 9.41586739869854e-06,
    },

    # Scheduler
    "use_lr_scheduler": True,
    "lr_scheduler_type": "cosine",

    # Early stopping
    # "early_stopping": {
    #     "monitor": "val/mrr_overall_avg",
    #     "mode": "max",
    #     "patience": 100,
    # },

    # Trainer
    "trainer": {
        "max_epochs": 200,          
        "accelerator": "auto",
        "precision": 16,
        "gradient_clip_val": 1.089126334404685,
    },
}

## 4. Run Final Training & Prediction

With the helper function and the final configuration defined, this cell executes the full pipeline and generates the submission file.

In [5]:
# Launch the experiment
run_experiment(CONFIG)

Setting up data module...
Loading BASE training data...
Loading COCO training data...
Loaded 125000 COCO captions and 25000 unique images.
Combined datasets. Total images: 50000, Total captions: 250000
Data split: 250000 train captions, 0 val captions.
Building image-centric structures for training sampler...
Built caption lookup for 50000 training images.
Loading test data...
Test data: 1500 captions.
Data setup complete. MRR config -> batch_size=0, folds=0, views=0, N_used=0


c:\Users\xavie\Desktop\challenge_last\.venv\Lib\site-packages\lightning_fabric\connector.py:571: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
You are using a CUDA device ('NVIDIA GeForce RTX 4070') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
c:\Users\xavie\Desktop\challenge_last\.venv\Lib\site-packages\pytorch_lightning\loggers\wandb.py:397: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogge


 Building model: MlpConnector
Created MLPConnector: 1024 -> 2048 -> 1536
Starting training.
Data module already set up. Skipping redundant setup.
Sanity Checking: |          | 0/? [00:00<?, ?it/s]Skipping mixed-equal MRR loader (need 5 views).
                                                  

c:\Users\xavie\Desktop\challenge_last\.venv\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:433: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
c:\Users\xavie\Desktop\challenge_last\.venv\Lib\site-packages\pytorch_lightning\utilities\data.py:106: Total length of `DataLoader` across ranks is zero. Please make sure this was your intention.
c:\Users\xavie\Desktop\challenge_last\.venv\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:433: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Using GroupedImageBatchSampler for training.
Epoch 114: 100%|██████████| 125/125 [00:03<00:00, 37.01it/s, v_num=pczt, train/loss_step=0.506, train/loss_epoch=0.525]
 StopAtEpochCallback: Reached epoch 115. Stop training.
Epoch 114: 100%|██████████| 125/125 [00:03<00:00, 35.54it/s, v_num=pczt, train/loss_step=0.506, train/loss_epoch=0.520]

Restoring states from the checkpoint path at C:\Users\xavie\Desktop\challenge_last\checkpoints\last-v1.ckpt



Training finished. Generating predictions with the best model.
Data module already set up. Skipping redundant setup.


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at C:\Users\xavie\Desktop\challenge_last\checkpoints\last-v1.ckpt
c:\Users\xavie\Desktop\challenge_last\.venv\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:433: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00, 498.91it/s]


[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Submission saved to submission_Golden_Submission.csv (1500 rows)
Run complete. Submission saved to submission_Golden_Submission.csv


0,1
epoch,▁▁▁▂▂▂▂▂▃▃▄▄▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇█
lr-AdamW,██████████▇▇▇▇▇▇▇▇▇▆▅▅▅▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁
train/loss_epoch,█▆▅▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
train/loss_step,█▇▇▅▆▄▄▃▃▄▃▄▄▃▃▃▄▃▄▃▃▃▂▃▂▃▂▃▂▂▁▂▁▁▁▂▁▂▁▁
trainer/global_step,▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇██

0,1
epoch,114.0
lr-AdamW,0.00012
train/loss_epoch,0.5202
train/loss_step,0.54664
trainer/global_step,14374.0


### Second best submission (other hyperparameters)

In [6]:
CONFIG = {
    # Project info
    "project_name": "Triplet",
    "run_name": "Golden2",
    "seed": 42,

    # Data 
    "data_path": "data",
    "coco_npz_path": "data/coco_25.npz",
    "batch_size": 2000,            # multiple of 5 (for grouped captions)
    "standardize": False,
    "group_captions": True,        # grouping captions by image
    "sample_hard_images": False,
    
    
    # Validation / MRR
    "val_mrr_batch_size": 0,
    "val_mrr_folds": 0,
    "val_num_caption_views": 0,
    "val_log_chunks": False,
    "num_anchors": 0,

    # Model
    "model": {
        "type": "MlpConnector",
        "input_dim": 1024,
        "output_dim": 1536,
        "hidden_dim": 2048,
        "dropout_rate": 0.88,
        "normalize_out": True,
        "activation_fun": torch.nn.SiLU,
    },

    # Loss
    "loss_name": "triplet",
    "margin": 0.883,
    "similarity_type": "dot",   
    "symmetric": False,
    "multi_positive": True,
    "use_mse": True,
    "lambda_mse" : 3.7,
    
    
    # Optimization
    "optimizer": {
        "lr": 0.0005048222945628928,
        "weight_decay": 1.135105292535283e-05,
    },

    # Scheduler
    "use_lr_scheduler": True,
    "lr_scheduler_type": "cosine",

    # Early stopping
    # "early_stopping": {
    #     "monitor": "val/mrr_overall_avg",
    #     "mode": "max",
    #     "patience": 100,
    # },

    # Trainer
    "trainer": {
        "max_epochs": 200,          
        "accelerator": "auto",
        "precision": 16,
        "gradient_clip_val": 0.5,
    },
}

In [None]:
run_experiment(CONFIG)

## 5. Hyperparameter Tuning (Optuna)

To show our tuning process, we load the Optuna study from its SQLite database. This shows the best trial and its parameters, which informed our final `CONFIG` above.

In [7]:
import optuna
import pandas as pd
import os

# Configuration
DB_NAME = "tuning.db" 
STUDY_NAME = "tri_tuning_study_5"
DB_PATH = f"sqlite:///{DB_NAME}"

if not os.path.exists(DB_NAME):
    print(f"Error: Database file not found at '{DB_NAME}'")
    print("Please make sure the .db file is in the same directory as this notebook.")
else:
    try:
        # Load the study
        study = optuna.load_study(
            study_name=STUDY_NAME,
            storage=DB_PATH
        )
    
        print(f"Successfully loaded study '{STUDY_NAME}' from '{DB_PATH}'")
        print(f"Total trials: {len(study.trials)}")
        
        # Show the best trial
        print("\n Best Trial")
        best_trial = study.best_trial
        print(f"Value (Validation Score): {best_trial.value}")
        
        print("\nBest Parameters Found:")
        for key, value in best_trial.params.items():
            print(f"  {key}: {value}")
    
        # Show a DataFrame of the top 5 trials
        print("\n Top 5 Trials")
        df = study.trials_dataframe()
        print(df.sort_values(by="value", ascending=False).head(5))
    
    except Exception as e:
        print(f"Could not load Optuna study. Please check DB_PATH and STUDY_NAME.")
        print(f"Error: {e}")

  from .autonotebook import tqdm as notebook_tqdm


Successfully loaded study 'tri_tuning_study_5' from 'sqlite:///tuning.db'
Total trials: 62

 Best Trial
Value (Validation Score): 0.9040225744247437

Best Parameters Found:
  lr: 0.00030361953337802
  weight_decay: 9.41586739869854e-06
  dropout_rate: 0.7042707796929302
  margin: 1.164467807751587
  lambda_mse: 7.876500463590652
  hidden_dim: 2048
  batch_size: 2000
  gradient_clip_val: 1.089126334404685

 Top 5 Trials
    number     value             datetime_start          datetime_complete  \
24      24  0.904023 2025-11-15 12:56:26.906317 2025-11-15 13:09:07.610189   
52      52  0.903479 2025-11-15 14:22:22.030465 2025-11-15 14:31:38.113688   
17      17  0.902143 2025-11-15 12:24:49.912719 2025-11-15 12:36:40.454871   
53      53  0.902033 2025-11-15 14:31:38.124690 2025-11-15 14:40:24.292023   
54      54  0.901853 2025-11-15 14:40:24.303531 2025-11-15 14:49:44.216120   

                 duration  params_batch_size  params_dropout_rate  \
24 0 days 00:12:40.703872              