# **SBERT Training**

In my work on the MS MARCO Passage Ranking dataset, I manage several files that each serve a distinct role in building a training and evaluation pipeline for my information retrieval model. Here’s how I organize and utilize the different files:

Collection

What It Is:
A large file that contains all the text in the dataset—specifically, each passage (or document). For Passage Ranking, it typically consists of rows formatted as pid<TAB>passage_text, linking a passage ID (pid) to its actual text.

Usage:
I use the collection file to look up the actual text for each passage by its ID.

Queries

What It Is:
A list of user queries that are either real or anonymized.

Typical Format:
The file is usually formatted as qid<TAB>query_text, mapping each query ID (qid) to its corresponding query.

Usage:
I retrieve passages for these queries during the training and evaluation phases of my model.

Qrels.train

What It Is:
The relevance judgments (qrels) for the training queries. This file uses the TREC format, such as qid 0 pid relevance_label.

Usage:
It tells me which (qid, pid) pairs are actually relevant. For pointwise training, I treat these pairs as positive examples (label=1) and assume that any other (qid, pid) pair in my candidate set is negative (label=0).

Qrels.dev

What It Is:
The relevance judgments for the development (or validation) queries.

Usage:
I use the dev set to evaluate how well my model generalizes to unseen data. For each dev query, I can identify which passages are relevant (label=1) and compute metrics like MRR or nDCG.

Putting It All Together

I load the passages from the collection file into a dictionary (mapping pid to passage_text).

I load the queries into another dictionary (mapping qid to query_text).

I use the qrels.train file to form (query, passage, label) training samples for my model, where the label is set to 1 if the (qid, pid) pair is marked relevant in qrels.train, and 0 otherwise.

After training, I evaluate the performance of my model using the separate qrels.dev file, which provides a held-out set of queries and their corresponding relevant passages.

This pipeline allows me to train my model on a well-defined set of relevance judgments and then assess its effectiveness on unseen data, ensuring that my re-ranker generalizes well before final testing.

#**Step-by-Step Approach in this Notebook:**
1. Prepare Training Data
2. Load a pre-trained SBERT model
3. Convert data for SBERT training
4. Train SBERT
5. Evaluate model
6. Save the fine-tuned model ready to be used post FAISS

In [None]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_

In [None]:
!pip install ftfy

Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ftfy
Successfully installed ftfy-6.3.1


In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
file_path = "qrels.train.tsv"
file_path = '/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/qrels.train.tsv'

df_qrels = pd.read_csv(
    file_path,
    sep='\t',           # tab-separated
    header=None,
    names=["qid", "unused", "pid", "rel"]  # column names for clarity
)

df_qrels.head()

Unnamed: 0,qid,unused,pid,rel
0,1185869,0,0,1
1,1185868,0,16,1
2,597651,0,49,1
3,403613,0,60,1
4,1183785,0,389,1


In [None]:
df_qrels.shape

(532761, 4)

4 columns, 532,761 rows

* qid is query id
* pid is passage id
* rel is relevance label

In [None]:
import pandas as pd

#file_path = "queries.train.tsv"
file_path = '/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/queries.train.tsv'

df_queries = pd.read_csv(
    file_path,
    sep='\t',
    header=None,
    names=["qid", "query_text"]
)

df_queries.head()

Unnamed: 0,qid,query_text
0,121352,define extreme
1,634306,what does chattel mean on credit history
2,920825,what was the great leap forward brainly
3,510633,tattoo fixers how much does it cost
4,737889,what is decentralization process.


In [None]:
df_queries.shape

(808731, 2)

In [None]:
df_collection = pd.read_csv(
    '/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/collection.tsv',
    sep='\t',
    header=None,
    names=['pid', 'passage']
)
df_collection.head()


Unnamed: 0,pid,passage
0,0,The presence of communication amid scientific ...
1,1,The Manhattan Project and its atomic bomb help...
2,2,Essay on The Manhattan Project - The Manhattan...
3,3,The Manhattan Project was the name for a proje...
4,4,versions of each volume as well as complementa...


In [None]:
df_collection.shape

(8841823, 2)

In [None]:
df_collection.loc[49, "passage"]

'Color—urine can be a variety of colors, most often shades of yellow, from very pale or colorless to very dark or amber. Unusual or abnormal urine colors can be the result of a disease process, several medications (e.g., multivitamins can turn urine bright yellow), or the result of eating certain foods.'

^As you can see there is a problem with formatting in some entries in the data. I need to clean it

In [None]:
from ftfy import fix_text

# Fix encoding in 'passage'
df_collection["passage"] = df_collection["passage"].apply(fix_text)

# Check result
print(df_collection.loc[49, "passage"])

Color—urine can be a variety of colors, most often shades of yellow, from very pale or colorless to very dark or amber. Unusual or abnormal urine colors can be the result of a disease process, several medications (e.g., multivitamins can turn urine bright yellow), or the result of eating certain foods.


In [None]:
print(df_collection.loc[49, "passage"])

Color—urine can be a variety of colors, most often shades of yellow, from very pale or colorless to very dark or amber. Unusual or abnormal urine colors can be the result of a disease process, several medications (e.g., multivitamins can turn urine bright yellow), or the result of eating certain foods.


Above you can see that the passages were fixed. Let's now do same for query text.

In [None]:
from ftfy import fix_text

# Fix encoding in 'query text'
df_queries["query_text"] = df_queries["query_text"].apply(fix_text)

In [None]:
df_queries.head()

Unnamed: 0,qid,query_text
0,121352,define extreme
1,634306,what does chattel mean on credit history
2,920825,what was the great leap forward brainly
3,510633,tattoo fixers how much does it cost
4,737889,what is decentralization process.


In [None]:
print(df_queries.loc[121352, "query_text"])

is a green pea considered a vegetable or a protein


##**Now we can merge all tables into a dataframe:**

In [None]:
# Merge qrels with queries on 'qid'
df_merged = pd.merge(df_qrels, df_queries, on='qid', how='left')

# Merge the resulting positives with the collection on 'pid'
df_merged = pd.merge(df_merged, df_collection, on='pid', how='left')

df_merged.drop(["unused"], axis=1, inplace=True)

df_merged.head()

Unnamed: 0,qid,pid,rel,query_text,passage
0,1185869,0,1,)what was the immediate impact of the success ...,The presence of communication amid scientific ...
1,1185868,16,1,_________ justice is designed to repair the ha...,The approach is based on a theory of justice t...
2,597651,49,1,what color is amber urine,"Color—urine can be a variety of colors, most o..."
3,403613,60,1,is autoimmune hepatitis a bile acid synthesis ...,Inborn errors of bile acid synthesis can produ...
4,1183785,389,1,elegxo meaning,The word convict here (elegcw /elegxo) means t...


In [None]:
df_merged.shape

(532761, 5)

saving this merged df:

In [None]:
df_merged.to_csv('df_merged.csv', index=False)
from google.colab import files
files.download('df_merged.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
df_merged = pd.read_csv('df_merged.csv')

Now I don't need to run all the code before this cell, i can just import df_merged.csv

You can see that the dataframe has 532,761 rows (data points)

**The dataframe above cannot yet be used for training the model, as it contains queries, passages and positive relevance scores. I need to add negative relevance scores**

In [None]:
pip install joblib




In [None]:
# all_pids is the list of all passage IDs from the collection
all_pids = df_collection['pid'].unique().tolist()


In [None]:
df_merged.head()

Unnamed: 0,qid,pid,rel,query_text,passage
0,1185869,0,1,)what was the immediate impact of the success ...,The presence of communication amid scientific ...
1,1185868,16,1,_________ justice is designed to repair the ha...,The approach is based on a theory of justice t...
2,597651,49,1,what color is amber urine,"Color—urine can be a variety of colors, most o..."
3,403613,60,1,is autoimmune hepatitis a bile acid synthesis ...,Inborn errors of bile acid synthesis can produ...
4,1183785,389,1,elegxo meaning,The word convict here (elegcw /elegxo) means t...


In [None]:
df_merged['rel'].unique()

array([1])

Checking how many CPU cores i have

In [None]:
import multiprocessing

num_cores = multiprocessing.cpu_count()
print("Number of CPU cores:", num_cores)

Number of CPU cores: 8


**Negative Sampling:**

MS Marco Dataset only contains query-passage pairs with positive relevancy, so negative (0) relevancy pairs have to be manually generated. Here's the plan:

Progressive approach:
* Start with generating 50,000 negative pairs (by increasing queries to ~5,000 and negatives per query to 10)
* Then subsample the positives to 50,000
* This gives you a balanced dataset of 100,000 examples, which is sufficient for initial

In [None]:
import random
import pandas as pd

# Ensure keys are of the same type across DataFrames and all_pids
# convert key columns to str for consistency
df_qrels['qid'] = df_qrels['qid'].astype(str)
df_qrels['pid'] = df_qrels['pid'].astype(str)
df_queries['qid'] = df_queries['qid'].astype(str)
df_collection['pid'] = df_collection['pid'].astype(str)

# Ensure all_pids are strings as well
all_pids = [str(pid) for pid in all_pids]

# Debug: Print data types to verify consistency
print("df_qrels dtypes:")
print(df_qrels.dtypes)
print("\ndf_queries dtypes:")
print(df_queries.dtypes)
print("\ndf_collection dtypes:")
print(df_collection.dtypes)
print("\nType of first element in all_pids:", type(all_pids[0]))


# Sample 5000 QUERIES to generate 50k negatives

all_qids = df_qrels["qid"].unique().tolist()
random.shuffle(all_qids)

subsample_size = 5000  # Target ~5000 queries
sub_qids = all_qids[:subsample_size]
print(f"Selected {len(sub_qids)} unique queries for negative sampling")

# Process in reasonable chunks

chunk_size = 100
chunks = [sub_qids[i : i + chunk_size] for i in range(0, len(sub_qids), chunk_size)]
print(f"Split into {len(chunks)} chunks of {chunk_size} queries each")

# Build a dictionary of positives for each qid
pos_dict = {}
for row in df_qrels.itertuples(index=False):
    q = row.qid
    p = row.pid
    if q not in pos_dict:
        pos_dict[q] = set()
    pos_dict[q].add(p)


# Improved negative sampling function

def sample_negatives_for_qid(qid, max_samples=10):
    """
    For a given qid, efficiently sample negative passages.

    Args:
        qid: The query ID
        max_samples: Maximum number of negative samples to generate

    Returns:
        List of (qid, pid, 0) tuples representing negative samples
    """
    pos_pids = pos_dict.get(qid, set())

    # Fixed number of negatives per query to prevent excessive sampling
    neg_samples = []
    attempts = 0
    max_attempts = max_samples * 20  # Allow more failed attempts

    while len(neg_samples) < max_samples and attempts < max_attempts:
        attempts += 1
        # Pick a random pid from all_pids
        pid = random.choice(all_pids)
        # Only add if it's not a positive for this query
        if pid not in pos_pids and (qid, pid, 0) not in neg_samples:
            neg_samples.append((qid, pid, 0))

    return neg_samples


# Process each chunk sequentially with progress tracking
all_neg_samples = []  # store negative samples
target_negatives = 50000
negatives_per_query = min(10, target_negatives // len(sub_qids) + 1)

print(f"Target: {target_negatives} negatives at {negatives_per_query} per query")

for idx, chunk_qids in enumerate(chunks):
    print(f"Processing chunk {idx+1}/{len(chunks)}: {len(chunk_qids)} queries")

    # Process queries sequentially
    for i, qid in enumerate(chunk_qids):
        neg_samples = sample_negatives_for_qid(qid, max_samples=negatives_per_query)
        all_neg_samples.extend(neg_samples)

        # Progress update every 50 queries
        if (i+1) % 50 == 0 or i+1 == len(chunk_qids):
            print(f"  -> Processed {i+1}/{len(chunk_qids)} queries in current chunk")
            print(f"  -> Total negatives so far: {len(all_neg_samples)}")

    # Check if we've reached our target
    if len(all_neg_samples) >= target_negatives:
        print(f"Reached target of {target_negatives} negatives. Stopping.")
        break

#Convert negative pairs to DataFrame

df_neg = pd.DataFrame(all_neg_samples[:target_negatives], columns=["qid", "pid", "rel"])
print("\nTotal negative pairs:", len(df_neg))
print("Sample of negative pairs:")
print(df_neg.head())


# Merge negative pairs with query and passage text

print("Merging negative pairs with query and passage text...")
df_neg_merged = pd.merge(df_neg, df_queries, on='qid', how='left')
df_neg_merged = pd.merge(df_neg_merged, df_collection, on='pid', how='left')


# Subsample positives to match number of negatives
print(f"Subsampling {len(df_neg)} positives from {len(df_merged)} total positives...")
df_pos_subsampled = df_merged.sample(n=len(df_neg), random_state=42)


# Create final balanced dataset

df_balanced = pd.concat([df_pos_subsampled, df_neg_merged], ignore_index=True)
df_balanced = df_balanced.sample(frac=1.0, random_state=42)  # Shuffle the dataset

print("\nFinal balanced dataset size:", df_balanced.shape[0])
print("Class distribution:")
print(df_balanced["rel"].value_counts())
print("\nSample of the balanced dataset:")
print(df_balanced.head(5))

# Save the balanced dataset
df_balanced.to_csv('balanced_dataset_100k.csv', index=False)
print("Balanced dataset saved to 'balanced_dataset_100k.csv'")

In [None]:
df_balanced

NameError: name 'df_balanced' is not defined

In [None]:
df_negative_pairs = df_balanced[df_balanced['rel'] == 0]
df_positive_pairs = df_balanced[df_balanced['rel'] == 1]

NameError: name 'df_balanced' is not defined

In [None]:
df_negative_pairs

In [None]:
df_positive_pairs

NameError: name 'df_positive_pairs' is not defined

You can see both of the classes have 50,000 pairs. This dataframe can be used to train our model.

In [None]:
from google.colab import files

#Save df_all to a CSV file in the Colab environment
df_balanced.to_csv('df_balanced.csv', index=False)

#Download the CSV file to your local machine
files.download('df_balanced.csv')

NameError: name 'df_balanced' is not defined

In [None]:
file_path = '/content/df_merged.cs'
with open(file_path, 'r') as file:
    data = file.read()
print(data)

FileNotFoundError: [Errno 2] No such file or directory: '/content/df_merged.cs'

In [None]:
file_path = '/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/df_balanced.csv'

df_balanced = pd.read_csv(file_path)

df_balanced.head(5)

Unnamed: 0,qid,pid,rel,query_text,passage
0,1147448,3292484,0,what law was put into place to end child labor,The basic idea is to use a sentence structure ...
1,1146837,1911754,0,what pay range is considered middle class,A: Clinical signs in humans usually develop wi...
2,1150869,1505975,1,what is the function of the macrophages in the...,function of alveolar macrophagesThe function o...
3,525889,6063694,0,two types of nucleic acids viruses may have,→ دِبْلُوماسيّ diplomat diplomat Diplomat διπλ...
4,34120,6965147,0,average cost of new home construction,People with Down syndrome may have a variety o...


In [None]:
print(df_balanced['rel'].value_counts())

rel
0    50000
1    50000
Name: count, dtype: int64


confirming class balance

# Choosing and importing a relevant SBERT Model:

When a query comes in, SBERT classifier (pre-trained on MS MARCO) generates an embedding for the query and each of the 10 candidate academic passages. It then computes cosine similarities between the query and each candidate. The final ranking is produced by ordering the candidates from highest to lowest similarity. In other words, the classifier scores the candidates based on their semantic relevance to the query, and these scores are used to produce a ranking.

In [None]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m 

In [None]:
!pip install datasets
!pip install transformers
!pip install accelerate -U
!pip install transformers[torch]
!pip install wandb

Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.6.0-py3-none-any.whl (354 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m354.7/354.7 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 1.5.2
    Uninstalling accelerate-1.5.2:
      Successfully uninstalled accelerate-1.5.2
Successfully installed accelerate-1.6.0


In [None]:
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
import multiprocessing
from datasets import Dataset
from transformers import AutoTokenizer

import evaluate
import torch

In [None]:
from sentence_transformers import InputExample
from torch.utils.data import DataLoader
import random

# 1. Split the balanced dataset into train and validation sets
train_df, val_df = train_test_split(
    df_balanced,
    test_size=0.1,
    random_state=42,
    stratify=df_balanced['rel']  # Maintain class balance in splits
)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")

# 2. Convert DataFrames to lists of InputExample objects
def df_to_input_examples(df):
    examples = []
    for _, row in df.iterrows():
        # SBERT expects InputExample objects with texts and a label
        examples.append(InputExample(
            texts=[row['query_text'], row['passage']],
            label=float(row['rel'])
        ))
    return examples

train_examples = df_to_input_examples(train_df)
val_examples = df_to_input_examples(val_df)

# Sample some examples to verify
print("\nSample training examples:")
for i in range(3):
    ex = random.choice(train_examples)
    print(f"Query: {ex.texts[0][:50]}...")
    print(f"Passage: {ex.texts[1][:50]}...")
    print(f"Label: {ex.label}\n")

Training samples: 90000
Validation samples: 10000

Sample training examples:
Query: who and when was rifling invented barrel...
Passage: Diversity programs may include outreach to the com...
Label: 0.0

Query: what city is in newark...
Passage: Newark is a city in Alameda County, California, Un...
Label: 1.0

Query: what is tv refresh rate mean...
Passage: Essentially, a TV's refresh rate is how fast the d...
Label: 1.0



In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-v4')
model = model.to(device)  # Automatically uses GPU if available

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

# Define loss
train_loss = losses.CosineSimilarityLoss(model=model)

# Create evaluator for validation set
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(val_examples, name='val-eval')

# Fine-tune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=2,
    evaluation_steps=100,  # Validation every 100 steps
    output_path='/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/SBERT_MODEL/'
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33malenabd24[0m ([33malenabd24-queen-mary-university-of-london[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Val-eval Pearson Cosine,Val-eval Spearman Cosine
100,No log,No log,0.966356,0.865689
200,No log,No log,0.967574,0.865749
300,No log,No log,0.968418,0.865807
400,No log,No log,0.968535,0.865845
500,0.060000,No log,0.968272,0.86587
600,0.060000,No log,0.96835,0.865888
700,0.060000,No log,0.968525,0.865899
800,0.060000,No log,0.968593,0.865906
900,0.060000,No log,0.968828,0.865905
1000,0.040600,No log,0.968677,0.865901


In [None]:
# this snipped loads a previouysly saved model from a directory to this notebook environment
# from sentence_transformers import SentenceTransformer

# # Path where you saved the model
# model_path = '/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/SBERT_MODEL_13TH_APRIL/'

# # Load the model
# model = SentenceTransformer(model_path)


Preparing the test set:

**Training and monitoring evaluation:**

**Post-Training Validation:**
* Load the official test set - qrels.dev.tsv

In [None]:
#Read qrels.dev.tsv
test_qrels = pd.read_csv(
    '/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/qrels.dev.tsv',
    sep='\t',
    header=None,
    names=['qid', 'unused', 'pid', 'rel']  # "unused" corresponds to the '0' column
)

# read queries.dev.tsv
test_queries = pd.read_csv(
    '/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/queries.dev.tsv',
    sep='\t',
    header=None,
    names=['qid', 'query_text']
)

For final validation we are using the MS MARCO's dev set, which was not used for training or fine-tuning model parameters. In theory it should serve as an unbiased performance assessment.

**When i run the BinaryClassificationEvaluator, it does the following:**
1. Feeds each pair to your model to get a similarity score (e.g., cosine similarity between embeddings).

2. Compares that predicted score vs. the ground-truth rel label for classification metrics (like accuracy, F1, or AUC).

In [None]:
# merge qrels.dev with queries.dev on 'qid'
# m [qid, unused, pid, rel, query_text]
test_merged = pd.merge(test_qrels, test_queries, on='qid', how='left')

# Merge again with collection on 'pid' to get the passage text
#    [qid, unused, pid, rel, query_text, passage]
test_merged = pd.merge(test_merged, df_collection, on='pid', how='left')

# (Optional) drop the 'unused' column
test_merged.drop(columns=['unused'], inplace=True)

print("Dev set (test) shape:", test_merged.shape)
test_merged.head()


Dev set (test) shape: (59273, 5)


Unnamed: 0,qid,pid,rel,query_text,passage
0,1102432,2026790,1,. what is a corporation?,McDonald's Corporation is one of the most reco...
1,1102431,7066866,1,why did rachel carson write an obligation to e...,The Obligation to Endure by Rachel Carson Rach...
2,1102431,7066867,1,why did rachel carson write an obligation to e...,Carson believes that as man tries to eliminate...
3,1090282,7066900,1,symptoms of a dying mouse,The symptoms are similar but the mouse will be...
4,39449,7066905,1,average number of lightning strikes per day,Although many lightning flashes are simply clo...


In [None]:
print(test_merged['rel'].value_counts())

rel
1    59273
Name: count, dtype: int64


In [None]:
import pandas as pd

# Check if test_merged has at least 5000 samples
if test_merged.shape[0] > 5000:
    test_merged_sub = test_merged.sample(n=5000, random_state=42).reset_index(drop=True)
    print("Subsampled test_merged to 5,000 data samples.")
else:
    test_merged_sub = test_merged.copy()
    print("test_merged has less than 5,000 samples; no subsampling performed.")

# Optional: Print the shape and a preview of the subsampled DataFrame
print("New shape of test_merged_sub:", test_merged_sub.shape)
print(test_merged_sub.head())


Subsampled test_merged to 5,000 data samples.
New shape of test_merged_sub: (5000, 5)
       qid      pid  rel                                         query_text  \
0  1084031  7132043    1        what does constructivist structuralist mean   
1   332830  5789735    1        how old a do you have to be to get a tattoo   
2  1088785  7091207    1                            watchguard body cameras   
3  1033718  7212203    1  what is the purpose of genetically modified cr...   
4   617167  7713540    1                     what days is the ides of march   

                                             passage  
0  Definition of constructivism - a style or move...  
1  You have to be 18 years old to get a tattoo. I...  
2  WatchGuard Vista WatchGuard Vista: WatchGuard ...  
3  More detail on some of the traits crops are ge...  


Above we can see that the test dataset only has positive relevancy query-passage pairs, so once again we have to manually generate negative relevancy pairs.

We don't need to validate our model with 120k data samples - that's overkill. I'm gonna sample test_merged to 5000 rows, and generate 5k more negative pairs.

Confirming we cut the dataset to 5k entries:

In [None]:
print(test_merged_sub['rel'].value_counts())

rel
1    5000
Name: count, dtype: int64


Generating 5k negative query-passage pairs

In [None]:
import random
import pandas as pd

#  Ensure keys are strings for consistency
test_merged_sub['qid'] = test_merged_sub['qid'].astype(str)
test_merged_sub['pid'] = test_merged_sub['pid'].astype(str)
test_queries['qid'] = test_queries['qid'].astype(str)
df_collection['pid'] = df_collection['pid'].astype(str)

# Get all passage IDs from the collection
all_pids = list(df_collection['pid'].unique())

#  Build a dictionary of positives from test_merged_sub
test_pos_dict = {}
for _, row in test_merged_sub.iterrows():
    qid = row['qid']
    pid = row['pid']
    if qid not in test_pos_dict:
        test_pos_dict[qid] = set()
    test_pos_dict[qid].add(pid)

# Negative sampling function (1:1 ratio with positives)

def sample_negatives_for_qid_test(qid, max_samples):
    """
    For a given test query id, sample negative passages (ones not in the positive set).
    max_samples is set to match the number of positives for that query.
    """
    pos_pids = test_pos_dict.get(qid, set())
    neg_samples = []
    attempts = 0
    max_attempts = max_samples * 20  # Allow extra attempts if needed
    while len(neg_samples) < max_samples and attempts < max_attempts:
        attempts += 1
        pid = random.choice(all_pids)
        if pid not in pos_pids and (qid, pid, 0) not in neg_samples:
            neg_samples.append((qid, pid, 0))
    return neg_samples

# Process queries in chunks to generate negative samples
# Create a shuffled list of qids from the subsampled test data
all_qids = list(test_pos_dict.keys())
random.shuffle(all_qids)

# Define chunk size (here we use 100 queries per chunk)
chunk_size = 100
chunks = [all_qids[i:i + chunk_size] for i in range(0, len(all_qids), chunk_size)]
print(f"Total queries to process: {len(all_qids)} in {len(chunks)} chunks of {chunk_size} each.")

neg_samples_list = []

# Process each chunk sequentially
for idx, chunk_qids in enumerate(chunks):
    print(f"Processing chunk {idx+1}/{len(chunks)} with {len(chunk_qids)} queries.")
    for qid in chunk_qids:
        # retrieve the query text from test_queries
        query_rows = test_queries[test_queries['qid'] == qid]
        if query_rows.empty:
            continue
        query_text = query_rows.iloc[0]['query_text']

        # Set negatives count equal to number of positives for this query
        num_pos = len(test_pos_dict[qid])
        neg_samples = sample_negatives_for_qid_test(qid, num_pos)

        # For each negative, lookup its passage text from df_collection
        for (qid_neg, pid_neg, rel_neg) in neg_samples:
            passage_rows = df_collection[df_collection['pid'] == pid_neg]
            if passage_rows.empty:
                continue
            passage_text = passage_rows.iloc[0]['passage']
            neg_samples_list.append({
                'qid': qid_neg,
                'pid': pid_neg,
                'rel': rel_neg,
                'query_text': query_text,
                'passage': passage_text
            })

    print(f"  -> Total negatives so far: {len(neg_samples_list)}")

print("Finished processing all chunks.")

# convert negative samples into a DataFrame
df_neg_test = pd.DataFrame(neg_samples_list)
print("Total negative samples for evaluation:", len(df_neg_test))
print("Negative samples preview:")
print(df_neg_test.head())


# Combine with positive pairs (from the subsampled test set) to create final evaluation DataFrame

df_pos_test = test_merged_sub[['qid', 'pid', 'rel', 'query_text', 'passage']]
test_final = pd.concat([df_pos_test, df_neg_test], ignore_index=True)
test_final = test_final.sample(frac=1.0, random_state=42).reset_index(drop=True)

# Check class distribution
print("Final evaluation set class distribution:")
print(test_final['rel'].value_counts())
print("\nFinal evaluation set preview:")
print(test_final.head())


Total queries to process: 4969 in 50 chunks of 100 each.
Processing chunk 1/50 with 100 queries.
  -> Total negatives so far: 100
Processing chunk 2/50 with 100 queries.
  -> Total negatives so far: 201
Processing chunk 3/50 with 100 queries.
  -> Total negatives so far: 301
Processing chunk 4/50 with 100 queries.
  -> Total negatives so far: 402
Processing chunk 5/50 with 100 queries.
  -> Total negatives so far: 502
Processing chunk 6/50 with 100 queries.
  -> Total negatives so far: 602
Processing chunk 7/50 with 100 queries.
  -> Total negatives so far: 703
Processing chunk 8/50 with 100 queries.
  -> Total negatives so far: 804
Processing chunk 9/50 with 100 queries.
  -> Total negatives so far: 904
Processing chunk 10/50 with 100 queries.
  -> Total negatives so far: 1005
Processing chunk 11/50 with 100 queries.
  -> Total negatives so far: 1106
Processing chunk 12/50 with 100 queries.
  -> Total negatives so far: 1207
Processing chunk 13/50 with 100 queries.
  -> Total negatives

Saving test_final as a csv locally:

In [None]:
from google.colab import files

# Save test_final to a CSV file (without the index column)
test_final.to_csv('test_final.csv', index=False)

# Trigger a download of the CSV file
files.download('test_final.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Evaluation:**

The code is evaluating your model’s ability to decide whether a pair of texts (a query and a passage) is relevant (label 1.0) or irrelevant (label 0.0)

In [None]:
from sentence_transformers import InputExample
from sentence_transformers.evaluation import BinaryClassificationEvaluator

# Convert merged test set to InputExamples
test_examples = []
for _, row in test_final.iterrows():
    test_examples.append(InputExample(
        texts=[row['query_text'], row['passage']],
        label=float(row['rel'])  # 1.0 = relevant, 0.0 = irrelevant
    ))


In [None]:
# Initialise evaluator
evaluator = BinaryClassificationEvaluator.from_input_examples(
    test_examples,
    name='test-eval'
)

# Run evaluation
evaluator(model)


{'test-eval_cosine_accuracy': 0.9865,
 'test-eval_cosine_accuracy_threshold': 0.5036637783050537,
 'test-eval_cosine_f1': 0.9865013498650135,
 'test-eval_cosine_f1_threshold': 0.47773101925849915,
 'test-eval_cosine_precision': 0.9864027194561088,
 'test-eval_cosine_recall': 0.9866,
 'test-eval_cosine_ap': 0.9990008667480205,
 'test-eval_cosine_mcc': 0.9730000194600006}

This process ultimately lets you assess how well your fine-tuned SBERT model differentiates between semantically relevant and non-relevant query–passage pairs, which is critical for ranking task

Ranking:

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('/content/drive/MyDrive/Colab_Notebooks/Information Retrieval/Search Engine Project/SBERT_MODEL/')

def compute_cosine_similarity(vec1, vec2):
    """Compute the cosine similarity between two vectors."""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Dictionary to store ranking results by query
ranking_results = {}

# Group the evaluation DataFrame by query id
for qid, group in test_final.groupby('qid'):
    # extract query text
    query_text = group.iloc[0]['query_text']
    # Get list of candidate passages and corresponding ids and ground truth labels
    candidate_passages = group['passage'].tolist()
    candidate_ids = group['pid'].tolist()
    candidate_labels = group['rel'].tolist()

    # Compute embeddings
    # For the query:
    query_embedding = model.encode([query_text])[0]  # single embedding vector
    # For candidates:
    candidate_embeddings = model.encode(candidate_passages)

    # Compute cosine similarity scores for each candidate passage
    similarity_scores = [
        compute_cosine_similarity(query_embedding, candidate_emb)
        for candidate_emb in candidate_embeddings
    ]

    # Convert scores to a NumPy array for sorting
    similarity_scores = np.array(similarity_scores)

    # Sort candidate indices based on similarity scores (descending order)
    sorted_indices = np.argsort(-similarity_scores)

    # Retrieve sorted candidates, their ids, scores, and labels
    sorted_candidate_ids = [candidate_ids[i] for i in sorted_indices]
    sorted_scores = similarity_scores[sorted_indices]
    sorted_candidate_labels = [candidate_labels[i] for i in sorted_indices]

    # Store results in a dictionary for this query
    ranking_results[qid] = {
        'query_text': query_text,
        'sorted_candidate_ids': sorted_candidate_ids,
        'sorted_scores': sorted_scores,
        'sorted_candidate_labels': sorted_candidate_labels
    }

    # Print ranking for this query (for demonstration)
    print(f"Query ID: {qid}")
    print("Ranked Passage IDs:", sorted_candidate_ids)
    print("Similarity Scores:", sorted_scores)
    print("\n")

#Compute Mean Reciprocal Rank (MRR) as a ranking metric
def compute_mrr(sorted_labels):
    """Return reciprocal rank for the first relevant (label==1) candidate."""
    for rank, label in enumerate(sorted_labels, start=1):
        if label == 1:
            return 1.0 / rank
    return 0.0

# Calculate MRR for each query and average them
mrr_list = [compute_mrr(data['sorted_candidate_labels']) for data in ranking_results.values()]
mean_mrr = np.mean(mrr_list)
print("Mean Reciprocal Rank (MRR) across queries:", mean_mrr)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Ranked Passage IDs: ['7661942', '7306061']
Similarity Scores: [0.73657346 0.14602663]


Query ID: 758008
Ranked Passage IDs: ['7561394', '1829092']
Similarity Scores: [0.6587354  0.05528211]


Query ID: 758104
Ranked Passage IDs: ['5536159', '7436084']
Similarity Scores: [0.8620485  0.13836473]


Query ID: 759243
Ranked Passage IDs: ['7445414', '6566971']
Similarity Scores: [0.9398777  0.14405185]


Query ID: 759861
Ranked Passage IDs: ['7556003', '1364694']
Similarity Scores: [0.9324275  0.03937463]


Query ID: 760774
Ranked Passage IDs: ['7884334', '5752137']
Similarity Scores: [0.80378866 0.03762812]


Query ID: 761194
Ranked Passage IDs: ['7726869', '3988862']
Similarity Scores: [0.55772984 0.06785862]


Query ID: 76169
Ranked Passage IDs: ['7339645', '1681780']
Similarity Scores: [0.6684715  0.00907359]


Query ID: 761700
Ranked Passage IDs: ['7717531', '1202799']
Similarity Scores: [0.8980647  0.02435574]


Query ID

In [None]:
import pandas as pd

# Define the evaluation metrics dictionary
metrics = {
    'test-eval_cosine_accuracy': 0.9865,
    'test-eval_cosine_accuracy_threshold': 0.5036637783050537,
    'test-eval_cosine_f1': 0.9865013498650135,
    'test-eval_cosine_f1_threshold': 0.47773101925849915,
    'test-eval_cosine_precision': 0.9864027194561088,
    'test-eval_cosine_recall': 0.9866,
    'test-eval_cosine_ap': 0.9990008667480205,
    'test-eval_cosine_mcc': 0.9730000194600006
}

# Convert dictionary to a DataFrame
df = pd.DataFrame(list(metrics.items()), columns=["Metric", "Value"])

# Display the table
df

Unnamed: 0,Metric,Value
0,test-eval_cosine_accuracy,0.9865
1,test-eval_cosine_accuracy_threshold,0.503664
2,test-eval_cosine_f1,0.986501
3,test-eval_cosine_f1_threshold,0.477731
4,test-eval_cosine_precision,0.986403
5,test-eval_cosine_recall,0.9866
6,test-eval_cosine_ap,0.999001
7,test-eval_cosine_mcc,0.973


In [None]:
import json
import pandas as pd

# Example JSON data as a string (you could also load this from a file)
data_json = """
{
  "query_ndcg": {
    "1": {
      "BM25 baseline": 0.9702678299923723,
      "Basic LSI with k=150 dimensions": 0.9758521455086463,
      "Field-weighted LSI": 0.9848920829832359,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9830304948282328
    },
    "2": {
      "BM25 baseline": 0.9676702855814429,
      "Basic LSI with k=150 dimensions": 0.9629813342264886,
      "Field-weighted LSI": 0.9667761720787513,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9794256057551403
    },
    "3": {
      "BM25 baseline": 0.9044563947524946,
      "Basic LSI with k=150 dimensions": 0.9933589454400779,
      "Field-weighted LSI": 0.9898865373853496,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9881020176053733
    },
    "4": {
      "BM25 baseline": 0.9581091124245024,
      "Basic LSI with k=150 dimensions": 0.948554309956447,
      "Field-weighted LSI": 0.9960109913990778,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9878911924184829
    },
    "5": {
      "BM25 baseline": 0.9658615078539107,
      "Basic LSI with k=150 dimensions": 0.9575689562960752,
      "Field-weighted LSI": 0.9505284770054349,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9655739770289112
    },
    "6": {
      "BM25 baseline": 0.9752052508426301,
      "Basic LSI with k=150 dimensions": 0.9627981168426469,
      "Field-weighted LSI": 0.9581046966570809,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9591202408479728
    },
    "7": {
      "BM25 baseline": 0.9818993385732202,
      "Basic LSI with k=150 dimensions": 0.9807871865261831,
      "Field-weighted LSI": 0.9809150399185756,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9823527299043489
    },
    "8": {
      "BM25 baseline": 0.9321327160076697,
      "Basic LSI with k=150 dimensions": 0.9710645286552664,
      "Field-weighted LSI": 0.9762610082520967,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9754601319164254
    },
    "9": {
      "BM25 baseline": 0.9245261750389756,
      "Basic LSI with k=150 dimensions": 0.9676601688646809,
      "Field-weighted LSI": 0.9667775455963419,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9776080183622335
    },
    "10": {
      "BM25 baseline": 0.9529765131303531,
      "Basic LSI with k=150 dimensions": 0.9645555969472415,
      "Field-weighted LSI": 0.9691697980787715,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9621459956970113
    },
    "11": {
      "BM25 baseline": 0.8964577324065195,
      "Basic LSI with k=150 dimensions": 0.9609562747330634,
      "Field-weighted LSI": 0.9807828639897396,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9793535110310084
    },
    "12": {
      "BM25 baseline": 0.984695848675374,
      "Basic LSI with k=150 dimensions": 0.9788847916645828,
      "Field-weighted LSI": 0.9693232588401846,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9857402370727113
    },
    "13": {
      "BM25 baseline": 0.9782201890488277,
      "Basic LSI with k=150 dimensions": 0.9855064285891632,
      "Field-weighted LSI": 0.9902678789460896,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9887766464301465
    },
    "14": {
      "BM25 baseline": 0.970356466152375,
      "Basic LSI with k=150 dimensions": 0.9628576588661965,
      "Field-weighted LSI": 0.9280259962459354,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9809106119256944
    },
    "15": {
      "BM25 baseline": 0.9835913643588563,
      "Basic LSI with k=150 dimensions": 0.9690110565535949,
      "Field-weighted LSI": 0.9027999093185158,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9030498209383528
    },
    "16": {
      "BM25 baseline": 0.9966040770726391,
      "Basic LSI with k=150 dimensions": 0.9855664146422799,
      "Field-weighted LSI": 0.994992385847428,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9942910776627677
    },
    "17": {
      "BM25 baseline": 0.9825657033733907,
      "Basic LSI with k=150 dimensions": 0.9827942717636607,
      "Field-weighted LSI": 0.8947880226442804,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9809054561880343
    },
    "18": {
      "BM25 baseline": 0.9621443885246534,
      "Basic LSI with k=150 dimensions": 0.9441357282937284,
      "Field-weighted LSI": 0.9594627248767399,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9832791715266921
    },
    "19": {
      "BM25 baseline": 0.9334185635964451,
      "Basic LSI with k=150 dimensions": 0.9829274742442599,
      "Field-weighted LSI": 0.9933716770959169,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9847174929495353
    },
    "20": {
      "BM25 baseline": 0.9516312021336671,
      "Basic LSI with k=150 dimensions": 0.9796666282990453,
      "Field-weighted LSI": 0.9786787250738833,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9556153266519222
    },
    "21": {
      "BM25 baseline": 0.9578392479069989,
      "Basic LSI with k=150 dimensions": 0.9766666646329698,
      "Field-weighted LSI": 0.9758102199492783,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9850558651966211
    },
    "22": {
      "BM25 baseline": 0.9748673310883267,
      "Basic LSI with k=150 dimensions": 0.9918139964433209,
      "Field-weighted LSI": 0.9873837905369207,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9883659434195166
    },
    "23": {
      "BM25 baseline": 0.9018046783186535,
      "Basic LSI with k=150 dimensions": 0.9741947524048282,
      "Field-weighted LSI": 0.9574728776175179,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9573189253711919
    },
    "24": {
      "BM25 baseline": 0.9261770173775742,
      "Basic LSI with k=150 dimensions": 0.9739773208175877,
      "Field-weighted LSI": 0.9941760144154915,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9866652960369205
    },
    "25": {
      "BM25 baseline": 0.9514747319710257,
      "Basic LSI with k=150 dimensions": 0.9493207798604006,
      "Field-weighted LSI": 0.9288759835628523,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9038360948563292
    },
    "26": {
      "BM25 baseline": 0.9699219676915909,
      "Basic LSI with k=150 dimensions": 0.9609181526850149,
      "Field-weighted LSI": 0.9203767389876794,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9689337600130596
    },
    "27": {
      "BM25 baseline": 0.9655588343889085,
      "Basic LSI with k=150 dimensions": 0.9249466577802449,
      "Field-weighted LSI": 0.9275834806993435,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9140204740001455
    },
    "28": {
      "BM25 baseline": 0.9657423422096584,
      "Basic LSI with k=150 dimensions": 0.9836744745815332,
      "Field-weighted LSI": 0.9854102537186108,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9854102537186108
    },
    "29": {
      "BM25 baseline": 0.9925132360531861,
      "Basic LSI with k=150 dimensions": 0.986803734351502,
      "Field-weighted LSI": 0.9915805450761829,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9840163312121945
    },
    "30": {
      "BM25 baseline": 0.965399565159306,
      "Basic LSI with k=150 dimensions": 0.9437529122758956,
      "Field-weighted LSI": 0.9827509727452103,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9774730536666356
    }
  },
  "system_averages": {
    "BM25 baseline": 0.9581363203901849,
    "Basic LSI with k=150 dimensions": 0.9694519154248875,
    "Field-weighted LSI": 0.966107888984751,
    "Field-weighted LSI with BERT-enhanced indexing": 0.9716148584744075
  },
  "overall_comparison": [
    {
      "system": "Field-weighted LSI with BERT-enhanced indexing",
      "average_ndcg@10": 0.9716148584744075
    },
    {
      "system": "Basic LSI with k=150 dimensions",
      "average_ndcg@10": 0.9694519154248875
    },
    {
      "system": "Field-weighted LSI",
      "average_ndcg@10": 0.966107888984751
    },
    {
      "system": "BM25 baseline",
      "average_ndcg@10": 0.9581363203901849
    }
  ]
}
"""

# Parse JSON data
data = json.loads(data_json)

# Extract the query_ndcg dictionary
query_ndcg = data["query_ndcg"]

# Initialize a dictionary to accumulate scores per system
system_scores = {}

# Loop over each query in query_ndcg (keys are query numbers as strings)
for query_num, systems in query_ndcg.items():
    for system, ndcg in systems.items():
        system_scores.setdefault(system, []).append(ndcg)

# Compute the average NDCG for each system across all queries
average_ndcg = {system: sum(scores) / len(scores) for system, scores in system_scores.items()}

# Convert the averages into a DataFrame for easy viewing
df_avg_ndcg = pd.DataFrame(list(average_ndcg.items()), columns=["System", "Average NDCG@10"])
df_avg_ndcg = df_avg_ndcg.sort_values("Average NDCG@10", ascending=False).reset_index(drop=True)
print(df_avg_ndcg.to_string(index=False))

# Interpretation:
print("\nInterpretation:")
print("NDCG (Normalized Discounted Cumulative Gain) is a ranking metric that evaluates")
print("how well the system orders the documents; values closer to 1 indicate near-ideal ranking.")
print("The computed average NDCG@10 for each system are:")
for system, avg in average_ndcg.items():
    print(f"  - {system}: {avg:.4f}")
print("\nOverall, Field-weighted LSI with BERT-enhanced indexing shows the highest average NDCG@10,")
print("indicating it ranks results most closely to the ideal across the evaluated queries.")


                                        System  Average NDCG@10
Field-weighted LSI with BERT-enhanced indexing         0.971615
               Basic LSI with k=150 dimensions         0.969452
                            Field-weighted LSI         0.966108
                                 BM25 baseline         0.958136

Interpretation:
NDCG (Normalized Discounted Cumulative Gain) is a ranking metric that evaluates
how well the system orders the documents; values closer to 1 indicate near-ideal ranking.
The computed average NDCG@10 for each system are:
  - BM25 baseline: 0.9581
  - Basic LSI with k=150 dimensions: 0.9695
  - Field-weighted LSI: 0.9661
  - Field-weighted LSI with BERT-enhanced indexing: 0.9716

Overall, Field-weighted LSI with BERT-enhanced indexing shows the highest average NDCG@10,
indicating it ranks results most closely to the ideal across the evaluated queries.


In [None]:
import pandas as pd

# Create a list of dictionaries with the data
data = [
    {
        "system": "Field-weighted LSI with BERT-enhanced indexing",
        "average_ndcg@10": 0.9716148584744075
    },
    {
        "system": "Basic LSI with k=150 dimensions",
        "average_ndcg@10": 0.9694519154248875
    },
    {
        "system": "Field-weighted LSI",
        "average_ndcg@10": 0.966107888984751
    },
    {
        "system": "BM25 baseline",
        "average_ndcg@10": 0.9581363203901849
    }
]

# Create the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df

Unnamed: 0,system,average_ndcg@10
0,Field-weighted LSI with BERT-enhanced indexing,0.971615
1,Basic LSI with k=150 dimensions,0.969452
2,Field-weighted LSI,0.966108
3,BM25 baseline,0.958136


In [None]:
import json
import pandas as pd
import numpy as np

# Sample JSON data (replace this multi-line string with your actual JSON data if reading from a file)
data_json = """
{
  "query_ndcg": {
    "1": {
      "BM25 baseline": 0.9702678299923723,
      "Basic LSI with k=150 dimensions": 0.9758521455086463,
      "Field-weighted LSI": 0.9848920829832359,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9830304948282328
    },
    "2": {
      "BM25 baseline": 0.9676702855814429,
      "Basic LSI with k=150 dimensions": 0.9629813342264886,
      "Field-weighted LSI": 0.9667761720787513,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9794256057551403
    },
    "3": {
      "BM25 baseline": 0.9044563947524946,
      "Basic LSI with k=150 dimensions": 0.9933589454400779,
      "Field-weighted LSI": 0.9898865373853496,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9881020176053733
    },
    "4": {
      "BM25 baseline": 0.9581091124245024,
      "Basic LSI with k=150 dimensions": 0.948554309956447,
      "Field-weighted LSI": 0.9960109913990778,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9878911924184829
    },
    "5": {
      "BM25 baseline": 0.9658615078539107,
      "Basic LSI with k=150 dimensions": 0.9575689562960752,
      "Field-weighted LSI": 0.9505284770054349,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9655739770289112
    },
    "6": {
      "BM25 baseline": 0.9752052508426301,
      "Basic LSI with k=150 dimensions": 0.9627981168426469,
      "Field-weighted LSI": 0.9581046966570809,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9591202408479728
    },
    "7": {
      "BM25 baseline": 0.9818993385732202,
      "Basic LSI with k=150 dimensions": 0.9807871865261831,
      "Field-weighted LSI": 0.9809150399185756,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9823527299043489
    },
    "8": {
      "BM25 baseline": 0.9321327160076697,
      "Basic LSI with k=150 dimensions": 0.9710645286552664,
      "Field-weighted LSI": 0.9762610082520967,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9754601319164254
    },
    "9": {
      "BM25 baseline": 0.9245261750389756,
      "Basic LSI with k=150 dimensions": 0.9676601688646809,
      "Field-weighted LSI": 0.9667775455963419,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9776080183622335
    },
    "10": {
      "BM25 baseline": 0.9529765131303531,
      "Basic LSI with k=150 dimensions": 0.9645555969472415,
      "Field-weighted LSI": 0.9691697980787715,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9621459956970113
    },
    "11": {
      "BM25 baseline": 0.8964577324065195,
      "Basic LSI with k=150 dimensions": 0.9609562747330634,
      "Field-weighted LSI": 0.9807828639897396,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9793535110310084
    },
    "12": {
      "BM25 baseline": 0.984695848675374,
      "Basic LSI with k=150 dimensions": 0.9788847916645828,
      "Field-weighted LSI": 0.9693232588401846,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9857402370727113
    },
    "13": {
      "BM25 baseline": 0.9782201890488277,
      "Basic LSI with k=150 dimensions": 0.9855064285891632,
      "Field-weighted LSI": 0.9902678789460896,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9887766464301465
    },
    "14": {
      "BM25 baseline": 0.970356466152375,
      "Basic LSI with k=150 dimensions": 0.9628576588661965,
      "Field-weighted LSI": 0.9280259962459354,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9809106119256944
    },
    "15": {
      "BM25 baseline": 0.9835913643588563,
      "Basic LSI with k=150 dimensions": 0.9690110565535949,
      "Field-weighted LSI": 0.9027999093185158,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9030498209383528
    },
    "16": {
      "BM25 baseline": 0.9966040770726391,
      "Basic LSI with k=150 dimensions": 0.9855664146422799,
      "Field-weighted LSI": 0.994992385847428,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9942910776627677
    },
    "17": {
      "BM25 baseline": 0.9825657033733907,
      "Basic LSI with k=150 dimensions": 0.9827942717636607,
      "Field-weighted LSI": 0.8947880226442804,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9809054561880343
    },
    "18": {
      "BM25 baseline": 0.9621443885246534,
      "Basic LSI with k=150 dimensions": 0.9441357282937284,
      "Field-weighted LSI": 0.9594627248767399,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9832791715266921
    },
    "19": {
      "BM25 baseline": 0.9334185635964451,
      "Basic LSI with k=150 dimensions": 0.9829274742442599,
      "Field-weighted LSI": 0.9933716770959169,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9847174929495353
    },
    "20": {
      "BM25 baseline": 0.9516312021336671,
      "Basic LSI with k=150 dimensions": 0.9796666282990453,
      "Field-weighted LSI": 0.9786787250738833,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9556153266519222
    },
    "21": {
      "BM25 baseline": 0.9578392479069989,
      "Basic LSI with k=150 dimensions": 0.9766666646329698,
      "Field-weighted LSI": 0.9758102199492783,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9850558651966211
    },
    "22": {
      "BM25 baseline": 0.9748673310883267,
      "Basic LSI with k=150 dimensions": 0.9918139964433209,
      "Field-weighted LSI": 0.9873837905369207,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9883659434195166
    },
    "23": {
      "BM25 baseline": 0.9018046783186535,
      "Basic LSI with k=150 dimensions": 0.9741947524048282,
      "Field-weighted LSI": 0.9574728776175179,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9573189253711919
    },
    "24": {
      "BM25 baseline": 0.9261770173775742,
      "Basic LSI with k=150 dimensions": 0.9739773208175877,
      "Field-weighted LSI": 0.9941760144154915,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9866652960369205
    },
    "25": {
      "BM25 baseline": 0.9514747319710257,
      "Basic LSI with k=150 dimensions": 0.9493207798604006,
      "Field-weighted LSI": 0.9288759835628523,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9038360948563292
    },
    "26": {
      "BM25 baseline": 0.9699219676915909,
      "Basic LSI with k=150 dimensions": 0.9609181526850149,
      "Field-weighted LSI": 0.9203767389876794,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9689337600130596
    },
    "27": {
      "BM25 baseline": 0.9655588343889085,
      "Basic LSI with k=150 dimensions": 0.9249466577802449,
      "Field-weighted LSI": 0.9275834806993435,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9140204740001455
    },
    "28": {
      "BM25 baseline": 0.9657423422096584,
      "Basic LSI with k=150 dimensions": 0.9836744745815332,
      "Field-weighted LSI": 0.9854102537186108,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9854102537186108
    },
    "29": {
      "BM25 baseline": 0.9925132360531861,
      "Basic LSI with k=150 dimensions": 0.986803734351502,
      "Field-weighted LSI": 0.9915805450761829,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9840163312121945
    },
    "30": {
      "BM25 baseline": 0.965399565159306,
      "Basic LSI with k=150 dimensions": 0.9437529122758956,
      "Field-weighted LSI": 0.9827509727452103,
      "Field-weighted LSI with BERT-enhanced indexing": 0.9774730536666356
    }
  },
  "system_averages": {
    "BM25 baseline": 0.9581363203901849,
    "Basic LSI with k=150 dimensions": 0.9694519154248875,
    "Field-weighted LSI": 0.966107888984751,
    "Field-weighted LSI with BERT-enhanced indexing": 0.9716148584744075
  },
  "overall_comparison": [
    {
      "system": "Field-weighted LSI with BERT-enhanced indexing",
      "average_ndcg@10": 0.9716148584744075
    },
    {
      "system": "Basic LSI with k=150 dimensions",
      "average_ndcg@10": 0.9694519154248875
    },
    {
      "system": "Field-weighted LSI",
      "average_ndcg@10": 0.966107888984751
    },
    {
      "system": "BM25 baseline",
      "average_ndcg@10": 0.9581363203901849
    }
  ]
}
"""

# Parse the JSON data
data = json.loads(data_json)

# Extract the query_ndcg values
query_ndcg = data["query_ndcg"]

# Initialize a dictionary to accumulate the per-system scores
system_scores = {}

for query, scores in query_ndcg.items():
    for system, ndcg in scores.items():
        system_scores.setdefault(system, []).append(ndcg)

# Compute the average, median, and standard deviation for each system
summary = []
for system, scores in system_scores.items():
    avg = np.mean(scores)
    median = np.median(scores)
    std = np.std(scores)
    summary.append({
        "System": system,
        "Average NDCG@10": avg,
        "Median NDCG@10": median,
        "STD NDCG@10": std
    })

# Create a DataFrame to display the summary
df_summary = pd.DataFrame(summary)
df_summary = df_summary.sort_values("Average NDCG@10", ascending=False).reset_index(drop=True)

print(df_summary)


                                           System  Average NDCG@10  \
0  Field-weighted LSI with BERT-enhanced indexing         0.971615   
1                 Basic LSI with k=150 dimensions         0.969452   
2                              Field-weighted LSI         0.966108   
3                                   BM25 baseline         0.958136   

   Median NDCG@10  STD NDCG@10  
0        0.980908     0.023714  
1        0.972521     0.015790  
2        0.976036     0.027842  
3        0.965651     0.026032  
