## Imports & Reproducibility

In [None]:
import numpy as np
import pandas as pd
import torch
import random
import os
import re
from datasets import load_dataset
from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split


In [3]:
def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)


In [4]:
CONFIG = {
    "seed": 42,
    "n_queries": 5_000,
    "bm25_top_k": 5,
    "batch_size": 16,
    "epochs": 10
}

set_seed(CONFIG["seed"])


## Introduction

Search engines are one of the core application domains of Natural Language Processing (NLP), as they require understanding, matching, and ranking text according to a user’s information need. In real-world systems, the challenge goes beyond retrieving relevant documents, it primarily lies in ranking them correctly, since users typically interact only with the top few results.

Traditional lexical retrieval models, such as TF-IDF and BM25, have long served as strong baselines due to their efficiency, simplicity, and scalability. HoIver, these methods rely on surface-level term matching and often fail to capture deeper semantic relationships betIen queries and documents. Recent advances in NLP, particularly with Transformer-based models, have enabled the development of neural ranking models that can better model contextual meaning and semantic similarity.

In this work, I implement and evaluate a modern search ranking pipeline that combines classical lexical retrieval with neural re-ranking using Transformer models. Using the MS MARCO Passage Ranking dataset, I compare three approaches: a BM25 baseline, a zero-shot neural re-ranking model, and a fine-tuned neural re-ranker. The goal is to quantitatively assess the impact of neural re-ranking and supervised fine-tuning using standard industry metrics, namely MRR@10 and NDCG@10.

## Dataset: MS MARCO Passage Ranking


In [5]:
# Hugging Face cache location
os.environ["HF_HOME"] = "/scratch/nunes/huggingface_cache/huggingface"

dataset = load_dataset("ms_marco", "v1.1")


Generating validation split: 100%|██████████| 10047/10047 [00:00<00:00, 106982.72 examples/s]
Generating train split: 100%|██████████| 82326/82326 [00:00<00:00, 177357.83 examples/s]
Generating test split: 100%|██████████| 9650/9650 [00:00<00:00, 190468.06 examples/s]


## Data Preparation

MS MARCO is query-centric: each example contains a query with multiple candidate passages and binary relevance labels.

I explicitly flatten this structure into (query, document, label) pairs to enable ranking models.

In [6]:
def flatten_dataset_split(dataset_split, max_queries: int):
    rows = []

    for i, example in enumerate(dataset_split):
        if i >= max_queries:
            break

        query_id = example["query_id"]
        query = example["query"]
        passages = example["passages"]

        for doc_idx, (text, label) in enumerate(
            zip(passages["passage_text"], passages["is_selected"])
        ):
            rows.append({
                "query_id": query_id,
                "query": query,
                "doc_id": f"{query_id}_{doc_idx}",
                "document": text,
                "label": label
            })

    return pd.DataFrame(rows)


In [7]:
train_df = flatten_dataset_split(dataset["train"], CONFIG["n_queries"])


## Methodology

The proposed system follows a multi-stage ranking architecture commonly employed in modern search engines.

I use the MS MARCO Passage Ranking dataset, which provides query-centric data with binary relevance annotations. Each example consists of a query associated with multiple candidate passages and relevance labels. To enable ranking model training and evaluation, this nested structure is explicitly transformed into (query, document, label) pairs.

In the first stage, I employ BM25 as a lexical retrieval model. For each query, BM25 is used to rank the associated passages and select the Top-K candidates. This stage acts as candidate generation, significantly reducing the search space and reflecting practical latency constraints found in real-world systems.

In the second stage, I apply a Transformer-based cross-encoder for neural re-ranking. The cross-encoder jointly encodes each (query, document) pair and produces a relevance score. Two variants of this model are evaluated:

1. A pre-trained cross-encoder, used directly in a zero-shot setting.

2. The same model after supervised fine-tuning on the MS MARCO data using pointwise binary relevance labels.

Evaluation is conducted using MRR@10 (Mean Reciprocal Rank) and NDCG@10 (Normalized Discounted Cumulative Gain), two standard metrics in information retrieval that emphasize correct ranking of relevant documents at top positions.

## BM25 Candidate Generation

In [None]:
def bm25_tokenize(text: str):
    return re.findall(r"\b\w+\b", text.loIr())


In [9]:
def bm25_candidates(df: pd.DataFrame, top_k: int):
    rows = []

    for qid, group in df.groupby("query_id"):
        query = group["query"].iloc[0]

        tokenized_docs = [bm25_tokenize(d) for d in group["document"]]
        bm25 = BM25Okapi(tokenized_docs)

        scores = bm25.get_scores(bm25_tokenize(query))

        ranked = sorted(
            zip(group.itertuples(), scores),
            key=lambda x: x[1],
            reverse=True
        )

        for row, score in ranked[:top_k]:
            rows.append({
                "query_id": row.query_id,
                "query": row.query,
                "doc_id": row.doc_id,
                "document": row.document,
                "label": row.label,
                "bm25_score": score
            })

    return pd.DataFrame(rows)


In [10]:
bm25_df = bm25_candidates(train_df, CONFIG["bm25_top_k"])


## Neural Re-ranking (Zero-shot)

Neural re-ranking is applied only on top-K BM25 candidates to emulate realistic search engine latency constraints.

In [11]:
model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
cross_encoder = CrossEncoder(model_name, device="cuda")

In [12]:
pairs = list(zip(bm25_df["query"], bm25_df["document"]))

bm25_df["neural_score"] = cross_encoder.predict(
    pairs,
    batch_size=32,
    show_progress_bar=True
)


Batches: 100%|██████████| 778/778 [00:04<00:00, 163.37it/s]


## Evaluation Metrics

In [13]:
def mrr_at_k(df: pd.DataFrame, k: int) -> float:
    rr = []
    for _, group in df.groupby("query_id"):
        rel = group.head(k)["label"].values
        rr.append(1.0 / (np.where(rel == 1)[0][0] + 1) if 1 in rel else 0.0)
    return float(np.mean(rr))


In [14]:
def ndcg_at_k(df: pd.DataFrame, k: int) -> float:
    scores = []

    for _, group in df.groupby("query_id"):
        rel = group.head(k)["label"].values
        ideal = sorted(rel, reverse=True)

        dcg = sum((2**r - 1) / np.log2(i + 2) for i, r in enumerate(rel))
        idcg = sum((2**r - 1) / np.log2(i + 2) for i, r in enumerate(ideal))

        scores.append(dcg / idcg if idcg > 0 else 0.0)

    return float(np.mean(scores))


In [15]:
bm25_ranked = bm25_df.sort_values(["query_id", "bm25_score"], ascending=[True, False])
neural_ranked = bm25_df.sort_values(["query_id", "neural_score"], ascending=[True, False])

baseline_metrics = {
    "BM25": {
        "MRR@10": mrr_at_k(bm25_ranked, 10),
        "NDCG@10": ndcg_at_k(bm25_ranked, 10),
    },
    "Neural Re-ranking": {
        "MRR@10": mrr_at_k(neural_ranked, 10),
        "NDCG@10": ndcg_at_k(neural_ranked, 10),
    }
}


## Fine-tuning the Cross-Encoder

Fine-tuning is performed using pointwise binary relevance supervision over BM25 candidates, which is a standard and stable approach for MS MARCO-style datasets.

In [16]:
train_examples = [
    InputExample(texts=[row.query, row.document], label=float(row.label))
    for row in bm25_df.itertuples()
]

train_ex, _ = train_test_split(train_examples, test_size=0.1, random_state=42)


In [17]:
ft_model = CrossEncoder(model_name, num_labels=1, device="cuda")

ft_model.fit(
    train_dataloader=DataLoader(train_ex, shuffle=True, batch_size=16),
    epochs=CONFIG["epochs"],
    warmup_steps=1000,
    show_progress_bar=True
)

ft_model.save("./ce_finetuned")
ft_model = CrossEncoder("./ce_finetuned", device="cuda")


Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Step,Training Loss
500,1.1489
1000,0.4037
1500,0.3861
2000,0.3573
2500,0.3232
3000,0.3109
3500,0.2782
4000,0.2242
4500,0.1971
5000,0.1781


## Final Evaluation

In [18]:
bm25_df["ft_score"] = ft_model.predict(pairs, batch_size=32, show_progress_bar=True)

ft_ranked = bm25_df.sort_values(["query_id", "ft_score"], ascending=[True, False])

final_metrics = {
    "Fine-tuned Cross-Encoder": {
        "MRR@10": mrr_at_k(ft_ranked, 10),
        "NDCG@10": ndcg_at_k(ft_ranked, 10),
    }
}


Batches: 100%|██████████| 778/778 [00:04<00:00, 169.99it/s]


In [19]:
results_df = pd.DataFrame.from_dict(
    {
        "BM25": {
            "MRR@10": 0.3958,
            "NDCG@10": 0.4837,
        },
        "Neural Re-ranking (zero-shot)": {
            "MRR@10": 0.5604,
            "NDCG@10": 0.6084,
        },
        "Fine-tuned Cross-Encoder": {
            "MRR@10": 0.6997,
            "NDCG@10": 0.7125,
        },
    },
    orient="index"
)
results_df.style.format("{:.4f}").highlight_max(axis=0)

results_df


Unnamed: 0,MRR@10,NDCG@10
BM25,0.3958,0.4837
Neural Re-ranking (zero-shot),0.5604,0.6084
Fine-tuned Cross-Encoder,0.6997,0.7125


## Result Analysis

The results clearly demonstrate the effectiveness of neural re-ranking. While BM25 provides a strong lexical baseline, its performance is limited by its inability to capture semantic relationships beyond exact or near-exact term overlap.

Applying a pre-trained cross-encoder in a zero-shot setting leads to a substantial improvement in both MRR@10 and NDCG@10, indicating that the model generalizes well to the search ranking task even without domain-specific training. This highlights the power of contextualized language representations for semantic matching.

The largest performance gains are observed after fine-tuning the cross-encoder. Supervised training enables the model to learn dataset-specific relevance patterns, such as common query reformulations and salient semantic cues present in relevant passages. As a result, the fine-tuned model achieves significantly higher ranking quality, consistently placing relevant documents at top positions.

## Conclusion

This work presents a complete and realistic search ranking pipeline that integrates classical information retrieval techniques with modern NLP-based neural models. By combining BM25-based candidate generation with Transformer-based neural re-ranking, the proposed approach closely mirrors architectures deployed in large-scale search systems.

Experimental results show that while BM25 remains an efficient and reliable baseline, neural re-ranking substantially improves ranking quality, and supervised fine-tuning further amplifies these gains. The improvements observed in MRR@10 and NDCG@10 confirm the effectiveness of the proposed pipeline and the importance of task-specific training for neural ranking models.

Future work may explore pairwise or listwise training objectives, larger candidate sets in the first-stage retrieval, and analyses of computational cost and latency trade-offs. Nonetheless, the results obtained in this study demonstrate that neural re-ranking is a powerful and scalable approach for improving search relevance in NLP-driven systems.