# Ejercicio 10: Re-ranking

**Objetivo:** Implementar y evaluar un pipeline de Recuperación de Información en dos etapas, y analizar el impacto del re-ranking en la calidad del ranking.

## Parte 1. Preparación del corpus

* Cargar el corpus (documentos/pasajes).
* Cargar las consultas (queries).
* Cargar qrels (relevancia).

In [1]:
!pip install beir

Collecting beir
  Downloading beir-2.2.0-py3-none-any.whl.metadata (28 kB)
Collecting pytrec-eval-terrier (from beir)
  Downloading pytrec_eval_terrier-0.5.10-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Downloading beir-2.2.0-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pytrec_eval_terrier-0.5.10-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (304 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m304.8/304.8 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytrec-eval-terrier, beir
Successfully installed beir-2.2.0 pytrec-eval-terrier-0.5.10


In [2]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
import pandas as pd

  from tqdm.autonotebook import tqdm


In [3]:
DATASET_NAME = "scifact"
DATA_DIR = "../data/beir_datasets"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{DATASET_NAME}.zip"
util.download_and_unzip(url, DATA_DIR)

../data/beir_datasets/scifact.zip:   0%|          | 0.00/2.69M [00:00<?, ?iB/s]

'../data/beir_datasets/scifact'

In [4]:
dataset_path = DATA_DIR + "/" + DATASET_NAME
corpus, queries, qrels = GenericDataLoader(dataset_path).load(split="test")

  0%|          | 0/5183 [00:00<?, ?it/s]

In [5]:
df_corpus = (
    pd.DataFrame.from_dict(corpus, orient="index")
      .reset_index()
      .rename(columns={"index": "doc_id"})
)

df_corpus

Unnamed: 0,doc_id,text,title
0,4983,Alterations of the architecture of cerebral wh...,Microstructural development of human newborn c...
1,5836,Myelodysplastic syndromes (MDS) are age-depend...,Induction of myelodysplasia by myeloid-derived...
2,7912,ID elements are short interspersed elements (S...,"BC1 RNA, the transcript from a master gene for..."
3,18670,DNA methylation plays an important role in bio...,The DNA Methylome of Human Peripheral Blood Mo...
4,19238,Two human Golli (for gene expressed in the oli...,The human myelin basic protein gene is include...
...,...,...,...
5178,195689316,BACKGROUND The main associations of body-mass ...,Body-mass index and cause-specific mortality i...
5179,195689757,A key aberrant biological difference between t...,Targeting metabolic remodeling in glioblastoma...
5180,196664003,A signaling pathway transmits information from...,Signaling architectures that transmit unidirec...
5181,198133135,AIMS Trabecular bone score (TBS) is a surrogat...,"Association between pre-diabetes, type 2 diabe..."


In [6]:
df_queries = (
    pd.DataFrame.from_dict(queries, orient="index", columns=["query"])
      .reset_index()
      .rename(columns={"index": "query_id"})
)

df_queries

Unnamed: 0,query_id,query
0,1,0-dimensional biomaterials show inductive prop...
1,3,"1,000 genomes project enables mapping of genet..."
2,5,1/2000 in UK have abnormal PrP positivity.
3,13,5% of perinatal mortality is due to low birth ...
4,36,A deficiency of vitamin B12 increases blood le...
...,...,...
295,1379,Women with a higher birth weight are more like...
296,1382,aPKCz causes tumour enhancement by affecting g...
297,1385,cSMAC formation enhances weak ligand signalling.
298,1389,mTORC2 regulates intracellular cysteine levels...


In [7]:
rows = []
for qid, docs in qrels.items():
    for doc_id, rel in docs.items():
        rows.append({
            "query_id": qid,
            "doc_id": doc_id,
            "relevance": rel
        })

df_qrels = pd.DataFrame(rows)
df_qrels

Unnamed: 0,query_id,doc_id,relevance
0,1,31715818,1
1,3,14717500,1
2,5,13734012,1
3,13,1606628,1
4,36,5152028,1
...,...,...,...
334,1379,17450673,1
335,1382,17755060,1
336,1385,306006,1
337,1389,23895668,1


In [8]:
# Elegimos una query cualquiera que tenga varios documentos relevantes
qid = "133"

print("Query:")
print(df_queries.loc[df_queries["query_id"] == qid, "query"].values[0])

print("\nDocumentos relevantes para esta query:")
df_qrels[(df_qrels["query_id"] == qid) & (df_qrels["relevance"] > 0)]

Query:
Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Documentos relevantes para esta query:


Unnamed: 0,query_id,doc_id,relevance
31,133,38485364,1
32,133,6969753,1
33,133,17934082,1
34,133,16280642,1
35,133,12640810,1


In [9]:
!pip install rank_bm25 sentence-transformers lightgbm

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


## Parte 2. Retrieval inicial (baseline)

* Implementar retrieval inicial con BM25
* Obtener métricas: Recall@10 nDCG@10

In [10]:
from rank_bm25 import BM25Okapi
from beir.retrieval.evaluation import EvaluateRetrieval
import string

In [12]:
# 1. Preprocesamiento simple del corpus
corpus_ids = list(corpus.keys())
corpus_list = [corpus[doc_id] for doc_id in corpus_ids]
# Tokenización
tokenized_corpus = [
    (doc['title'] + " " + doc['text']).lower().split()
    for doc in corpus_list
]

# 2. Indexación con BM25
bm25 = BM25Okapi(tokenized_corpus)

# 3. Retrieval (Búsqueda)
results_bm25 = {}
top_k = 100  # Recuperamos 100 candidatos

for qid, query_text in queries.items():
    query_tokens = query_text.lower().split()
    # Obtenemos scores para todos los documentos
    doc_scores = bm25.get_scores(query_tokens)
    # Nos quedamos con los top_k índices
    top_n_indices = sorted(range(len(doc_scores)), key=lambda i: doc_scores[i], reverse=True)[:top_k]
    # Guardamos BEIR {doc_id: score}
    results_bm25[qid] = {corpus_ids[i]: float(doc_scores[i]) for i in top_n_indices}

# 4. Evaluación del Baseline
evaluator = EvaluateRetrieval()
ndcg, _map, recall, precision = evaluator.evaluate(qrels, results_bm25, k_values=[10])

print("\n--- Resultados Baseline (BM25) ---")
print(f"nDCG@10:  {ndcg['NDCG@10']:.4f}")
print(f"Recall@10: {recall['Recall@10']:.4f}")


--- Resultados Baseline (BM25) ---
nDCG@10:  0.5597
Recall@10: 0.6862


## Parte 3. Implementación del re-ranking _cross-encoder_

* Re-rankear los top-k candidatos para cada query.
* Identificar qué documentos cambian de posición en el top 10

In [13]:
from sentence_transformers import CrossEncoder
import numpy as np

In [14]:
# 1. Cargar el modelo Cross-Encoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
results_cross_encoder = {}

# 2. Iterar sobre las queries y sus candidatos de BM25
for qid, hits in results_bm25.items():
    pairs = []
    doc_ids_list = []

    for doc_id in hits:
        doc_text = corpus[doc_id]['title'] + " " + corpus[doc_id]['text']
        pairs.append([queries[qid], doc_text])
        doc_ids_list.append(doc_id)

    if len(pairs) == 0:
        continue

    # 3. Predecir scores
    scores = cross_encoder.predict(pairs)

    # 4. Guardar resultados
    # BEIR espera un dict {doc_id: score}
    results_cross_encoder[qid] = {
        doc_id: float(score)
        for doc_id, score in zip(doc_ids_list, scores)
    }

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [28]:
# Visualización
# Usamos la query de ejemplo '133' para ver cómo cambió el top 10
qid_ejemplo = "133"
if qid_ejemplo in results_cross_encoder:
    print(f"\nAnálisis para Query {qid_ejemplo}:")

    sorted_bm25 = sorted(results_bm25[qid_ejemplo].items(), key=lambda x: x[1], reverse=True)[:5]
    print("\nTop 5 Original (BM25):")
    for doc_id, score in sorted_bm25:
        print(f"Título: {corpus[doc_id].get('title', 'No Title')}")
        print(f"Doc: {doc_id} | Score: {score:.4f} | Rel: {qrels[qid_ejemplo].get(doc_id, 0)}")
        print("-" * 50)

    # Top 5 Cross-Encoder
    sorted_ce = sorted(results_cross_encoder[qid_ejemplo].items(), key=lambda x: x[1], reverse=True)[:5]
    print("\nTop 5 Re-ranked (Cross-Encoder):")
    for doc_id, score in sorted_ce:
        print(f"Título: {corpus[doc_id].get('title', 'No Title')}")
        print(f"Doc: {doc_id} | Score: {score:.4f} | Rel: {qrels[qid_ejemplo].get(doc_id, 0)}")
        print("-" * 50)


Análisis para Query 133:

Top 5 Original (BM25):
Título: Schizophrenia susceptibility pathway neuregulin 1–ErbB4 suppresses Src upregulation of NMDA receptors
Doc: 26688294 | Score: 55.1214 | Rel: 0
--------------------------------------------------
Título: Focal contacts as mechanosensors: externally applied local mechanical force induces growth of focal contacts by an mDia1-dependent and ROCKindependent mechanism
Doc: 9507605 | Score: 50.4710 | Rel: 0
--------------------------------------------------
Título: Local Ca2+ influx through Ca2+ release-activated Ca2+ (CRAC) channels stimulates production of an intracellular messenger and an intercellular pro-inflammatory signal.
Doc: 37964706 | Score: 49.9029 | Rel: 0
--------------------------------------------------
Título: Combating trastuzumab resistance by targeting SRC, a common node downstream of multiple resistance pathways
Doc: 5270265 | Score: 46.2733 | Rel: 0
--------------------------------------------------
Título: The regul

## Parte 4. Implementación del re-ranking _LTR_

* Re-rankear los top-k candidatos para cada query.
* Identificar qué documentos cambian de posición en el top 10

In [16]:
import lightgbm as lgb
from sentence_transformers import SentenceTransformer, util as sbert_util

In [17]:
# 1. Cargamos un Bi-Encoder rápido para generar features extra
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')

# Codificamos todo el corpu
corpus_embeddings = bi_encoder.encode(
    [(corpus[did]['title'] + " " + corpus[did]['text']) for did in corpus_ids],
    convert_to_tensor=True,
    show_progress_bar=True
)
corpus_map = {did: idx for idx, did in enumerate(corpus_ids)}

def extract_features(qid, doc_id, bm25_score):
    if doc_id not in corpus_map: return [bm25_score, 0]

    query_emb = bi_encoder.encode(queries[qid], convert_to_tensor=True, show_progress_bar=False)
    doc_emb = corpus_embeddings[corpus_map[doc_id]]
    cos_score = sbert_util.cos_sim(query_emb, doc_emb).item()

    return [bm25_score, cos_score]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/162 [00:00<?, ?it/s]

In [18]:
# Entrenamiento del modelo LTR
X_train = []
y_train = []
group_train = []
# Tomamos las primeras 50 queries como "train"
train_qids = list(qrels.keys())[:50]

for qid in train_qids:
    if qid not in results_bm25: continue

    # Usamos los candidatos de BM25 como pool de documentos
    candidates = results_bm25[qid]
    relevances = qrels[qid]

    current_group_size = 0
    for doc_id, bm25_score in candidates.items():
        # Label: 1 si es relevante, 0 si no
        label = relevances.get(doc_id, 0)

        # Features
        feats = extract_features(qid, doc_id, bm25_score)

        X_train.append(feats)
        y_train.append(label)
        current_group_size += 1

    group_train.append(current_group_size)

# Convertir a numpy
X_train = np.array(X_train)
y_train = np.array(y_train)

# Entrenar LightGBM con LambdaRank
ranker = lgb.LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    n_estimators=100
)
ranker.fit(X_train, y_train, group=group_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000519 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 5000, number of used features: 2


In [20]:
# Re-ranking en el Test Set ---
results_ltr = {}

# Usamos las queries restantes (o todas) para testear
for qid, hits in results_bm25.items():
    X_test = []
    doc_ids_list = []

    for doc_id, bm25_score in hits.items():
        feats = extract_features(qid, doc_id, bm25_score)
        X_test.append(feats)
        doc_ids_list.append(doc_id)

    if not X_test: continue

    # Predecir scores
    scores = ranker.predict(np.array(X_test))

    results_ltr[qid] = {
        doc_id: float(score)
        for doc_id, score in zip(doc_ids_list, scores)
    }



In [25]:
# Visualización
qid_ejemplo = "133"

if qid_ejemplo in results_ltr:
    print(f"\nAnálisis para Query {qid_ejemplo}:")

    # Top 5 BM25
    sorted_bm25 = sorted(results_bm25[qid_ejemplo].items(), key=lambda x: x[1], reverse=True)[:5]
    print("\nTop 5 Original (BM25):")
    for doc_id, score in sorted_bm25:
        print(f"Título: {corpus[doc_id].get('title', 'No Title')}")
        print(f"Doc: {doc_id} | Score: {score:.4f} | Rel: {qrels[qid_ejemplo].get(doc_id, 0)}")
        print("-" * 50)

    # Top 5 LTR
    sorted_ltr = sorted(results_ltr[qid_ejemplo].items(), key=lambda x: x[1], reverse=True)[:5]
    print("\nTop 5 Re-ranked (LightGBM / LTR):")
    for doc_id, score in sorted_ltr:
        print(f"Título: {corpus[doc_id].get('title', 'No Title')}")
        print(f"Doc: {doc_id} | Score: {score:.4f} | Rel: {qrels[qid_ejemplo].get(doc_id, 0)}")
        print("-" * 50)


Análisis para Query 133:

Top 5 Original (BM25):
Título: Schizophrenia susceptibility pathway neuregulin 1–ErbB4 suppresses Src upregulation of NMDA receptors
Doc: 26688294 | Score: 55.1214 | Rel: 0
--------------------------------------------------
Título: Focal contacts as mechanosensors: externally applied local mechanical force induces growth of focal contacts by an mDia1-dependent and ROCKindependent mechanism
Doc: 9507605 | Score: 50.4710 | Rel: 0
--------------------------------------------------
Título: Local Ca2+ influx through Ca2+ release-activated Ca2+ (CRAC) channels stimulates production of an intracellular messenger and an intercellular pro-inflammatory signal.
Doc: 37964706 | Score: 49.9029 | Rel: 0
--------------------------------------------------
Título: Combating trastuzumab resistance by targeting SRC, a common node downstream of multiple resistance pathways
Doc: 5270265 | Score: 46.2733 | Rel: 0
--------------------------------------------------
Título: The regul

## Parte 5. Evaluación post re-ranking

Calcular métricas:
* nDCG@10
* MAP
* Recall@10

In [29]:
# 1. Baseline BM25
ndcg_b, _, recall_b, _ = evaluator.evaluate(qrels, results_bm25, k_values=[10])

# 2. Cross-Encoder
ndcg_ce, map_ce, recall_ce, _ = evaluator.evaluate(qrels, results_cross_encoder, k_values=[10])

# 3. LTR (LightGBM)
ndcg_ltr, map_ltr, recall_ltr, _ = evaluator.evaluate(qrels, results_ltr, k_values=[10])

# Crear tabla comparativa
results_df = pd.DataFrame([
    {
        "Model": "BM25 (Baseline)",
        "nDCG@10": ndcg_b['NDCG@10'],
        "Recall@10": recall_b['Recall@10']
    },
    {
        "Model": "Cross-Encoder (Re-ranker)",
        "nDCG@10": ndcg_ce['NDCG@10'],
        "Recall@10": recall_ce['Recall@10']
    },
    {
        "Model": "LTR (LightGBM)",
        "nDCG@10": ndcg_ltr['NDCG@10'],
        "Recall@10": recall_ltr['Recall@10']
    }
])

results_df

Unnamed: 0,Model,nDCG@10,Recall@10
0,BM25 (Baseline),0.5597,0.68617
1,Cross-Encoder (Re-ranker),0.65092,0.74961
2,LTR (LightGBM),0.60682,0.72878
