# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables participants of subtask 4b to quickly get started. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to evaluate the baseline model

Participants are free to use this notebook and add their own models for the competition.

# 1) Importing data

In [18]:
import numpy as np
import pandas as pd
import torch
import pickle
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
from tqdm.auto import tqdm

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [19]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = 'subtask4b_collection_data.pkl' #MODIFY PATH

In [20]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [21]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [22]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [23]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_QUERY_TRAIN_DATA = 'subtask4b_query_tweets_train.tsv' #MODIFY PATH
PATH_QUERY_DEV_DATA = 'subtask4b_query_tweets_dev.tsv' #MODIFY PATH

In [24]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')

In [25]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [26]:
df_query_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12853 entries, 0 to 12852
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     12853 non-null  int64 
 1   tweet_text  12853 non-null  object
 2   cord_uid    12853 non-null  object
dtypes: int64(1), object(2)
memory usage: 301.4+ KB


In [27]:
df_query_dev.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


In [28]:
df_query_dev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     1400 non-null   int64 
 1   tweet_text  1400 non-null   object
 2   cord_uid    1400 non-null   object
dtypes: int64(1), object(2)
memory usage: 32.9+ KB


# 2) Running the baseline
The following code runs a BM25 baseline.


In [29]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi

ERROR: ld.so: object '/opt/conda/lib/libmkl_def.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_avx2.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_core.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_intel_lp64.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_intel_thread.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_def.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_avx2.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_core.so' from LD_PRE

In [30]:
# Create the BM25 corpus
corpus = df_collection[:][['title', 'abstract']].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]['cord_uid'].tolist()
tokenized_corpus = [doc.split(' ') for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [31]:
def get_top_cord_uids(query):
  text2bm25top = {}
  if query in text2bm25top.keys():
      return text2bm25top[query]
  else:
      tokenized_query = query.split(' ')
      doc_scores = bm25.get_scores(tokenized_query)
      indices = np.argsort(-doc_scores)[:5]
      bm25_topk = [cord_uids[x] for x in indices]

      text2bm25top[query] = bm25_topk
      return bm25_topk


In [None]:
# Retrieve topk candidates using the BM25 model
df_query_train['bm25_topk'] = df_query_train['tweet_text'].apply(lambda x: get_top_cord_uids(x))
df_query_dev['bm25_topk'] = df_query_dev['tweet_text'].apply(lambda x: get_top_cord_uids(x))

# 3) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the query set using the Mean Reciprocal Rank score (MRR@5).

In [None]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        # performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance


In [None]:
results_train = get_performance_mrr(df_query_train, 'cord_uid', 'bm25_topk')
results_dev = get_performance_mrr(df_query_dev, 'cord_uid', 'bm25_topk')
# Printed MRR@k results in the following format: {k: MRR@k}
print(f"Results on the train set: {results_train}")
print(f"Results on the dev set: {results_dev}")

# 4) Exporting results to prepare the submission on Codalab

In [22]:
df_query_dev['preds'] = df_query_dev['bm25_topk'].apply(lambda x: x[:5])

In [23]:
df_query_dev[['post_id', 'preds']].to_csv('predictions.tsv', index=None, sep='\t')

# Implementing Neural Re-ranking Model

In this part we are extending a BM25 retrieval baseline by implementing a Neural Re-ranking Model. The BM25 gives us a first-stage candidate list, and our neural model learns to rerank those candidates by jointly modeling the semantic relationship between the tweet and each paper (title + abstract)

## Step 1: Installing & LoadingDependencies

In [24]:
!pip install -q torch transformers

ERROR: ld.so: object '/opt/conda/lib/libmkl_def.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_avx2.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_core.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_intel_lp64.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_intel_thread.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_def.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_avx2.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_core.so' from LD_PRE

In [25]:
tqdm.pandas()
# Load the pretrained cross-encoder model (which is a lightweight BERT-based ranker)
tokenizer=AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L-6-v2")
model=AutoModelForSequenceClassification.from_pretrained("cross-encoder/ms-marco-MiniLM-L-6-v2")

# pipeline for pairwise scoring
reranker = TextClassificationPipeline(model=model,tokenizer=tokenizer,return_all_scores=False,
    device=-1 # set to 0 if GPU exists)
print("Cross-encoder ready for reranking!")

2025-05-21 10:56:58.881426: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-05-21 10:56:58.900401: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-05-21 10:56:58.906016: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-21 10:56:58.920984: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Cross-encoder ready for reranking!




## Step 2: Loading Predictions and Collection, Prepare Rerank Inputs

We first read in the BM25 baseline predictions and the dev tweet queries, ensuring the join key post_id was of a uniform string type to avoid merge errors. We loaded the CORD-19 paper metadata (7,718 documents), concatenated each paper’s title and abstract into a single text field (paper_text), and built a lookup dictionary from cord_uid to paper_text.
Next, we took a look at the top-5 predictions for each tweet into individual rows—resulting in N_tweets × 5 rows—each. we assembled a list of (tweet_text, paper_text) pairs, which will be fed to our cross-encoder reranker.


In [26]:
# Loading BM25 predictions
df_preds = pd.read_csv("predictions.tsv",sep="\t",converters={"preds": eval})

df_preds["post_id"] = df_preds["post_id"].astype(str)

# Loading dev queries
df_dev = pd.read_csv("subtask4b_query_tweets_dev.tsv",sep="\t",names=["post_id", "tweet_text", "cord_uid"],)

df_dev["post_id"] = df_dev["post_id"].astype(str)

# Loading collection metadata
with open("subtask4b_collection_data.pkl", "rb") as f:
    df_papers = pickle.load(f)

df_papers["paper_text"] = (
    df_papers["title"].fillna("") + " " + df_papers["abstract"].fillna(""))
paper_text_lookup = df_papers.set_index("cord_uid")["paper_text"].to_dict()

rows = []
for _, row in df_preds.iterrows():
    post_id = row["post_id"]
    for rank, cand_uid in enumerate(row["preds"], start=1):
        rows.append({"post_id": post_id, "rank": rank, "cand_uid": cand_uid})
df_rerank = pd.DataFrame(rows)

# Merge tweet text and ground-truth cord_uid
df_rerank = df_rerank.merge(df_dev[["post_id", "tweet_text", "cord_uid"]],on="post_id",how="left")

df_rerank["paper_text"] = df_rerank["cand_uid"].map(paper_text_lookup)

pairs = list(zip(df_rerank["tweet_text"].tolist(), df_rerank["paper_text"].tolist()))

print(f"dataframe for reranking size: {df_rerank.shape}")
print(f"total pairs: {len(pairs)}")

dataframe for reranking size: (7000, 6)
total pairs: 7000


## Step 3: Computing Cross-Encoder Scores in Batches

We used the HuggingFace cross-encoder/ms-marco-MiniLM-L-6-v2 model to jointly score each tweet–paper pair. To scale to our 5 × N_dev inputs without running out of memory, we processed pairs in mini-batches of 32. Each batch was tokenized as [CLS] tweet [SEP] paper_text [SEP] and passed through the model in evaluation mode. We extracted the scalar relevance score (for regression heads) or the probability/logit of the “relevant” class (for classification heads), and stored these in the DataFrame. These scores will drive our new reranking order.

In [28]:
model.eval()
scores = []
batch_size = 32

for start in range(0, len(pairs), batch_size):
    batch_pairs = pairs[start : start + batch_size]
    queries = [q for (q, d) in batch_pairs]
    docs    = [d for (q, d) in batch_pairs]
    
    # Tokenize the batch of pairs
    encodings = tokenizer(queries, docs,padding=True,truncation=True,return_tensors="pt")
    device = next(model.parameters()).device
    for k, v in encodings.items():
        encodings[k] = v.to(device)

    # Forward pass without gradients
    with torch.no_grad():
        logits = model(**encodings).logits

    if logits.shape[-1] == 1:
        batch_scores = logits.squeeze(-1).tolist()
    else:
        # 1: “relevant”
        batch_scores = logits[:, 1].tolist()
    scores.extend(batch_scores)

df_rerank["score"] = scores
print("Assigned reranking scores. Sample:\n", df_rerank.head())

Assigned reranking scores. Sample:
   post_id  rank  cand_uid                                         tweet_text  \
0      16     1  25aj8rj5  covid recovery: this study from the usa reveal...   
1      16     2  gatxuwz7  covid recovery: this study from the usa reveal...   
2      16     3  59up4v56  covid recovery: this study from the usa reveal...   
3      16     4  styavbvi  covid recovery: this study from the usa reveal...   
4      16     5  6sy80720  covid recovery: this study from the usa reveal...   

   cord_uid                                         paper_text     score  
0  3qvh482o  Long covid-mechanisms, risk factors, and manag...  0.787592  
1  3qvh482o  Cerebral microvascular endothelial glycocalyx ... -2.689593  
2  3qvh482o  Fatigue and Cognitive Impairment in Post-COVID...  4.022662  
3  3qvh482o  COVCOG 2: Cognitive and Memory Deficits in Lon...  1.950408  
4  3qvh482o  A further plot twist: will 'long COVID' have a... -5.372519  


## Step 4: Re-rank and Evaluate

In [29]:
# Build re-ranked predictions
def rerank_group(group): # sort by the cross-encoder score and take top 5
    top5 = group.sort_values("score", ascending=False).head(5)
    return list(top5["cand_uid"])

# Apply over each post_id. Include only columns: ['post_id','preds']
df_new_preds = (df_rerank.groupby("post_id", group_keys=False).apply(lambda g: rerank_group(g)).reset_index(name="preds"))

# Merge with ground-truth cord_uid for evaluation
df_eval = df_new_preds.merge(df_dev[["post_id", "cord_uid"]],on="post_id",how="left")

# MRR@ 1,5,10
results_rerank = get_performance_mrr(df_eval,col_gold="cord_uid",col_pred="preds",list_k=[1, 5, 10])
print(f"Neural Re-ranking results on dev set: {results_rerank}")

Neural Re-ranking results on dev set: {1: 0.535, 5: 0.5738809523809524, 10: 0.5738809523809524}


  .apply(lambda g: rerank_group(g))


Integrating a BERT-based cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) to rerank the BM25 candidates yielded clear gains over the lexical baseline. On the dev set, MRR@1 improved from 0.505 to 0.535 (+3.0 pts), and MRR@5 improved from 0.552 to 0.574 (+2.2 pts). These improvements demonstrate that joint modeling of tweet and paper text captures semantic relevance signals that BM25 misses, allowing more correct papers to be promoted into the top ranks.

MRR@1: how often the top re-ranked candidate is correct.

MRR@5: the mean reciprocal rank among the top 5 (primary metric).

MRR@10: extended view up to position 10.

Compared to the BM25 baseline on dev ({1: 0.505, 5: 0.552, 10: 0.552}), the neural re-ranker achieves:

MRR@1: ↑ from 0.505 → 0.535 (an absolute gain of +0.030)

MRR@5: ↑ from 0.552 → 0.574 (an absolute gain of +0.022)

MRR@10: same as MRR@5 here, also ↑ +0.022

As observed, The cross-encoder reranker improves the ranking quality, especially pushing the correct paper to the very top more often.

In [30]:
# Export in the required format: post_id \t ["pred1", "pred2", ..., "pred5"]
df_new_preds.to_csv("reranked_predictions.tsv", sep="\t", index=False)

## More Experiments:

### Experiment 1: Linear Score Fusion (α × BM25 + (1-α) × Neural, α=0.5)

For the first experiment, We combined BM25 and neural cross‐encoder scores by first normalizing each within every tweet (post_id), then taking:

fusion_score = 0.5×bm25_norm + 0.5×neural_norm.

We re‐ranked each tweet’s 5 candidates by this new fusion score and evaluated MRR@1, MRR@5, MRR@10.

In [31]:
# Simulating a BM25 “score” from higher rank to lower rank
df_rerank["bm25_score"] = -df_rerank["rank"]

# Normalizing bm25 and neural score per post_id
def normalize(group, col):
    vals = group[col]
    return (vals - vals.min()) / (vals.max() - vals.min() + 1e-8)

df_rerank["bm25_norm"]  = df_rerank.groupby("post_id").apply(lambda g: normalize(g, "bm25_score")).reset_index(0,drop=True)
df_rerank["neural_norm"] = df_rerank.groupby("post_id").apply(lambda g: normalize(g, "score")).reset_index(0,drop=True)

# fusion scores
alpha = 0.5
df_rerank["fusion_score_lin"] = alpha * df_rerank["bm25_norm"] + (1-alpha) * df_rerank["neural_norm"]

# Re-ranking and collecting top-5
def get_top5_lin(g):
    return list(g.sort_values("fusion_score_lin", ascending=False)["cand_uid"].head(5))

df_preds_fusion_lin = (df_rerank.groupby("post_id", group_keys=False).apply(lambda g: get_top5_lin(g)).reset_index(name="preds"))

# Evaluation
df_eval_lin = df_preds_fusion_lin.merge(df_dev[["post_id","cord_uid"]],on="post_id", how="left")
results_fusion_lin = get_performance_mrr(df_eval_lin, "cord_uid", "preds", [1,5,10])
print("Linear Fusion (α=0.5) MRR:", results_fusion_lin)

  df_rerank["bm25_norm"]  = df_rerank.groupby("post_id").apply(lambda g: normalize(g, "bm25_score")).reset_index(0,drop=True)
  df_rerank["neural_norm"] = df_rerank.groupby("post_id").apply(lambda g: normalize(g, "score")).reset_index(0,drop=True)


Linear Fusion (α=0.5) MRR: {1: 0.5385714285714286, 5: 0.5749285714285715, 10: 0.5749285714285715}


  .apply(lambda g: get_top5_lin(g))


Comparison to previous models:

Model:                   MRR@1   MRR@5   MRR@10

BM25 baseline:	         0.505   0.552   0.552

Neural Re-ranking:	     0.535   0.574   0.574

Linear Fusion (α=0.5):	 0.539   0.575   0.575

MRR@1 improved from:     0.535 → 0.539 (+0.004)

MRR@5 improved from:     0.574 → 0.575 (+0.001)

Based on the results, The simple 50/50 fusion of lexical (BM25) and semantic (neural) scores yields a small but measurable gain over the pure neural reranker. This suggests that BM25 still contributes complementary signals—especially in borderline cases where the cross‐encoder might under‐ or over‐score certain candidates.

### Experiment b (Reciprocal-Rank Fusion Results & Interpretation)

In the second experiment, We compute each candidate’s rank under both BM25 and the neural cross-encoder. We then convert these to reciprocal ranks (1/r) and sum them to get a fused score.
Rank-based fusion is robust to differing score scales and emphasizes candidates that score well under both methods.

In [32]:
# Computing neural rank per post_id
df_rerank["neural_rank"] = (df_rerank.groupby("post_id")["score"].rank(ascending=False, method="first"))

# Compute reciprocal ranks for BM25 and neural
df_rerank["rr_bm25"]=1.0/df_rerank["rank"]
df_rerank["rr_neural"]=1.0/df_rerank["neural_rank"]

# Sum reciprocal ranks for fusion score
df_rerank["fusion_score_rr"]=df_rerank["rr_bm25"]+df_rerank["rr_neural"]

# Re-rank by the sum of reciprocal-rank
def get_top5_rr(group):
    return list(group.sort_values("fusion_score_rr", ascending=False)["cand_uid"].head(5))

df_preds_fusion_rr = (df_rerank.groupby("post_id", group_keys=False)
                      .apply(lambda g: get_top5_rr(g)).reset_index(name="preds"))

df_eval_rr = df_preds_fusion_rr.merge(df_dev[["post_id", "cord_uid"]],on="post_id",how="left")

# Evaluation with MRR@ 1,5,10
results_fusion_rr = get_performance_mrr(df_eval_rr, "cord_uid", "preds", [1, 5, 10])
print("Reciprocal-Rank Fusion MRR:", results_fusion_rr)

Reciprocal-Rank Fusion MRR: {1: 0.5278571428571428, 5: 0.5709761904761904, 10: 0.5709761904761904}


  .apply(lambda g: get_top5_rr(g))


Comparison:

Model:	MRR@1,MRR@5

BM25 baseline:	0.505,0.552

Neural Re-ranking:	0.535,0.574

Linear Score Fusion: (α=0.5) 0.539,0.575

Reciprocal-Rank Fusion: 0.528,0.571

Reciprocal-Rank Fusion underperforms compared to pure neural and linear fusion:

MRR@1 falls from 0.535 → 0.528 (−0.007)

MRR@5 falls from 0.574 → 0.571 (−0.003)

Comparing the results we can conclude that, Summing reciprocal ranks gives too much weight to small rank differences and is less effective here. Linear fusion proved the most robust at balancing BM25’s lexical signals with the cross-encoder’s semantic scores.

### Experiment 3

Below is Experiment 3) (a) simple α‐weight tuning for the linear score fusion. We’ll change α value from 0.0 to 1.0 (with step 0.1), compute MRR@5 for each, and pick the best.
For Evaluation, we re‐rank the same top‐5 candidates and compute MRR@5 For each α. this way we can find the best α that maximizes dev MRR@5, this can show whether a different balance beats the 0.5 default.

In [34]:
# Introducing alphas:
alphas=[i/10 for i in range(0,11)]
results_alpha=[]

df_rerank["bm25_score"]=-df_rerank["rank"]
def normalize(group, col):
    vals = group[col]
    return (vals-vals.min())/(vals.max()-vals.min()+1e-8)
    
df_rerank["bm25_norm"]=df_rerank.groupby("post_id").apply(lambda g: normalize(g, "bm25_score")).reset_index(0,drop=True)
df_rerank["neural_norm"]=df_rerank.groupby("post_id").apply(lambda g: normalize(g, "score")).reset_index(0,drop=True)

for α in alphas:
    df_rerank["fusion_score"]=α*df_rerank["bm25_norm"]+(1-α)*df_rerank["neural_norm"]
    # Re-Ranking
    df_preds=(df_rerank.groupby("post_id", group_keys=False).apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"]
                                                                                  .head(5))).reset_index(name="preds"))
    # Evaluation
    df_eval = df_preds.merge(df_dev[["post_id","cord_uid"]],on="post_id",how="left")
    perf = get_performance_mrr(df_eval,"cord_uid","preds",[5])
    results_alpha.append((α, perf[5]))

print("α vs. MRR@5:")
for α, mrr5 in results_alpha:
    print(f" α={α:.1f} → MRR@5={mrr5:.4f}")

best_alpha, best_mrr5 = max(results_alpha, key=lambda x: x[1])
print(f"\nBest α={best_alpha:.1f} with MRR@5={best_mrr5:.4f}")

  df_rerank["bm25_norm"]   = df_rerank.groupby("post_id").apply(lambda g: normalize(g, "bm25_score")).reset_index(0,drop=True)
  df_rerank["neural_norm"] = df_rerank.groupby("post_id").apply(lambda g: normalize(g, "score")).reset_index(0,drop=True)
  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))
  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))
  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))
  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))
  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))
  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))
  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))
  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))
  .appl

α vs. MRR@5:
 α=0.0 → MRR@5=0.5739
 α=0.1 → MRR@5=0.5760
 α=0.2 → MRR@5=0.5776
 α=0.3 → MRR@5=0.5754
 α=0.4 → MRR@5=0.5739
 α=0.5 → MRR@5=0.5749
 α=0.6 → MRR@5=0.5696
 α=0.7 → MRR@5=0.5637
 α=0.8 → MRR@5=0.5555
 α=0.9 → MRR@5=0.5520
 α=1.0 → MRR@5=0.5520

Best α=0.2 with MRR@5=0.5776


  .apply(lambda g: list(g.sort_values("fusion_score", ascending=False)["cand_uid"].head(5)))


Comparing the results we can observe:
for α = 0.0 (pure neural cross-encoder) we got MRR@5 = 0.5739  
for α = 0.2 we got MRR@5 = 0.5776 (+0.0037 over pure neural, +0.0256 over BM25) which was the best mix. and for α=1.0 (or pure BM25) we got MRR@5 = 0.5520
 
The optimal α=0.2 indicates that a heavier weighting on the neural model combined with a light lexical signal yields the best retrieval performance.

# Conclusion

Best on the whole experience on neural rerank:
**Pure neural re-ranking** had the highest improvement to BM25 (+0.022 MRR@5).  
**Linear fusion** added further gains; equal weighting is good, but best at α=0.2 (80% neural, 20% BM25) for maximum MRR@5.  
**Reciprocal-rank fusion** was less effective, which demonstrated the importance of calibrated score fusion.  
The best Experiment used α = 0.2, achieving **MRR@5 = 0.5776** on dev set.