In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
#!ls "/content/drive/My Drive/clef2025-checkthat-lab-main-task4-subtask_4b/task4/subtask_4b"

In [None]:
# import os

# project_path = "/content/drive/My Drive/clef2025-checkthat-lab-main-task4-subtask_4b/task4/subtask_4b"
# os.chdir(project_path)

# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables participants of subtask 4b to quickly get started. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to evaluate the baseline model

Participants are free to use this notebook and add their own models for the competition.

# 1) Importing data

In [1]:
import numpy as np
import pandas as pd

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [2]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = 'subtask4b_collection_data.pkl' #MODIFY PATH

In [3]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [None]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [None]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [4]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_QUERY_TRAIN_DATA = 'subtask4b_query_tweets_train.tsv' #MODIFY PATH
PATH_QUERY_DEV_DATA = 'subtask4b_query_tweets_dev.tsv' #MODIFY PATH

In [5]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')

In [None]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [None]:
df_query_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12853 entries, 0 to 12852
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     12853 non-null  int64 
 1   tweet_text  12853 non-null  object
 2   cord_uid    12853 non-null  object
dtypes: int64(1), object(2)
memory usage: 301.4+ KB


In [None]:
df_query_dev.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


In [None]:
df_query_dev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     1400 non-null   int64 
 1   tweet_text  1400 non-null   object
 2   cord_uid    1400 non-null   object
dtypes: int64(1), object(2)
memory usage: 32.9+ KB


# 2) Running the baseline
The following code runs a BM25 baseline.


In [None]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi


ERROR: ld.so: object '/opt/conda/lib/libmkl_def.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_avx2.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_core.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_intel_lp64.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_intel_thread.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_def.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_avx2.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_core.so' from LD_PRE

To build a BM25-based search engine that allows you to **score papers based on a tweet query**, using **text similarity** (word matching).
* Combine title and abstract into one text field for each paper
* Creates a list of paper IDs in the same order as the corpus. These IDs help you **track which paper each text belongs to**
* Splits each combined `title + abstract` into words (tokens)
* bm25 = BM25Okapi(tokenized_corpus): Creates the actual **BM25 model** using your list of tokenized documents. BM25 now knows which papers contain which words, and how important those words are

In [None]:
# Create the BM25 corpus
corpus = df_collection[:][['title', 'abstract']].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]['cord_uid'].tolist()
tokenized_corpus = [doc.split(' ') for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

This function is the **retrieval step** of your traditional IR model — it takes a tweet and returns the **top-5 paper IDs** that are most relevant based on BM25 scores.
* You give it a `query` (the tweet text). It returns a list of the **top 5 paper IDs (`cord_uid`)** that best match the query, based on BM25.
* Initializes a **temporary cache dictionary**
* Checks if this query has already been processed and cached.
* Tokenizes the tweet into words (BM25 needs tokens, not raw strings).
  * For example: `"masks help prevent spread"` → `["masks", "help", "prevent", "spread"]`
* Scores each paper in the BM25 corpus by how well it matches the query tokens. Returns a list of scores (one per paper).
* Sorts the scores **in descending order**
* Looks up the **cord\_uid (paper ID)** of those top 5 indices.
* Saves the result to the cache (though it gets wiped every call unless you fix it). Returns the top 5 paper IDs for this query.

In [None]:
def get_top_cord_uids(query):
  text2bm25top = {}
  if query in text2bm25top.keys():
      return text2bm25top[query]
  else:
      tokenized_query = query.split(' ')
      doc_scores = bm25.get_scores(tokenized_query)
      indices = np.argsort(-doc_scores)[:5]
      bm25_topk = [cord_uids[x] for x in indices]

      text2bm25top[query] = bm25_topk
      return bm25_topk


This part of the code is where you actually **use the BM25 retrieval model** on your training and dev datasets.
You're applying the `get_top_cord_uids()` function (the BM25 search function) to every tweet in your training and development sets.
* Takes each tweet in the `tweet_text` column of `df_query_train`
* Passes it to your `get_top_cord_uids()` function
* That function uses **BM25** to retrieve the **top-5 most relevant paper IDs**
* Saves the list of top 5 `cord_uid`s to a new column called `bm25_topk`
You then do the same for the dev set

In [None]:
# Retrieve topk candidates using the BM25 model
df_query_train['bm25_topk'] = df_query_train['tweet_text'].apply(lambda x: get_top_cord_uids(x))
df_query_dev['bm25_topk'] = df_query_dev['tweet_text'].apply(lambda x: get_top_cord_uids(x))

# 3) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the query set using the Mean Reciprocal Rank score (MRR@5).

Calculate **how well your model is ranking the correct paper**, based on different top-k cutoffs (like top-1, top-5, and top-10).
* **At what position** the correct `cord_uid` appears in your model’s predictions.
* And returns the **average of reciprocal ranks** over all queries.
* `col_gold`: The column containing the correct paper ID (usually `'cord_uid'`)
* `col_pred`: The column containing a list of predicted paper IDs (e.g. `'bm25_topk'` or `'reranked_topk'`)
* `list_k`: The cutoffs you want to compute MRR\@k for (e.g. `[1, 5, 10]`)

In [6]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        #performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance


This block of code is **evaluating how well your BM25 retrieval performed** using the `get_performance_mrr()` function you defined earlier.

In [None]:
results_train = get_performance_mrr(df_query_train, 'cord_uid', 'bm25_topk')
results_dev = get_performance_mrr(df_query_dev, 'cord_uid', 'bm25_topk')
# Printed MRR@k results in the following format: {k: MRR@k}
print(f"Results on the train set: {results_train}")
print(f"Results on the dev set: {results_dev}")

Results on the train set: {1: 0.5080525947249669, 5: 0.5509388210275163, 10: 0.5509388210275163}
Results on the dev set: {1: 0.505, 5: 0.5520357142857142, 10: 0.5520357142857142}


# 4) Exporting results to prepare the submission on Codalab

In [None]:
#df_query_dev['preds'] = df_query_dev['bm25_topk'].apply(lambda x: x[:5])

In [None]:
#df_query_dev[['post_id', 'preds']].to_csv('predictions.tsv', index=None, sep='\t')

# **Part 1: Traditional IR Model (BM25)**

In [None]:
#New version, had to adjust due to missing the CLEF deadline.

from sklearn.model_selection import train_test_split
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from rank_bm25 import BM25Okapi

# 1. Combine and clean training + dev queries
df_query_train["split_origin"] = "train"
df_query_dev["split_origin"] = "dev"
df_all_queries = pd.concat([df_query_train, df_query_dev], ignore_index=True)
df_all_queries = df_all_queries.drop_duplicates(subset=["tweet_text"])

# 2. Split into new train/test sets (80/20)
df_train_split, df_test_split = train_test_split(
    df_all_queries, test_size=0.2, random_state=42
)

# 3. Build BM25 index on document collection
df_collection['combined_text'] = (df_collection['title'] + ' ' + df_collection['abstract']).fillna('')
corpus = df_collection['combined_text'].apply(word_tokenize).tolist()
bm25 = BM25Okapi(corpus)

# 4. Define BM25 retrieval function
def get_top_cord_uids(query, topk=20):
    tokenized_query = word_tokenize(query)
    scores = bm25.get_scores(tokenized_query)
    top_indices = scores.argsort()[-topk:][::-1]
    return df_collection.iloc[top_indices]['cord_uid'].tolist()

# 5. Run BM25 on test set
df_test_split['bm25_topk'] = df_test_split['tweet_text'].apply(lambda x: get_top_cord_uids(x))

# 6. Evaluate MRR@5 using your own function
def mrr_at_5(df):
    reciprocal_ranks = []
    for _, row in df.iterrows():
        true_uid = row['cord_uid']
        preds = row['bm25_topk'][:5]
        if true_uid in preds:
            rank = preds.index(true_uid) + 1
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0)
    return sum(reciprocal_ranks) / len(reciprocal_ranks)

score = mrr_at_5(df_test_split)
print(f'MRR@5 (local test split): {score:.4f}')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


MRR@5 (local test split): 0.5460


In [None]:
#df_query_dev.bm25_topk[1]

# **Part 2: Neural IR Model**

In [7]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd
import numpy as np
from typing import List, Dict

import os
os.environ["WANDB_DISABLED"] = "true"


In [8]:
def encode_texts(model: SentenceTransformer, texts: List[str], batch_size: int = 32):
    return model.encode(texts, convert_to_tensor=True, show_progress_bar=True, batch_size=batch_size)


In [9]:
def retrieve_top_k(query_embeddings, doc_embeddings, doc_ids, top_k=10) -> List[List[str]]:
    all_preds = []
    for query_embedding in query_embeddings:
        scores = util.cos_sim(query_embedding, doc_embeddings)[0]
        top_results = scores.argsort(descending=True)[:top_k]
        top_doc_ids = [doc_ids[i] for i in top_results]
        all_preds.append(top_doc_ids)
    return all_preds


In [10]:
def get_mrr_at_k(data: pd.DataFrame, col_gold: str, col_pred: str, list_k = [1, 5, 10]) -> Dict[int, float]:
    performance = {}
    for k in list_k:
        data["in_topk"] = data.apply(
            lambda x: 1 / (x[col_pred][:k].index(x[col_gold]) + 1)
                      if x[col_gold] in x[col_pred][:k] else 0,
            axis=1)
        performance[k] = data["in_topk"].mean()
    return performance


In [11]:
def run_neural_ir_pipeline(
    model_name: str,
    df_collection: pd.DataFrame,
    df_queries: pd.DataFrame,
    top_k: int = 10
) -> Dict[int, float]:

    print(f"\n Loading model: {model_name}")
    model = SentenceTransformer(model_name)

    print("Encoding documents:")
    doc_texts = df_collection.apply(lambda row: f"{row['title']} {row['abstract']}", axis=1).tolist()
    doc_ids = df_collection['cord_uid'].tolist()
    doc_embeddings = encode_texts(model, doc_texts)

    print("Encoding queries:")
    query_texts = df_queries['tweet_text'].tolist()
    query_embeddings = encode_texts(model, query_texts)

    print("Retrieving top-K documents:")
    topk_preds = retrieve_top_k(query_embeddings, doc_embeddings, doc_ids, top_k=top_k)
    df_queries['neural_topk'] = topk_preds

    print("Evaluating with MRR@K:")
    results = get_mrr_at_k(df_queries, col_gold='cord_uid', col_pred='neural_topk')
    print(f"Results for {model_name}: {results}")
    return results


In [12]:
results_miniLM = run_neural_ir_pipeline(
    model_name='sentence-transformers/msmarco-MiniLM-L-12-v3',
    df_collection=df_collection,
    df_queries=df_query_dev
)


 Loading model: sentence-transformers/msmarco-MiniLM-L-12-v3


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.72k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding documents:


Batches:   0%|          | 0/242 [00:00<?, ?it/s]

Encoding queries:


Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Retrieving top-K documents:
Evaluating with MRR@K:
Results for sentence-transformers/msmarco-MiniLM-L-12-v3: {1: np.float64(0.34785714285714286), 5: np.float64(0.4126428571428572), 10: np.float64(0.42183503401360545)}


In [13]:
results_e5 = run_neural_ir_pipeline(
    model_name='intfloat/e5-base-v2',
    df_collection=df_collection,
    df_queries=df_query_dev
)


 Loading model: intfloat/e5-base-v2


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Encoding documents:


Batches:   0%|          | 0/242 [00:00<?, ?it/s]

Encoding queries:


Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Retrieving top-K documents:
Evaluating with MRR@K:
Results for intfloat/e5-base-v2: {1: np.float64(0.54), 5: np.float64(0.5994047619047619), 10: np.float64(0.6070802154195011)}


In [14]:
results_e5 = run_neural_ir_pipeline(
    model_name='sentence-transformers/msmarco-distilbert-base-v4',
    df_collection=df_collection,
    df_queries=df_query_dev
)


 Loading model: sentence-transformers/msmarco-distilbert-base-v4


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding documents:


Batches:   0%|          | 0/242 [00:00<?, ?it/s]

Encoding queries:


Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Retrieving top-K documents:
Evaluating with MRR@K:
Results for sentence-transformers/msmarco-distilbert-base-v4: {1: np.float64(0.365), 5: np.float64(0.42802380952380953), 10: np.float64(0.43436281179138325)}


In [15]:
results_e5 = run_neural_ir_pipeline(
    model_name='intfloat/e5-large-v2',
    df_collection=df_collection,
    df_queries=df_query_dev
)


 Loading model: intfloat/e5-large-v2


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

Encoding documents:


Batches:   0%|          | 0/242 [00:00<?, ?it/s]

Encoding queries:


Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Retrieving top-K documents:
Evaluating with MRR@K:
Results for intfloat/e5-large-v2: {1: np.float64(0.5771428571428572), 5: np.float64(0.6394404761904762), 10: np.float64(0.6460663265306122)}


In [16]:
results_e5 = run_neural_ir_pipeline(
    model_name='sentence-transformers/all-MiniLM-L6-v2',
    df_collection=df_collection,
    df_queries=df_query_dev
)


 Loading model: sentence-transformers/all-MiniLM-L6-v2


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding documents:


Batches:   0%|          | 0/242 [00:00<?, ?it/s]

Encoding queries:


Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Retrieving top-K documents:
Evaluating with MRR@K:
Results for sentence-transformers/all-MiniLM-L6-v2: {1: np.float64(0.4157142857142857), 5: np.float64(0.48966666666666664), 10: np.float64(0.4993826530612245)}


# **Part 3: Neural Re-Ranker (Re-ranking top 20 BM25 candidates using BERT)**


In [None]:
# !pip install -U accelerate
# !pip install datasets
# !pip install -U sentence-transformers
# !pip uninstall -y keras tensorflow tf-keras keras-nightly keras-preprocessing keras-vis
# !pip install -U sentence-transformers --no-deps  # avoid triggering keras/tf again

In [None]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"

import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Generate training data
train_examples = []
for _, row in df_train_split.iterrows():
    query = row['tweet_text']
    gold_uid = row['cord_uid']

    # Positive sample
    pos_doc = df_collection[df_collection['cord_uid'] == gold_uid]
    if not pos_doc.empty:
        pos_text = (pos_doc.iloc[0]['title'] + ' ' + pos_doc.iloc[0]['abstract']).strip()
        train_examples.append(InputExample(texts=[query, pos_text], label=1.0))

    # Negative sample (BM25 top doc not equal to gold)
    top_candidates = get_top_cord_uids(query, topk=20)
    for neg_uid in top_candidates:
        if neg_uid != gold_uid:
            neg_doc = df_collection[df_collection['cord_uid'] == neg_uid]
            if not neg_doc.empty:
                neg_text = (neg_doc.iloc[0]['title'] + ' ' + neg_doc.iloc[0]['abstract']).strip()
                train_examples.append(InputExample(texts=[query, neg_text], label=0.0))
                break

# Fine-tune the CrossEncoder
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', num_labels=1)
cross_encoder.fit(train_dataloader=train_dataloader, epochs=1, warmup_steps=100, output_path='./finetuned_crossencoder')

# After training is done
cross_encoder.save("./finetuned_crossencoder")

# Reload fine-tuned model and apply reranking
cross_encoder = CrossEncoder('./finetuned_crossencoder')

def rerank_with_cross_encoder(tweet, topk_ids):
    candidates = df_collection[df_collection['cord_uid'].isin(topk_ids)].copy()
    candidates['sort_idx'] = candidates['cord_uid'].apply(lambda x: topk_ids.index(x))
    candidates = candidates.sort_values('sort_idx')

    texts = (candidates['title'] + ' ' + candidates['abstract']).fillna('').tolist()
    model_inputs = [(tweet, doc) for doc in texts]
    scores = cross_encoder.predict(model_inputs)

    return [x for _, x in sorted(zip(scores, topk_ids), reverse=True)]

df_test_split['reranked_top5'] = df_test_split.apply(
    lambda row: rerank_with_cross_encoder(row['tweet_text'], row['bm25_topk']),
    axis=1
)

# Evaluate
def mrr_at_k(df, pred_col, gold_col='cord_uid', k=5):
    scores = []
    for _, row in df.iterrows():
        preds = row[pred_col][:k]
        if row[gold_col] in preds:
            rank = preds.index(row[gold_col]) + 1
            scores.append(1 / rank)
        else:
            scores.append(0)
    return np.mean(scores)

mrr5 = mrr_at_k(df_test_split, pred_col='reranked_top5', k=5)
print(f"Local MRR@5 (fine-tuned CrossEncoder): {mrr5:.4f}")


ERROR: ld.so: object '/opt/conda/lib/libmkl_def.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_avx2.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_core.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_intel_lp64.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_intel_thread.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_def.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_avx2.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/conda/lib/libmkl_core.so' from LD_PRE

Step,Training Loss
500,0.5625
1000,0.4229


Local MRR@5 (fine-tuned CrossEncoder): 0.6352


NEURAL RE-RANKER ON TOP OF BM25

In [None]:
NEG_PER_POS = 4     # number of negatives per positive
CANDIDATE_K = 20    # BM25 candidate pool size
FINAL_K     = 5     # reranked top-k

train_examples = []
for _, row in df_train_split.iterrows():
    q       = row['tweet_text']
    pos_uid = row['cord_uid']
    cands = get_top_cord_uids(q, topk=NEG_PER_POS+1)
    if pos_uid not in cands:
        cands[-1] = pos_uid
    for uid in cands:
        doc_text = df_collection.loc[
            df_collection['cord_uid']==uid, 'combined_text'
        ].values[0]
        label = 1.0 if uid == pos_uid else 0.0
        train_examples.append(InputExample(texts=[q, doc_text], label=label))

model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
ce = CrossEncoder(model_name, num_labels=1)

train_loader = DataLoader(train_examples, shuffle=True, batch_size=16)
ce.fit(
    train_dataloader=train_loader,
    epochs=2,
    output_path="./checkthat_reranker",
    save_best_model=True
)

def neural_rerank(query, bm25_uids, topk=FINAL_K):
    pairs = []
    for uid in bm25_uids:
        doc_text = df_collection.loc[
            df_collection['cord_uid']==uid, 'combined_text'
        ].values[0]
        pairs.append([query, doc_text])
    scores = ce.predict(pairs)
    idxs = np.argsort(scores)[::-1][:topk]
    return [bm25_uids[i] for i in idxs]

df_test_split['bm25_20'] = df_test_split['tweet_text']\
    .apply(lambda q: get_top_cord_uids(q, topk=CANDIDATE_K))
df_test_split['nn_top5'] = df_test_split.apply(
    lambda r: neural_rerank(r['tweet_text'], r['bm25_20'], topk=FINAL_K),
    axis=1
)

def mrr_list_at_5(pred_lists, true_uids):
    rr = []
    for preds, true in zip(pred_lists, true_uids):
        if true in preds:
            rr.append(1.0 / (preds.index(true) + 1))
        else:
            rr.append(0.0)
    return float(np.mean(rr))

nn_score = mrr_list_at_5(
    df_test_split['nn_top5'].tolist(),
    df_test_split['cord_uid'].tolist()
)
print(f'Neural re-ranking MRR@5: {nn_score:.4f}')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.

BM25 MRR@5 (local split): 0.5460

/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
WARNING:huggingface_hub.file_download:Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]
README.md:   0%|          | 0.00/3.66k [00:00<?, ?B/s]
/usr/local/lib/python3.11/dist-packages/datasets/table.py:1395: FutureWarning: promote has been superseded by promote_options='default'.
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
/usr/local/lib/python3.11/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'.
  table = cls._concat_blocks(blocks, axis=0)
Token indices sequence length is longer than the specified maximum sequence length for this model (533 > 512). Running this sequence through the model will result in indexing errors
[34m[1mwandb[0m: [33mWARNING[0m The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.

<IPython.core.display.Javascript object>
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:[34m[1mwandb[0m: [33mWARNING[0m If you're specifying your api key in code, ensure this code is not shared publicly.
[34m[1mwandb[0m: [33mWARNING[0m Consider setting the WANDB_API_KEY environment variable, or running `wandb login` from the command line.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33me12127030[0m ([33me12127030-tu-wien[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin

Neural re-ranking MRR@5: 0.6499
