# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables participants of subtask 4b to quickly get started. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to evaluate the baseline model

Participants are free to use this notebook and add their own models for the competition.

# 1) Importing data

In [1]:
!pip install pandas




[notice] A new release of pip is available: 23.2.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import numpy as np
import pandas as pd

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [3]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = 'subtask4b_collection_data.pkl' #MODIFY PATH

In [4]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [5]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [5]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [6]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_QUERY_TRAIN_DATA = 'subtask4b_query_tweets_train.tsv' #MODIFY PATH
PATH_QUERY_DEV_DATA = 'subtask4b_query_tweets_dev.tsv' #MODIFY PATH

In [7]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')

In [8]:
df_query_dev.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


In [9]:
df_query_dev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     1400 non-null   int64 
 1   tweet_text  1400 non-null   object
 2   cord_uid    1400 non-null   object
dtypes: int64(1), object(2)
memory usage: 32.9+ KB


# 2) Running the baseline
The following code runs a BM25 baseline.


In [10]:
# !pip install rank_bm25
from rank_bm25 import BM25Okapi


In [11]:
# Create the BM25 corpus
corpus = df_collection[:][['title', 'abstract']].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]['cord_uid'].tolist()
tokenized_corpus = [doc.split(' ') for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [12]:
def get_top_cord_uids(query):
  text2bm25top = {}
  if query in text2bm25top.keys():
      return text2bm25top[query]
  else:
      tokenized_query = query.split(' ')
      doc_scores = bm25.get_scores(tokenized_query)
      indices = np.argsort(-doc_scores)[:5]
      bm25_topk = [cord_uids[x] for x in indices]

      text2bm25top[query] = bm25_topk
      return bm25_topk


In [13]:
# Retrieve topk candidates using the BM25 model
df_query_train['bm25_topk'] = df_query_train['tweet_text'].apply(lambda x: get_top_cord_uids(x))
df_query_dev['bm25_topk'] = df_query_dev['tweet_text'].apply(lambda x: get_top_cord_uids(x))

# 3) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the query set using the Mean Reciprocal Rank score (MRR@5).

In [14]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        #performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance


In [15]:
results_train = get_performance_mrr(df_query_train, 'cord_uid', 'bm25_topk')
results_dev = get_performance_mrr(df_query_dev, 'cord_uid', 'bm25_topk')
# Printed MRR@k results in the following format: {k: MRR@k}
print(f"Results on the train set: {results_train}")
print(f"Results on the dev set: {results_dev}")

Results on the train set: {1: np.float64(0.5081303975725512), 5: np.float64(0.5509777224513084), 10: np.float64(0.5509777224513084)}
Results on the dev set: {1: np.float64(0.5057142857142857), 5: np.float64(0.5522738095238094), 10: np.float64(0.5522738095238094)}


# 4) Exporting results to prepare the submission on Codalab

In [16]:
df_query_dev['preds'] = df_query_dev['bm25_topk'].apply(lambda x: x[:5])

In [17]:
df_query_dev[['post_id', 'preds']].to_csv('predictions.tsv', index=None, sep='\t')

# 4) Neural Re-ranking Model
This section implements a neural re-ranking model that improves upon the BM25 baseline by using BERT and cross-attention.

In [18]:
# Install required dependencies
!pip install torch transformers tqdm rank_bm25

Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/44/80/b353c024e6b624cd9ce1d66dcb9d24e0294680f95b369f19280e241a0159/torch-2.7.0-cp312-cp312-win_amd64.whl.metadata
  Downloading torch-2.7.0-cp312-cp312-win_amd64.whl.metadata (29 kB)
Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/36/f8/1f086942bc6a044e4e68dacf6de761a45367795efd5f57ad356765691c79/transformers-4.52.3-py3-none-any.whl.metadata
  Downloading transformers-4.52.3-py3-none-any.whl.metadata (40 kB)
     ---------------------------------------- 0.0/40.2 kB ? eta -:--:--
     ---------------------------------------- 40.2/40.2 kB 1.9 MB/s eta 0:00:00
Collecting sympy>=1.13.3 (from torch)
  Obtaining dependency information for sympy>=1.13.3 from https://files.pythonhosted.org/packages/a2/09/77d55d46fd61b4a135c444fc97158ef34a095e5681d0a6c10b75bf356191/sympy-1.14.0-py3-none-any.whl.metadata
  Downloading 

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Python312\\share'
Consider using the `--user` option or check the permissions.


[notice] A new release of pip is available: 23.2.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

In [19]:
!pip install --upgrade numpy

Collecting numpy
  Obtaining dependency information for numpy from https://files.pythonhosted.org/packages/36/fa/8c9210162ca1b88529ab76b41ba02d433fd54fecaf6feb70ef9f124683f1/numpy-2.2.6-cp312-cp312-win_amd64.whl.metadata
  Downloading numpy-2.2.6-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.8 kB ? eta -:--:--
     ------------------- ------------------ 30.7/60.8 kB 640.0 kB/s eta 0:00:01
     -------------------------------------- 60.8/60.8 kB 803.2 kB/s eta 0:00:00
Downloading numpy-2.2.6-cp312-cp312-win_amd64.whl (12.6 MB)
   ---------------------------------------- 0.0/12.6 MB ? eta -:--:--
   ---------------------------------------- 0.1/12.6 MB 2.0 MB/s eta 0:00:07
    --------------------------------------- 0.2/12.6 MB 3.3 MB/s eta 0:00:04
   - -------------------------------------- 0.3/12.6 MB 2.7 MB/s eta 0:00:05
   - -------------------------------------- 0.5/12.6 MB 3.0 MB/s eta 0:00:04
   -- ----------------------------------

ERROR: Could not install packages due to an OSError: [WinError 2] The system cannot find the file specified: 'C:\\Python312\\Scripts\\f2py.exe' -> 'C:\\Python312\\Scripts\\f2py.exe.deleteme'


[notice] A new release of pip is available: 23.2.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [22]:

import os# Import neural re-ranking model
from neural_reranker import NeuralReranker, Reranker, prepare_training_data, train_reranker

ModuleNotFoundError: No module named 'torch'

In [23]:
# Prepare training data
print("Preparing training data...")
training_data = prepare_training_data(df_query_train, df_collection)

Preparing training data...


NameError: name 'prepare_training_data' is not defined

In [None]:
# Initialize model
model = NeuralReranker(bert_model_name='bert-base-uncased')

# Train with smaller batch size and gradient accumulation
print("Training neural reranker...")
trained_model = train_reranker(
    model=model,
    training_data=training_data,
    num_epochs=3,
    batch_size=2,  # Smaller batch size
    learning_rate=2e-5,
    gradient_accumulation_steps=1  # Added gradient accumulation
)

# Save the trained model
print("Saving model...")
torch.save(trained_model.state_dict(), 'neural_reranker_model.pt')

# 5) Neural Re-ranking Model Evaluation
This section implements the evaluation pipeline for the neural re-ranking model.

In [None]:
def get_performance_mrr(data, col_gold, col_pred, list_k=[1, 5, 10]):
    """Calculate MRR@k for evaluation"""
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(
            lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) 
                      if x[col_gold] in [i for i in x[col_pred][:k]] else 0), 
            axis=1
        )
        d_performance[k] = data["in_topx"].mean()
    return d_performance

def evaluate_reranker(model, df_dev, df_collection, bm25_model, top_k=5):
    """Evaluate the reranker on the development set"""
    results = []
    
    for _, row in tqdm(df_dev.iterrows(), desc="Evaluating"):
        query = row['tweet_text']
        gold_doc_id = row['cord_uid']
        
        # Get BM25 candidates
        tokenized_query = query.split(' ')
        doc_scores = bm25_model.get_scores(tokenized_query)
        bm25_indices = np.argsort(-doc_scores)[:top_k]
        bm25_candidates = [df_collection.iloc[i]['cord_uid'] for i in bm25_indices]
        
        # Get document texts for reranking
        candidate_texts = []
        for doc_id in bm25_candidates:
            doc = df_collection[df_collection['cord_uid'] == doc_id].iloc[0]
            candidate_texts.append(f"{doc['title']} {doc['abstract']}")
        
        # Get BM25 scores for candidates
        bm25_scores = [doc_scores[i] for i in bm25_indices]
        
        # Rerank using neural model
        reranker = Reranker(model=model)
        reranked_docs = reranker.rerank(query, candidate_texts, bm25_scores)
        
        # Get reranked document IDs
        reranked_ids = [bm25_candidates[i] for i, _ in reranked_docs]
        
        results.append({
            'post_id': row['post_id'],
            'gold_doc_id': gold_doc_id,
            'predicted_docs': reranked_ids
        })
    
    return pd.DataFrame(results)

## 5.1) Train and Evaluate the Neural Reranker

In [None]:
# Prepare training data
print("Preparing training data...")
training_data = prepare_training_data(df_query_train, df_collection)

# Initialize and train model
print("Training neural reranker...")
model = NeuralReranker()
model = train_reranker(
    model=model,
    training_data=training_data,
    num_epochs=3,
    batch_size=8,
    learning_rate=2e-5
)

# Save trained model
print("Saving model...")
torch.save(model.state_dict(), 'neural_reranker_model.pt')

# Evaluate on dev set
print("Evaluating on development set...")
results_df = evaluate_reranker(model, df_query_dev, df_collection, bm25)

# Calculate MRR scores
print("Calculating MRR scores...")
mrr_scores = get_performance_mrr(
    results_df,
    'gold_doc_id',
    'predicted_docs',
    list_k=[1, 5, 10]
)

print("\nEvaluation Results:")
print(f"MRR@1: {mrr_scores[1]:.4f}")
print(f"MRR@5: {mrr_scores[5]:.4f}")
print(f"MRR@10: {mrr_scores[10]:.4f}")

## 5.2) Test the Neural Reranker on Individual Queries

In [None]:
def test_single_query(query: str, model_path: str, df_collection: pd.DataFrame, top_k: int = 5):
    """Test the reranker on a single query"""
    # Initialize BM25
    corpus = df_collection[['title', 'abstract']].apply(
        lambda x: f"{x['title']} {x['abstract']}", axis=1
    ).tolist()
    tokenized_corpus = [doc.split(' ') for doc in corpus]
    bm25 = BM25Okapi(tokenized_corpus)
    
    # Get BM25 candidates
    tokenized_query = query.split(' ')
    doc_scores = bm25.get_scores(tokenized_query)
    bm25_indices = np.argsort(-doc_scores)[:top_k]
    bm25_candidates = [df_collection.iloc[i]['cord_uid'] for i in bm25_indices]
    
    # Get document texts for reranking
    candidate_texts = []
    for doc_id in bm25_candidates:
        doc = df_collection[df_collection['cord_uid'] == doc_id].iloc[0]
        candidate_texts.append(f"{doc['title']} {doc['abstract']}")
    
    # Get BM25 scores for candidates
    bm25_scores = [doc_scores[i] for i in bm25_indices]
    
    # Load and initialize reranker
    model = NeuralReranker()
    model.load_state_dict(torch.load(model_path))
    reranker = Reranker(model=model)
    
    # Rerank documents
    reranked_docs = reranker.rerank(query, candidate_texts, bm25_scores)
    
    # Print results
    print(f"\nQuery: {query}\n")
    print("BM25 Results:")
    for i, (doc_id, score) in enumerate(zip(bm25_candidates, bm25_scores)):
        print(f"{i+1}. Doc ID: {doc_id}, Score: {score:.4f}")
    
    print("\nNeural Reranker Results:")
    for i, (doc_id, score) in enumerate(zip(bm25_candidates, [score for _, score in reranked_docs])):
        print(f"{i+1}. Doc ID: {doc_id}, Score: {score:.4f}")

# Test queries
test_queries = [
    "New study shows that COVID-19 can be transmitted through airborne particles",
    "Research indicates that wearing masks reduces the spread of respiratory diseases",
    "Scientists discover potential treatment for severe COVID-19 cases"
]

# Test each query
for query in test_queries:
    test_single_query(
        query=query,
        model_path='neural_reranker_model.pt',
        df_collection=df_collection
    )

## 5.3) Save Predictions for Submission

In [None]:
# Save predictions for submission
print("Saving predictions...")
results_df[['post_id', 'predicted_docs']].to_csv(
    'neural_reranker_predictions.tsv',
    index=None,
    sep='\t'
)