# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables participants of subtask 4b to quickly get started. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to evaluate the baseline model

Participants are free to use this notebook and add their own models for the competition.

# 1) Importing data

In [1]:
import numpy as np
import pandas as pd

In [2]:
import os

In [3]:
output_dir = "results"
experiment_name = "04_bm25-pt_t5-rerank_model"
rerank_model = "t5"  # or your actual reranker
results_filename = f"{experiment_name}_{rerank_model.replace('/', '-')}"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, results_filename + ".txt")
#f = open(output_path, "w")

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4?ref_type=heads

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [24]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = 'subtask4b_collection_data.pkl' #MODIFY PATH

In [25]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [None]:
df_collection.info()

In [None]:
df_collection.head()

In [None]:
print(df_collection.columns)

## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4?ref_type=heads

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [None]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_QUERY_DATA = 'subtask4b_query_tweets_dev.tsv' #MODIFY PATH

In [27]:
df_query = pd.read_csv(PATH_QUERY_DATA, sep = '\t')

In [None]:
df_query.head()

In [None]:
df_query.info()

In [11]:
np.random.seed(88)
df_query = df_query.sample(20)

# 2) Running the baseline
The following code runs a BM25-PT baseline.


In [None]:
!pip install bm25_pt
from bm25_pt import BM25
from torch import Tensor

class BM25_Pytorch:
    def __init__(self, corpus, cord_uids, device=None):
        self.corpus = corpus
        self.cord_uids = cord_uids
        self.bm25 = BM25(device=device)
        self.bm25.index(self.corpus)

    def get_top_cord_uids(self, query, k=5000):
        doc_scores = Tensor.cpu(self.bm25.score(query))
        indices = np.argsort(-doc_scores)[:k]
        bm25_topk = [self.cord_uids[x] for x in indices]
        return bm25_topk

    def retrieve_paper(paper_ids, df_collection):
        paper_dict = {}
        for id in paper_ids:
            paper_data = df_collection[df_collection['cord_uid'] == id]
            title = paper_data.iloc[0]['title']
            abstract = paper_data.iloc[0]['abstract']
            paper_dict[id] = {'title': title, 'abstract': abstract}
        return paper_dict


In [None]:
corpus = df_collection[:][['title', 'abstract']].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]['cord_uid'].tolist()

bm25_pytorch = BM25_Pytorch(corpus=corpus, cord_uids=cord_uids)

In [30]:
text2bm25top = {}

def get_top_cord_uids_pt(query):
    if query in text2bm25top:
        return text2bm25top[query]
    else:
        bm25_topk = bm25_pytorch.get_top_cord_uids(query, k=5000)
        text2bm25top[query] = bm25_topk
        return bm25_topk

In [31]:
# Apply using BM25_pt model
df_query['bm25_topk'] = df_query['tweet_text'].apply(get_top_cord_uids_pt)

# 3) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the query set using the Mean Reciprocal Rank score (MRR@5).

In [57]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        #performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance


In [None]:
bm25_results = get_performance_mrr(df_query, col_gold='cord_uid', col_pred='bm25_topk', list_k=[1,5,10])

In [None]:
print(df_query.columns)

In [None]:
df_query.columns = df_query.columns.str.strip()
df_query['cord_uid']

In [None]:
results = get_performance_mrr(df_query, 'cord_uid', 'bm25_topk')
# Printed MRR@k results in the following format: {k: MRR@k}
print(results)

# 4) Possible solutions

# 4.1) BM25-PT for Retrieval and T5 for Rerank

In [None]:
!pip install rerankers
# All of the above
!pip install "rerankers[all]"

In [33]:
from rerankers import Reranker, Document

In [None]:
!pip install --upgrade transformers

In [None]:
# default “cross‑encoder” is ms‑marco‑MiniLM‑L‑6‑v2 under the hood
reranker = Reranker("t5")

In [None]:
def retrieve_paper(paper_ids):
    paper_dict = {}
    for pid in paper_ids:
        paper_data = df_collection[df_collection['cord_uid'] == pid]
        if not paper_data.empty:
            title = paper_data.iloc[0]['title']
            abstract = paper_data.iloc[0]['abstract']
            paper_dict[pid] = {'title': title, 'abstract': abstract}
    return paper_dict

df_query['title_abstract'] = df_query['bm25_topk'].apply(retrieve_paper)


In [None]:
df_query[['tweet_text', 'title_abstract']].head()

In [None]:
def rerank_with_rerankers(row, top_k=1446):
    query = row["tweet_text"]

    # Ensure input format for Reranker
    docs = [
        f"{info['title']} {info['abstract']}"
        for info in row["title_abstract"].values()
    ]
    doc_ids = list(row["title_abstract"].keys())

    # Apply reranker
    results = reranker.rank(query=query, docs=docs, doc_ids=doc_ids)

    # Get top_k reordered cord_uids
    top_results = results.top_k(top_k)
    return [res.doc_id for res in top_results]

# Apply to DataFrame
df_query["bm25_reranker_topk"] = df_query.apply(rerank_with_rerankers, axis=1)

In [None]:
df_query['preds'] = df_query['bm25_reranker_topk'].apply(lambda x: x[:5])
output_predictions_path = f"predictions_{experiment_name}_{rerank_model.replace('/', '-')}.tsv"
df_query[['post_id', 'preds']].to_csv(output_predictions_path, sep='\t', index=False)

In [None]:
df_query.head()

In [None]:
mrr_scores = get_performance_mrr(
    df_query,                # Dataset
    "cord_uid",              # Ground‑truth column
    "bm25_reranker_topk"     # New reranked lists
)
print(mrr_scores)