# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables participants of subtask 4b to quickly get started. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to evaluate the baseline model

Participants are free to use this notebook and add their own models for the competition.

# 1) Importing data

In [1]:
import numpy as np
import pandas as pd

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4?ref_type=heads

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [2]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = "subtask4b_collection_data.pkl"  # MODIFY PATH

In [13]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [14]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [15]:
duplicate_cord_uids = [
    "i9jrobdb",
    "dql5n9k7",
    "oynbwas9",
    "lxnc91rq",
    "nrc35akc",
    "bcraqemn",
    "2qwk2z22",
    "5ik10en4",
    "x0me00m0",
    "611ih9pl",
    "12th7nja",
    "z5zchmxz",
    "0r46eacc",
    "peiqusin",
    "7xb7hj9u",
    "xco6yr7o",
    "j8666ial",
    "qu7ddcw9",
    "uitc2tru",
    "5tkxucuw",
    "8f9auteb",
    "e1macqqm",
    "sdsmy0bw",
    "lzwgyzdp",
    "f18a3xiv",
    "6r1hy34r",
    "gsgbhqwl",
    "8v7150vv",
    "bggera4e",
    "ebgu29uh",
    "gvtp8s8b",
    "cj82xu0w",
    "r6gzbnu2",
    "34b511wf",
    "o86wki37",
    "4budtw8s",
    "xnkn38i9",
    "zl4ixyg1",
]

In [17]:
duplicate_ids_df = df_collection[df_collection.cord_uid.isin(duplicate_cord_uids)]
print(len(duplicate_ids_df))
duplicate_ids_df.to_csv("duplicate_ids_cord.tsv", sep="\t")

38


In [10]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4?ref_type=heads

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [None]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_QUERY_DATA = "subtask4b_query_tweets.tsv"  # MODIFY PATH

In [None]:
df_query = pd.read_csv(PATH_QUERY_DATA, sep="\t")

In [None]:
df_query.head()

Unnamed: 0,tweet_text,cord_uid
0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,this study isn't receiving sufficient attentio...,4kfl29ul
2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [None]:
df_query.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14253 entries, 0 to 14252
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   tweet_text  14253 non-null  object
 1   cord_uid    14253 non-null  object
dtypes: object(2)
memory usage: 222.8+ KB


In [None]:
duplicate_cord_uids

array(['i9jrobdb', 'dql5n9k7', 'oynbwas9', 'lxnc91rq', 'nrc35akc',
       'bcraqemn', '2qwk2z22', '5ik10en4', 'x0me00m0', '611ih9pl',
       '12th7nja', 'z5zchmxz', '0r46eacc', 'peiqusin', '7xb7hj9u',
       'xco6yr7o', 'j8666ial', 'qu7ddcw9', 'uitc2tru', '5tkxucuw',
       '8f9auteb', 'e1macqqm', 'sdsmy0bw', 'lzwgyzdp', 'f18a3xiv',
       '6r1hy34r', 'gsgbhqwl', '8v7150vv', 'bggera4e', 'ebgu29uh',
       'gvtp8s8b', 'cj82xu0w', 'r6gzbnu2', '34b511wf', 'o86wki37',
       '4budtw8s', 'xnkn38i9', 'zl4ixyg1'], dtype=object)

In [None]:
df_query[df_query.cord_uid.isin(duplicate_cord_uids)]

Unnamed: 0,tweet_text,cord_uid
463,"""In summary, all the 455 contacts were exclude...",7xb7hj9u
507,for those that believe in the science. this st...,7xb7hj9u
521,but there *is* a massive rct on community mask...,zl4ixyg1
666,A recent study sparks some significant concern...,z5zchmxz
673,breaking: newly published paper concludes the ...,z5zchmxz
...,...,...
13371,Its time to take a more in-depth look at what ...,7xb7hj9u
13765,are covid-19 misinformation a threat to public...,gsgbhqwl
13952,"If you wear a mask that blocks droplets, you'l...",7xb7hj9u
14037,why has the covid-19 pandemic boosted support ...,5tkxucuw


# 2) Running the baseline
The following code runs a BM25 baseline.


In [None]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
# Create the BM25 corpus
corpus = df_collection[:][["title", "abstract"]].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]["cord_uid"].tolist()
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [None]:
text2bm25top = {}


def get_top_cord_uids(query):
    if query in text2bm25top.keys():
        return text2bm25top[query]
    else:
        tokenized_query = query.split(" ")
        doc_scores = bm25.get_scores(tokenized_query)
        indices = np.argsort(-doc_scores)[:1000]
        bm25_topk = [cord_uids[x] for x in indices]

        text2bm25top[query] = bm25_topk
        return bm25_topk

In [None]:
# Retrieve topk candidates using the BM25 model
df_query["bm25_topk"] = df_query["tweet_text"].apply(lambda x: get_top_cord_uids(x))

# 3) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the query set using the Mean Reciprocal Rank score (MRR@5).

In [None]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k=[1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(
            lambda x: (
                1 / ([i for i in x[col_pred][:k]].index(x[col_gold]) + 1)
                if x[col_gold] in [i for i in x[col_pred][:k]]
                else 0
            ),
            axis=1,
        )
        # performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance

In [None]:
results = get_performance_mrr(df_query, "cord_uid", "bm25_topk")
# Printed MRR@k results in the following format: {k: MRR@k}
print(results)

{1: 0.5077527538062162, 5: 0.5509085806496877, 10: 0.5559879624339737}
