# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables participants of subtask 4b to quickly get started. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to evaluate the baseline model

Participants are free to use this notebook and add their own models for the competition.

# 1) Importing data

In [1]:
import numpy as np
import pandas as pd

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [2]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = 'subtask4b_collection_data.pkl'  #MODIFY PATH

In [3]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [4]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [5]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [6]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_QUERY_TRAIN_DATA = 'subtask4b_query_tweets_train.tsv' #MODIFY PATH
PATH_QUERY_DEV_DATA = 'subtask4b_query_tweets_dev.tsv' #MODIFY PATH

In [7]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')

In [8]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [9]:
df_query_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12853 entries, 0 to 12852
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     12853 non-null  int64 
 1   tweet_text  12853 non-null  object
 2   cord_uid    12853 non-null  object
dtypes: int64(1), object(2)
memory usage: 301.4+ KB


In [10]:
df_query_dev.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


In [11]:
df_query_dev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     1400 non-null   int64 
 1   tweet_text  1400 non-null   object
 2   cord_uid    1400 non-null   object
dtypes: int64(1), object(2)
memory usage: 32.9+ KB


# 2) Running the baseline
The following code runs a BM25 baseline.


In [12]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi




In [13]:
# Create the BM25 corpus
corpus = df_collection[:][['title', 'abstract']].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]['cord_uid'].tolist()
tokenized_corpus = [doc.split(' ') for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [14]:
def get_top_cord_uids(query):
  text2bm25top = {}
  if query in text2bm25top.keys():
      return text2bm25top[query]
  else:
      tokenized_query = query.split(' ')
      doc_scores = bm25.get_scores(tokenized_query)
      indices = np.argsort(-doc_scores)[:5]
      bm25_topk = [cord_uids[x] for x in indices]

      text2bm25top[query] = bm25_topk
      return bm25_topk


In [15]:
# Retrieve topk candidates using the BM25 model
df_query_train['bm25_topk'] = df_query_train['tweet_text'].apply(lambda x: get_top_cord_uids(x))
df_query_dev['bm25_topk'] = df_query_dev['tweet_text'].apply(lambda x: get_top_cord_uids(x))

# 3) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the query set using the Mean Reciprocal Rank score (MRR@5).

In [16]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        #performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance



In [17]:
results_train = get_performance_mrr(df_query_train, 'cord_uid', 'bm25_topk')
results_dev = get_performance_mrr(df_query_dev, 'cord_uid', 'bm25_topk')
# Printed MRR@k results in the following format: {k: MRR@k}
print(f"Results on the train set: {results_train}")
print(f"Results on the dev set: {results_dev}")

Results on the train set: {1: np.float64(0.5079747918773827), 5: np.float64(0.5508999196037242), 10: np.float64(0.5508999196037242)}
Results on the dev set: {1: np.float64(0.505), 5: np.float64(0.5520357142857142), 10: np.float64(0.5520357142857142)}


# 4) Exporting results to prepare the submission on Codalab

In [18]:
df_query_dev['preds'] = df_query_dev['bm25_topk'].apply(lambda x: x[:5])

In [19]:
df_query_dev[['post_id', 'preds']].to_csv('predictions.tsv', index=None, sep='\t')

# NLP model

In [20]:
!pip install -q sentence-transformers==2.2.2 transformers==4.28.1 huggingface-hub==0.14.1 pandas numpy tqdm scikit-learn

In [21]:
import pandas as pd
import numpy as np
import pickle
import re
import random
from tqdm import tqdm
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


#### Prepare the function for evaluation

In [22]:
def get_performance_mrr(data, col_gold='cord_uid', col_pred='preds', list_k=[1, 5]):
    results = {}
    for k in list_k:
        def reciprocal_rank(row):
            preds = row[col_pred][:k]
            if row[col_gold] in preds:
                return 1 / (preds.index(row[col_gold]) + 1)
            return 0
        mrr = data.apply(reciprocal_rank, axis=1).mean()
        results[f"MRR@{k}"] = round(mrr, 4)
    return results


#### Prepare the data

In [23]:
print("Loading the dataset...")
PATH_QUERY_TEST_DATA = 'subtask4b_query_tweets_test.tsv' 
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')
df_query_test = pd.read_csv(PATH_QUERY_TEST_DATA, sep = '\t')

# Load and prepare collection data
print("Loading collection data...")
with open('subtask4b_collection_data.pkl', 'rb') as f:
    collection_df = pickle.load(f)
collection_df['text'] = collection_df['title'].fillna('') + ' ' + collection_df['abstract'].fillna('')
paper_text_map = collection_df.set_index('cord_uid')['text'].to_dict()
paper_cord_uids = collection_df['cord_uid'].tolist()


Loading the dataset...
Loading collection data...


  collection_df = pickle.load(f)


### Preprocessing the tweets dataset

In [24]:
# Function to clean tweet text 
def clean_tweet(text):
    # Remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)
    # Remove emojis
    text = text.encode('ascii', 'ignore').decode('ascii')
    # Remove slashes 
    text = re.sub(r"\b\d+\/\d*\b", "", text)
    # Lowercase everything
    text = text.lower()
    # Collapse multiple spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Apply to the dataset
df_query_train['tweet_text'] = df_query_train['tweet_text'].apply(clean_tweet)
df_query_dev['tweet_text'] = df_query_dev['tweet_text'].apply(clean_tweet)
df_query_test['tweet_text'] = df_query_test['tweet_text'].apply(clean_tweet)

### Baseline SBERT model 

In [25]:
# Load pre-trained Sentence-BERT model
print("Loading SBERT model...")
baseline_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Loading SBERT model...


In [26]:
# Encode collection texts into dense vectors
print("Encoding collection (papers)...")
paper_embeddings = baseline_model.encode(
    collection_df['text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

Encoding collection (papers)...


Batches: 100%|██████████| 121/121 [02:04<00:00,  1.03s/it]


In [27]:
# Encode test tweet texts
print("Encoding tweets...")
test_tweet_embeddings = baseline_model.encode(
    df_query_test['tweet_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Encode dev tweet texts
print("Encoding dev tweets...")
dev_tweet_embeddings = baseline_model.encode(
    df_query_dev['tweet_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

Encoding tweets...


Batches: 100%|██████████| 23/23 [00:04<00:00,  5.47it/s]


Encoding dev tweets...


Batches: 100%|██████████| 22/22 [00:04<00:00,  4.81it/s]


In [28]:
# Compute cosine similarities
print("Computing similarities...")
similarity_matrix = cosine_similarity(test_tweet_embeddings, paper_embeddings)

Computing similarities...


In [29]:
# For each tweet, get top 5 most similar documents : Mean Reciprocal Rank (MRR@5)
print("Selecting top 5 documents for each tweet...")
top_k = 5
top_indices = np.argsort(similarity_matrix, axis=1)[:, -top_k:][:, ::-1]  # reverse sort for descending

top_cord_uids = [
    [paper_cord_uids[i] for i in row]
    for row in top_indices
]

Selecting top 5 documents for each tweet...


In [30]:
# Build submission dataframe
print("Saving predictions...")
df_query_test['preds'] = top_cord_uids
submission_df = df_query_test[['post_id', 'preds']]

Saving predictions...


In [None]:
# Save to nlp_predictions_representation_baseline.tsv
submission_df.to_csv('nlp_predictions_representation_baseline.tsv', sep='\t', index=False)
print("Done! Output saved to 'nlp_predictions_representation_baseline.tsv'")

Done! Output saved to 'predictions_representation_baseline.tsv'


In [32]:
# Compute similarity with paper embeddings
print("Computing similarity for dev set...")
similarity_matrix_dev = cosine_similarity(dev_tweet_embeddings, paper_embeddings)

Computing similarity for dev set...


#### Evaluation

In [33]:
top_k = 10

# Get top-K predicted cord_uids for each tweet
top_indices_dev = np.argsort(similarity_matrix_dev, axis=1)[:, -top_k:][:, ::-1]
top_cord_uids_dev = [
    [paper_cord_uids[i] for i in row]
    for row in top_indices_dev
]

# Add predictions to the dev dataframe
df_query_dev['preds'] = top_cord_uids_dev

# Run and print MRR metrics
mrr_results = get_performance_mrr(df_query_dev, list_k=[1, 5, 10])

# Compute and print MRR@5
mrr5 = get_performance_mrr(df_query_dev)
print(f"SBERT baseline MRR@1: {mrr_results['MRR@1']},SBERT baseline MRR@5: {mrr_results['MRR@5']},SBERT baseline MRR@10: {mrr_results['MRR@10']}")
print(f"BM25 MRR@1: {results_dev[1]}, BM25 MRR@5: {results_dev[5]}, BM25 MRR@10: {results_dev[10]}")

SBERT baseline MRR@1: 0.41,SBERT baseline MRR@5: 0.4857,SBERT baseline MRR@10: 0.4955
BM25 MRR@1: 0.505, BM25 MRR@5: 0.5520357142857142, BM25 MRR@10: 0.5520357142857142


### Experiment 1: Positive pairs training

In [34]:
# Load pre-trained Sentence-BERT model
print("Loading SBERT model...")
pos_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Loading SBERT model...


In [35]:
# Prepare training examples
all_cord_uids = list(paper_text_map.keys())

train_examples_pos = []
train_examples_mixed = []
missing = 0


# Create positive and mixed examples
for _, row in df_query_train.iterrows():
    tweet = row['tweet_text']
    true_cord_uid = row['cord_uid']
    paper_text = paper_text_map.get(true_cord_uid)

    # Positive example
    if paper_text:
        pos_example = InputExample(texts=[tweet, paper_text], label=1.0)
        train_examples_pos.append(pos_example)
        train_examples_mixed.append(pos_example)

        # Create one negative sample (random paper ≠ true_cord_uid)
        negative_uid = random.choice([uid for uid in all_cord_uids if uid != true_cord_uid])
        neg_paper_text = paper_text_map[negative_uid]
        neg_example = InputExample(texts=[tweet, neg_paper_text], label=0.0)
        train_examples_mixed.append(neg_example)
    else:
        missing += 1

print(f"Positive-only examples: {len(train_examples_pos)}")
print(f"Mixed examples (pos + neg): {len(train_examples_mixed)}")
print(f"Skipped {missing} tweets due to missing paper metadata.")


Positive-only examples: 12853
Mixed examples (pos + neg): 25706
Skipped 0 tweets due to missing paper metadata.


In [36]:
train_dataloader_pos = DataLoader(train_examples_pos, shuffle=True, batch_size=16)
train_loss_pos = losses.CosineSimilarityLoss(pos_model)

pos_model.fit(
    train_objectives=[(train_dataloader_pos, train_loss_pos)],
    epochs=1,
    warmup_steps=100,
    show_progress_bar=True
)

pos_model.save("fine_tuned_sbert_model_pos")
pos_model = SentenceTransformer("fine_tuned_sbert_model_pos")

Iteration: 100%|██████████| 804/804 [17:05<00:00,  1.28s/it]
Epoch: 100%|██████████| 1/1 [17:05<00:00, 1025.34s/it]


In [37]:
# Encode collection texts 
print("Encoding collection (papers)...")
paper_embeddings_pos = pos_model.encode(
    collection_df['text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

Encoding collection (papers)...


Batches: 100%|██████████| 121/121 [02:26<00:00,  1.21s/it]


In [None]:
# Encode tweet texts
print("Encoding tweets...")
test_tweet_embeddings_pos = pos_model.encode(
    df_query_test['tweet_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

similarity_matrix_pos = cosine_similarity(test_tweet_embeddings_pos, paper_embeddings_pos)

# Extract top-K predictions
top_k = 5
top_indices_pos = np.argsort(similarity_matrix_pos, axis=1)[:, -top_k:][:, ::-1]
top_cord_uids_pos = [[paper_cord_uids[i] for i in row] for row in top_indices_pos]

# Save submission
print("Saving predictions...")
df_query_test['preds_pos'] = top_cord_uids_pos
df_query_test[['post_id', 'preds_pos']].to_csv('nlp_predictions_representation_pos.tsv', sep='\t', index=False)


Encoding tweets...


Batches: 100%|██████████| 23/23 [00:05<00:00,  4.32it/s]


Saving predictions...


In [39]:
dev_tweet_embeddings_pos = pos_model.encode(
    df_query_dev['tweet_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Recompute cosine similarity for dev set
similarity_matrix_dev_pos = cosine_similarity(dev_tweet_embeddings_pos, paper_embeddings_pos)

Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Batches: 100%|██████████| 22/22 [00:05<00:00,  3.97it/s]


In [47]:
top_k = 10

# Get top-K predicted cord_uids for each tweet
top_indices_dev_pos = np.argsort(similarity_matrix_dev_pos, axis=1)[:, -top_k:][:, ::-1]
top_cord_uids_dev_pos = [[paper_cord_uids[i] for i in row] for row in top_indices_dev_pos]

# Add predictions to the dev dataframe
df_query_dev['preds_pos'] = top_cord_uids_dev_pos

# Run and print MRR metrics
mrr_results_pos = get_performance_mrr(df_query_dev, col_pred='preds_pos', list_k=[1, 5, 10])


# Compute and print MRR@5
mrr5 = get_performance_mrr(df_query_dev)
print(f"SBERT pos MRR@1: {mrr_results_pos['MRR@1']},SBERT pos MRR@5: {mrr_results_pos['MRR@5']},SBERT pos MRR@10: {mrr_results_pos['MRR@10']}")
print(f"BM25 MRR@1: {results_dev[1]}, BM25 MRR@5: {results_dev[5]}, BM25 MRR@10: {results_dev[10]}")

SBERT pos MRR@1: 0.2493,SBERT pos MRR@5: 0.3056,SBERT pos MRR@10: 0.3149
BM25 MRR@1: 0.505, BM25 MRR@5: 0.5520357142857142, BM25 MRR@10: 0.5520357142857142


### Experiment 2: Mixed positive and negative pairs training

In [48]:
# Load pre-trained Sentence-BERT model
print("Loading SBERT model...")
mixed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Loading SBERT model...


In [49]:
train_dataloader_mixed = DataLoader(train_examples_mixed, shuffle=True, batch_size=16)
train_loss_mixed = losses.MultipleNegativesRankingLoss(mixed_model)

mixed_model.fit(
    train_objectives=[(train_dataloader_mixed, train_loss_mixed)],
    epochs=1,
    warmup_steps=100,
    show_progress_bar=True
)

mixed_model.save("fine_tuned_sbert_mixed_v1")
mixed_model = SentenceTransformer("fine_tuned_sbert_mixed_v1")

Iteration: 100%|██████████| 1607/1607 [31:19<00:00,  1.17s/it]
Epoch: 100%|██████████| 1/1 [31:19<00:00, 1879.08s/it]


In [50]:
# Encode collection texts into dense vectors
print("Encoding collection (papers)...")
paper_embeddings_mixed = mixed_model.encode(
    collection_df['text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

Encoding collection (papers)...


Batches: 100%|██████████| 121/121 [02:16<00:00,  1.12s/it]


In [None]:
# Encode tweet texts
print("Encoding tweets...")
test_tweet_embeddings_mixed = mixed_model.encode(
    df_query_test['tweet_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

similarity_matrix_mixed = cosine_similarity(test_tweet_embeddings_mixed, paper_embeddings_mixed)

# Extract top-K predictions
top_k = 5
top_indices_mixed = np.argsort(similarity_matrix_mixed, axis=1)[:, -top_k:][:, ::-1]
top_cord_uids_mixed = [[paper_cord_uids[i] for i in row] for row in top_indices_mixed]

# Save submission
print("Saving predictions...")
df_query_test['preds_mixed'] = top_cord_uids_mixed
df_query_test[['post_id', 'preds_mixed']].to_csv('nlp_predictions_representation_mixed.tsv', sep='\t', index=False)


Encoding tweets...


Batches: 100%|██████████| 23/23 [00:04<00:00,  5.24it/s]


Saving predictions...


In [52]:
dev_tweet_embeddings_mixed = mixed_model.encode(
    df_query_dev['tweet_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Recompute cosine similarity for dev set
similarity_matrix_dev_mixed = cosine_similarity(dev_tweet_embeddings_mixed, paper_embeddings_mixed)

Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Batches: 100%|██████████| 22/22 [00:05<00:00,  4.34it/s]


In [53]:
top_k = 10

# Get top-K predicted cord_uids for each tweet
top_indices_dev_mixed = np.argsort(similarity_matrix_dev_mixed, axis=1)[:, -top_k:][:, ::-1]
top_cord_uids_dev_mixed = [
    [paper_cord_uids[i] for i in row]
    for row in top_indices_dev_mixed
]

# Add predictions to the dev dataframe
df_query_dev['preds_mixed'] = top_cord_uids_dev_mixed

# Run and print MRR metrics
mrr_results_mixed = get_performance_mrr(df_query_dev, col_pred='preds_mixed', list_k=[1, 5, 10])

print(f"SBERT mixed MRR@1: {mrr_results_mixed['MRR@1']},SBERT mixed MRR@5: {mrr_results_mixed['MRR@5']},SBERT mixed MRR@10: {mrr_results_mixed['MRR@10']}")
print(f"BM25 MRR@1: {results_dev[1]}, BM25 MRR@5: {results_dev[5]}, BM25 MRR@10: {results_dev[10]}")

SBERT mixed MRR@1: 0.4993,SBERT mixed MRR@5: 0.5704,SBERT mixed MRR@10: 0.5785
BM25 MRR@1: 0.505, BM25 MRR@5: 0.5520357142857142, BM25 MRR@10: 0.5520357142857142


### Experiment 3: Mixed positive and hard negative pairs training

In [54]:
# Load pre-trained Sentence-BERT model
print("Loading SBERT model...")
hardneg_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Loading SBERT model...


In [55]:
# Encode all paper texts
paper_ids = collection_df['cord_uid'].tolist()
paper_texts = collection_df['text'].tolist()

print("Encoding paper texts...")
paper_embeddings_hardneg = hardneg_model.encode(paper_texts, show_progress_bar=True, convert_to_numpy=True, batch_size=64)

# Create lookup: cord_uid -> embedding
paper_embedding_map_hardneg = dict(zip(paper_ids, paper_embeddings_hardneg))

Encoding paper texts...


Batches: 100%|██████████| 121/121 [03:04<00:00,  1.53s/it]


In [56]:
# Encode all tweets in the training set
tweet_ids = df_query_train['post_id'].tolist()
tweet_texts = df_query_train['tweet_text'].tolist()
tweet_uids = df_query_train['cord_uid'].tolist() 

print("Encoding tweet texts...")
tweet_embeddings_hardneg = hardneg_model.encode(tweet_texts, show_progress_bar=True, convert_to_numpy=True, batch_size=64)

Encoding tweet texts...


Batches: 100%|██████████| 201/201 [00:54<00:00,  3.70it/s]


In [None]:
hard_negative_examples = []
k = 2  # number of hard negatives per tweet

for i in tqdm(range(len(tweet_texts))):
    tweet_emb = tweet_embeddings_hardneg[i]
    tweet_text = tweet_texts[i]
    true_uid = tweet_uids[i]
    
    true_text = paper_text_map.get(true_uid)
    if not true_text:
        continue  
    
    # Positive example 
    hard_negative_examples.append(InputExample(texts=[tweet_text, true_text], label=1.0))

    # Compute similarity to all paper embeddings
    sims = cosine_similarity([tweet_emb], paper_embeddings_hardneg)[0]
    
    # Mask out true paper
    sims = [(pid, score) for pid, score in zip(paper_ids, sims) if pid != true_uid]
    
    # Sort by similarity, take top k
    hard_negs = sorted(sims, key=lambda x: x[1], reverse=True)[:k]

    for neg_uid, _ in hard_negs:
        neg_text = paper_text_map.get(neg_uid)
        if neg_text:
            hard_negative_examples.append(InputExample(texts=[tweet_text, neg_text], label=0.0))


100%|██████████| 12853/12853 [02:06<00:00, 101.85it/s]


In [58]:
train_dataloader_hardneg = DataLoader(hard_negative_examples, shuffle=True, batch_size=16)
train_loss_hardneg = losses.MultipleNegativesRankingLoss(model=hardneg_model)

hardneg_model.fit(
    train_objectives=[(train_dataloader_hardneg, train_loss_hardneg)],
    epochs=1,
    warmup_steps=100,
    show_progress_bar=True
)

hardneg_model.save("sbert_hardneg_finetuned")
hardneg_model = SentenceTransformer("sbert_hardneg_finetuned")

Iteration: 100%|██████████| 2410/2410 [49:19<00:00,  1.23s/it]
Epoch: 100%|██████████| 1/1 [49:19<00:00, 2959.95s/it]


In [59]:
# Encode collection texts into dense vectors
print("Encoding collection (papers)...")
paper_embeddings_hardneg = hardneg_model.encode(
    collection_df['text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

Encoding collection (papers)...


Batches: 100%|██████████| 121/121 [02:24<00:00,  1.19s/it]


In [None]:
# Encode tweet texts
print("Encoding tweets...")
test_tweet_embeddings_hardneg = hardneg_model.encode(
    df_query_test['tweet_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

similarity_matrix = cosine_similarity(test_tweet_embeddings_hardneg, paper_embeddings_hardneg)

# Extract top-K predictions
top_k = 5
top_indices_hardneg = np.argsort(similarity_matrix, axis=1)[:, -top_k:][:, ::-1]
top_cord_uids_hardneg = [[paper_cord_uids[i] for i in row] for row in top_indices_hardneg]

# Save submission
print("Saving predictions...")
df_query_test['preds_hardneg'] = top_cord_uids_hardneg
df_query_test[['post_id', 'preds_hardneg']].to_csv('nlp_predictions_representation_mixed_hardneg.tsv', sep='\t', index=False)


Encoding tweets...


Batches: 100%|██████████| 23/23 [00:04<00:00,  4.74it/s]


Saving predictions...


In [61]:
dev_tweet_embeddings_hardneg = hardneg_model.encode(
    df_query_dev['tweet_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Recompute cosine similarity for dev set
similarity_matrix_dev = cosine_similarity(dev_tweet_embeddings_hardneg, paper_embeddings_hardneg)

Batches: 100%|██████████| 22/22 [00:05<00:00,  4.04it/s]


In [62]:
top_k = 10

# Get top-K predicted cord_uids for each tweet
top_indices_dev_hardneg = np.argsort(similarity_matrix_dev, axis=1)[:, -top_k:][:, ::-1]
top_cord_uids_dev_hardneg = [
    [paper_cord_uids[i] for i in row]
    for row in top_indices_dev_hardneg
]

# Add predictions to the dev dataframe
df_query_dev['preds_hardneg'] = top_cord_uids_dev_hardneg

# Run and print MRR metrics
mrr_results_hardneg = get_performance_mrr(df_query_dev, col_pred='preds_hardneg', list_k=[1, 5, 10])

print(f"SBERT hardneg MRR@1: {mrr_results_hardneg['MRR@1']},SBERT hardneg MRR@5: {mrr_results_hardneg['MRR@5']},SBERT hardneg MRR@10: {mrr_results_hardneg['MRR@10']}")
print(f"BM25 MRR@1: {results_dev[1]}, BM25 MRR@5: {results_dev[5]}, BM25 MRR@10: {results_dev[10]}")

SBERT hardneg MRR@1: 0.4793,SBERT hardneg MRR@5: 0.5514,SBERT hardneg MRR@10: 0.5602
BM25 MRR@1: 0.505, BM25 MRR@5: 0.5520357142857142, BM25 MRR@10: 0.5520357142857142
