# CheckThat Re-ranking Model

---

For the start we'll install the following dependencies

In [None]:
!pip install sentence-transformers torch pandas numpy scikit-learn tqdm

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

## Setup:

For the start we'll load the data and setup the depencys. We decided to train the model unsing WandDb as it antomatically yields the easy comparison of models.

In [None]:
import pickle
import requests
import io
import os
import random
import numpy as np
import pandas as pd
import torch
from torch import nn
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
from sentence_transformers.cross_encoder import CrossEncoder
from torch.utils.data import DataLoader
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
import warnings

warnings.filterwarnings('ignore')

random.seed(40)
np.random.seed(40)
torch.manual_seed(40)

<torch._C.Generator at 0x79ed258b32f0>

Now we load the CLEAF dataset for training and evaluation:

In [None]:
base_url = "https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task4/subtask_4b/"

response = requests.get(base_url + 'subtask4b_collection_data.pkl')
data = pickle.load(io.BytesIO(response.content))
df_collection = pd.DataFrame(data)

df_query_train = pd.read_csv(base_url + 'subtask4b_query_tweets_train.tsv', sep='\t')
df_query_dev = pd.read_csv(base_url + 'subtask4b_query_tweets_dev.tsv', sep='\t')

print(f"Collection: {len(df_collection)} Dokumente")
print(f"Train: {len(df_query_train)}")
print(f"VAL: {len(df_query_dev)}")

Collection: 7718 Dokumente
Train: 12853
VAL: 1400


In [None]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


In [None]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [None]:
sample_query = df_query_train.iloc[0]
sample_doc = df_collection[df_collection['cord_uid'] == sample_query['cord_uid']].iloc[0]

print(f"Tweet: {sample_query['tweet_text']}")
print(f"Paper Title: {sample_doc['title']}")
print(f"Abstract: {sample_doc['abstract'][:100]}")

Tweet: Oral care in rehabilitation medicine: oral vulnerability, oral muscle wasting, and hospital-associated oral issues
Paper Title: Oral Management in Rehabilitation Medicine: Oral Frailty, Oral Sarcopenia, and Hospital-Associated Oral Problems
Abstract: Oral health is a crucial but often neglected aspect of rehabilitation medicine. Approximately 71% of


## The Base Model

For the start we'll train a simple base line. The idea is to use the standard aproach, ... take a sequence transformer model finetune it with what we have... mostly the positve examples.

We will create train example of format query_text, doc_text. The QuereText consitsts of title and abstract.

In [None]:
def create_first_train_set(df_query_train, df_collection):
    train_examples = []

    for _, row in tqdm(df_query_train.iterrows(), desc="Creating training examples"):
        doc_row = df_collection[df_collection['cord_uid'] == row['cord_uid']]
        if doc_row.empty:
            continue
        doc_text = doc_row.iloc[0]['title'] + " " + doc_row.iloc[0]['abstract']
        query_text = row['tweet_text']

        train_examples.append(InputExample(texts=[query_text, doc_text], label=1.0))

    print(f"{len(train_examples)} positive samples")
    return train_examples

train_examples = create_first_train_set(df_query_train, df_collection)
dev_examples = create_first_train_set(df_query_dev, df_collection)

Creating training examples: 12853it [00:16, 782.77it/s]


12853 positive samples


Creating training examples: 1400it [00:01, 795.09it/s]

1400 positive samples





Now we implement the Sequence transformer.

In [None]:
model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

batch_size = 128
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=40,
    warmup_steps=100,
    show_progress_bar=True
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mchristofer-held-123[0m ([33mbevor-ich-fernschau[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
500,0.6505
1000,0.3176
1500,0.2007
2000,0.1467
2500,0.1196
3000,0.1057
3500,0.0969
4000,0.095


Create the courpus for the emedding:

In [None]:
corpus = (df_collection['title'] + " " + df_collection['abstract']).tolist()
corpus_ids = df_collection['cord_uid'].tolist()

corpus_embeddings = model.encode(
    corpus,
    convert_to_tensor=True,
    show_progress_bar=True,
    batch_size=128
)

Batches:   0%|          | 0/61 [00:00<?, ?it/s]

## Implementation of the Retrieval Function

We use the topk candiates

In [None]:
def get_topk_candidates(query, model, corpus_embeddings, corpus_ids, k=10):
    query_emb = model.encode(query, convert_to_tensor=True)
    cos = nn.CosineSimilarity(dim=1)
    similarities = cos(query_emb.unsqueeze(0), corpus_embeddings)
    top_indices = similarities.cpu().numpy().argsort()[-k:][::-1]
    return [corpus_ids[i] for i in top_indices], similarities.cpu().numpy()[top_indices]

test_query = df_query_dev.iloc[0]['tweet_text']
candidates, scores = get_topk_candidates(test_query, model, corpus_embeddings, corpus_ids, k=5)

print(f"Query: {test_query}")
print(f"Top-5 Candidates: {candidates}")
print(f"Scores: {scores}")

Query: covid recovery: this study from the usa reveals that a proportion of cases experience impairment in some cognitive functions for several months after infection. some possible biases &amp; limitations but more research is required on impact of these long term effects.
Top-5 Candidates: ['3qvh482o', 'hg3xpej0', '8t2tic9n', 'styavbvi', 'nksd3wuw']
Scores: [0.6880174  0.6558095  0.64512086 0.62691975 0.5916209 ]


And For the evaluation we use the MRR metric

In [None]:
def calculate_mrr(df_queries, prediction_column='predictions', gold_column='cord_uid'):
    def get_mrr_score(row, k):
        if row[gold_column] in row[prediction_column][:k]:
            rank = row[prediction_column][:k].index(row[gold_column]) + 1
            return 1.0 / rank
        return 0.0

    results = {}
    for k in [1, 5, 10]:
        scores = df_queries.apply(lambda row: get_mrr_score(row, k), axis=1)
        results[f'MRR@{k}'] = scores.mean()

    return results

dev_predictions = []
for _, row in tqdm(df_query_dev.iterrows(), desc="Predicting"):
    candidates, _ = get_topk_candidates(row['tweet_text'], model, corpus_embeddings, corpus_ids, k=10)
    dev_predictions.append(candidates)

df_query_dev['predictions'] = dev_predictions

baseline_results = calculate_mrr(df_query_dev)
print(f"Baseline Results: {baseline_results}")

Predicting: 1400it [00:11, 127.27it/s]


Baseline Results: {'MRR@1': np.float64(0.5407142857142857), 'MRR@5': np.float64(0.6060476190476191), 'MRR@10': np.float64(0.6142732426303855)}


## Hard Negative Mining

We use hard negative mining for finding harder traindata. Te model learned only to compare random text to the right answer till now. Thats easy. We want it to learn the hard neaunces. Therefore we use our basemodel to find simliar text that is wrong

In [None]:
def create_hard_negative_examples(df_query_train, df_collection, model, corpus_embeddings, corpus_ids, num_negatives=3):
    train_examples = []

    for _, row in tqdm(df_query_train.iterrows()):
        doc_row = df_collection[df_collection['cord_uid'] == row['cord_uid']]
        if doc_row.empty:
            continue

        doc_text = doc_row.iloc[0]['title'] + " " + doc_row.iloc[0]['abstract']
        query_text = row['tweet_text']
        train_examples.append(InputExample(texts=[query_text, doc_text], label=1.0))

        candidates, _ = get_topk_candidates(query_text, model, corpus_embeddings, corpus_ids, k=20)

        hard_negatives = []
        for candidate_uid in candidates:
            if candidate_uid != row['cord_uid']:
                neg_doc_row = df_collection[df_collection['cord_uid'] == candidate_uid]
                if not neg_doc_row.empty:
                    neg_doc_text = neg_doc_row.iloc[0]['title'] + " " + neg_doc_row.iloc[0]['abstract']
                    hard_negatives.append(neg_doc_text)
                    if len(hard_negatives) >= num_negatives:
                        break

        for neg_text in hard_negatives:
            train_examples.append(InputExample(texts=[query_text, neg_text], label=0.0))
    return train_examples

hard_negative_examples = create_hard_negative_examples(
    df_query_train, df_collection, model, corpus_embeddings, corpus_ids, num_negatives=5
)

12853it [03:12, 66.60it/s]


Now we have to reatrain

In [None]:

improved_model = SentenceTransformer(model_name)

hard_neg_dataloader = DataLoader(hard_negative_examples, shuffle=True, batch_size=batch_size)

improved_model.fit(
    train_objectives=[(hard_neg_dataloader, train_loss)],
    epochs=20,
    warmup_steps=200,
    show_progress_bar=True,
    output_path="./improved_model",
    use_amp=True
)


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.8961
1000,0.5521
1500,0.4595
2000,0.4113
2500,0.3827
3000,0.3637
3500,0.3411
4000,0.322
4500,0.3143
5000,0.3088


We can create a new embedding

In [None]:
improved_corpus_embeddings = improved_model.encode(
    corpus,
    convert_to_tensor=True,
    show_progress_bar=True,
    batch_size=128
)

Batches:   0%|          | 0/61 [00:00<?, ?it/s]

And Evaluate it compared to the baseline.

In [None]:
improved_predictions = []
for _, row in tqdm(df_query_dev.iterrows(), desc="Improved Predicting"):
    candidates, _ = get_topk_candidates(
        row['tweet_text'], improved_model, improved_corpus_embeddings, corpus_ids, k=10
    )
    improved_predictions.append(candidates)

df_query_dev['improved_predictions'] = improved_predictions

improved_results = calculate_mrr(df_query_dev, prediction_column='improved_predictions')

print(f"Baseline: {baseline_results}")
print(f"Improved: {improved_results}")

for metric in baseline_results.keys():
    improvement = improved_results[metric] - baseline_results[metric]
    print(f"{metric} Improvment: {improvement:.4f}")

Improved Predicting: 1400it [00:14, 98.47it/s]

Baseline: {'MRR@1': np.float64(0.5407142857142857), 'MRR@5': np.float64(0.6060476190476191), 'MRR@10': np.float64(0.6142732426303855)}
Improved: {'MRR@1': np.float64(0.5178571428571429), 'MRR@5': np.float64(0.585797619047619), 'MRR@10': np.float64(0.5943968253968254)}
MRR@1 Improvment: -0.0229
MRR@5 Improvment: -0.0203
MRR@10 Improvment: -0.0199





## Differen Approach Cross Encoder

We used a bi encoder till now. We can also use a cross encoder. This is a differen approach that does stuff a bit different.



In [None]:
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', num_labels=1)

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

We have to prepare the trainingsdata a bit different

In [None]:
def create_cross_encoder_training_data(df_query_train, df_collection, bi_encoder, corpus_embeddings, corpus_ids, top_k=10):
    ce_examples = []

    for _, row in tqdm(df_query_train.iterrows(), desc="Creating CE training data"):
        query = row['tweet_text']
        correct_uid = row['cord_uid']

        candidates, _ = get_topk_candidates(query, bi_encoder, corpus_embeddings, corpus_ids, k=top_k)

        for uid in candidates:
            doc_row = df_collection[df_collection['cord_uid'] == uid]
            if doc_row.empty:
                continue

            doc_text = doc_row.iloc[0]['title'] + " " + doc_row.iloc[0]['abstract']
            label = 1 if uid == correct_uid else 0

            ce_examples.append(InputExample(texts=[query, doc_text], label=float(label)))

    print(f"{len(ce_examples)} samples")
    return ce_examples

ce_training_data = create_cross_encoder_training_data(
    df_query_train, df_collection, model, corpus_embeddings, corpus_ids
)

Creating CE training data: 12853it [04:13, 50.70it/s]

128530 samples





Now we can train it a bit diffrently

In [None]:
ce_dataloader = DataLoader(ce_training_data, shuffle=True, batch_size=128)

cross_encoder.fit(
    train_dataloader=ce_dataloader,
    epochs=5,
    warmup_steps=100,
    output_path="./cross_encoder_model"
)

Step,Training Loss
500,0.1622
1000,0.1792
1500,0.1646
2000,0.1632
2500,0.1489
3000,0.1503
3500,0.1397
4000,0.1389
4500,0.1299
5000,0.1329


With reranking

In [None]:
def complete_pipeline_prediction(query, bi_encoder, cross_encoder, corpus_embeddings, corpus_ids, df_collection,
                                first_stage_k=20, final_k=5):
    candidates, _ = get_topk_candidates(query, bi_encoder, corpus_embeddings, corpus_ids, k=first_stage_k)

    if cross_encoder is not None and len(candidates) > 1:
        pairs = []
        valid_candidates = []

        for uid in candidates:
            doc_row = df_collection[df_collection['cord_uid'] == uid]
            if not doc_row.empty:
                doc_text = doc_row.iloc[0]['title'] + " " + doc_row.iloc[0]['abstract']
                pairs.append([query, doc_text])
                valid_candidates.append(uid)

        if pairs:
            scores = cross_encoder.predict(pairs)
            ranked_indices = np.argsort(scores)[::-1]
            reranked_candidates = [valid_candidates[i] for i in ranked_indices]

            return reranked_candidates[:final_k]

    return candidates[:final_k]

test_query = df_query_dev.iloc[0]['tweet_text']
pipeline_result = complete_pipeline_prediction(
    test_query, model, cross_encoder, corpus_embeddings, corpus_ids, df_collection
)

print(f"Query: {test_query}")
print(f"Pipeline Result: {pipeline_result}")
print(f"Correct Answer: {df_query_dev.iloc[0]['cord_uid']}")

Query: covid recovery: this study from the usa reveals that a proportion of cases experience impairment in some cognitive functions for several months after infection. some possible biases &amp; limitations but more research is required on impact of these long term effects.
Pipeline Result: ['3qvh482o', '8t2tic9n', 'nksd3wuw', 'hg3xpej0', 'rthsl7a9']
Correct Answer: 3qvh482o


And with the Final Evaluation

In [None]:
final_predictions = []
for _, row in tqdm(df_query_dev.iterrows(), desc="Final Pipeline Predictions"):
    prediction = complete_pipeline_prediction(
        row['tweet_text'], improved_model, cross_encoder,
        improved_corpus_embeddings, corpus_ids, df_collection
    )
    final_predictions.append(prediction)

df_query_dev['final_predictions'] = final_predictions

final_results = calculate_mrr(df_query_dev, prediction_column='final_predictions')

print(f"Baseline:  {baseline_results}")
print(f"Improved:  {improved_results}")
print(f"Final:     {final_results}")

for metric in baseline_results.keys():
    baseline_score = baseline_results[metric]
    final_score = final_results[metric]
    improvement = final_score - baseline_score
    improvement_pct = (improvement / baseline_score) * 100 if baseline_score > 0 else 0
    print(f"{metric}: {baseline_score:.4f} → {final_score:.4f} (+{improvement:.4f}, +{improvement_pct:.1f}%)")

Final Pipeline Predictions: 1400it [01:42, 13.73it/s]

Baseline:  {'MRR@1': np.float64(0.5407142857142857), 'MRR@5': np.float64(0.6060476190476191), 'MRR@10': np.float64(0.6142732426303855)}
Improved:  {'MRR@1': np.float64(0.5178571428571429), 'MRR@5': np.float64(0.585797619047619), 'MRR@10': np.float64(0.5943968253968254)}
Final:     {'MRR@1': np.float64(0.6292857142857143), 'MRR@5': np.float64(0.6662261904761905), 'MRR@10': np.float64(0.6662261904761905)}
MRR@1: 0.5407 → 0.6293 (+0.0886, +16.4%)
MRR@5: 0.6060 → 0.6662 (+0.0602, +9.9%)
MRR@10: 0.6143 → 0.6662 (+0.0520, +8.5%)





Predictions For the submission:

In [None]:
submission_predictions = []
for predictions in final_predictions:
    submission_predictions.append(predictions[:5])

df_submission = df_query_dev[['post_id']].copy()
df_submission['preds'] = submission_predictions
df_submission.to_csv('/content/tmp/neural_reranking_predictions.tsv', index=False, sep='\t')

print("Predictions saved in /content/tmp/neural_reranking_predictions.tsv")

Predictions saved in /content/tmp/neural_reranking_predictions.tsv
1400 Predictions erstellt
