# Neural NLP Representation Learning Approach

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook implements an improved neural approach using sentence transformers with:
- Enhanced text preprocessing for scientific content
- Multi-query retrieval with domain-specific augmentation
- Semantic term matching boosts

This remains a pure neural representation learning approach.

In [1]:
!pip install sentence-transformers scikit-learn




[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\Emirhan\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


# 1) Importing data and packages

In [2]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import re
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


## 1.a) Import the collection set

In [3]:
PATH_COLLECTION_DATA = 'subtask4b_collection_data.pkl' #MODIFY PATH

In [4]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [5]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [6]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

In [7]:
PATH_QUERY_TRAIN_DATA = 'subtask4b_query_tweets_train.tsv' #MODIFY PATH
PATH_QUERY_DEV_DATA = 'subtask4b_query_tweets_dev.tsv' #MODIFY PATH

In [8]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep='\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep='\t')

In [9]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [10]:
df_query_dev.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


# 2) Text preprocessing functions

In [11]:
def clean_tweet_text(text):
    """Clean tweet text while preserving scientific information"""
    if pd.isna(text):
        return ""
    
    text = str(text)
    
    text = re.sub(r'&amp;', 'and', text)
    text = re.sub(r'&lt;', '<', text)
    text = re.sub(r'&gt;', '>', text)
    
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'http\S+', '', text)
    
    text = re.sub(r'#covid19', 'COVID-19', text, flags=re.IGNORECASE)
    text = re.sub(r'#sarscov2', 'SARS-CoV-2', text, flags=re.IGNORECASE)
    text = re.sub(r'#(covid|coronavirus)', 'COVID-19', text, flags=re.IGNORECASE)
    text = re.sub(r'#(\w+)', r'\1', text)
    
    text = re.sub(r'\bcovid-?19\b', 'COVID-19', text, flags=re.IGNORECASE)
    text = re.sub(r'\bsars-?cov-?2\b', 'SARS-CoV-2', text, flags=re.IGNORECASE)
    text = re.sub(r'\bcovid\b(?![\d-])', 'COVID-19', text, flags=re.IGNORECASE)
    
    text = re.sub(r'\bnih\b', 'NIH', text, flags=re.IGNORECASE)
    text = re.sub(r'\bicu\b', 'ICU', text, flags=re.IGNORECASE)
    text = re.sub(r'\bppe\b', 'PPE', text, flags=re.IGNORECASE)
    text = re.sub(r'\busa\b', 'USA', text, flags=re.IGNORECASE)
    
    text = re.sub(r'\bp\s*[<>=]\s*0\.(\d+)', r'p-value 0.\1', text, flags=re.IGNORECASE)
    text = re.sub(r'\b(\d+)%\s*ci\b', r'\1% confidence interval', text, flags=re.IGNORECASE)
    
    text = re.sub(r'[💃🚨▶️👍📈📊🔥✅❌🎯🧵👇🏻🔴☑️⬇️➡️]+', '', text)
    text = re.sub(r'[\"\"\"]', '"', text)
    
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

def clean_scientific_text(text):
    """Minimal cleaning for scientific text"""
    if pd.isna(text):
        return ""
    
    text = str(text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

def create_enhanced_document_text(row):
    """Create structured document representation"""
    title = clean_scientific_text(row['title'])
    abstract = clean_scientific_text(row['abstract'])
    authors = str(row['authors']) if not pd.isna(row['authors']) else ""
    journal = str(row['journal']) if not pd.isna(row['journal']) else ""
    
    enhanced_text = f"Title: {title}. Abstract: {abstract}"
    
    if authors:
        enhanced_text += f" Authors: {authors}"
    if journal:
        enhanced_text += f" Journal: {journal}"
    
    return enhanced_text

# 3) Load sentence transformer model

In [12]:
# Load sentence transformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device='cuda')

# 4) Prepare document representations

In [13]:
# Create enhanced representations for all documents
df_collection['enhanced_text'] = df_collection.apply(create_enhanced_document_text, axis=1)

# Prepare corpus and IDs
corpus = df_collection['enhanced_text'].tolist()
cord_uids = df_collection['cord_uid'].tolist()

# 5) Encode all documents

In [14]:
# Encode all documents
doc_embeddings = model.encode(
    corpus,
    show_progress_bar=True,
    convert_to_numpy=True,
    batch_size=32,
    normalize_embeddings=True
)

Batches: 100%|██████████| 242/242 [01:25<00:00,  2.82it/s]


# 6) Advanced retrieval functions

In [None]:
def create_augmented_queries(tweet_text):
    """Create multiple query variations for better retrieval"""
    base_query = clean_tweet_text(tweet_text)
    queries = [base_query]
    base_lower = base_query.lower()
    
    # Add scientific context if study-related terms present
    if any(term in base_lower for term in ['study', 'research', 'trial', 'analysis', 'findings']):
        queries.append(f"scientific research {base_query}")
    
    # Add COVID context if relevant
    if any(term in base_lower for term in ['covid', 'coronavirus', 'pandemic', 'vaccine', 'mask']):
        queries.append(f"COVID-19 pandemic study {base_query}")
    
    # Add statistical context if numbers present
    if re.search(r'\d+%|\bp-value|\bconfidence interval|\bodds ratio|\brisk', base_query, re.IGNORECASE):
        queries.append(f"statistical research findings {base_query}")
    
    return queries


def precompute_doc_features():
    """Pre-compute boolean arrays for fast boosting"""
    features = {}
    
    corpus_lower = [doc.lower() for doc in corpus]
    
    features['covid'] = np.array([
        any(term in doc for term in ['covid', 'coronavirus']) 
        for doc in corpus_lower
    ], dtype=float)
    
    features['stats'] = np.array([
        bool(re.search(r'\d+%|\bp-value', doc)) 
        for doc in corpus_lower
    ], dtype=float)
    
    features['study'] = np.array([
        any(term in doc for term in ['study', 'trial', 'research']) 
        for doc in corpus_lower
    ], dtype=float)
    
    features['medical'] = np.array([
        any(term in doc for term in ['vaccine', 'mask', 'treatment']) 
        for doc in corpus_lower
    ], dtype=float)
    
    return features

doc_features = precompute_doc_features()

def compute_semantic_boost(query_text, doc_features):
    """Vectorized semantic boosting"""
    query_lower = query_text.lower()
    boosts = np.zeros(len(doc_features['covid']))
    
    if any(term in query_lower for term in ['covid', 'coronavirus']):
        boosts += 0.05 * doc_features['covid']
    
    if re.search(r'\d+%|\bp-value', query_lower):
        boosts += 0.06 * doc_features['stats']
    
    if any(term in query_lower for term in ['study', 'trial', 'research']):
        boosts += 0.03 * doc_features['study']
    
    if any(term in query_lower for term in ['vaccine', 'mask', 'treatment']):
        boosts += 0.04 * doc_features['medical']
    
    return boosts

def retrieve_papers(query_text, k=5):
    """Neural retrieval using multi-query and semantic boosting"""
    queries = create_augmented_queries(query_text)
    
    all_similarities = []
    for query in queries:
        query_embedding = model.encode([query], 
                                     convert_to_numpy=True, 
                                     normalize_embeddings=True)
        base_similarities = cosine_similarity(query_embedding, doc_embeddings).flatten()
        
        boosts = compute_semantic_boost(query, doc_features)
        boosted_similarities = base_similarities + boosts
        all_similarities.append(boosted_similarities)
    
    if len(all_similarities) == 1:
        fused_scores = all_similarities[0]
    else:
        weights = [0.6] + [0.4 / (len(all_similarities) - 1)] * (len(all_similarities) - 1)
        fused_scores = np.average(all_similarities, axis=0, weights=weights)
    
    top_indices = np.argsort(fused_scores)[::-1][:k]
    return [cord_uids[i] for i in top_indices]

# 7) Running the improved neural model

In [24]:
# Retrieve topk candidates using the improved neural model
df_query_train['improved_neural_topk'] = df_query_train['tweet_text'].apply(lambda x: retrieve_papers(x))
df_query_dev['improved_neural_topk'] = df_query_dev['tweet_text'].apply(lambda x: retrieve_papers(x))

# 8) Evaluating the improved neural model

In [26]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        #performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance

In [27]:
results_train = get_performance_mrr(df_query_train, 'cord_uid', 'improved_neural_topk')
results_dev = get_performance_mrr(df_query_dev, 'cord_uid', 'improved_neural_topk')
# Printed MRR@k results in the following format: {k: MRR@k}
print(f"Results on the train set: {results_train}")
print(f"Results on the dev set: {results_dev}")

Results on the train set: {1: np.float64(0.4220804481444021), 5: np.float64(0.4897196504058715), 10: np.float64(0.4897196504058715)}
Results on the dev set: {1: np.float64(0.4392857142857143), 5: np.float64(0.5115000000000001), 10: np.float64(0.5115000000000001)}


# 9) Exporting results to prepare the submission

In [28]:
df_query_dev['preds'] = df_query_dev['improved_neural_topk'].apply(lambda x: x[:5])

In [29]:
df_query_dev[['post_id', 'preds']].to_csv('predictions_improved_neural.tsv', index=None, sep='\t')