# MedMentions Dataset Preprocessing

## Overview

This notebook preprocesses the MedMentions dataset for entity linking with the UMLS knowledge base.

**Dataset:** 352K+ entity mentions from 4K PubMed biomedical abstracts
**Knowledge Base:** UMLS (Unified Medical Language System) - different from Wikipedia
**Challenge:** High ambiguity, especially in medical abbreviations (MS, RA, CA)

**Preprocessing Pipeline:**
1. Load and parse PubTator format ‚Üí Extract mentions with UMLS CUIDs
2. Filter valid entities ‚Üí Remove corrupted annotations
3. Normalize mentions ‚Üí Standardize medical terminology
4. Extract biomedical features ‚Üí Add domain-specific attributes
5. Create structured records ‚Üí Format for entity linking models
6. Export data ‚Üí Save as JSONL and Parquet
7. Generate statistics ‚Üí Document dataset characteristics

---

## Step 0: Import Libraries

In [None]:
import pandas as pd
import sys
import warnings
import re
import os
from tqdm import tqdm
warnings.filterwarnings('ignore')

In [None]:
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))

if parent_dir not in sys.path:
    sys.path.append(parent_dir)
import Utils as u

## Step 1: Load MedMentions Dataset

**Purpose:** Parse the PubTator format file and split data by official train/dev/test PMIDs.

**What it does:**
- Reads corpus_pubtator.txt (biomedical abstracts with entity annotations)
- Extracts mentions with UMLS Concept Unique Identifiers (CUIDs)
- Adds 200-character context windows around each mention
- Splits data according to official train/dev/test document lists

**Dataset structure:** Each mention includes:
- `pmid`: PubMed document ID
- `mention`: Entity text (e.g., "diabetes", "MS")
- `entity_id`: UMLS CUID (e.g., "C0011849")
- `entity_type`: UMLS semantic type (e.g., "Disease or Syndrome")
- Context fields for disambiguation

In [None]:

CORPUS_PATH = 'data/MedMentions/corpus_pubtator.txt'
TRAIN_PMIDS_PATH = 'data/MedMentions/corpus_pubtator_pmids_trng.txt'
DEV_PMIDS_PATH = 'data/MedMentions/corpus_pubtator_pmids_dev.txt'
TEST_PMIDS_PATH = 'data/MedMentions/corpus_pubtator_pmids_test.txt'

print("LOADING MEDMENTIONS DATASET")

mentions_list = u.parse_pubtator_file(CORPUS_PATH)
df_all = pd.DataFrame(mentions_list)

print(f"\nTotal mentions: {len(df_all):,}")
print(f"Total PMIDs: {df_all['pmid'].nunique():,}")
print(f"Total unique UMLS entities: {df_all['entity_id'].nunique():,}")

LOADING MEDMENTIONS DATASET
Parsing data/MedMentions/corpus_pubtator.txt...


365672it [00:02, 128875.31it/s]


Parsed 352496 mentions from 4392 documents

Total mentions: 352,496
Total PMIDs: 4,392
Total unique UMLS entities: 34,724


In [4]:
print("\nLoading train/dev/test splits...")

with open(TRAIN_PMIDS_PATH, 'r') as f:
    train_pmids = set(f.read().splitlines())

with open(DEV_PMIDS_PATH, 'r') as f:
    dev_pmids = set(f.read().splitlines())

with open(TEST_PMIDS_PATH, 'r') as f:
    test_pmids = set(f.read().splitlines())


df_train = df_all[df_all['pmid'].isin(train_pmids)].copy()
df_val = df_all[df_all['pmid'].isin(dev_pmids)].copy()
df_test = df_all[df_all['pmid'].isin(test_pmids)].copy()

print(f"\n Split Statistics:")
print(f"Train: {len(df_train):,} mentions from {df_train['pmid'].nunique():,} documents")
print(f"Val:   {len(df_val):,} mentions from {df_val['pmid'].nunique():,} documents")
print(f"Test:  {len(df_test):,} mentions from {df_test['pmid'].nunique():,} documents")


Loading train/dev/test splits...

 Split Statistics:
Train: 211,029 mentions from 2,635 documents
Val:   71,062 mentions from 878 documents
Test:  70,405 mentions from 879 documents


## Step 2: Filter Valid Entities

**Purpose:** Remove mentions with invalid or missing UMLS CUIDs to ensure data quality.

**What it does:**
- Validates that each mention has a proper UMLS identifier
- Removes corrupted or incomplete entity annotations
- Ensures entity spans match the actual text
- Filters out malformed entries

**Why it matters:** MedMentions can have parsing errors or incomplete annotations. Clean data prevents training issues and improves model reliability. This is especially important for biomedical data where entity IDs must match the UMLS knowledge base.

In [None]:


print("Filtering valid entities...")
print("\nTrain:")
df_train = u.apply_filter_valid_entities(df_train)
print("\nVal:")
df_val = u.apply_filter_valid_entities(df_val)
print("\nTest:")
df_test = u.apply_filter_valid_entities(df_test)

Filtering valid entities...

Train:
Original: 211,029 | Valid: 211,029 | Removed: 0 (0.00%)

Val:
Original: 71,062 | Valid: 71,062 | Removed: 0 (0.00%)

Test:
Original: 70,405 | Valid: 70,405 | Removed: 0 (0.00%)


## Step 3: Normalize Mentions

**Purpose:** Standardize medical terminology for consistent matching across variations.

**What it does:**
- Converts to lowercase (except abbreviations)
- Removes extra whitespace and special characters
- Strips edge punctuation
- Creates `normalized_mention` field

**Medical examples:**
- "Type II diabetes" ‚Üí "type ii diabetes"
- "COVID-19" ‚Üí "covid-19"
- "rheumatoid arthritis (RA)" ‚Üí "rheumatoid arthritis ra"

**Why it matters:** Medical texts use inconsistent formatting. Normalization helps match "Diabetes Mellitus", "diabetes mellitus", and "DIABETES MELLITUS" to the same UMLS concept, improving entity linking accuracy.

In [None]:


df_train['normalized_mention'] = df_train['mention'].apply(u.normalize_mention)
df_val['normalized_mention'] = df_val['mention'].apply(u.normalize_mention)
df_test['normalized_mention'] = df_test['mention'].apply(u.normalize_mention)

print("\nExamples:")
sample = df_train.head(10)
for orig, norm in zip(sample['mention'], sample['normalized_mention']):
    if orig != norm:
        print(f"  '{orig}' ‚Üí '{norm}'")

print("\n Normalization complete!")

Normalizing mentions...

Examples:

 Normalization complete!


## Step 4: Extract Biomedical Features

**Purpose:** Add domain-specific features that help disambiguate medical entities.

**What it does:**
- `is_abbreviation`: Detects ALL CAPS short forms (MS, RA, CA, DNA)
- `mention_length`: Character count for the mention
- `mention_word_count`: Number of words in mention
- `context_length`: Word count in surrounding text
- `semantic_type_main`: Primary UMLS semantic type
- `is_multi_word`: Whether mention spans multiple words

**Why these features matter:**
- **Abbreviations** are 2-3x more ambiguous (e.g., "MS" = Multiple Sclerosis, Mass Spectrometry, Mississippi)
- **Semantic types** provide category hints (Disease vs Procedure vs Chemical)
- **Context length** indicates disambiguation difficulty
- **Multi-word terms** tend to be more specific

**Output:** Enhanced dataset with features for both analysis and model training.

In [None]:
print("Extracting biomedical features...")

df_train = u.extract_biomedical_features(df_train)
df_val = u.extract_biomedical_features(df_val)
df_test = u.extract_biomedical_features(df_test)

print("\n Feature Statistics (Train):")
print(f"Abbreviations: {df_train['is_abbreviation'].sum():,} ({df_train['is_abbreviation'].mean()*100:.1f}%)")
print(f"Mean mention length: {df_train['mention_length'].mean():.1f} chars")
print(f"Mean context length: {df_train['context_length'].mean():.1f} words")
print(f"Mean words per mention: {df_train['mention_word_count'].mean():.2f}")
print(f"\nTop 5 semantic types:")
print(df_train['semantic_type_main'].value_counts().head())

Extracting biomedical features...

 Feature Statistics (Train):
Abbreviations: 17,526 (8.3%)
Mean mention length: 10.7 chars
Mean context length: 56.8 words
Mean words per mention: 1.37

Top 5 semantic types:
semantic_type_main
T080    18689
T169    14241
T081    11888
T033     9511
T116     9194
Name: count, dtype: int64


## Step 5: Create Mention-Candidate Pairs

**Purpose:** Transform raw data into standardized format for entity linking models.

**What it does:**
- Creates structured record for each mention
- Includes original mention, normalized form, and context
- Adds entity ID (ground truth for training)
- Bundles all features into unified format

**Record structure:**
```json
{
  "mention": "diabetes",
  "normalized_mention": "diabetes",
  "context_left": "...patients with...",
  "context_right": "...mellitus type 2...",
  "entity_id": "C0011849",
  "entity_type": "Disease or Syndrome",
  "features": { ... }
}
```

**Why it matters:** This standardized format enables easy integration with various entity linking models (bi-encoders, cross-encoders, retrieval systems). The structure matches what Clarify-and-Link expects.

In [None]:
print("Creating standardized mention records...")

train_records = [u.create_mention_record(row) for _, row in df_train.iterrows()]
val_records = [u.create_mention_record(row) for _, row in df_val.iterrows()]
test_records = [u.create_mention_record(row) for _, row in df_test.iterrows()]
print(f"\nCreated {len(train_records):,} train records")
print(f"Created {len(val_records):,} val records")
print(f"Created {len(test_records):,} test records")


print("\n Example record:")
import json
print(json.dumps(train_records[0], indent=2))

Creating standardized mention records...

Created 211,029 train records
Created 71,062 val records
Created 70,405 test records

 Example record:
{
  "pmid": "25763772",
  "mention": "DCTN4",
  "normalized_mention": "DCTN4",
  "context_left": "",
  "context_right": " as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis Pseudomonas aeruginosa (Pa) infection in cystic fibrosis (CF) patients is associated with worse long-term pulmonary diseas",
  "full_context": " DCTN4  as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis Pseudomonas aeruginosa (Pa) infection in cystic fibrosis (CF) patients is associated with worse long-term pulmonary diseas",
  "label_id": "C4308010",
  "entity_type": "T116,T123",
  "semantic_type_main": "T116",
  "start": 0,
  "end": 5,
  "is_abbreviation": true,
  "mention_length": 5,
  "mention_word_count": 1,
  "context_length": 28,
  "title": "DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic f

## Step 6: Export Preprocessed Data

**Purpose:** Save processed data in multiple formats optimized for different use cases.

**Export formats:**

1. **JSONL (JSON Lines):**
   - One record per line
   - Easy streaming for large datasets
   - Human-readable for debugging
   - Standard format for NLP pipelines

2. **Parquet:**
   - Columnar storage (fast loading)
   - Efficient compression (smaller files)
   - Preserves data types
   - Best for pandas/analysis workflows

**Output files:**
- `train.jsonl` / `train.parquet` - Training data (largest split)
- `val.jsonl` / `val.parquet` - Validation for hyperparameter tuning
- `test.jsonl` / `test.parquet` - Final evaluation (never used during training)

**Why both formats:** JSONL for model training, Parquet for fast analysis and visualization.

In [9]:
os.makedirs('data/processed/medmentions', exist_ok=True)

print("EXPORTING PREPROCESSED DATA")

# Format 1: JSONL (for entity linking models)
print("\n1. Exporting JSONL format...")
with open('data/processed/medmentions/train.jsonl', 'w', encoding='utf-8') as f:
    for record in train_records:
        f.write(json.dumps(record) + '\n')
print("   train.jsonl")

with open('data/processed/medmentions/val.jsonl', 'w', encoding='utf-8') as f:
    for record in val_records:
        f.write(json.dumps(record) + '\n')
print("   val.jsonl")

with open('data/processed/medmentions/test.jsonl', 'w', encoding='utf-8') as f:
    for record in test_records:
        f.write(json.dumps(record) + '\n')
print("   test.jsonl")

# Format 2: Parquet (for fast loading)
print("\n2. Exporting Parquet format...")
df_train_export = pd.DataFrame(train_records)
df_val_export = pd.DataFrame(val_records)
df_test_export = pd.DataFrame(test_records)

df_train_export.to_parquet('data/processed/medmentions/train.parquet', index=False)
df_val_export.to_parquet('data/processed/medmentions/val.parquet', index=False)
df_test_export.to_parquet('data/processed/medmentions/test.parquet', index=False)
print("   train.parquet")
print("   val.parquet")
print("   test.parquet")

print("\n All exports complete!")

EXPORTING PREPROCESSED DATA

1. Exporting JSONL format...
   train.jsonl
   val.jsonl
   test.jsonl

2. Exporting Parquet format...
   train.parquet
   val.parquet
   test.parquet

 All exports complete!


## Step 7: Generate Preprocessing Statistics

**Purpose:** Create comprehensive statistics report for documentation, analysis, and reproducibility.

**Statistics computed:**
- **Dataset size:** Number of mentions, documents, unique entities
- **Mention characteristics:** Length distribution, abbreviation percentage
- **Ambiguity metrics:** How many unique entities per mention
- **Context quality:** Average context length
- **Split balance:** Distribution across train/val/test

**Output:** CSV file with statistics for each split, enabling:
- Dataset comparison (AIDA vs MedMentions)
- Quality verification (detect issues early)
- Paper/presentation metrics
- Reproducibility documentation

**Use case:** These stats appear in your milestone presentation and help justify why MedMentions is challenging (high abbreviation rate, medical domain complexity).

In [None]:

print("Computing preprocessing statistics...\n")

train_stats = u.compute_split_statistics(df_train, 'TRAIN')
val_stats = u.compute_split_statistics(df_val, 'VAL')
test_stats = u.compute_split_statistics(df_test, 'TEST')

stats_df = pd.DataFrame([train_stats, val_stats, test_stats])
print("MEDMENTIONS PREPROCESSING STATISTICS")
print("="*70)
print(stats_df.to_string(index=False))

stats_df.to_csv('data/processed/medmentions/preprocessing_stats.csv', index=False)
print("\nStatistics saved to preprocessing_stats.csv")

Computing preprocessing statistics...

MEDMENTIONS PREPROCESSING STATISTICS
split  num_mentions  num_pmids  num_unique_entities  num_unique_semantic_types  num_abbreviations pct_abbreviations avg_mention_length avg_context_length avg_words_per_mention top_semantic_type  top_semantic_type_count
TRAIN        211029       2635                25691                        126              17526             8.31%               10.7               56.8                  1.37              T080                    18689
  VAL         71062        878                12610                        124               5744             8.08%               10.7               56.9                  1.37              T080                     6435
 TEST         70405        879                12419                        123               5922             8.41%               10.7               56.8                  1.37              T080                     6361

Statistics saved to preprocessing_stats.csv


---

## ‚úÖ Preprocessing Complete!

**Outputs generated:**
- `data/processed/medmentions/train.jsonl` + `train.parquet`
- `data/processed/medmentions/val.jsonl` + `val.parquet`
- `data/processed/medmentions/test.jsonl` + `test.parquet`
- `data/processed/medmentions/preprocessing_stats.csv`

**Key dataset characteristics:**
- üè• **Domain:** Biomedical literature (PubMed abstracts)
- üî¨ **Entities:** UMLS concepts (diseases, procedures, chemicals)
- üéØ **Challenge:** High ambiguity, especially abbreviations
- üìä **Size:** 350K+ mentions, 4K+ documents, 35K+ unique entities

**Next steps:**
1. Run entity linking analysis (see `Analysis/med_mentions_analysis.ipynb`)
2. Compare with AIDA dataset characteristics
3. Train Clarify-and-Link model on preprocessed data