# Task 2.5: PubMed Linkage Pipeline
## NCT ID - PubMed Publication Mapping with AI Reference Detection

---

### Objective
Link clinical trials (NCT IDs) to their associated PubMed publications and identify AI references.

### Implementation Details
- **Input**: `Task2/clinical_trial_sample (1).csv` (9,428 NCT IDs)
- **Sample**: First 100 NCT IDs (scalable to full dataset)
- **AI Detection Method**: Rule-Based Keyword Matching (Option 1)
- **Output Format**: One row per NCT-PMID pair (Option A)
- **API**: NCBI E-utilities (ESearch, ESummary, EFetch)

### Output Schema
```
nct_id, pmid_from_pubmed_search, publication_year, journal, ai_reference_indicator
```

## 1. Setup and Configuration

In [1]:
# Import required libraries
import requests
import pandas as pd
import time
import xml.etree.ElementTree as ET
from urllib.parse import quote
import json
from tqdm import tqdm
import logging
from typing import List, Dict, Tuple, Optional
import os

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

In [None]:
# Configuration
API_KEY = '03780e250434b347f670b6995eaa0d524508'
BASE_URL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
RATE_LIMIT_DELAY = 0.11  # 10 requests/second with API key (0.1s + buffer)
MAX_RETRIES = 3
RETRY_DELAY = 2  # seconds
CHECKPOINT_INTERVAL = 50  # Save progress every 50 NCT IDs

# File paths
INPUT_FILE = '../Task2/clinical_trial_sample (1).csv'
OUTPUT_FILE = 'nct_pubmed_linkage.csv'
CHECKPOINT_FILE = 'checkpoint_results.csv'

# Sample size (set to None to process all)
SAMPLE_SIZE = None  # Processing all 9,428 NCT IDs

## 2. AI Keyword Dictionary

Comprehensive list of AI-related terms for keyword matching.

In [3]:
# AI Keywords for detection
AI_KEYWORDS = [
    # Core AI Terms
    'artificial intelligence', 'machine learning', 'deep learning',
    'neural network', 'convolutional neural network', 'cnn',
    'recurrent neural network', 'rnn', 'transformer',
    
    # AI Techniques
    'random forest', 'support vector machine', 'svm',
    'gradient boosting', 'xgboost', 'decision tree',
    'k-nearest neighbor', 'knn', 'naive bayes',
    'ensemble learning', 'supervised learning', 'unsupervised learning',
    'reinforcement learning', 'transfer learning',
    
    # Deep Learning Architectures
    'lstm', 'gru', 'attention mechanism', 'bert', 'gpt',
    'resnet', 'u-net', 'gan', 'generative adversarial network',
    'autoencoder', 'variational autoencoder', 'vae',
    
    # AI Applications in Healthcare/Drug Development
    'computer vision', 'natural language processing', 'nlp',
    'image recognition', 'object detection', 'semantic segmentation',
    'drug discovery ai', 'ai-driven', 'ai-powered', 'ai-enabled',
    'predictive modeling', 'feature extraction', 'dimensionality reduction',
    
    # Specific Algorithms
    'logistic regression', 'linear regression', 'pca',
    'clustering', 'classification algorithm', 'regression model'
]

print(f"Total AI keywords: {len(AI_KEYWORDS)}")

Total AI keywords: 54


## 3. Core API Functions

### 3.1 ESearch - Search for Publications by NCT ID

In [4]:
def search_pubmed_by_nct(nct_id: str, api_key: str = API_KEY) -> List[str]:
    """
    Search PubMed for publications referencing a specific NCT ID.
    
    Args:
        nct_id: Clinical trial identifier (e.g., 'NCT00175851')
        api_key: NCBI API key for rate limit increase
    
    Returns:
        List of PubMed IDs (PMIDs) as strings
    """
    url = f"{BASE_URL}esearch.fcgi"
    params = {
        'db': 'pubmed',
        'term': nct_id,
        'retmode': 'json',
        'retmax': 100,
        'api_key': api_key
    }
    
    for attempt in range(MAX_RETRIES):
        try:
            response = requests.get(url, params=params, timeout=30)
            response.raise_for_status()
            
            data = response.json()
            pmids = data.get('esearchresult', {}).get('idlist', [])
            
            time.sleep(RATE_LIMIT_DELAY)
            return pmids
            
        except Exception as e:
            logger.warning(f"ESearch attempt {attempt + 1} failed for {nct_id}: {e}")
            if attempt < MAX_RETRIES - 1:
                time.sleep(RETRY_DELAY * (attempt + 1))
            else:
                logger.error(f"ESearch failed for {nct_id} after {MAX_RETRIES} attempts")
                return []
    
    return []

### 3.2 ESummary - Retrieve Publication Metadata

In [5]:
def get_publication_metadata(pmids: List[str], api_key: str = API_KEY) -> Dict[str, Dict]:
    """
    Retrieve publication metadata (year, journal, title) for a list of PMIDs.
    
    Args:
        pmids: List of PubMed IDs
        api_key: NCBI API key
    
    Returns:
        Dictionary mapping PMID to metadata dict with keys:
        - 'year': Publication year
        - 'journal': Journal name
        - 'title': Article title
    """
    if not pmids:
        return {}
    
    url = f"{BASE_URL}esummary.fcgi"
    params = {
        'db': 'pubmed',
        'id': ','.join(pmids),
        'retmode': 'json',
        'api_key': api_key
    }
    
    for attempt in range(MAX_RETRIES):
        try:
            response = requests.get(url, params=params, timeout=30)
            response.raise_for_status()
            
            data = response.json()
            result = data.get('result', {})
            
            metadata = {}
            for pmid in pmids:
                if pmid in result:
                    pub_data = result[pmid]
                    
                    # Extract year from pubdate
                    pubdate = pub_data.get('pubdate', '')
                    year = pubdate.split()[0] if pubdate else ''
                    
                    metadata[pmid] = {
                        'year': year,
                        'journal': pub_data.get('source', ''),
                        'title': pub_data.get('title', '')
                    }
            
            time.sleep(RATE_LIMIT_DELAY)
            return metadata
            
        except Exception as e:
            logger.warning(f"ESummary attempt {attempt + 1} failed: {e}")
            if attempt < MAX_RETRIES - 1:
                time.sleep(RETRY_DELAY * (attempt + 1))
            else:
                logger.error(f"ESummary failed after {MAX_RETRIES} attempts")
                return {}
    
    return {}

### 3.3 EFetch - Retrieve Publication Abstracts

In [6]:
def get_publication_abstracts(pmids: List[str], api_key: str = API_KEY) -> Dict[str, str]:
    """
    Retrieve abstracts for a list of PMIDs.
    
    Args:
        pmids: List of PubMed IDs
        api_key: NCBI API key
    
    Returns:
        Dictionary mapping PMID to abstract text
    """
    if not pmids:
        return {}
    
    url = f"{BASE_URL}efetch.fcgi"
    params = {
        'db': 'pubmed',
        'id': ','.join(pmids),
        'retmode': 'xml',
        'rettype': 'abstract',
        'api_key': api_key
    }
    
    for attempt in range(MAX_RETRIES):
        try:
            response = requests.get(url, params=params, timeout=30)
            response.raise_for_status()
            
            # Parse XML response
            root = ET.fromstring(response.content)
            
            abstracts = {}
            for article in root.findall('.//PubmedArticle'):
                # Get PMID
                pmid_elem = article.find('.//PMID')
                if pmid_elem is not None:
                    pmid = pmid_elem.text
                    
                    # Get abstract text
                    abstract_texts = []
                    for abstract_elem in article.findall('.//AbstractText'):
                        if abstract_elem.text:
                            abstract_texts.append(abstract_elem.text)
                    
                    abstracts[pmid] = ' '.join(abstract_texts) if abstract_texts else ''
            
            time.sleep(RATE_LIMIT_DELAY)
            return abstracts
            
        except Exception as e:
            logger.warning(f"EFetch attempt {attempt + 1} failed: {e}")
            if attempt < MAX_RETRIES - 1:
                time.sleep(RETRY_DELAY * (attempt + 1))
            else:
                logger.error(f"EFetch failed after {MAX_RETRIES} attempts")
                return {}
    
    return {}

## 4. AI Reference Detection

In [7]:
def detect_ai_reference(title: str, abstract: str, keywords: List[str] = AI_KEYWORDS) -> bool:
    """
    Detect AI references in publication title and abstract using keyword matching.
    
    Args:
        title: Publication title
        abstract: Publication abstract
        keywords: List of AI-related keywords to search for
    
    Returns:
        True if any AI keyword is found, False otherwise
    """
    # Combine title and abstract, convert to lowercase
    combined_text = (str(title) + ' ' + str(abstract)).lower()
    
    # Check for keyword matches
    for keyword in keywords:
        if keyword.lower() in combined_text:
            return True
    
    return False

## 5. Main Pipeline Functions

In [8]:
def process_nct_id(nct_id: str) -> List[Dict]:
    """
    Process a single NCT ID: search PubMed, retrieve metadata and abstracts,
    detect AI references.
    
    Args:
        nct_id: Clinical trial identifier
    
    Returns:
        List of dictionaries, one per publication found (or one empty dict if none)
    """
    results = []
    
    # Step 1: Search for PMIDs
    pmids = search_pubmed_by_nct(nct_id)
    
    if not pmids:
        # No publications found - return row with empty values
        results.append({
            'nct_id': nct_id,
            'pmid_from_pubmed_search': '',
            'publication_year': '',
            'journal': '',
            'ai_reference_indicator': ''
        })
        return results
    
    # Step 2: Get metadata (year, journal, title)
    metadata = get_publication_metadata(pmids)
    
    # Step 3: Get abstracts
    abstracts = get_publication_abstracts(pmids)
    
    # Step 4: Process each publication
    for pmid in pmids:
        meta = metadata.get(pmid, {})
        abstract = abstracts.get(pmid, '')
        title = meta.get('title', '')
        
        # Detect AI reference
        ai_detected = detect_ai_reference(title, abstract)
        
        results.append({
            'nct_id': nct_id,
            'pmid_from_pubmed_search': pmid,
            'publication_year': meta.get('year', ''),
            'journal': meta.get('journal', ''),
            'ai_reference_indicator': ai_detected
        })
    
    return results

In [9]:
def process_nct_batch(nct_ids: List[str], checkpoint_file: str = CHECKPOINT_FILE) -> pd.DataFrame:
    """
    Process a batch of NCT IDs with checkpoint saving.
    
    Args:
        nct_ids: List of NCT IDs to process
        checkpoint_file: File path for saving checkpoints
    
    Returns:
        DataFrame with all results
    """
    all_results = []
    
    # Check if checkpoint exists
    start_idx = 0
    if os.path.exists(checkpoint_file):
        logger.info(f"Loading checkpoint from {checkpoint_file}")
        checkpoint_df = pd.read_csv(checkpoint_file)
        all_results = checkpoint_df.to_dict('records')
        
        # Find where to resume
        processed_nct_ids = set(checkpoint_df['nct_id'].unique())
        for i, nct_id in enumerate(nct_ids):
            if nct_id not in processed_nct_ids:
                start_idx = i
                break
        else:
            start_idx = len(nct_ids)  # All processed
        
        logger.info(f"Resuming from NCT ID index {start_idx}")
    
    # Process NCT IDs
    for i, nct_id in enumerate(tqdm(nct_ids[start_idx:], desc="Processing NCT IDs", initial=start_idx, total=len(nct_ids))):
        try:
            results = process_nct_id(nct_id)
            all_results.extend(results)
            
            # Save checkpoint every CHECKPOINT_INTERVAL
            if (start_idx + i + 1) % CHECKPOINT_INTERVAL == 0:
                checkpoint_df = pd.DataFrame(all_results)
                checkpoint_df.to_csv(checkpoint_file, index=False)
                logger.info(f"Checkpoint saved at NCT ID {start_idx + i + 1}/{len(nct_ids)}")
        
        except Exception as e:
            logger.error(f"Error processing {nct_id}: {e}")
            # Add error row
            all_results.append({
                'nct_id': nct_id,
                'pmid_from_pubmed_search': '',
                'publication_year': '',
                'journal': '',
                'ai_reference_indicator': ''
            })
    
    # Convert to DataFrame
    df = pd.DataFrame(all_results)
    
    # Save final checkpoint
    df.to_csv(checkpoint_file, index=False)
    logger.info("Final checkpoint saved")
    
    return df

## 6. Load Input Data

In [10]:
# Load clinical trial dataset
df_clinical_trials = pd.read_csv(INPUT_FILE)

print(f"Total NCT IDs in dataset: {len(df_clinical_trials)}")
print(f"\nDataset columns: {df_clinical_trials.columns.tolist()}")
print(f"\nFirst few rows:")
df_clinical_trials.head()

Total NCT IDs in dataset: 9428

Dataset columns: ['nct_id', 'brief_title', 'overall_status', 'sponsor_name', 'gvkey_sponsor', 'phase_number', 'start_date', 'start_year']

First few rows:


Unnamed: 0,nct_id,brief_title,overall_status,sponsor_name,gvkey_sponsor,phase_number,start_date,start_year
0,NCT00175851,Open Label Trial to Study the Long-term Safety...,Withdrawn,UCB Pharma,24454,3,2008-05-01,2008
1,NCT00359632,Study to Evaluate Eye Function in Patients Tak...,Terminated,Pfizer,8530,3,2008-11-01,2008
2,NCT00415155,A Study of LY2181308 in Patients With Advanced...,Withdrawn,Eli Lilly and Company,6730,2,2008-08-01,2008
3,NCT00422110,A Study to Evaluate the Efficacy and Safety of...,Withdrawn,UCB Pharma,24454,3,2008-05-01,2008
4,NCT00422422,"Open-label, Pharmacokinetic, Safety and Effica...",Completed,UCB Pharma,24454,2,2011-07-01,2011


In [11]:
# Extract NCT IDs (first 100 for sample)
if SAMPLE_SIZE is not None:
    nct_ids = df_clinical_trials['nct_id'].head(SAMPLE_SIZE).tolist()
    print(f"Processing first {SAMPLE_SIZE} NCT IDs (sample mode)")
else:
    nct_ids = df_clinical_trials['nct_id'].tolist()
    print(f"Processing all {len(nct_ids)} NCT IDs (full dataset mode)")

print(f"\nExample NCT IDs: {nct_ids[:5]}")

Processing first 100 NCT IDs (sample mode)

Example NCT IDs: ['NCT00175851', 'NCT00359632', 'NCT00415155', 'NCT00422110', 'NCT00422422']


## 7. Run Pipeline

In [12]:
# Process all NCT IDs
logger.info("Starting PubMed linkage pipeline...")
df_results = process_nct_batch(nct_ids)
logger.info("Pipeline completed!")

2026-02-15 13:24:07,248 - INFO - Starting PubMed linkage pipeline...
Processing NCT IDs:  49%|████▉     | 49/100 [00:46<00:39,  1.28it/s]2026-02-15 13:24:54,717 - INFO - Checkpoint saved at NCT ID 50/100
Processing NCT IDs:  99%|█████████▉| 99/100 [01:41<00:01,  1.05s/it]2026-02-15 13:25:49,277 - INFO - Checkpoint saved at NCT ID 100/100
Processing NCT IDs: 100%|██████████| 100/100 [01:42<00:00,  1.02s/it]
2026-02-15 13:25:49,280 - INFO - Final checkpoint saved
2026-02-15 13:25:49,280 - INFO - Pipeline completed!


## 8. Export Results

In [13]:
# Save final results
df_results.to_csv(OUTPUT_FILE, index=False)
print(f"Results saved to {OUTPUT_FILE}")
print(f"\nTotal rows in output: {len(df_results)}")

Results saved to nct_pubmed_linkage.csv

Total rows in output: 129


## 9. Summary Statistics

In [14]:
# Calculate summary statistics
print("=" * 60)
print("SUMMARY STATISTICS")
print("=" * 60)

# Total NCT IDs processed
total_nct_ids = df_results['nct_id'].nunique()
print(f"\n1. Total NCT IDs processed: {total_nct_ids}")

# NCT IDs with publications
nct_with_pubs = df_results[df_results['pmid_from_pubmed_search'] != '']['nct_id'].nunique()
print(f"2. NCT IDs with at least one publication: {nct_with_pubs} ({nct_with_pubs/total_nct_ids*100:.1f}%)")

# NCT IDs without publications
nct_without_pubs = total_nct_ids - nct_with_pubs
print(f"3. NCT IDs without publications: {nct_without_pubs} ({nct_without_pubs/total_nct_ids*100:.1f}%)")

# Total publications found
total_publications = df_results[df_results['pmid_from_pubmed_search'] != ''].shape[0]
print(f"\n4. Total publications found: {total_publications}")

# Average publications per NCT ID (for those with publications)
if nct_with_pubs > 0:
    avg_pubs = total_publications / nct_with_pubs
    print(f"5. Average publications per NCT ID (with pubs): {avg_pubs:.2f}")

# Publications with AI references
ai_publications = df_results[df_results['ai_reference_indicator'] == True].shape[0]
if total_publications > 0:
    print(f"\n6. Publications with AI references: {ai_publications} ({ai_publications/total_publications*100:.1f}%)")
else:
    print(f"\n6. Publications with AI references: {ai_publications}")

# NCT IDs with at least one AI publication
nct_with_ai = df_results[df_results['ai_reference_indicator'] == True]['nct_id'].nunique()
print(f"7. NCT IDs with at least one AI publication: {nct_with_ai} ({nct_with_ai/total_nct_ids*100:.1f}%)")

SUMMARY STATISTICS

1. Total NCT IDs processed: 100
2. NCT IDs with at least one publication: 37 (37.0%)
3. NCT IDs without publications: 63 (63.0%)

4. Total publications found: 66
5. Average publications per NCT ID (with pubs): 1.78

6. Publications with AI references: 6 (9.1%)
7. NCT IDs with at least one AI publication: 6 (6.0%)


In [15]:
# Top journals
print("\n" + "=" * 60)
print("TOP 10 JOURNALS")
print("=" * 60)
top_journals = df_results[df_results['journal'] != '']['journal'].value_counts().head(10)
print(top_journals)


TOP 10 JOURNALS
journal
Postgrad Med                       4
J Child Adolesc Psychopharmacol    4
Clin Ther                          2
Nat Rev Cancer                     2
J Comp Eff Res                     2
Cancer                             2
Curr Med Res Opin                  2
Leuk Lymphoma                      2
JAMA Psychiatry                    2
J Diabetes Investig                2
Name: count, dtype: int64


In [16]:
# Publication year distribution
print("\n" + "=" * 60)
print("PUBLICATION YEAR DISTRIBUTION")
print("=" * 60)
year_dist = df_results[df_results['publication_year'] != '']['publication_year'].value_counts().sort_index()
print(year_dist)


PUBLICATION YEAR DISTRIBUTION
publication_year
2009    2
2010    2
2011    8
2012    8
2013    9
2014    7
2015    6
2016    4
2017    7
2018    1
2019    3
2020    3
2021    2
2023    2
2024    2
Name: count, dtype: int64


In [17]:
# Preview final output
print("\n" + "=" * 60)
print("FINAL OUTPUT PREVIEW")
print("=" * 60)
print(f"\nFirst 10 rows:")
df_results.head(10)


FINAL OUTPUT PREVIEW

First 10 rows:


Unnamed: 0,nct_id,pmid_from_pubmed_search,publication_year,journal,ai_reference_indicator
0,NCT00175851,,,,
1,NCT00359632,,,,
2,NCT00415155,,,,
3,NCT00422110,,,,
4,NCT00422422,38518434.0,2024.0,Epilepsy Res,False
5,NCT00422422,31810577.0,2020.0,Eur J Paediatr Neurol,False
6,NCT00422422,31250322.0,2019.0,Paediatr Drugs,False
7,NCT00422422,28280887.0,2017.0,Eur J Clin Pharmacol,False
8,NCT00436748,36791280.0,2023.0,Cochrane Database Syst Rev,True
9,NCT00455052,,,,


In [18]:
# Show examples of different categories
print("\n" + "=" * 60)
print("EXAMPLE RESULTS")
print("=" * 60)

# Example: NCT with publications (no AI)
no_ai_example = df_results[(df_results['pmid_from_pubmed_search'] != '') & 
                           (df_results['ai_reference_indicator'] == False)].head(1)
if not no_ai_example.empty:
    print("\nExample: NCT with publication (No AI reference)")
    print(no_ai_example.to_string(index=False))

# Example: NCT with AI publication
ai_example = df_results[df_results['ai_reference_indicator'] == True].head(1)
if not ai_example.empty:
    print("\nExample: NCT with publication (AI reference detected)")
    print(ai_example.to_string(index=False))

# Example: NCT without publications
no_pub_example = df_results[df_results['pmid_from_pubmed_search'] == ''].head(1)
if not no_pub_example.empty:
    print("\nExample: NCT without publications")
    print(no_pub_example.to_string(index=False))


EXAMPLE RESULTS

Example: NCT with publication (No AI reference)
     nct_id pmid_from_pubmed_search publication_year      journal ai_reference_indicator
NCT00422422                38518434             2024 Epilepsy Res                  False

Example: NCT with publication (AI reference detected)
     nct_id pmid_from_pubmed_search publication_year                    journal ai_reference_indicator
NCT00436748                36791280             2023 Cochrane Database Syst Rev                   True

Example: NCT without publications
     nct_id pmid_from_pubmed_search publication_year journal ai_reference_indicator
NCT00175851                                                                        


## 10. Notes and Instructions

### Scaling to Full Dataset
To process all 9,428 NCT IDs instead of just 100:
1. Change `SAMPLE_SIZE = 100` to `SAMPLE_SIZE = None` in the Configuration cell
2. Re-run all cells from "6. Load Input Data" onwards
3. The pipeline will automatically use checkpoints to save progress

### Checkpoint/Resume Functionality
- Progress is automatically saved every 50 NCT IDs
- If the notebook crashes or is interrupted, simply re-run the pipeline
- It will automatically detect the checkpoint file and resume from where it stopped
- Delete `checkpoint_results.csv` to start fresh

### API Rate Limits
- With API key: 10 requests/second
- Estimated time for 100 NCT IDs: ~10-15 minutes
- Estimated time for all 9,428 NCT IDs: ~15-20 hours

### Output Format
- **nct_id**: Clinical trial identifier
- **pmid_from_pubmed_search**: PubMed ID (empty if no publications)
- **publication_year**: Year of publication (empty if no publications)
- **journal**: Journal name (empty if no publications)
- **ai_reference_indicator**: True/False/empty (empty if no publications)

### Customization
- To modify AI keywords, edit the `AI_KEYWORDS` list in Section 2
- To adjust checkpoint frequency, change `CHECKPOINT_INTERVAL` in Configuration
- To modify rate limits, adjust `RATE_LIMIT_DELAY` in Configuration