# DOME Registry Pre-Repair Analysis

This notebook is designed to analyze the DOME registry data, search for specific papers, and cross-reference user metadata before any DSW (Daily Study Workflow) repair operations are performed. It provides tools to verify the state of annotations and curator mappings.


In [2]:
import requests
import time
import pandas as pd

def fetch_epmc_metadata(pmid):
    """
    Fetches Title, Authors, PubYear, and DOI from Europe PMC API for a given PMID.
    """
    if not pmid or str(pmid) == 'nan':
        return None, None, None, None
        
    url = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
    params = {
        'query': f'EXT_ID:{pmid} SRC:MED',
        'format': 'json',
        'resultType': 'core'
    }
    
    try:
        response = requests.get(url, params=params)
        response.raise_for_status()
        data = response.json()
        
        result_list = data.get('resultList', {}).get('result', [])
        
        if result_list:
            top_result = result_list[0]
            title = top_result.get('title', '')
            authors = top_result.get('authorString', '')
            pub_year = top_result.get('pubYear', '')
            doi = top_result.get('doi', '')
            return title, authors, pub_year, doi
            
    except Exception as e:
        print(f"Error fetching PMID {pmid}: {e}")
        
    return None, None, None, None

# Load the TSV
tsv_file = "Dome-Recommendations-Annotated-Articles_20250202.tsv"
print(f"Loading {tsv_file}...")
df_recs = pd.read_csv(tsv_file, sep='\t')

# Ensure columns exist
new_cols = ['EPMC_title', 'EPMC_authors', 'EPMC_pub_year', 'EPMC_doi']
for col in new_cols:
    if col not in df_recs.columns:
        df_recs[col] = None

# Iterate and Enrich
print("Enriching data with Europe PMC API...")
total = len(df_recs)

for index, row in df_recs.iterrows():
    pmid = row.get('PMID')
    
    # Skip if already filled (optional, but good for retries)
    # or if PMID is missing
    if pd.isna(pmid):
        continue
        
    print(f"Processing {index + 1}/{total}: PMID {pmid}", end='\r')
    
    title, authors, year, doi = fetch_epmc_metadata(pmid)
    
    df_recs.at[index, 'EPMC_title'] = title
    df_recs.at[index, 'EPMC_authors'] = authors
    df_recs.at[index, 'EPMC_pub_year'] = year
    df_recs.at[index, 'EPMC_doi'] = doi
    
    # Be nice to the API
    time.sleep(0.1)

print("\nEnrichment complete.")
df_recs.head()

Loading Dome-Recommendations-Annotated-Articles_20250202.tsv...
Enriching data with Europe PMC API...
Processing 188/188: PMID 24977146
Enrichment complete.


Unnamed: 0,Informazioni cronologiche,PMID,Journal name,Publication year,DOME version,Provenance,Dataset splits,Redundancy between data splits,Availability of data,Algorithm,...,Evaluation method,Performance measures,Comparison,Confidence,Availability of evaluation,Indirizzo email,EPMC_title,EPMC_authors,EPMC_pub_year,EPMC_doi
0,3/23/2022 16:36:18,33465072,PLoS Comput Biol.,2021,1.0,"yes, N_pos: 13128 N_neg: 32766",5-fold cross-validation,Correlation Feature Selection is used.,"yes, described at S1 table: https://deposition...","Naive Bayes, SVM, Random Forests",...,5-fold cross validation,ROC curves and AUC values,,,,andrewhatos@gmail.com,Genome-wide prediction of topoisomerase IIβ bi...,"Martínez-García PM, García-Torres M, Divina F,...",2021,10.1371/journal.pcbi.1007814
1,3/28/2022 0:44:11,33679869,Front Genet.,2021,1.0,public databases: 901 samples from The Cancer ...,10-fold cross validation,,"no, they claim: ""Publicly available datasets w...","Spathial, Random forest, LASSO",...,10-fold cross validation,"AUCs for 2-, 3-, and 5-year OS were 0.527, 0.5...",no,no,no,andrewhatos@gmail.com,Construction and Comprehensive Analyses of a M...,"Sun S, Fei K, Zhang G, Wang J, Yang Y, Guo W, ...",2020,10.3389/fgene.2020.617174
2,3/28/2022 16:40:10,34419924,EBioMedicine.,2021,1.0,samples from Stanford Health Care and Stanford...,The final analysis included for discovery coho...,,Conatact (hoganca@stanford.edu) is provided ri...,gradient boosted decision trees and random for...,...,novel experiments.,"AUC, sensitivity, specificity",costeffective comare to PCR tests and could be...,,"yes, https://github.com/stanfordmlgroup/influe...",andrewhatos@gmail.com,Nasopharyngeal metabolomics and machine learni...,"Hogan CA, Rajpurkar P, Sowrirajan H, Phillips ...",2021,10.1016/j.ebiom.2021.103546
3,3/29/2022 23:01:05,34112769,Nat Commun .,2021,1.0,Clinical data collected by the authors N_neg 3...,N_pos: Discovery 227 and Validation 77,,All data included in this study is available u...,composite model with several stepp with diffe...,...,independent dataset,"Accuracy (94.81% for the C4 prediction model, ...",no,no,no,andrewhatos@gmail.com,A new molecular classification to drive precis...,"Soret P, Le Dantec C, Desvaux E, Foulquier N, ...",2021,10.1038/s41467-021-23472-7
4,3/28/2022 12:23:43,32915751,IEEE J Biomed Health Inform.,2020,1.0,two public COVID-19 CT datasets: - https:/...,no,no,preprocessed data: https://drive.google.com/fi...,"deep learning, novel approach",...,four-fold cross-validation,"Accuracy, F1 score, Sensitivity, Precision, AUC",firstly comparision to COVID-Net method https:...,outperforming the original COVID-Net trained o...,no,andrewhatos@gmail.com,Contrastive Cross-Site Learning With Redesigne...,"Wang Z, Liu Q, Dou Q.",2020,10.1109/jbhi.2020.3023246


In [3]:
import json

# 1. Integrate User OIDs
users_file = "dome_users_20260130.json"
try:
    print(f"Loading users from {users_file}...")
    with open(users_file, 'r', encoding='utf-8') as f:
        users_data = json.load(f)
        
    # Create Email -> OID mapping
    email_to_oid = {}
    for u in users_data:
        email = u.get('email')
        oid = u.get('_id', {}).get('$oid')
        if email and oid:
            email_to_oid[email.strip()] = oid
            
    # Apply to DataFrame
    # Target column is 'Indirizzo email' based on file inspection
    if 'Indirizzo email' in df_recs.columns:
        df_recs['User_OID'] = df_recs['Indirizzo email'].apply(lambda x: email_to_oid.get(str(x).strip(), 'Unknown'))
        print("User OIDs mapped successfully.")
    else:
        print("Warning: 'Indirizzo email' column not found in TSV.")

except Exception as e:
    print(f"Error mapping users: {e}")

# 2. Reorder columns (EPMC info after PMID)
cols = list(df_recs.columns)
if 'PMID' in cols:
    pmid_idx = cols.index('PMID')
    
    # Remove EPMC cols if they are in the list currently
    active_new_cols = [c for c in new_cols if c in cols]
    for col in active_new_cols:
        cols.remove(col)
        
    # Insert them back after PMID
    for i, col in enumerate(active_new_cols):
        cols.insert(pmid_idx + 1 + i, col)
        
    df_recs = df_recs[cols]

# 3. Save (Overwriting/Creating the single enriched file)
output_file = "Dome-Recommendations-Annotated-Articles_20250202_Enriched.tsv"
df_recs.to_csv(output_file, sep='\t', index=False)
print(f"Saved enriched data with User OIDs to {output_file}")

Loading users from dome_users_20260130.json...
User OIDs mapped successfully.
Saved enriched data with User OIDs to Dome-Recommendations-Annotated-Articles_20250202_Enriched.tsv


In [4]:
# 4. Standardize Column Names and Sync with JSON Schema

raw_reviews_file = "dome_review_raw_human_20260128.json"

# Known TSV -> DOME JSON Key Mapping
# Based on the TSV headers provided and DOME schema conventions
column_map = {
    'Journal name': 'publication/journal',
    'Publication year': 'publication/year',
    
    'Provenance': 'dataset/provenance',
    'Dataset splits': 'dataset/splits',
    'Redundancy between data splits': 'dataset/redundancy',
    'Availability of data': 'dataset/availability',
    
    'Algorithm': 'optimization/algorithm',
    'Meta-predictions': 'optimization/meta',
    'Data encoding': 'optimization/encoding',
    'Parameters': 'optimization/parameters',
    'Features': 'optimization/features',
    'Fitting': 'optimization/fitting',
    'Regularization': 'optimization/regularization',
    'Availability of configuration': 'optimization/config',
    
    'Interpretability': 'model/interpretability',
    'Output': 'model/output',
    'Execution time': 'model/duration',
    'Availability of software': 'model/availability',
    
    'Evaluation method': 'evaluation/method',
    'Performance measures': 'evaluation/measure',
    'Comparison': 'evaluation/comparison',
    'Confidence': 'evaluation/confidence',
    'Availability of evaluation': 'evaluation/availability',
    
    # Metadata/Extra
    'Informazioni cronologiche': 'timestamp',
    'Indirizzo email': 'user_email'
}

# 1. Rename Columns
print("Renaming TSV columns to match DOME JSON schema...")
df_recs.rename(columns=column_map, inplace=True)

# 2. Extract Full Schema from JSON
# We flatten the keys from the first few records to get a list of all possible "section/field" keys
all_json_keys = set()
try:
    with open(raw_reviews_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        
    for entry in data[:10]: # Check first 10 to cover bases
        for section, content in entry.items():
            if isinstance(content, dict):
                for field in content.keys():
                    if not field.startswith('$'): # Skip mongo internal keys
                        all_json_keys.add(f"{section}/{field}")
            else:
                if not section.startswith('_'):
                    all_json_keys.add(section)
                    
    print(f"Detected {len(all_json_keys)} standard schema keys from JSON.")
    
except Exception as e:
    print(f"Error reading JSON schema: {e}")
    # Fallback default keys if file read fails
    all_json_keys = {
        'publication/title', 'publication/authors', 'publication/doi', 'publication/year', 'publication/journal',
        'publication/tags',
        'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability',
        'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 
        'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config',
        'model/interpretability', 'model/output', 'model/duration', 'model/availability',
        'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availability'
    }

# 3. Add Missing Schema Columns
for key in all_json_keys:
    if key not in df_recs.columns:
        df_recs[key] = None # Add as empty

# 4. Remove unwanted columns
# Filter for exact columns ending in '/done' or '/skip'
cols_to_drop = [c for c in df_recs.columns if c.endswith('/skip') or c.endswith('/done')]

if 'DOME version' in df_recs.columns:
    cols_to_drop.append('DOME version')

if cols_to_drop:
    print(f"Dropping {len(cols_to_drop)} unwanted columns: {cols_to_drop}")
    df_recs.drop(columns=cols_to_drop, inplace=True)

# Re-Save with Version prefix
output_file_schema = "v1_Dome-Recommendations-Schema_Aligned.tsv"
df_recs.to_csv(output_file_schema, sep='\t', index=False)
print(f"Saved schema-aligned file to {output_file_schema}")
print("Columns:", list(df_recs.columns))

Renaming TSV columns to match DOME JSON schema...
Detected 44 standard schema keys from JSON.
Dropping 11 unwanted columns: ['optimization/done', 'publication/skip', 'publication/done', 'evaluation/done', 'dataset/done', 'model/done', 'optimization/skip', 'evaluation/skip', 'dataset/skip', 'model/skip', 'DOME version']
Saved schema-aligned file to v1_Dome-Recommendations-Schema_Aligned.tsv
Columns: ['timestamp', 'PMID', 'EPMC_title', 'EPMC_authors', 'EPMC_pub_year', 'EPMC_doi', 'publication/journal', 'publication/year', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'Availability of  configuration', 'model/interpretability', 'model/output', 'Execution time ', 'model/availability', 'evaluation/method', 'Performance measures ', 'evaluation/comparison', 'evaluation/confidenc

In [5]:
# 5. Merge EPMC Data and Finalize Order

print("Merging EPMC data into main schema columns...")

# List of merge pairs: (Source, Target)
merge_pairs = [
    ('EPMC_title', 'publication/title'),
    ('EPMC_authors', 'publication/authors'),
    ('EPMC_doi', 'publication/doi')
]

for src, tgt in merge_pairs:
    if src in df_recs.columns and tgt in df_recs.columns:
        # Use EPMC data to fill/overwrite
        # If you only want to fill missing values, use .fillna() instead
        # Here we overwrite as per instructions which implies using the fetched data
        df_recs[tgt] = df_recs[src]
        print(f"Merged {src} -> {tgt}")
    else:
        print(f"Skipping merge {src} -> {tgt} (Column missing)")

# Cleanup: Drop the temporary EPMC columns that were merged
cols_to_drop = [src for src, tgt in merge_pairs]
df_recs.drop(columns=cols_to_drop, inplace=True, errors='ignore')
print(f"Dropped source columns: {cols_to_drop}")


# Define the canonical field order based on DOME JSON structure
field_order = [
    # Metadata (Note: EPMC columns removed/moved)
    'user_email', 'User_OID', 'timestamp', 'PMID',
    
    # Publication
    'publication/title', 
    'publication/authors', 
    'publication/journal', 
    'publication/year', 
    'EPMC_pub_year', # Moved here for ordering
    'publication/doi',
    'publication/tags',
    
    # Dataset
    'dataset/provenance', 
    'dataset/splits', 
    'dataset/redundancy', 
    'dataset/availability',
    
    # Optimization
    'optimization/algorithm', 
    'optimization/meta', 
    'optimization/encoding', 
    'optimization/parameters', 
    'optimization/features', 
    'optimization/fitting', 
    'optimization/regularization', 
    'optimization/config',
    
    # Model
    'model/interpretability', 
    'model/output', 
    'model/duration', 
    'model/availability',
    
    # Evaluation
    'evaluation/method', 
    'evaluation/measure', 
    'evaluation/comparison', 
    'evaluation/confidence', 
    'evaluation/availability'
]

# Append any remaining columns that weren't explicitly ordered (just in case)
for col in df_recs.columns:
    if col not in field_order:
        field_order.append(col)

# Reindex the DataFrame
final_cols = [c for c in field_order if c in df_recs.columns]
df_recs = df_recs[final_cols]

# Final Save
output_file_final = "v4_Dome-Recommendations-Final_Merged.tsv"
df_recs.to_csv(output_file_final, sep='\t', index=False)
print(f"Saved final merged file to {output_file_final}")
print("Final Column Order:", list(df_recs.columns))

Merging EPMC data into main schema columns...
Merged EPMC_title -> publication/title
Merged EPMC_authors -> publication/authors
Merged EPMC_doi -> publication/doi
Dropped source columns: ['EPMC_title', 'EPMC_authors', 'EPMC_doi']
Saved final merged file to v4_Dome-Recommendations-Final_Merged.tsv
Final Column Order: ['user_email', 'User_OID', 'timestamp', 'PMID', 'publication/title', 'publication/authors', 'publication/journal', 'publication/year', 'EPMC_pub_year', 'publication/doi', 'publication/tags', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config', 'model/interpretability', 'model/output', 'model/duration', 'model/availability', 'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availability', '

In [6]:
# 6. Standardization: Map Legacy IDs and Timestamps

print("Standardizing ID and Timestamp columns...")

# Check current columns to avoid errors
cols = df_recs.columns

# 1. Handle PMID -> publication/pmid
# Drop the empty placeholder column if it exists
if 'publication/pmid' in cols:
    print("Dropping placeholder 'publication/pmid' column...")
    df_recs.drop(columns=['publication/pmid'], inplace=True)

# Rename the actual data column
if 'PMID' in cols:
    print("Renaming 'PMID' -> 'publication/pmid'...")
    df_recs.rename(columns={'PMID': 'publication/pmid'}, inplace=True)
else:
    print("Warning: 'PMID' column not found.")

# 2. Handle timestamp -> update
# Drop the empty placeholder column if it exists
if 'update' in cols:
    print("Dropping placeholder 'update' column...")
    df_recs.drop(columns=['update'], inplace=True)

# Rename the actual data column
if 'timestamp' in cols:
    print("Renaming 'timestamp' -> 'update'...")
    df_recs.rename(columns={'timestamp': 'update'}, inplace=True)
else:
    print("Warning: 'timestamp' column not found.")

# Save Version 5 (stacking on previous v4)
output_file_v5 = "v5_Dome-Recommendations-Standardized_Columns.tsv"
df_recs.to_csv(output_file_v5, sep='\t', index=False)
print(f"Saved standardized data to {output_file_v5}")
print("Current Columns:", list(df_recs.columns))

Standardizing ID and Timestamp columns...
Dropping placeholder 'publication/pmid' column...
Renaming 'PMID' -> 'publication/pmid'...
Dropping placeholder 'update' column...
Renaming 'timestamp' -> 'update'...
Saved standardized data to v5_Dome-Recommendations-Standardized_Columns.tsv
Current Columns: ['user_email', 'User_OID', 'update', 'publication/pmid', 'publication/title', 'publication/authors', 'publication/journal', 'publication/year', 'EPMC_pub_year', 'publication/doi', 'publication/tags', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config', 'model/interpretability', 'model/output', 'model/duration', 'model/availability', 'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availability', 'Availab

In [None]:
# 7. Standardization: Data Migration for Config, Duration, and Measure

print("Migrating legacy data to schema columns...")

# List of migration mappings: (Legacy Source Column, Target Schema Column)
# precise names taken from dataframe columns
migration_map = [
    ('Availability of  configuration', 'optimization/config'),
    ('Execution time ', 'model/duration'),
    ('Performance measures ', 'evaluation/measure')
]

for legacy_col, target_col in migration_map:
    # Ensure source exists
    if legacy_col in df_recs.columns:
        # We want to fill the target with legacy data. 
        # If target exists, we overwrite. If not, we rename (less likely if schema enforced, but safe).
        if target_col in df_recs.columns:
            print(f"Migrating '{legacy_col}' -> '{target_col}'...")
            df_recs[target_col] = df_recs[legacy_col]
            
            # Drop the legacy column
            df_recs.drop(columns=[legacy_col], inplace=True)
            print(f"Dropped legacy column '{legacy_col}'")
        else:
             print(f"Target column '{target_col}' missing. Renaming '{legacy_col}' to '{target_col}'.")
             df_recs.rename(columns={legacy_col: target_col}, inplace=True)
    else:
        print(f"Legacy column '{legacy_col}' not found in dataframe.")

# Save Version 6
output_file_v6 = "v6_Dome-Recommendations-Migrated_Legacy_Data.tsv"
df_recs.to_csv(output_file_v6, sep='\t', index=False)
print(f"Saved migrated data to {output_file_v6}")
print("Current Columns:", list(df_recs.columns))

Migrating legacy data to schema columns...
Legacy column 'Availability of configuration' not found in dataframe.
Legacy column 'Execution time' not found in dataframe.
Legacy column 'Performance measure' not found in dataframe.
Saved migrated data to v6_Dome-Recommendations-Migrated_Legacy_Data.tsv
Current Columns: ['user_email', 'User_OID', 'update', 'publication/pmid', 'publication/title', 'publication/authors', 'publication/journal', 'publication/year', 'EPMC_pub_year', 'publication/doi', 'publication/tags', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config', 'model/interpretability', 'model/output', 'model/duration', 'model/availability', 'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availabi