# DOME Registry Pre-Repair Analysis

This notebook is designed to analyze the DOME registry data, search for specific papers, and cross-reference user metadata before any DSW (Daily Study Workflow) repair operations are performed. It provides tools to verify the state of annotations and curator mappings.


In [38]:
import requests
import time
import pandas as pd

def fetch_epmc_metadata(pmid):
    """
    Fetches Title, Authors, PubYear, and DOI from Europe PMC API for a given PMID.
    """
    if not pmid or str(pmid) == 'nan':
        return None, None, None, None
        
    url = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
    params = {
        'query': f'EXT_ID:{pmid} SRC:MED',
        'format': 'json',
        'resultType': 'core'
    }
    
    try:
        response = requests.get(url, params=params)
        response.raise_for_status()
        data = response.json()
        
        result_list = data.get('resultList', {}).get('result', [])
        
        if result_list:
            top_result = result_list[0]
            title = top_result.get('title', '')
            authors = top_result.get('authorString', '')
            pub_year = top_result.get('pubYear', '')
            doi = top_result.get('doi', '')
            return title, authors, pub_year, doi
            
    except Exception as e:
        print(f"Error fetching PMID {pmid}: {e}")
        
    return None, None, None, None

# Load the TSV
tsv_file = "Dome-Recommendations-Annotated-Articles_20250202.tsv"
print(f"Loading {tsv_file}...")
df_recs = pd.read_csv(tsv_file, sep='\t')

# Ensure columns exist
new_cols = ['EPMC_title', 'EPMC_authors', 'EPMC_pub_year', 'EPMC_doi']
for col in new_cols:
    if col not in df_recs.columns:
        df_recs[col] = None

# Iterate and Enrich
print("Enriching data with Europe PMC API...")
total = len(df_recs)

for index, row in df_recs.iterrows():
    pmid = row.get('PMID')
    
    # Skip if already filled (optional, but good for retries)
    # or if PMID is missing
    if pd.isna(pmid):
        continue
        
    print(f"Processing {index + 1}/{total}: PMID {pmid}", end='\r')
    
    title, authors, year, doi = fetch_epmc_metadata(pmid)
    
    df_recs.at[index, 'EPMC_title'] = title
    df_recs.at[index, 'EPMC_authors'] = authors
    df_recs.at[index, 'EPMC_pub_year'] = year
    df_recs.at[index, 'EPMC_doi'] = doi
    
    # Be nice to the API
    time.sleep(0.1)

print("\nEnrichment complete.")
df_recs.head()

Loading Dome-Recommendations-Annotated-Articles_20250202.tsv...
Enriching data with Europe PMC API...
Processing 188/188: PMID 24977146
Enrichment complete.


Unnamed: 0,Informazioni cronologiche,PMID,Journal name,Publication year,DOME version,Provenance,Dataset splits,Redundancy between data splits,Availability of data,Algorithm,...,Evaluation method,Performance measures,Comparison,Confidence,Availability of evaluation,Indirizzo email,EPMC_title,EPMC_authors,EPMC_pub_year,EPMC_doi
0,3/23/2022 16:36:18,33465072,PLoS Comput Biol.,2021,1.0,"yes, N_pos: 13128 N_neg: 32766",5-fold cross-validation,Correlation Feature Selection is used.,"yes, described at S1 table: https://deposition...","Naive Bayes, SVM, Random Forests",...,5-fold cross validation,ROC curves and AUC values,,,,andrewhatos@gmail.com,Genome-wide prediction of topoisomerase IIβ bi...,"Martínez-García PM, García-Torres M, Divina F,...",2021,10.1371/journal.pcbi.1007814
1,3/28/2022 0:44:11,33679869,Front Genet.,2021,1.0,public databases: 901 samples from The Cancer ...,10-fold cross validation,,"no, they claim: ""Publicly available datasets w...","Spathial, Random forest, LASSO",...,10-fold cross validation,"AUCs for 2-, 3-, and 5-year OS were 0.527, 0.5...",no,no,no,andrewhatos@gmail.com,Construction and Comprehensive Analyses of a M...,"Sun S, Fei K, Zhang G, Wang J, Yang Y, Guo W, ...",2020,10.3389/fgene.2020.617174
2,3/28/2022 16:40:10,34419924,EBioMedicine.,2021,1.0,samples from Stanford Health Care and Stanford...,The final analysis included for discovery coho...,,Conatact (hoganca@stanford.edu) is provided ri...,gradient boosted decision trees and random for...,...,novel experiments.,"AUC, sensitivity, specificity",costeffective comare to PCR tests and could be...,,"yes, https://github.com/stanfordmlgroup/influe...",andrewhatos@gmail.com,Nasopharyngeal metabolomics and machine learni...,"Hogan CA, Rajpurkar P, Sowrirajan H, Phillips ...",2021,10.1016/j.ebiom.2021.103546
3,3/29/2022 23:01:05,34112769,Nat Commun .,2021,1.0,Clinical data collected by the authors N_neg 3...,N_pos: Discovery 227 and Validation 77,,All data included in this study is available u...,composite model with several stepp with diffe...,...,independent dataset,"Accuracy (94.81% for the C4 prediction model, ...",no,no,no,andrewhatos@gmail.com,A new molecular classification to drive precis...,"Soret P, Le Dantec C, Desvaux E, Foulquier N, ...",2021,10.1038/s41467-021-23472-7
4,3/28/2022 12:23:43,32915751,IEEE J Biomed Health Inform.,2020,1.0,two public COVID-19 CT datasets: - https:/...,no,no,preprocessed data: https://drive.google.com/fi...,"deep learning, novel approach",...,four-fold cross-validation,"Accuracy, F1 score, Sensitivity, Precision, AUC",firstly comparision to COVID-Net method https:...,outperforming the original COVID-Net trained o...,no,andrewhatos@gmail.com,Contrastive Cross-Site Learning With Redesigne...,"Wang Z, Liu Q, Dou Q.",2020,10.1109/jbhi.2020.3023246


In [39]:
import json

# 1. Integrate User OIDs
users_file = "dome_users_20260202.json"
try:
    print(f"Loading users from {users_file}...")
    with open(users_file, 'r', encoding='utf-8') as f:
        users_data = json.load(f)
        
    # Create Email -> OID mapping
    email_to_oid = {}
    for u in users_data:
        email = u.get('email')
        oid = u.get('_id', {}).get('$oid')
        if email and oid:
            email_to_oid[email.strip()] = oid
            
    # Apply to DataFrame
    # Target column is 'Indirizzo email' based on file inspection
    if 'Indirizzo email' in df_recs.columns:
        df_recs['User_OID'] = df_recs['Indirizzo email'].apply(lambda x: email_to_oid.get(str(x).strip(), 'Unknown'))
        print("User OIDs mapped successfully.")
    else:
        print("Warning: 'Indirizzo email' column not found in TSV.")

except Exception as e:
    print(f"Error mapping users: {e}")

# 2. Reorder columns (EPMC info after PMID)
cols = list(df_recs.columns)
if 'PMID' in cols:
    pmid_idx = cols.index('PMID')
    
    # Remove EPMC cols if they are in the list currently
    active_new_cols = [c for c in new_cols if c in cols]
    for col in active_new_cols:
        cols.remove(col)
        
    # Insert them back after PMID
    for i, col in enumerate(active_new_cols):
        cols.insert(pmid_idx + 1 + i, col)
        
    df_recs = df_recs[cols]

# 3. Save (Overwriting/Creating the single enriched file)
output_file = "Dome-Recommendations-Annotated-Articles_20250202_Enriched.tsv"
df_recs.to_csv(output_file, sep='\t', index=False)
print(f"Saved enriched data with User OIDs to {output_file}")

Loading users from dome_users_20260202.json...
User OIDs mapped successfully.
Saved enriched data with User OIDs to Dome-Recommendations-Annotated-Articles_20250202_Enriched.tsv


In [40]:
# 4. Standardize Column Names and Sync with JSON Schema

raw_reviews_file = "dome_review_raw_human_20260202.json"

# Known TSV -> DOME JSON Key Mapping
# Based on the TSV headers provided and DOME schema conventions
column_map = {
    'Journal name': 'publication/journal',
    'Publication year': 'publication/year',
    
    'Provenance': 'dataset/provenance',
    'Dataset splits': 'dataset/splits',
    'Redundancy between data splits': 'dataset/redundancy',
    'Availability of data': 'dataset/availability',
    
    'Algorithm': 'optimization/algorithm',
    'Meta-predictions': 'optimization/meta',
    'Data encoding': 'optimization/encoding',
    'Parameters': 'optimization/parameters',
    'Features': 'optimization/features',
    'Fitting': 'optimization/fitting',
    'Regularization': 'optimization/regularization',
    'Availability of configuration': 'optimization/config',
    
    'Interpretability': 'model/interpretability',
    'Output': 'model/output',
    'Execution time': 'model/duration',
    'Availability of software': 'model/availability',
    
    'Evaluation method': 'evaluation/method',
    'Performance measures': 'evaluation/measure',
    'Comparison': 'evaluation/comparison',
    'Confidence': 'evaluation/confidence',
    'Availability of evaluation': 'evaluation/availability',
    
    # Metadata/Extra
    'Informazioni cronologiche': 'timestamp',
    'Indirizzo email': 'user_email'
}

# 1. Rename Columns
print("Renaming TSV columns to match DOME JSON schema...")
df_recs.rename(columns=column_map, inplace=True)

# 2. Extract Full Schema from JSON
# We flatten the keys from the first few records to get a list of all possible "section/field" keys
all_json_keys = set()
try:
    with open(raw_reviews_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        
    for entry in data[:10]: # Check first 10 to cover bases
        for section, content in entry.items():
            if isinstance(content, dict):
                for field in content.keys():
                    if not field.startswith('$'): # Skip mongo internal keys
                        all_json_keys.add(f"{section}/{field}")
            else:
                if not section.startswith('_'):
                    all_json_keys.add(section)
                    
    print(f"Detected {len(all_json_keys)} standard schema keys from JSON.")
    
except Exception as e:
    print(f"Error reading JSON schema: {e}")
    # Fallback default keys if file read fails
    all_json_keys = {
        'publication/title', 'publication/authors', 'publication/doi', 'publication/year', 'publication/journal',
        'publication/tags',
        'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability',
        'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 
        'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config',
        'model/interpretability', 'model/output', 'model/duration', 'model/availability',
        'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availability'
    }

# 3. Add Missing Schema Columns
for key in all_json_keys:
    if key not in df_recs.columns:
        df_recs[key] = None # Add as empty

# 4. Remove unwanted columns
# Filter for exact columns ending in '/done' or '/skip'
cols_to_drop = [c for c in df_recs.columns if c.endswith('/skip') or c.endswith('/done')]

if 'DOME version' in df_recs.columns:
    cols_to_drop.append('DOME version')

if cols_to_drop:
    print(f"Dropping {len(cols_to_drop)} unwanted columns: {cols_to_drop}")
    df_recs.drop(columns=cols_to_drop, inplace=True)

# Re-Save with Version prefix
output_file_schema = "v1_Dome-Recommendations-Schema_Aligned.tsv"
df_recs.to_csv(output_file_schema, sep='\t', index=False)
print(f"Saved schema-aligned file to {output_file_schema}")
print("Columns:", list(df_recs.columns))

Renaming TSV columns to match DOME JSON schema...
Detected 45 standard schema keys from JSON.
Dropping 11 unwanted columns: ['optimization/done', 'publication/skip', 'publication/done', 'evaluation/done', 'dataset/done', 'model/done', 'optimization/skip', 'evaluation/skip', 'dataset/skip', 'model/skip', 'DOME version']
Saved schema-aligned file to v1_Dome-Recommendations-Schema_Aligned.tsv
Columns: ['timestamp', 'PMID', 'EPMC_title', 'EPMC_authors', 'EPMC_pub_year', 'EPMC_doi', 'publication/journal', 'publication/year', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'Availability of  configuration', 'model/interpretability', 'model/output', 'Execution time ', 'model/availability', 'evaluation/method', 'Performance measures ', 'evaluation/comparison', 'evaluation/confidenc

In [41]:
# 5. Merge EPMC Data and Finalize Order

print("Merging EPMC data into main schema columns...")

# List of merge pairs: (Source, Target)
merge_pairs = [
    ('EPMC_title', 'publication/title'),
    ('EPMC_authors', 'publication/authors'),
    ('EPMC_doi', 'publication/doi')
]

for src, tgt in merge_pairs:
    if src in df_recs.columns and tgt in df_recs.columns:
        # Use EPMC data to fill/overwrite
        # If you only want to fill missing values, use .fillna() instead
        # Here we overwrite as per instructions which implies using the fetched data
        df_recs[tgt] = df_recs[src]
        print(f"Merged {src} -> {tgt}")
    else:
        print(f"Skipping merge {src} -> {tgt} (Column missing)")

# Cleanup: Drop the temporary EPMC columns that were merged
cols_to_drop = [src for src, tgt in merge_pairs]
df_recs.drop(columns=cols_to_drop, inplace=True, errors='ignore')
print(f"Dropped source columns: {cols_to_drop}")


# Define the canonical field order based on DOME JSON structure
field_order = [
    # Metadata (Note: EPMC columns removed/moved)
    'user_email', 'User_OID', 'timestamp', 'PMID',
    
    # Publication
    'publication/title', 
    'publication/authors', 
    'publication/journal', 
    'publication/year', 
    'EPMC_pub_year', # Moved here for ordering
    'publication/doi',
    'publication/tags',
    
    # Dataset
    'dataset/provenance', 
    'dataset/splits', 
    'dataset/redundancy', 
    'dataset/availability',
    
    # Optimization
    'optimization/algorithm', 
    'optimization/meta', 
    'optimization/encoding', 
    'optimization/parameters', 
    'optimization/features', 
    'optimization/fitting', 
    'optimization/regularization', 
    'optimization/config',
    
    # Model
    'model/interpretability', 
    'model/output', 
    'model/duration', 
    'model/availability',
    
    # Evaluation
    'evaluation/method', 
    'evaluation/measure', 
    'evaluation/comparison', 
    'evaluation/confidence', 
    'evaluation/availability'
]

# Append any remaining columns that weren't explicitly ordered (just in case)
for col in df_recs.columns:
    if col not in field_order:
        field_order.append(col)

# Reindex the DataFrame
final_cols = [c for c in field_order if c in df_recs.columns]
df_recs = df_recs[final_cols]

# Final Save
output_file_final = "v4_Dome-Recommendations-Final_Merged.tsv"
df_recs.to_csv(output_file_final, sep='\t', index=False)
print(f"Saved final merged file to {output_file_final}")
print("Final Column Order:", list(df_recs.columns))

Merging EPMC data into main schema columns...
Merged EPMC_title -> publication/title
Merged EPMC_authors -> publication/authors
Merged EPMC_doi -> publication/doi
Dropped source columns: ['EPMC_title', 'EPMC_authors', 'EPMC_doi']
Saved final merged file to v4_Dome-Recommendations-Final_Merged.tsv
Final Column Order: ['user_email', 'User_OID', 'timestamp', 'PMID', 'publication/title', 'publication/authors', 'publication/journal', 'publication/year', 'EPMC_pub_year', 'publication/doi', 'publication/tags', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config', 'model/interpretability', 'model/output', 'model/duration', 'model/availability', 'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availability', '

In [42]:
# 6. Standardization: Map Legacy IDs and Timestamps

print("Standardizing ID and Timestamp columns...")

# Check current columns to avoid errors
cols = df_recs.columns

# 1. Handle PMID -> publication/pmid
# Drop the empty placeholder column if it exists
if 'publication/pmid' in cols:
    print("Dropping placeholder 'publication/pmid' column...")
    df_recs.drop(columns=['publication/pmid'], inplace=True)

# Rename the actual data column
if 'PMID' in cols:
    print("Renaming 'PMID' -> 'publication/pmid'...")
    df_recs.rename(columns={'PMID': 'publication/pmid'}, inplace=True)
else:
    print("Warning: 'PMID' column not found.")

# 2. Handle timestamp -> update
# Drop the empty placeholder column if it exists
if 'update' in cols:
    print("Dropping placeholder 'update' column...")
    df_recs.drop(columns=['update'], inplace=True)

# Rename the actual data column
if 'timestamp' in cols:
    print("Renaming 'timestamp' -> 'update'...")
    df_recs.rename(columns={'timestamp': 'update'}, inplace=True)
else:
    print("Warning: 'timestamp' column not found.")

# Save Version 5 (stacking on previous v4)
output_file_v5 = "v5_Dome-Recommendations-Standardized_Columns.tsv"
df_recs.to_csv(output_file_v5, sep='\t', index=False)
print(f"Saved standardized data to {output_file_v5}")
print("Current Columns:", list(df_recs.columns))

Standardizing ID and Timestamp columns...
Dropping placeholder 'publication/pmid' column...
Renaming 'PMID' -> 'publication/pmid'...
Dropping placeholder 'update' column...
Renaming 'timestamp' -> 'update'...
Saved standardized data to v5_Dome-Recommendations-Standardized_Columns.tsv
Current Columns: ['user_email', 'User_OID', 'update', 'publication/pmid', 'publication/title', 'publication/authors', 'publication/journal', 'publication/year', 'EPMC_pub_year', 'publication/doi', 'publication/tags', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config', 'model/interpretability', 'model/output', 'model/duration', 'model/availability', 'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availability', 'Availab

In [43]:
# 7. Standardization: Data Migration for Config, Duration, and Measure

print("Migrating legacy data to schema columns...")

# List of migration mappings: (Legacy Source Column, Target Schema Column)
# precise names taken from dataframe columns
migration_map = [
    ('Availability of  configuration', 'optimization/config'),
    ('Execution time ', 'model/duration'),
    ('Performance measures ', 'evaluation/measure')
]

for legacy_col, target_col in migration_map:
    # Ensure source exists
    if legacy_col in df_recs.columns:
        # We want to fill the target with legacy data. 
        # If target exists, we overwrite. If not, we rename (less likely if schema enforced, but safe).
        if target_col in df_recs.columns:
            print(f"Migrating '{legacy_col}' -> '{target_col}'...")
            df_recs[target_col] = df_recs[legacy_col]
            
            # Drop the legacy column
            df_recs.drop(columns=[legacy_col], inplace=True)
            print(f"Dropped legacy column '{legacy_col}'")
        else:
             print(f"Target column '{target_col}' missing. Renaming '{legacy_col}' to '{target_col}'.")
             df_recs.rename(columns={legacy_col: target_col}, inplace=True)
    else:
        print(f"Legacy column '{legacy_col}' not found in dataframe.")

# Save Version 6
output_file_v6 = "v6_Dome-Recommendations-Migrated_Legacy_Data.tsv"
df_recs.to_csv(output_file_v6, sep='\t', index=False)
print(f"Saved migrated data to {output_file_v6}")
print("Current Columns:", list(df_recs.columns))

Migrating legacy data to schema columns...
Migrating 'Availability of  configuration' -> 'optimization/config'...
Dropped legacy column 'Availability of  configuration'
Migrating 'Execution time ' -> 'model/duration'...
Dropped legacy column 'Execution time '
Migrating 'Performance measures ' -> 'evaluation/measure'...
Dropped legacy column 'Performance measures '
Saved migrated data to v6_Dome-Recommendations-Migrated_Legacy_Data.tsv
Current Columns: ['user_email', 'User_OID', 'update', 'publication/pmid', 'publication/title', 'publication/authors', 'publication/journal', 'publication/year', 'EPMC_pub_year', 'publication/doi', 'publication/tags', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config', 'model/interpretability', 'model/output', 'model/duratio

In [44]:
# 8. Standardization: Enforce Final Schema Order

print("Enforcing final schema order (restoring missing empty columns)...")

# Define the definitive desired order (updating names to current versions)
target_schema_order = [
    # Metadata
    'user_email', 'User_OID', 'update', 'publication/pmid',
    
    # Publication
    'publication/title', 
    'publication/authors', 
    'publication/journal', 
    'publication/year', 
    'EPMC_pub_year', 
    'publication/doi',
    'publication/tags',
    
    # Dataset
    'dataset/provenance', 
    'dataset/splits', 
    'dataset/redundancy', 
    'dataset/availability',
    
    # Optimization
    'optimization/algorithm', 
    'optimization/meta', 
    'optimization/encoding', 
    'optimization/parameters', 
    'optimization/features', 
    'optimization/fitting', 
    'optimization/regularization', 
    'optimization/config',
    
    # Model
    'model/interpretability', 
    'model/output', 
    'model/duration', 
    'model/availability',
    
    # Evaluation
    'evaluation/method', 
    'evaluation/measure', 
    'evaluation/comparison', 
    'evaluation/confidence', 
    'evaluation/availability'
]

# 1. Add missing columns from schema as empty
for col in target_schema_order:
    if col not in df_recs.columns:
        print(f"Adding missing schema column: {col}")
        df_recs[col] = None  # Add empty column

# 2. Reorder columns
# Identify any extra columns in dataframe not in schema to append at end
extra_cols = [c for c in df_recs.columns if c not in target_schema_order]
if extra_cols:
    print(f"Preserving extra columns: {extra_cols}")

# Create final ordering
final_order = target_schema_order + extra_cols
df_recs = df_recs[final_order]

# Save Version 7
output_file_v7 = "v7_Dome-Recommendations-Final_Schema_Ordered.tsv"
df_recs.to_csv(output_file_v7, sep='\t', index=False)
print(f"Saved final ordered data to {output_file_v7}")
print("Final Columns:", list(df_recs.columns))

Enforcing final schema order (restoring missing empty columns)...
Preserving extra columns: ['reviewState', 'score', 'uuid', 'publication/updated', 'public', 'shortid']
Saved final ordered data to v7_Dome-Recommendations-Final_Schema_Ordered.tsv
Final Columns: ['user_email', 'User_OID', 'update', 'publication/pmid', 'publication/title', 'publication/authors', 'publication/journal', 'publication/year', 'EPMC_pub_year', 'publication/doi', 'publication/tags', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config', 'model/interpretability', 'model/output', 'model/duration', 'model/availability', 'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availability', 'reviewState', 'score', 'uuid', 'publication/upda

In [45]:
# 9. Standardization: Final Reordering (System Columns & Publication Update)

print("Applying final column reordering...")

# Define the precise final order
final_reorder = [
    # System Columns (Moved to start)
    'public', 'shortid', 'uuid', 'reviewState',
    
    # Metadata
    'user_email', 'User_OID', 'update', 'publication/pmid',
    
    # Publication
    'publication/title', 
    'publication/authors', 
    'publication/journal', 
    'publication/year', 
    'EPMC_pub_year', 
    'publication/doi',
    'publication/updated', # Moved/Added here
    'publication/tags',
    
    # Dataset
    'dataset/provenance', 
    'dataset/splits', 
    'dataset/redundancy', 
    'dataset/availability',
    
    # Optimization
    'optimization/algorithm', 
    'optimization/meta', 
    'optimization/encoding', 
    'optimization/parameters', 
    'optimization/features', 
    'optimization/fitting', 
    'optimization/regularization', 
    'optimization/config',
    
    # Model
    'model/interpretability', 
    'model/output', 
    'model/duration', 
    'model/availability',
    
    # Evaluation
    'evaluation/method', 
    'evaluation/measure', 
    'evaluation/comparison', 
    'evaluation/confidence', 
    'evaluation/availability'
]

# Ensure publication/updated exists
if 'publication/updated' not in df_recs.columns:
    print("Adding missing 'publication/updated' column...")
    df_recs['publication/updated'] = None

# Reorder
# Check for any remaining columns not in our explicit list to append at the end
current_cols = df_recs.columns.tolist()
remaining_cols = [c for c in current_cols if c not in final_reorder]

if remaining_cols:
    print(f"Appending remaining columns: {remaining_cols}")

# Construct final list
full_order = final_reorder + remaining_cols

# Filter to ensure we don't ask for columns that don't exist (though we expect them to)
# effectively reindexing
df_recs = df_recs.reindex(columns=full_order)

# Save Version 8
output_file_v8 = "v8_Dome-Recommendations-System_Cols_Reordered.tsv"
df_recs.to_csv(output_file_v8, sep='\t', index=False)
print(f"Saved reordered data to {output_file_v8}")
print("Final Columns:", list(df_recs.columns))

Applying final column reordering...
Appending remaining columns: ['score']
Saved reordered data to v8_Dome-Recommendations-System_Cols_Reordered.tsv
Final Columns: ['public', 'shortid', 'uuid', 'reviewState', 'user_email', 'User_OID', 'update', 'publication/pmid', 'publication/title', 'publication/authors', 'publication/journal', 'publication/year', 'EPMC_pub_year', 'publication/doi', 'publication/updated', 'publication/tags', 'dataset/provenance', 'dataset/splits', 'dataset/redundancy', 'dataset/availability', 'optimization/algorithm', 'optimization/meta', 'optimization/encoding', 'optimization/parameters', 'optimization/features', 'optimization/fitting', 'optimization/regularization', 'optimization/config', 'model/interpretability', 'model/output', 'model/duration', 'model/availability', 'evaluation/method', 'evaluation/measure', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/availability', 'score']


In [46]:
# 10. Standardization: Final Strict Schema Enforcement (Version 9)

print("Applying STRICT final column reordering and validation...")

# The definitive DOME Schema Order
strict_order = [
    '_id/$oid',
    'dataset/availability',
    'dataset/provenance',
    'dataset/redundancy',
    'dataset/splits',
    'dataset/done',
    'dataset/skip',
    'evaluation/availability',
    'evaluation/comparison',
    'evaluation/confidence',
    'evaluation/measure',
    'evaluation/method',
    'evaluation/done',
    'evaluation/skip',
    'model/availability',
    'model/duration',
    'model/interpretability',
    'model/output',
    'model/done',
    'model/skip',
    'optimization/algorithm',
    'optimization/config',
    'optimization/encoding',
    'optimization/features',
    'optimization/fitting',
    'optimization/meta',
    'optimization/parameters',
    'optimization/regularization',
    'optimization/done',
    'optimization/skip',
    'user/$oid',
    'publication/pmid',
    'publication/updated',
    'publication/authors',
    'publication/journal',
    'publication/title',
    'publication/doi',
    'publication/year',
    'publication/done',
    'publication/skip',
    'publication/tags',
    'public',
    'created/$date',
    'updated/$date',
    'uuid',
    'reviewState',
    'shortid',
    'update'
]

# Validation: Check for missing columns and add them
missing_cols = []
for col in strict_order:
    if col not in df_recs.columns:
        missing_cols.append(col)
        df_recs[col] = None # Add missing column as empty

if missing_cols:
    print(f"Added {len(missing_cols)} missing columns empty to match schema: {missing_cols}")

# Validation checks on current dataframe
# We will check if the columns exist (we just ensured they do) 
# and potentially map some of our intermediate names to these final names if needed.

# Mappings based on prior steps:
# User_OID -> user/$oid
# user_email -> (drop or keep? assuming drop as not in strict list) but maybe user/$oid IS the user column?
# Assuming User_OID from previous step maps to user/$oid

if 'User_OID' in df_recs.columns:
    print("Mapping User_OID -> user/$oid")
    df_recs['user/$oid'] = df_recs['User_OID']

# Note: previous steps created 'update', 'publication/pmid', 'publication/updated' so those are good.

# Apply the strict reorder
# This drops any columns NOT in the strict_order list (unless we append them, but user asked to flag errors if order/columns don't match or add if missing, implying strict list)
# "if a column would be missing add the empty column ... If column does not match order flag errors"

current_cols = df_recs.columns.tolist()
extra_cols_found = [c for c in current_cols if c not in strict_order]
if extra_cols_found:
    print(f"Warning: The following columns are present but NOT in the strict schema list and will be dropped: {extra_cols_found}")

# Create Version 9 with strict order
df_v9 = df_recs.reindex(columns=strict_order)

# Save Version 9
output_file_v9 = "v9_Dome-Recommendations-Strict_Schema_Ordered.tsv"
df_v9.to_csv(output_file_v9, sep='\t', index=False)
print(f"Saved strict schema ordered data to {output_file_v9}")
print("Final Columns:", list(df_v9.columns))

Applying STRICT final column reordering and validation...
Added 14 missing columns empty to match schema: ['_id/$oid', 'dataset/done', 'dataset/skip', 'evaluation/done', 'evaluation/skip', 'model/done', 'model/skip', 'optimization/done', 'optimization/skip', 'user/$oid', 'publication/done', 'publication/skip', 'created/$date', 'updated/$date']
Mapping User_OID -> user/$oid
Saved strict schema ordered data to v9_Dome-Recommendations-Strict_Schema_Ordered.tsv
Final Columns: ['_id/$oid', 'dataset/availability', 'dataset/provenance', 'dataset/redundancy', 'dataset/splits', 'dataset/done', 'dataset/skip', 'evaluation/availability', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/measure', 'evaluation/method', 'evaluation/done', 'evaluation/skip', 'model/availability', 'model/duration', 'model/interpretability', 'model/output', 'model/done', 'model/skip', 'optimization/algorithm', 'optimization/config', 'optimization/encoding', 'optimization/features', 'optimization/fitting', '

In [47]:
# 11. Standardization: Add PMCID Column (Version 10)

print("Adding 'publication/pmcid' column...")

# Work with the latest dataframe
df_final = df_v9.copy()

# target position
target_col = 'publication/pmid'
new_col = 'publication/pmcid'

if target_col in df_final.columns:
    # Get index and add 1 to insert after
    col_index = df_final.columns.get_loc(target_col) + 1
    df_final.insert(col_index, new_col, None)
    print(f"Inserted '{new_col}' after '{target_col}'")
else:
    print(f"Target '{target_col}' not found, appending '{new_col}' at end.")
    df_final[new_col] = None

# Save Version 10
output_file_v10 = "v10_Dome-Recommendations-With_PMCID.tsv"
df_final.to_csv(output_file_v10, sep='\t', index=False)
print(f"Saved data with PMCID to {output_file_v10}")
print("Final Columns:", list(df_final.columns))

Adding 'publication/pmcid' column...
Inserted 'publication/pmcid' after 'publication/pmid'
Saved data with PMCID to v10_Dome-Recommendations-With_PMCID.tsv
Final Columns: ['_id/$oid', 'dataset/availability', 'dataset/provenance', 'dataset/redundancy', 'dataset/splits', 'dataset/done', 'dataset/skip', 'evaluation/availability', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/measure', 'evaluation/method', 'evaluation/done', 'evaluation/skip', 'model/availability', 'model/duration', 'model/interpretability', 'model/output', 'model/done', 'model/skip', 'optimization/algorithm', 'optimization/config', 'optimization/encoding', 'optimization/features', 'optimization/fitting', 'optimization/meta', 'optimization/parameters', 'optimization/regularization', 'optimization/done', 'optimization/skip', 'user/$oid', 'publication/pmid', 'publication/pmcid', 'publication/updated', 'publication/authors', 'publication/journal', 'publication/title', 'publication/doi', 'publication/year', 'pu

In [48]:
import requests
import time
import pandas as pd

# 12. Generate EPMC Source Metadata File

input_v10_file = "v10_Dome-Recommendations-With_PMCID.tsv"
output_epmc_file = "epmc_source_metadata.tsv"

print(f"Loading {input_v10_file} to extract PMIDs...")
try:
    df_v10 = pd.read_csv(input_v10_file, sep='\t')
    # Convert to string, drop NA, get unique
    pmids = df_v10['publication/pmid'].dropna().astype(str).unique()
    # Filter out empty strings or non-numeric looking PMIDs if necessary, but API handles validation
    pmids = [p for p in pmids if p.strip() and p != 'nan']
    print(f"Found {len(pmids)} unique PMIDs to fetch.")
except FileNotFoundError:
    print(f"Error: {input_v10_file} not found. Make sure previous block was run.")
    pmids = []

def fetch_full_epmc_metadata(pmid):
    """
    Fetches comprehensive metadata from Europe PMC.
    """
    # Clean PMID (remove .0 if it was parsed as float)
    pmid_str = str(pmid).replace('.0', '')
    
    url = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
    query = f'EXT_ID:{pmid_str} SRC:MED'
    params = {'query': query, 'format': 'json', 'resultType': 'core'}
    
    try:
        r = requests.get(url, params=params)
        r.raise_for_status()
        data = r.json()
        result_list = data.get('resultList', {}).get('result', [])
        
        if not result_list:
            return None
            
        item = result_list[0]
        
        # Extract Mesh Terms
        mesh_list = item.get('meshHeadingList', {}).get('meshHeading', [])
        mesh_terms = "; ".join([m.get('descriptorName', '') for m in mesh_list])
        
        # Extract Keywords
        kw_list = item.get('keywordList', {}).get('keyword', [])
        keywords = "; ".join(kw_list) if kw_list else ""
        
        return {
            'publication/pmid': pmid_str,
            'epmc_title': item.get('title'),
            'epmc_authors': item.get('authorString'),
            'epmc_journal': item.get('journalInfo', {}).get('journal', {}).get('title'),
            'epmc_year': item.get('pubYear'),
            'epmc_doi': item.get('doi'),
            'epmc_pmcid': item.get('pmcid'),
            'epmc_mesh': mesh_terms,
            'epmc_keywords': keywords
        }
        
    except Exception as e:
        print(f"Error fetching {pmid}: {e}")
        return None

# Fetch Data
if len(pmids) > 0:
    epmc_data = []
    print("Fetching metadata from Europe PMC...")

    for i, pmid in enumerate(pmids):
        print(f"Processing {i+1}/{len(pmids)}: {pmid}", end='\r')
        meta = fetch_full_epmc_metadata(pmid)
        if meta:
            epmc_data.append(meta)
        time.sleep(0.1) # Rate limit courtesy

    print(f"\nFetched data for {len(epmc_data)} records.")

    # Create DataFrame and Save
    if epmc_data:
        df_epmc = pd.DataFrame(epmc_data)

        # Reorder columns nicely
        cols = ['publication/pmid', 'epmc_pmcid', 'epmc_doi', 'epmc_year', 'epmc_journal', 'epmc_title', 'epmc_authors', 'epmc_mesh', 'epmc_keywords']
        # Ensure all cols exist
        for c in cols:
            if c not in df_epmc.columns: df_epmc[c] = None
            
        df_epmc = df_epmc[cols]

        df_epmc.to_csv(output_epmc_file, sep='\t', index=False)
        print(f"Saved EPMC source metadata to {output_epmc_file}")
    else:
        print("No metadata fetched.")
else:
    print("No PMIDs to process.")

Loading v10_Dome-Recommendations-With_PMCID.tsv to extract PMIDs...
Found 176 unique PMIDs to fetch.
Fetching metadata from Europe PMC...
Processing 176/176: 24977146
Fetched data for 176 records.
Saved EPMC source metadata to epmc_source_metadata.tsv


In [49]:
import ipywidgets as widgets
from IPython.display import display

# 13. Interactive Comparison Tool

print("Initializing interactive comparison interface...")

# Load Data (ensure consistent strings for joining)
try:
    df_v10 = pd.read_csv("v10_Dome-Recommendations-With_PMCID.tsv", sep='\t')
    df_epmc = pd.read_csv("epmc_source_metadata.tsv", sep='\t')
    
    # Pre-processing keys
    df_v10['publication/pmid'] = df_v10['publication/pmid'].fillna('').astype(str).str.replace('.0', '')
    df_epmc['publication/pmid'] = df_epmc['publication/pmid'].fillna('').astype(str).str.replace('.0', '')
    
    # Merge for easier navigation
    # Left join to verify all v10 records, even if EPMC failed
    df_merged = pd.merge(df_v10, df_epmc, on='publication/pmid', how='left')
    
    # Fields to compare
    comparison_fields = [
        ('Title', 'publication/title', 'epmc_title'),
        ('Authors', 'publication/authors', 'epmc_authors'),
        ('Journal', 'publication/journal', 'epmc_journal'),
        ('Year', 'publication/year', 'epmc_year'),
        ('DOI', 'publication/doi', 'epmc_doi')
    ]

except Exception as e:
    print(f"Error loading files: {e}")
    df_merged = pd.DataFrame()

# Navigation logic
current_index = 0
total_records = len(df_merged)

# Widgets
w_output = widgets.Output() # For display area
w_prev = widgets.Button(description="<< Previous", disabled=True)
w_next = widgets.Button(description="Next >>")
w_status = widgets.Label(value=f"Record 1 of {total_records}")

def display_record(index):
    if index < 0 or index >= total_records: return
    
    row = df_merged.iloc[index]
    pmid = row['publication/pmid']
    
    with w_output:
        w_output.clear_output()
        print(f"=== CHECKING PMID: {pmid} ===")
        print("-" * 80)
        print(f"{'FIELD':<15} | {'DOME (v10)':<40} | {'EPMC (Source)':<40}")
        print("-" * 80)
        
        for label, dome_col, epmc_col in comparison_fields:
            val_dome = str(row.get(dome_col, ''))[:40] # Truncate for display
            val_epmc = str(row.get(epmc_col, ''))[:40]
            
            # Simple diff indicator
            match = " " if val_dome == val_epmc else "*" 
            print(f"{match} {label:<13} | {val_dome:<40} | {val_epmc:<40}")
            
        print("-" * 80)
        print("Raw EPMC Mesh Terms:")
        print(row.get('epmc_mesh', 'N/A'))

def on_prev(b):
    global current_index
    current_index = max(0, current_index - 1)
    update_ui()

def on_next(b):
    global current_index
    current_index = min(total_records - 1, current_index + 1)
    update_ui()
    
def update_ui():
    w_status.value = f"Record {current_index + 1} of {total_records}"
    w_prev.disabled = (current_index == 0)
    w_next.disabled = (current_index == total_records - 1)
    display_record(current_index)

w_prev.on_click(on_prev)
w_next.on_click(on_next)

# Layout
if not df_merged.empty:
    display(widgets.HBox([w_prev, w_status, w_next]))
    display(w_output)
    display_record(0)
else:
    print("No data available to verify.")

Initializing interactive comparison interface...


HBox(children=(Button(description='<< Previous', disabled=True, style=ButtonStyle()), Label(value='Record 1 of…

Output()

In [50]:
# 14. Data Repair: Overwrite DOME with EPMC Metadata (Version 11)

print("Repairing publication metadata with official EPMC data...")

# Start from v10 (already loaded as df_v10, but good to ensure freshness if cells skipped)
# We use df_merged created in previous step or rebuild it if needed
if 'df_merged' not in locals() or df_merged.empty:
    print("re-merging for repair operation...")
    df_v10 = pd.read_csv("v10_Dome-Recommendations-With_PMCID.tsv", sep='\t')
    df_epmc = pd.read_csv("epmc_source_metadata.tsv", sep='\t')
    
    # Pre-processing keys
    df_v10['publication/pmid'] = df_v10['publication/pmid'].fillna('').astype(str).str.replace('.0', '')
    df_epmc['publication/pmid'] = df_epmc['publication/pmid'].fillna('').astype(str).str.replace('.0', '')
    
    df_merged = pd.merge(df_v10, df_epmc, on='publication/pmid', how='left')

# Create a copy for the new version
df_repaired = df_merged.copy()

# List of fields to overwrite: (Target DOME Col, Source EPMC Col)
overwrite_map = [
    ('publication/pmcid', 'epmc_pmcid'),
    ('publication/journal', 'epmc_journal'),
    ('publication/year', 'epmc_year'),
    # You asked for PMCID, Journal, Year. 
    # Did you want Title/Authors? "same for publciation/journal and publication/year" implies those specific ones.
    # I will stick to what was explicitly asked plus PMCID.
]

count_repairs = 0
for target, source in overwrite_map:
    # Only overwite if source is available (not null/nan)
    # We use numpy where or apply. 
    # Logic: If epmc_source is valid, use it. Else keep original.
    
    mask = df_repaired[source].notna() & (df_repaired[source].astype(str).str.strip() != '')
    
    # Log how many will change
    n_changes = mask.sum()
    print(f"Update '{target}' from '{source}': {n_changes} records to be updated.")
    
    # Apply update
    df_repaired.loc[mask, target] = df_repaired.loc[mask, source]
    count_repairs += 1

# Cleanup: Remove the appended EPMC columns from the merge (keys starting with epmc_)
epmc_cols = [c for c in df_repaired.columns if c.startswith('epmc_')]
df_repaired.drop(columns=epmc_cols, inplace=True)

# Save Version 11
output_file_v11 = "v11_Dome-Recommendations-EPMC_Metadata_Repaired.tsv"
df_repaired.to_csv(output_file_v11, sep='\t', index=False)
print(f"Saved repaired data to {output_file_v11}")
print("Final Columns:", list(df_repaired.columns))

Repairing publication metadata with official EPMC data...
Update 'publication/pmcid' from 'epmc_pmcid': 169 records to be updated.
Update 'publication/journal' from 'epmc_journal': 188 records to be updated.
Update 'publication/year' from 'epmc_year': 188 records to be updated.
Saved repaired data to v11_Dome-Recommendations-EPMC_Metadata_Repaired.tsv
Final Columns: ['_id/$oid', 'dataset/availability', 'dataset/provenance', 'dataset/redundancy', 'dataset/splits', 'dataset/done', 'dataset/skip', 'evaluation/availability', 'evaluation/comparison', 'evaluation/confidence', 'evaluation/measure', 'evaluation/method', 'evaluation/done', 'evaluation/skip', 'model/availability', 'model/duration', 'model/interpretability', 'model/output', 'model/done', 'model/skip', 'optimization/algorithm', 'optimization/config', 'optimization/encoding', 'optimization/features', 'optimization/fitting', 'optimization/meta', 'optimization/parameters', 'optimization/regularization', 'optimization/done', 'optimiza

 'PMC6708480' 'PMC6690680' 'PMC6242780' 'PMC5821274' 'PMC4606520'
 'PMC7816647' 'PMC5930664' 'PMC1847686' 'PMC4706063' 'PMC2701298'
 'PMC7734183' 'PMC7297119' 'PMC9328381' 'PMC8485143' 'PMC7073919'
 'PMC6902683' 'PMC2752621' 'PMC8336795' 'PMC8225676' 'PMC8093828'
 'PMC8100172' 'PMC7794018' 'PMC6548586' 'PMC5773889' 'PMC7735824'
 'PMC7406221' 'PMC6457539' 'PMC7446623' 'PMC5870574' 'PMC4894951'
 'PMC7707106' 'PMC8843059' 'PMC8067080' 'PMC8469072' 'PMC7473040'
 'PMC7721480' 'PMC7237030' 'PMC6732622' 'PMC6851483' 'PMC6459551'
 'PMC6532836' 'PMC5923460' 'PMC6214495' 'PMC5923460' 'PMC5034704'
 'PMC4460208' 'PMC7237030' 'PMC3009519' 'PMC5656045' 'PMC8230313'
 'PMC5104375' 'PMC2660303' 'PMC4315323' 'PMC7725002' 'PMC7068237'
 'PMC4315436' 'PMC5976622' 'PMC7648120' 'PMC6495231' 'PMC2638158'
 'PMC5738356' 'PMC7648120' 'PMC7333383' 'PMC3864407' 'PMC2665034'
 'PMC2559987' 'PMC6657583' 'PMC6832773' 'PMC6478501' 'PMC8288037'
 'PMC4507953' 'PMC4507953' 'PMC6908647' 'PMC6172579' 'PMC7442807'
 'PMC35422

In [51]:
import json
import collections

# 14b. JSON Source Uniqueness Check
# Check DOME Registry JSON for internal duplicates based on Title and Journal (Normalized)
# *Updated*: Only checks entries where "public": true

print("Checking JSON Source Uniqueness (Public Entries Only)...")

json_file = "dome_review_raw_human_20260202.json"

try:
    with open(json_file, 'r', encoding='utf-8') as f:
        json_data = json.load(f)
    print(f"Loaded {len(json_data)} total records from JSON.")
    
    # Filter for Public Entries
    public_entries = [e for e in json_data if e.get('public') is True]
    print(f"Analyzing {len(public_entries)} records marked as 'public': true")
    
    # Normalization Helper
    def normalize_flexible(text):
        if not isinstance(text, str): return ""
        # 1. Lowercase
        # 2. Split by whitespace and rejoin with single space (handles multiple spaces/tabs)
        # 3. Strip punctuation? Maybe just keep alphanumeric for 'key' purposes
        # The user asked for "caps, spaces" flex.
        return "".join(c for c in text if c.isalnum()).lower()

    # Store entries
    # Key: (Normalized Title, Normalized Journal) -> List of {OID, Original Title, Original Journal}
    groups = collections.defaultdict(list)
    
    for entry in public_entries:
        oid = entry.get('_id', {}).get('$oid', 'N/A')
        title = entry.get('publication', {}).get('title', '')
        journal = entry.get('publication', {}).get('journal', '')
        
        # Create Key
        norm_title = normalize_flexible(title)
        norm_journal = normalize_flexible(journal)
        
        # Only group if we actually have a title (journal might be empty, that's common)
        if norm_title:
            key = (norm_title, norm_journal)
            groups[key].append({
                'oid': oid,
                'title': title,
                'journal': journal
            })
            
    # Find Duplicates
    # Groups with > 1 entry
    dupes = [entries for key, entries in groups.items() if len(entries) > 1]
    
    print(f"\nPotential Duplicates Found (Public Only): {len(dupes)} groups")
    print("-" * 60)
    
    for i, group in enumerate(dupes):
        print(f"Group {i+1} ({len(group)} records):")
        for rec in group:
            print(f"  - OID: {rec['oid']}")
            print(f"    Title:   {rec['title']}")
            print(f"    Journal: {rec['journal']}")
        print("-" * 60)
        
    if not dupes:
        print("PASS: No internal duplicates found in Public entries based on flexible Title + Journal matching.")

except Exception as e:
    print(f"Error checking JSON uniqueness: {e}")

Checking JSON Source Uniqueness (Public Entries Only)...
Loaded 354 total records from JSON.
Analyzing 281 records marked as 'public': true

Potential Duplicates Found (Public Only): 9 groups
------------------------------------------------------------
Group 1 (2 records):
  - OID: 63516fedb9c880af1f305b1b
    Title:   Novel scaffold of natural compound eliciting sweet taste revealed by machine learning.
    Journal: Food Chem
  - OID: 63516fedb9c880af1f305b29
    Title:   Novel scaffold of natural compound eliciting sweet taste revealed by machine learning.
    Journal: Food Chem
------------------------------------------------------------
Group 2 (2 records):
  - OID: 63516fedb9c880af1f305b25
    Title:   MRI-based machine learning radiomics can predict HER2 expression level and pathologic response after neoadjuvant therapy in HER2 overexpressing breast cancer.
    Journal: EBioMedicine
  - OID: 63516fedb9c880af1f305b72
    Title:   MRI-based machine learning radiomics can predict HE

In [52]:
# 14c. Update JSON with Duplicate Info (Version 17)
# Action: Update the source JSON file directly.
#         Append new field 'Duplicate_shortid' to all entries.
#         For Public entries with duplicates, list the conflicting shortids.
#         For others, leave empty.

import json
import collections

# Using the 20260202 file as per global notebook instruction (overriding any legacy filenames)
json_file = "dome_review_raw_human_20260202.json"

print(f"Updating {json_file} with duplicate cross-references...")

try:
    with open(json_file, 'r', encoding='utf-8') as f:
        json_data = json.load(f)
    print(f"Loaded {len(json_data)} records.")

    # 1. Identify Duplicates (Public Only)
    public_entries = [e for e in json_data if e.get('public') is True]
    
    # Normalization
    def normalize_flexible(text):
        if not isinstance(text, str): return ""
        return "".join(c for c in text if c.isalnum()).lower()

    # Grouping
    groups = collections.defaultdict(list)
    for entry in public_entries:
        title = entry.get('publication', {}).get('title', '')
        journal = entry.get('publication', {}).get('journal', '')
        shortid = entry.get('shortid', '')
        
        # We use the normalized Title+Journal as the unique fingerprint
        norm_title = normalize_flexible(title)
        norm_journal = normalize_flexible(journal)
        
        if norm_title:
            key = (norm_title, norm_journal)
            groups[key].append({
                'shortid': shortid,
                'oid': entry.get('_id', {}).get('$oid')
            })

    # 2. Build Mapping: shortid -> [duplicate_shortids]
    # (Only for groups with > 1 entry)
    dupe_map = {}
    
    for key, group in groups.items():
        if len(group) > 1:
            # Get list of all shortids in this group
            # Filter out empty shortids just in case
            all_ids = [m['shortid'] for m in group if m['shortid']]
            
            for member in group:
                my_id = member['shortid']
                if my_id:
                    # Siblings are everyone else in the group
                    siblings = [sid for sid in all_ids if sid != my_id]
                    # Sort for consistency
                    siblings.sort()
                    dupe_map[my_id] = siblings

    # 3. Apply to All JSON Data
    updated_count = 0
    for entry in json_data:
        my_id = entry.get('shortid')
        
        # Look up dupes (returns [] if not found or if private)
        # Note: If an entry is private, it wasn't in 'groups', so it won't be in 'dupe_map'. Correct.
        duplicates = dupe_map.get(my_id, [])
        
        # Update/Add Field
        entry['Duplicate_shortid'] = duplicates
        
        if duplicates:
            updated_count += 1
            
    print(f"Tagged {updated_count} records with duplicate references.")

    # 4. Save back to file
    with open(json_file, 'w', encoding='utf-8') as f:
        json.dump(json_data, f, indent=4)
        
    print(f"Successfully updated {json_file}.")


except Exception as e:
    print(f"Error updating JSON: {e}")

Updating dome_review_raw_human_20260202.json with duplicate cross-references...
Loaded 354 records.
Tagged 19 records with duplicate references.
Successfully updated dome_review_raw_human_20260202.json.


In [53]:
# 14d. Create Public-Only JSON Subset
# Filter the source JSON to create a clean public release version.

import json

input_file = "dome_review_raw_human_20260202.json"
output_file = "public_dome_review_raw_human_20260202.json"

print(f"Filtering {input_file} for public entries...")

try:
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        
    # Filter
    public_content = [item for item in data if item.get('public') is True]
    
    # Save
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(public_content, f, indent=4)
        
    print(f"Success! Created '{output_file}' with {len(public_content)} entries.")

except Exception as e:
    print(f"Error creating public subset: {e}")

Filtering dome_review_raw_human_20260202.json for public entries...
Success! Created 'public_dome_review_raw_human_20260202.json' with 281 entries.


In [54]:
# 14e. Intermediate Deduplication (Pre-Validation)
# Remove duplicates from V11 based on Title, prioritizing completeness.

print("Performing intermediate deduplication on V11 data...")

# Normalized Title Helper
def normalize_title_dedup(text):
    if not isinstance(text, str): return ""
    return "".join(c for c in text if c.isalnum()).lower()

# Working on df_repaired from previous step
if 'df_repaired' not in locals():
    # Load if missing
    try:
        df_repaired = pd.read_csv("v11_Dome-Recommendations-EPMC_Metadata_Repaired.tsv", sep='\t')
    except:
        print("Error: df_repaired not found.")
        df_repaired = pd.DataFrame()

if not df_repaired.empty:
    # 1. Calculate Completeness Score
    # Count non-null values in columns (excluding system columns if possible, but general is fine)
    df_repaired['temp_completeness_score'] = df_repaired.notna().sum(axis=1)
    
    # 2. Normalize Title
    df_repaired['temp_norm_title'] = df_repaired['publication/title'].apply(normalize_title_dedup)
    
    # 3. Sort by Score (Complete -> Less Complete)
    # We want to keep the MOST complete one, so sort Descending.
    # drop_duplicates(keep='first') will then keep the top one (the most complete).
    df_repaired.sort_values(by='temp_completeness_score', ascending=False, inplace=True)
    
    # 4. Separate Duplicates
    # identify based on title only
    mask_has_title = df_repaired['temp_norm_title'] != ''
    
    # Rows with titles that appear more than once
    # We want to see what is being dropped. 
    # Let's perform the drop
    df_clean = df_repaired.drop_duplicates(subset=['temp_norm_title'], keep='first')
    
    # Identify dropped rows
    dropped_indices = df_repaired.index.difference(df_clean.index)
    df_dropped = df_repaired.loc[dropped_indices]
    
    # 5. Cleanup Helpers
    df_clean = df_clean.drop(columns=['temp_completeness_score', 'temp_norm_title'])
    df_dropped = df_dropped.drop(columns=['temp_completeness_score', 'temp_norm_title'])
    
    # Update the main variable for next steps
    df_repaired = df_clean.copy()
    
    print("-" * 30)
    print(f"Original V11 count: {len(df_clean) + len(df_dropped)}")
    print(f"Dropped duplicates: {len(df_dropped)}")
    print(f"Retained records:   {len(df_clean)}")
    
    # 6. Save
    file_retained = "v11b_Dome-Recommendations-Deduplicated.tsv"
    file_dropped = "v11b_Dome-Recommendations-Deduplicated_Dropped.tsv"
    
    df_clean.to_csv(file_retained, sep='\t', index=False)
    df_dropped.to_csv(file_dropped, sep='\t', index=False)
    print(f"Saved deduplicated data to {file_retained}")
    print(f"Saved dropped duplicates to {file_dropped}")

else:
    print("No data to deduplicate.")

Performing intermediate deduplication on V11 data...
------------------------------
Original V11 count: 188
Dropped duplicates: 12
Retained records:   176
Saved deduplicated data to v11b_Dome-Recommendations-Deduplicated.tsv
Saved dropped duplicates to v11b_Dome-Recommendations-Deduplicated_Dropped.tsv


In [55]:
import json

# 15. Validation: Match Records against Original JSON Source (Version 12)
# Revised Logic: Title-Based Validation (Normalized)
# *Updated*: Uses v11b Deduplicated TSV and Public JSON source

print("Validating records based on Normalized Title matching...")

# Using the new Public Subset for validation
raw_reviews_file = "public_dome_review_raw_human_20260202.json"

try:
    with open(raw_reviews_file, 'r', encoding='utf-8') as f:
        json_data = json.load(f)
    print(f"Loaded {len(json_data)} source records from Public JSON.")
except FileNotFoundError:
    print(f"Error: {raw_reviews_file} not found.")
    json_data = []

def normalize_title(text):
    """
    Normalizes a title string for lenient comparison.
    - Lowercase
    - Removes all non-alphanumeric characters (punctuation, spaces)
    """
    if not isinstance(text, str): return ""
    return "".join(c for c in text if c.isalnum()).lower()

# Build Set of Normalized Titles from JSON Source
json_titles = set()
count_titles = 0
for entry in json_data:
    title = entry.get('publication', {}).get('title', '')
    if title:
        norm = normalize_title(title)
        if norm:
            json_titles.add(norm)
            count_titles += 1

print(f"Indexed {len(json_titles)} unique titles from {count_titles} JSON entries.")

# Prepare DataFrame Records
# Try to load v11b if not present in memory
if 'df_repaired' not in locals() or df_repaired.empty: 
    print("Loading v11b from disk...")
    try:
        df_to_validate = pd.read_csv("v11b_Dome-Recommendations-Deduplicated.tsv", sep='\t')
    except:
        print("Error: v11b file not found. Run previous step.")
        df_to_validate = pd.DataFrame()
else:
    # Use version in memory (which should be the deduped one from 14e)
    # Double check by size if possible, or just copy
    df_to_validate = df_repaired.copy()

matched_rows = []
unmatched_rows = []

print("Matching records against JSON titles...")

if not df_to_validate.empty:
    for idx, row in df_to_validate.iterrows():
        # Extract Title from DataFrame
        d_title = str(row.get('publication/title', ''))
        
        # Normalize
        d_norm = normalize_title(d_title)
        
        # Check Match: Title must exist in JSON source
        if d_norm and d_norm in json_titles:
            matched_rows.append(row)
        else:
            unmatched_rows.append(row)

# Create Output DataFrames
df_matched = pd.DataFrame(matched_rows)
df_unmatched = pd.DataFrame(unmatched_rows)

print(f"Matching Results: {len(df_matched)} Retained | {len(df_unmatched)} Dropped (No Match)")

# Save V12 Retained
output_file_v12_retained = "v12_Dome-Recommendations-Validated_Retained.tsv"
if not df_matched.empty:
    df_matched.to_csv(output_file_v12_retained, sep='\t', index=False)
    print(f"Saved matched records to {output_file_v12_retained}")
else:
    print("Warning: No records matched!")
    # Save empty structure if dataframe columns exist
    if not df_to_validate.empty:
         df_to_validate.iloc[0:0].to_csv(output_file_v12_retained, sep='\t', index=False)

# Save Dropped Backup
output_file_v12_dropped = "v12_Dome-Recommendations-Validated_Dropped_Backup.tsv"
if not df_unmatched.empty:
    df_unmatched.to_csv(output_file_v12_dropped, sep='\t', index=False)
    print(f"Saved unmatched records to {output_file_v12_dropped}")
else:
    print("All records matched.")
    if not df_to_validate.empty:
        df_to_validate.iloc[0:0].to_csv(output_file_v12_dropped, sep='\t', index=False)

Validating records based on Normalized Title matching...
Loaded 281 source records from Public JSON.
Indexed 271 unique titles from 281 JSON entries.
Matching records against JSON titles...
Matching Results: 143 Retained | 33 Dropped (No Match)
Saved matched records to v12_Dome-Recommendations-Validated_Retained.tsv
Saved unmatched records to v12_Dome-Recommendations-Validated_Dropped_Backup.tsv


In [56]:
import json
import os

# 16. Cleanup: Append Curator Emails (Version 13)
# Processing both Retained and Dropped/Backup V12 files

print("Appending curator emails based on User OID (Version 13)...")

users_file = "dome_users_20260202.json"

# Define the pairs of files to process: (Input V12, Output V13)
files_to_process = [
    ("v12_Dome-Recommendations-Validated_Retained.tsv", "v13_Dome-Recommendations-With_Emails_Retained.tsv"),
    ("v12_Dome-Recommendations-Validated_Dropped_Backup.tsv", "v13_Dome-Recommendations-With_Emails_Dropped.tsv")
]

try:
    # Load User Data Once
    with open(users_file, 'r', encoding='utf-8') as f:
        users_data = json.load(f)
        
    # Create OID -> Email mapping
    oid_to_email = {}
    for u in users_data:
        email = u.get('email')
        oid = u.get('_id', {}).get('$oid')
        if email and oid:
            oid_to_email[oid.strip()] = email.strip()
            
    print(f"Loaded {len(oid_to_email)} user mappings.")

    # Process each file
    for input_file, output_file in files_to_process:
        if not os.path.exists(input_file):
            print(f"Skipping {input_file} (File not found)")
            continue
            
        print(f"Processing {input_file} -> {output_file}...")
        df_temp = pd.read_csv(input_file, sep='\t')
        
        if df_temp.empty:
            print(f"  Warning: {input_file} is empty.")
            
        # Get OIDs from user/$oid and map to email
        # Handle case where user/$oid might be missing or NaN gracefully
        if 'user/$oid' in df_temp.columns:
            df_temp['curator_email'] = df_temp['user/$oid'].apply(lambda x: oid_to_email.get(str(x).strip(), '') if pd.notna(x) else '')
        else:
            print("  Column 'user/$oid' not found creating empty column.")
            df_temp['curator_email'] = ''

        # Reorder columns: curator_email first
        cols = list(df_temp.columns)
        if 'curator_email' in cols:
            cols.remove('curator_email')
            cols.insert(0, 'curator_email')
            df_temp = df_temp[cols]
        
        # Save
        df_temp.to_csv(output_file, sep='\t', index=False)
        print(f"  Saved {output_file} ({len(df_temp)} records).")

except Exception as e:
    print(f"Error mapping emails: {e}")

Appending curator emails based on User OID (Version 13)...
Loaded 120 user mappings.
Processing v12_Dome-Recommendations-Validated_Retained.tsv -> v13_Dome-Recommendations-With_Emails_Retained.tsv...
  Saved v13_Dome-Recommendations-With_Emails_Retained.tsv (143 records).
Processing v12_Dome-Recommendations-Validated_Dropped_Backup.tsv -> v13_Dome-Recommendations-With_Emails_Dropped.tsv...
  Saved v13_Dome-Recommendations-With_Emails_Dropped.tsv (33 records).


In [62]:
import json
import pandas as pd
import numpy as np
import os

# 17. Data Completeness: Append Missing Entries from Source JSON (Version 14)
# Logic: Simple Title or DOI match. Appending unique missing JSON entries.
# *Updated*: Uses dedup_public_dome_review_raw_human_20260202.json

print("Appending missing records from Deduplicated Public JSON source to create complete dataset (Version 14)...")

input_file = "v12_Dome-Recommendations-Validated_Retained.tsv"
output_file_v14 = "v14_Dome-Recommendations-Complete_From_Source.tsv"
json_file = "dedup_public_dome_review_raw_human_20260202.json"

try:
    # 1. Load v12 Data (Validated)
    if os.path.exists(input_file):
        df_v12 = pd.read_csv(input_file, sep='\t')
    else:
        df_v12 = pd.DataFrame()
        
    print(f"Loaded {len(df_v12)} retained validated records from {input_file}")
    
    # 2. Load JSON Source
    with open(json_file, 'r', encoding='utf-8') as f:
        json_data = json.load(f)
    print(f"Loaded {len(json_data)} total source records from JSON")

    # 3. Build sets of Existing Identifiers from v12 (DOI and Title)
    def normalize_text(text):
        if not isinstance(text, str): return ""
        # Aggressive normalization: alphanumeric only, lowercase
        return "".join(c for c in text if c.isalnum()).lower()

    existing_titles = set()
    existing_dois = set()

    # Populate lookups from the existing dataframe
    for idx, row in df_v12.iterrows():
        t = normalize_text(str(row.get('publication/title', '')))
        d = normalize_text(str(row.get('publication/doi', '')))
        
        if t: existing_titles.add(t)
        if d: existing_dois.add(d)

    print(f"Index built: {len(existing_titles)} titles and {len(existing_dois)} DOIs found in v12.")

    # 4. Filter JSON for Missing Entries
    new_rows = []
    
    # Get schema from v12 for flattening
    # We want the new rows to respect the column structure of v12
    if not df_v12.empty:
        schema_cols = list(df_v12.columns)
    else:
        # Fallback if v12 is empty, unlikely but safe
        schema_cols = [] 

    # Identify columns to ignore during flattening (curator_email is removed later anyway, but cleaner to skip)
    cols_to_map = [c for c in schema_cols if c != 'curator_email']

    def flatten_entry(entry, cols):
        row = {}
        for col in cols:
            # Logic to walk keys: "publication/title" -> entry["publication"]["title"]
            keys = col.split('/')
            val = entry
            try:
                for k in keys:
                    val = val[k]
                # If we get a complex object (dict/list) instead of a value, treat as None/Empty
                if isinstance(val, (dict, list)):
                    row[col] = None
                else:
                    row[col] = val
            except (KeyError, TypeError):
                # If path doesn't exist
                row[col] = None
        return row

    count_appended = 0
    
    for entry in json_data:
        # Check if this entry exists in v12 based on simple Title OR DOI match
        j_title = normalize_text(entry.get('publication', {}).get('title', ''))
        j_doi = normalize_text(entry.get('publication', {}).get('doi', ''))
        
        # Match Logic: match on DOI match OR Title match
        # (If either matches, we assume the record is already present in v12)
        match_found = False
        
        if j_doi and j_doi in existing_dois:
            match_found = True
        elif j_title and j_title in existing_titles:
            match_found = True
            
        if not match_found:
            # It is MISSING -> Flatten and Add to queue
            new_row = flatten_entry(entry, cols_to_map)
            new_rows.append(new_row)
            count_appended += 1
            
    print(f"Identified {count_appended} entries in JSON that are missing from v12.")

    # 5. Concatenate
    if new_rows:
        df_new = pd.DataFrame(new_rows)
        # Ensure new dataframe matches v12 columns (except those we skipped)
        df_final = pd.concat([df_v12, df_new], ignore_index=True)
    else:
        df_final = df_v12.copy()
        
    # 6. Final Cleanup
    if 'curator_email' in df_final.columns:
        df_final.drop(columns=['curator_email'], inplace=True)
        print("Dropped 'curator_email' column.")
        
    # 7. Verification
    print("-" * 30)
    print(f"Total Entries in Source JSON: {len(json_data)}")
    print(f"Total Entries in Final v14:   {len(df_final)}")
    
    # Strict numeric check
    if len(df_final) == len(json_data):
        print("SUCCESS: Final count matches Source JSON count exactly.")
    else:
        diff = len(df_final) - len(json_data)
        print(f"WARNING: Mismatch by {diff} records.")
        if diff > 0:
            print("  (Final has MORE rows than JSON -> Likely duplicates in original v12 TSV)")
        else:
            print("  (Final has FEWER rows than JSON -> Some JSON entries not appended? Check match logic.)")

    # 8. Save
    df_final.to_csv(output_file_v14, sep='\t', index=False)
    print(f"Saved complete dataset to {output_file_v14}")

except Exception as e:
    print(f"Error generating v14: {e}")

Appending missing records from Deduplicated Public JSON source to create complete dataset (Version 14)...
Loaded 143 retained validated records from v12_Dome-Recommendations-Validated_Retained.tsv
Loaded 271 total source records from JSON
Index built: 143 titles and 143 DOIs found in v12.
Identified 128 entries in JSON that are missing from v12.
------------------------------
Total Entries in Source JSON: 271
Total Entries in Final v14:   271
SUCCESS: Final count matches Source JSON count exactly.
Saved complete dataset to v14_Dome-Recommendations-Complete_From_Source.tsv


In [63]:
# 16b. Deduplicate Public JSON Source
# Creates: dedup_public_dome_review_raw_human_20260202.json
# Logic: Group by normalized Title+Journal. Retain only the 'best' entry (highest coverage/length).

import json
import collections

print("Deduplicating Public JSON Source...")

input_json = "public_dome_review_raw_human_20260202.json"
output_json = "dedup_public_dome_review_raw_human_20260202.json"

try:
    with open(input_json, 'r', encoding='utf-8') as f:
        data = json.load(f)
    print(f"Loaded {len(data)} entries from {input_json}.")

    # Helper: Calculate 'Coverage Score' for a JSON entry
    def calculate_json_score(entry):
        # We can count the number of keys in the flattened structure or just total leaves
        # Simple heuristic: String length of the JSON string dump of the entry? 
        # Better: Count non-empty values in the dictionary recursively
        
        score = 0
        def recurse_count(obj):
            local_score = 0
            if isinstance(obj, dict):
                for k, v in obj.items():
                    local_score += recurse_count(v)
            elif isinstance(obj, list):
                for item in obj:
                    local_score += recurse_count(item)
            elif obj: # Not None, not empty string, not False, not 0
                 local_score += 1
            return local_score
            
        return recurse_count(entry)

    # Normalization
    def normalize_flexible(text):
        if not isinstance(text, str): return ""
        return "".join(c for c in text if c.isalnum()).lower()

    # Grouping
    groups = collections.defaultdict(list)
    for entry in data:
        title = entry.get('publication', {}).get('title', '')
        journal = entry.get('publication', {}).get('journal', '')
        
        norm_title = normalize_flexible(title)
        norm_journal = normalize_flexible(journal)
        
        if norm_title:
            key = (norm_title, norm_journal)
            groups[key].append(entry)
        else:
            # No title? Keep it safe as unique with OID as key suffix or separate list
            # We'll just add it to a unique bucket or keep unmatched
            # Let's say unmatched ones are kept automatically
            oid = entry.get('_id', {}).get('$oid')
            groups[('__no_title__', oid)].append(entry)

    # Selection
    final_entries = []
    dropped_count = 0
    
    for key, group in groups.items():
        if len(group) == 1:
            final_entries.append(group[0])
        else:
            # Conflict! Pick best.
            # Sort by score descending
            group.sort(key=calculate_json_score, reverse=True)
            best = group[0]
            final_entries.append(best)
            dropped_count += (len(group) - 1)

    # Save
    with open(output_json, 'w', encoding='utf-8') as f:
        json.dump(final_entries, f, indent=4)
        
    print("-" * 30)
    print(f"Final Count: {len(final_entries)}")
    print(f"Dropped Duplicates: {dropped_count}")
    print(f"Saved deduplicated JSON to {output_json}")

except Exception as e:
    print(f"Error deduplicating JSON: {e}")

Deduplicating Public JSON Source...
Loaded 281 entries from public_dome_review_raw_human_20260202.json.
------------------------------
Final Count: 271
Dropped Duplicates: 10
Saved deduplicated JSON to dedup_public_dome_review_raw_human_20260202.json


In [64]:
# 18. Deduplication: Intelligent Resolution (Version 15)
# Strategy: "Keep Best, Review All Conflicts"
# 1. Score records by Completeness (filled fields) and Recency (date).
# 2. Sort data so "Best" records are first.
# 3. Deduplicate by DOI and Title (keeping the TOP/BEST record).
# 4. Generate Clean Dataset (retaining the survivors).
# 5. Generate Review Dataset (showing both Survivors and Dropped duplicates for comparison).
# *Updated*: Uses dedup_public_dome_review_raw_human_20260202.json for validation

print("Performing Intelligent Deduplication (Version 15)...")

input_file = "v14_Dome-Recommendations-Complete_From_Source.tsv"
output_file_v15 = "v15_Dome-Recommendations-Best_Candidate_Retained.tsv"
review_file_v15 = "v15_Duplicates_Report_Survivors_and_Dropped.tsv" 
json_file = "dedup_public_dome_review_raw_human_20260202.json"

try:
    if os.path.exists(input_file):
        df_v15 = pd.read_csv(input_file, sep='\t')
        print(f"Loaded {len(df_v15)} records from v14.")
    else:
        print(f"Error: {input_file} not found. Please run previous cell.")
        df_v15 = pd.DataFrame()

    if not df_v15.empty:
        # --- 1. Preparation & Scoring ---
        
        # Normalization Helpers
        def normalize_title_simple(text):
            if not isinstance(text, str): return ""
            return " ".join(text.split()).lower()

        def normalize_doi(text):
            if not isinstance(text, str): return ""
            return text.strip().lower()

        df_v15['bs_norm_title'] = df_v15['publication/title'].apply(normalize_title_simple)
        df_v15['bs_norm_doi'] = df_v15['publication/doi'].apply(normalize_doi)

        # Calculate Scores
        # A. Completeness: Count non-null, non-empty values
        # (Exclude temp columns from count)
        data_cols = [c for c in df_v15.columns if not c.startswith('bs_')]
        df_v15['bs_completeness'] = df_v15[data_cols].notna().sum(axis=1)

        # B. Recency: Parse 'update' column if exists
        if 'update' in df_v15.columns:
            df_v15['bs_date'] = pd.to_datetime(df_v15['update'], errors='coerce')
        else:
            df_v15['bs_date'] = pd.NaT

        # --- 2. Sorting to Prioritize Best Records ---
        # Sort Order:
        # 1. Completeness (Desc) -> More data is better
        # 2. Date (Desc) -> Newer is better
        # 3. Index (Asc) -> Stability (keep original/top if tile)
        
        df_v15.sort_values(by=['bs_completeness', 'bs_date'], ascending=[False, False], inplace=True)
        # Reset index to ensure sequential processing respects sort
        # But keep old index to track origin if needed? No, just rely on sorted order.
        
        # --- 3. Identification of Conflicts ---
        
        # We need to identify ALL rows involved in duplicates for the report,
        # irrespective of whether they will be kept or dropped.
        
        title_mask = df_v15['bs_norm_title'] != ''
        # doi_mask = df_v15['bs_norm_doi'] != '' # DOI conflicts checked below
        
        # Identify duplicates (Mark all occurrences)
        title_dupes_all = df_v15.duplicated(subset=['bs_norm_title'], keep=False) & title_mask
        doi_dupes_all = df_v15.duplicated(subset=['bs_norm_doi'], keep=False) & (df_v15['bs_norm_doi'] != '')
        
        conflict_mask = title_dupes_all | doi_dupes_all
        
        # Extract the Conflict Group for Review BEFORE we drop anything
        df_conflicts = df_v15[conflict_mask].copy()
        
        if not df_conflicts.empty:
            df_conflicts['Review_Reason'] = ''
            df_conflicts.loc[title_dupes_all, 'Review_Reason'] += 'TITLE_MATCH '
            df_conflicts.loc[doi_dupes_all, 'Review_Reason'] += 'DOI_MATCH '
            
            # Sort for Review: Group by Title/DOI to see pairs
            df_conflicts.sort_values(by=['bs_norm_title', 'bs_norm_doi'], inplace=True)
            
            # Reorder columns
            cols = list(df_conflicts.columns)
            priority = ['Review_Reason', 'bs_completeness', 'update', 'publication/title', 'publication/doi']
            # Filter priority to valid columns
            priority = [c for c in priority if c in cols]
            remaining = [c for c in cols if c not in priority and not c.startswith('bs_')]
            
            df_review = df_conflicts[priority + remaining]
            
            df_review.to_csv(review_file_v15, sep='\t', index=False)
            print("-" * 30)
            print(f"REPORT: {len(df_review)} rows involved in duplicates saved to {review_file_v15}")
            print("        (This file contains BOTH the retained survivor and the dropped duplicates)")
        
        # --- 4. Deduplication (Keep First/Best) ---
        
        initial_count = len(df_v15)
        
        # Explicit Strategy:
        # Since we sorted by Completeness (Desc) and Date (Desc),
        # keep='first' will retain the record with the most data (or newest).
        # It drops subsequent (worse/identical) copies.
        
        # Drop DOI duplicates
        df_v15_dedup = df_v15.drop_duplicates(subset=['bs_norm_doi'], keep='first')
        
        # Now Drop Title duplicates from the result
        # Handle empty titles carefully (don't dedup empty strings against each other as one group)
        with_title = df_v15_dedup[df_v15_dedup['bs_norm_title'] != '']
        no_title = df_v15_dedup[df_v15_dedup['bs_norm_title'] == '']
        
        with_title_dedup = with_title.drop_duplicates(subset=['bs_norm_title'], keep='first')
        
        df_clean = pd.concat([with_title_dedup, no_title], ignore_index=True)
        
        # --- 5. Cleanup and Save ---
        
        # Remove helper columns
        clean_cols = [c for c in df_clean.columns if not c.startswith('bs_')]
        df_clean = df_clean[clean_cols]
        
        final_count = len(df_clean)
        dropped_count = initial_count - final_count

        print("-" * 30)
        print(f"Original Count: {initial_count}")
        print(f"Dropped:        {dropped_count} (Worse copies removed)")
        print(f"Final Count:    {final_count}")
        
        df_clean.to_csv(output_file_v15, sep='\t', index=False)
        print(f"Saved optimized dataset to {output_file_v15}")

        # Strict Check
        try:
            with open(json_file, 'r', encoding='utf-8') as f:
                json_data = json.load(f)
            
            print(f"\nSource Public Deduplicated JSON entries: {len(json_data)}")
            if len(df_clean) == len(json_data):
                print("SUCCESS: Final count matches Source JSON count exactly.")
            else:
                 diff = len(df_clean) - len(json_data)
                 print(f"Note: Count difference is {diff}.")
                 
        except Exception as e:
            print(f"Error checking JSON: {e}")

except Exception as e:
    print(f"Error in deduplication: {e}")

Performing Intelligent Deduplication (Version 15)...
Loaded 271 records from v14.
------------------------------
REPORT: 2 rows involved in duplicates saved to v15_Duplicates_Report_Survivors_and_Dropped.tsv
        (This file contains BOTH the retained survivor and the dropped duplicates)
------------------------------
Original Count: 271
Dropped:        1 (Worse copies removed)
Final Count:    270
Saved optimized dataset to v15_Dome-Recommendations-Best_Candidate_Retained.tsv

Source Public Deduplicated JSON entries: 271
Note: Count difference is -1.


In [67]:
# 18b. Final Formatting: Set System Flags
# Objective: Reset workflow flags and ensure public visibility for the final dataset.

print("Applying final system flags...")

input_file = "v15_Dome-Recommendations-Best_Candidate_Retained.tsv"
output_file_v16 = "v16_Dome-Recommendations-Import_Ready_With_Flags.tsv"

try:
    if os.path.exists(input_file):
        df_flags = pd.read_csv(input_file, sep='\t')
        print(f"Loaded {len(df_flags)} records from {input_file}")
        
        # Apply Flags as requested
        # publication/done -> 0.0
        # publication/skip -> 0.0
        # public -> 1.0
        
        df_flags['publication/done'] = 0.0
        df_flags['publication/skip'] = 0.0
        df_flags['public'] = 1.0
        
        print("Updated columns: 'publication/done' (0.0), 'publication/skip' (0.0), 'public' (1.0)")
        
        # Save
        df_flags.to_csv(output_file_v16, sep='\t', index=False)
        print(f"Saved final prepared dataset to {output_file_v16}")
        
    else:
        print(f"Error: {input_file} not found. Please run previous cell.")

except Exception as e:
    print(f"Error applying flags: {e}")

Applying final system flags...
Loaded 270 records from v15_Dome-Recommendations-Best_Candidate_Retained.tsv
Updated columns: 'publication/done' (0.0), 'publication/skip' (0.0), 'public' (1.0)
Saved final prepared dataset to v16_Dome-Recommendations-Import_Ready_With_Flags.tsv


In [70]:
# 19. Metadata Remediation: Enforce PMCID/PMID from External Registry (Version 17)
# Objective: Augment v16 dataset with PMCIDs and PMIDs from 'PMCIDs_DOME_Registry_Contents_2026-01-09.tsv'
# Match Logic: DOI (Primary) and Title (Secondary verification)
# Action: Fill missing values. Flag conflicts where existing value differs from source.

import os
import pandas as pd

print("Starting Metadata Remediation (PMID/PMCID Injection)...")

input_file_v16 = "v16_Dome-Recommendations-Import_Ready_With_Flags.tsv"
source_registry_file = "../DOME_Registry_TSV_Files/PMCIDs_DOME_Registry_Contents_2026-01-09.tsv"
output_file_v17 = "v17_Dome-Recommendations-Remediated_Metadata.tsv"
conflict_report_file = "v17_Metadata_Conflict_Report.tsv"
missing_pmid_file = "v17_Missing_PMID_Report.tsv"

try:
    # 1. Load Datasets
    if os.path.exists(input_file_v16):
        df_target = pd.read_csv(input_file_v16, sep='\t')
        print(f"Loaded target dataset: {len(df_target)} records.")
    else:
        print(f"Error: {input_file_v16} not found.")
        df_target = pd.DataFrame()
        
    if os.path.exists(source_registry_file):
        df_source = pd.read_csv(source_registry_file, sep='\t')
        print(f"Loaded source registry: {len(df_source)} records.")
    else:
        # Try absolute path just in case notebook relative path fails
        abs_path = os.path.abspath(os.path.join(os.getcwd(), source_registry_file))
        if os.path.exists(abs_path):
             df_source = pd.read_csv(abs_path, sep='\t')
             print(f"Loaded source registry (abs path): {len(df_source)} records.")
        else:
            print(f"Error: Registry file not found at {source_registry_file}")
            df_source = pd.DataFrame()

    if not df_target.empty and not df_source.empty:
        
        # 2. Prepare Source Dictionary for Fast Lookup
        # Normalize Keys
        def normalize_key(text):
            if not isinstance(text, str): return ""
            return "".join(c for c in text if c.isalnum()).lower()

        # We will index source by DOI
        source_cols = df_source.columns
        # Identify correct columns in source
        # User said: mapped_pmcid and mapped_pmid
        # We need to find the DOI column in source
        doi_col_source = next((c for c in source_cols if 'doi' in c.lower()), None)
        title_col_source = next((c for c in source_cols if 'title' in c.lower()), None)
        
        print(f"Source Columns Identified - DOI: {doi_col_source}, Title: {title_col_source}")

        source_lookup = {}
        for idx, row in df_source.iterrows():
            if doi_col_source:
                doi_norm = normalize_key(str(row[doi_col_source]))
                if doi_norm:
                    source_lookup[doi_norm] = {
                        'pmid': str(row.get('mapped_pmid', '')),
                        'pmcid': str(row.get('mapped_pmcid', '')),
                        'title': str(row.get(title_col_source, '')) if title_col_source else ''
                    }

        # 3. Iterate and Update Target
        # Counters
        count_updated_pmid = 0
        count_updated_pmcid = 0
        count_conflicts = 0
        conflicts = []

        # Ensure target columns exist
        if 'publication/pmid' not in df_target.columns: df_target['publication/pmid'] = None
        if 'publication/pmcid' not in df_target.columns: df_target['publication/pmcid'] = None

        for idx, row in df_target.iterrows():
            target_doi = normalize_key(str(row.get('publication/doi', '')))
            
            if target_doi in source_lookup:
                src_data = source_lookup[target_doi]
                
                # --- PMID Logic ---
                curr_pmid = str(row.get('publication/pmid', '')).replace('nan', '').replace('.0', '').strip()
                new_pmid = src_data['pmid'].replace('nan', '').replace('.0', '').strip()
                
                if new_pmid:
                    if not curr_pmid:
                        # Fill missing
                        df_target.at[idx, 'publication/pmid'] = new_pmid
                        count_updated_pmid += 1
                    elif curr_pmid != new_pmid:
                        # Conflict
                        count_conflicts += 1
                        conflicts.append({
                            'index': idx,
                            'doi': row.get('publication/doi'),
                            'field': 'PMID',
                            'current_value': curr_pmid,
                            'registry_value': new_pmid,
                            'source_title_check': src_data['title']
                        })

                # --- PMCID Logic ---
                curr_pmcid = str(row.get('publication/pmcid', '')).replace('nan', '').strip()
                new_pmcid = src_data['pmcid'].replace('nan', '').strip()
                
                if new_pmcid:
                    if not curr_pmcid:
                        # Fill missing
                        df_target.at[idx, 'publication/pmcid'] = new_pmcid
                        count_updated_pmcid += 1
                    elif curr_pmcid != new_pmcid:
                        # Conflict
                        count_conflicts += 1
                        conflicts.append({
                            'index': idx,
                            'doi': row.get('publication/doi'),
                            'field': 'PMCID',
                            'current_value': curr_pmcid,
                            'registry_value': new_pmcid,
                            'source_title_check': src_data['title']
                        })
        
        # Calculate Totals
        def has_value(val):
            s = str(val).lower().strip().replace('nan', '').replace('.0', '')
            return s != ''

        total_pmids = df_target['publication/pmid'].apply(has_value).sum()
        total_pmcids = df_target['publication/pmcid'].apply(has_value).sum()

        # 4. Save Results
        df_target.to_csv(output_file_v17, sep='\t', index=False)
        print("-" * 30)
        print(f"Results of Metadata Remediation:")
        print(f"  Processed {len(df_target)} records.")
        print(f"  PMIDs Filled:   {count_updated_pmid}")
        print(f"  PMCIDs Filled:  {count_updated_pmcid}")
        print(f"  Total PMIDs:    {total_pmids}")
        print(f"  Total PMCIDs:   {total_pmcids}")
        print(f"  Conflicts Found: {count_conflicts}")
        print(f"Saved remediated dataset to {output_file_v17}")
        
        # 5. Save Conflicts
        if conflicts:
            df_conflicts = pd.DataFrame(conflicts)
            df_conflicts.to_csv(conflict_report_file, sep='\t', index=False)
            print(f"Saved conflict details to {conflict_report_file}")
        else:
            print("No conflicts detected.")
            
        # 6. Save Missing PMID Report
        # Filter for rows where has_value is False for PMID
        missing_pmid_mask = ~df_target['publication/pmid'].apply(has_value)
        df_missing_pmid = df_target[missing_pmid_mask].copy()

        if not df_missing_pmid.empty:
            # Columns to export
            export_cols = ['publication/title', 'publication/journal', 'publication/doi']
            # Only include columns that actually exist in the dataframe
            final_export_cols = [c for c in export_cols if c in df_missing_pmid.columns]
            
            df_missing_pmid[final_export_cols].to_csv(missing_pmid_file, sep='\t', index=False)
            print(f"Saved {len(df_missing_pmid)} records with missing PMIDs to {missing_pmid_file}")
        else:
            print("Great news! No records are missing PMIDs.")

except Exception as e:
    print(f"Error in metadata remediation: {e}")

Starting Metadata Remediation (PMID/PMCID Injection)...
Loaded target dataset: 270 records.
Loaded source registry: 280 records.
Source Columns Identified - DOI: publication_doi, Title: publication_title
------------------------------
Results of Metadata Remediation:
  Processed 270 records.
  PMIDs Filled:   27
  PMCIDs Filled:  107
  Total PMIDs:    262
  Total PMCIDs:   237
  Conflicts Found: 3
Saved remediated dataset to v17_Dome-Recommendations-Remediated_Metadata.tsv
Saved conflict details to v17_Metadata_Conflict_Report.tsv
Saved 8 records with missing PMIDs to v17_Missing_PMID_Report.tsv


In [71]:
# 20. Manual Metadata Remediation Interface (Version 18 Preview)
# Objective: Interactive widget to manually resolve missing PMIDs and conflicts identified in Step 19.
# Features:
# - Loads v17 dataset.
# - Targets rows with missing PMIDs or reported conflicts.
# - Persists progress to 'v18_Remediation_Log.json' to avoid re-doing work.
# - Allows DOI search.

import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML
import json
import os

# --- Configuration ---
input_file = "v17_Dome-Recommendations-Remediated_Metadata.tsv"
conflict_file = "v17_Metadata_Conflict_Report.tsv"
progress_log_file = "v18_Remediation_Log.json"
output_file = "v18_Dome-Recommendations-Manual_Remediated.tsv"

# --- Load Data ---
if os.path.exists(input_file):
    df_work = pd.read_csv(input_file, sep='\t')
    print(f"Loaded dataset: {len(df_work)} records")
else:
    print("Error: v17 file not found. Run Step 19 first.")
    df_work = pd.DataFrame()

# Verify Columns
if 'publication/pmid' not in df_work.columns: df_work['publication/pmid'] = ""
if 'publication/pmcid' not in df_work.columns: df_work['publication/pmcid'] = ""

# Load Conflicts (to prioritize)
conflict_indices = set()
if os.path.exists(conflict_file):
    try:
        df_conflicts = pd.read_csv(conflict_file, sep='\t')
        # Assuming we can key by DOI if index changed, but v17 should align with v17 report if not sorted.
        # Best to match by DOI since indices might shift if file reloaded? 
        # Actually v17 was saved right after v17 report generation in Step 19 with no re-sort.
        # We'll use DOI matching to be safe.
        conflict_dois = set(df_conflicts['doi'].dropna().astype(str).str.lower().str.strip())
    except:
        conflict_dois = set()
else:
    conflict_dois = set()

# Load Progress
remediation_log = {}
if os.path.exists(progress_log_file):
    with open(progress_log_file, 'r') as f:
        remediation_log = json.load(f)
    print(f"Loaded existing progress: {len(remediation_log)} entries resolved.")

# --- Identify Work Queue ---
# Criteria: 
# 1. DOI is in conflict list OR
# 2. PMID is missing (empty or nan)
# AND
# 3. DOI is NOT in remediation_log

def get_doi(row):
    return str(row.get('publication/doi', '')).strip()

def needs_remediation(row):
    doi = get_doi(row)
    if not doi: return False # Skip empty DOI rows
    if doi in remediation_log: return False # Already done
    
    # Check conflict
    is_conflict = doi.lower() in conflict_dois
    
    # Check missing PMID
    pmid = str(row.get('publication/pmid', ''))
    is_missing = pmid.lower() in ['nan', '', 'none', '.0']
    
    return is_conflict or is_missing

# Create a queue of indices
df_work['temp_doi_key'] = df_work['publication/doi'].astype(str).str.strip()
queue_indices = [i for i, row in df_work.iterrows() if needs_remediation(row)]

print(f"Total items needing remediation: {len(queue_indices)}")

# --- Interface Logic ---

current_q_index = 0

# Widgets
output_area = widgets.Output()

style = {'description_width': 'initial'}
w_doi_display = widgets.HTML(value="<b>DOI:</b> ...")
w_title_display = widgets.HTML(value="<b>Title:</b> ...")
w_journal_display = widgets.HTML(value="<b>Journal:</b> ...")
w_status_display = widgets.HTML(value="")

w_pmid_input = widgets.Text(description="PMID:", style=style)
w_pmcid_input = widgets.Text(description="PMCID:", style=style)

w_save_btn = widgets.Button(description="Save & Next", button_style='success')
w_skip_btn = widgets.Button(description="Skip", button_style='warning')
w_stop_btn = widgets.Button(description="Stop & Export", button_style='danger')

w_search_input = widgets.Text(description="Search DOI:", placeholder="Paste DOI here")
w_search_btn = widgets.Button(description="Find")

# Helper to get current actual dataframe index
def get_current_idx():
    if 0 <= current_q_index < len(queue_indices):
        return queue_indices[current_q_index]
    return None

def load_entry(idx):
    if idx is None:
        w_status_display.value = "<b>Queue Complete!</b>"
        return

    row = df_work.loc[idx]
    doi = str(row.get('publication/doi', ''))
    title = str(row.get('publication/title', ''))
    journal = str(row.get('publication/journal', ''))
    
    curr_pmid = str(row.get('publication/pmid', '')).replace('nan', '')
    curr_pmcid = str(row.get('publication/pmcid', '')).replace('nan', '')

    w_doi_display.value = f"<b>DOI:</b> <a href='https://doi.org/{doi}' target='_blank'>{doi}</a> (Index: {idx})"
    w_title_display.value = f"<b>Title:</b> {title}"
    w_journal_display.value = f"<b>Journal:</b> {journal}"
    
    # Check if conflict
    if doi.lower() in conflict_dois:
        w_status_display.value = "<span style='color:red'><b>CONFLICT DETECTED</b> check sources</span>"
    elif curr_pmid == "":
        w_status_display.value = "<span style='color:orange'>Missing PMID</span>"
    else:
        w_status_display.value = "Review mode"

    # Pre-fill inputs with current values
    w_pmid_input.value = curr_pmid
    w_pmcid_input.value = curr_pmcid

def save_current(_):
    global current_q_index
    idx = get_current_idx()
    if idx is None: return

    row = df_work.loc[idx]
    doi = str(row.get('publication/doi', '')).strip()
    
    # Get values
    new_pmid = w_pmid_input.value.strip()
    new_pmcid = w_pmcid_input.value.strip()
    
    # Update DataFrame
    df_work.at[idx, 'publication/pmid'] = new_pmid
    df_work.at[idx, 'publication/pmcid'] = new_pmcid
    
    # Log persistence
    if doi:
        remediation_log[doi] = {
            'pmid': new_pmid,
            'pmcid': new_pmcid,
            'action': 'manual_update'
        }
        # Save log immediately
        with open(progress_log_file, 'w') as f:
            json.dump(remediation_log, f, indent=2)
    
    # Move next
    current_q_index += 1
    if current_q_index < len(queue_indices):
        load_entry(queue_indices[current_q_index])
    else:
        w_status_display.value = "Queue Complete! Click Stop & Export."

def skip_current(_):
    global current_q_index
    # Just move next without saving to log (so it appears again) or mark as skipped?
    # User said "If raren it will chekc all reolved". 
    # If we skip, maybe we should mark it as skipped in log so it doesn't show up again?
    # Let's assume 'Skip' means 'I'll do it later', so don't add to log.
    
    current_q_index += 1
    if current_q_index < len(queue_indices):
        load_entry(queue_indices[current_q_index])
    else:
        w_status_display.value = "Queue Complete!"

def export_data(_):
    # Apply all logs to df_work one last time to be sure
    print("Applying log to dataset...")
    for doi, data in remediation_log.items():
        # Find index by DOI
        matches = df_work[df_work['temp_doi_key'] == doi].index
        for m_idx in matches:
            df_work.at[m_idx, 'publication/pmid'] = data.get('pmid', '')
            df_work.at[m_idx, 'publication/pmcid'] = data.get('pmcid', '')
    
    # Drop temp col
    if 'temp_doi_key' in df_work.columns:
        df_work.drop(columns=['temp_doi_key'], inplace=True)
        
    df_work.to_csv(output_file, sep='\t', index=False)
    print(f"Exported remediated dataset to {output_file}")
    with output_area:
        print(f"Saved {output_file}")

def search_doi(_):
    search_term = w_search_input.value.strip()
    if not search_term: return
    
    # Find in df_work
    # Check DOI column string match
    matches = df_work[df_work['publication/doi'].astype(str).str.contains(search_term, case=False, na=False)]
    
    if not matches.empty:
        found_idx = matches.index[0]
        # Load this entry specifically, bypassing queue logic mainly for display
        # But we need to update 'current_q_index' to make 'Save' work.
        # If the found item is in our queue, sync to it.
        # If not, we might be editing something already resolved or not in queue.
        
        # Simple approach: Identify if found_idx is in queue_indices
        if found_idx in queue_indices:
            global current_q_index
            current_q_index = queue_indices.index(found_idx)
            load_entry(found_idx)
            w_status_display.value += " (Found in Queue)"
        else:
            # It's not in the queue, but we want to edit it.
            # We can force load it
            load_entry(found_idx)
            w_status_display.value = "<b>Manually Loaded (Not in active queue)</b>"
            
            # Monkey-patch get_current_idx for this one-off edit
            global get_current_idx_override
            get_current_idx_override = found_idx
            
            # We need to handle the Save button logic differently or update state
            # For simplicity, we won't fully support "next" after a manual search result that isn't in queue.
            # But the user asked to "search by doi".
            
    else:
        w_status_display.value = f"DOI '{search_term}' not found."

# Wire up events
w_save_btn.on_click(save_current)
w_skip_btn.on_click(skip_current)
w_stop_btn.on_click(export_data)
w_search_btn.on_click(search_doi)

# Layout
ui = widgets.VBox([
    widgets.HBox([w_search_input, w_search_btn]),
    widgets.HTML("<hr>"),
    w_doi_display,
    w_title_display,
    w_journal_display,
    w_status_display,
    widgets.HTML("<hr>"),
    w_pmid_input,
    w_pmcid_input,
    widgets.HBox([w_save_btn, w_skip_btn, w_stop_btn]),
    output_area
])

# Initialize
if queue_indices:
    load_entry(queue_indices[0])
    display(ui)
else:
    print("All conflicts and missing PMIDs resolved! Check output.")
    export_data(None)


Loaded dataset: 270 records
Total items needing remediation: 11


VBox(children=(HBox(children=(Text(value='', description='Search DOI:', placeholder='Paste DOI here'), Button(…

In [72]:
# 21. Final Format Validation (PMID/PMCID)
# Purpose: Ensure strict formatting for ID columns before import.
# Rules:
# - PMID: Must be numeric digits only.
# - PMCID: Must start with 'PMC' followed by digits.

import pandas as pd
import re
import os

# Determine input file (prioritize Manual output, fall back to Auto output)
input_file = "v18_Dome-Recommendations-Manual_Remediated.tsv"
if not os.path.exists(input_file):
    print(f"{input_file} not found, checking v17...")
    input_file = "v17_Dome-Recommendations-Remediated_Metadata.tsv"

output_file = "v19_Dome-Recommendations-Format_Validated.tsv"
dropped_report = "v19_Invalid_IDs_Dropped.tsv"

print(f"Validating formats in {input_file}...")

if os.path.exists(input_file):
    df = pd.read_csv(input_file, sep='\t')
    
    dropped_log = []

    # 1. Validate PMID
    def validate_pmid(row, idx):
        val = row.get('publication/pmid')
        s = str(val).strip()
        
        # Cleanup float conversions
        if s.endswith('.0'): s = s[:-2]
        
        # Check emptiness
        if s.lower() in ['nan', 'none', '', '<na>']: return None
        
        # Rules: Digits only
        if re.match(r'^\d+$', s):
            return s
        
        # Log Drop
        dropped_log.append({
            'index': idx,
            'doi': row.get('publication/doi'),
            'field': 'PMID',
            'value': s,
            'reason': 'Not numeric'
        })
        return None

    # 2. Validate PMCID
    def validate_pmcid(row, idx):
        val = row.get('publication/pmcid')
        s = str(val).strip()
        
        if s.lower() in ['nan', 'none', '', '<na>']: return None
        
        # Rules: Must start with PMC and be followed by digits
        # Allow case insensitive input, standardize to uppercase
        if re.match(r'^PMC\d+$', s, re.IGNORECASE):
            return s.upper()
        
        # Log Drop
        dropped_log.append({
            'index': idx,
            'doi': row.get('publication/doi'),
            'field': 'PMCID',
            'value': s,
            'reason': 'Invalid format (Expected PMC#)'
        })
        return None

    # Track counts
    pmid_count_start = df['publication/pmid'].notna() & (df['publication/pmid'].astype(str) != 'nan')
    pmcid_count_start = df['publication/pmcid'].notna() & (df['publication/pmcid'].astype(str) != 'nan')
    
    # Apply validations
    # Using list comprehensions for cleaner index access in logging
    new_pmids = [validate_pmid(row, i) for i, row in df.iterrows()]
    new_pmcids = [validate_pmcid(row, i) for i, row in df.iterrows()]
    
    df['publication/pmid'] = new_pmids
    df['publication/pmcid'] = new_pmcids
    
    # Check results
    pmid_count_end = df['publication/pmid'].notna().sum()
    pmcid_count_end = df['publication/pmcid'].notna().sum()
    
    print("-" * 30)
    print(f"Validation Results:")
    print(f"PMID: {pmid_count_start.sum()} -> {pmid_count_end} valid. (Removed {len([x for x in dropped_log if x['field']=='PMID'])})")
    print(f"PMCID: {pmcid_count_start.sum()} -> {pmcid_count_end} valid. (Removed {len([x for x in dropped_log if x['field']=='PMCID'])})")
    
    # Save Main File
    df.to_csv(output_file, sep='\t', index=False)
    print(f"Saved validated file to {output_file}")
    
    # Save Drop Log
    if dropped_log:
        pd.DataFrame(dropped_log).to_csv(dropped_report, sep='\t', index=False)
        print(f"Saved invalid inputs report to {dropped_report}")
    else:
        print("No invalid formats found.")

else:
    print(f"Error: Could not find input file {input_file}")

Validating formats in v18_Dome-Recommendations-Manual_Remediated.tsv...
------------------------------
Validation Results:
PMID: 264 -> 263 valid. (Removed 1)
PMCID: 238 -> 238 valid. (Removed 0)
Saved validated file to v19_Dome-Recommendations-Format_Validated.tsv
Saved invalid inputs report to v19_Invalid_IDs_Dropped.tsv


In [74]:
# 22. EPMC Metadata Overwrite (Version 20)
# Objective: Query Europe PMC API using valid PMIDs from v19 to overwrite Title, Journal, Authors, and DOI.
# Rationale: Ensure high-quality, consistent metadata from the source of truth (EPMC).
# Fix: includes robust handling for float-string conversion (e.g. '12345.0') when reading CSVs.

import requests
import time
import pandas as pd
import numpy as np
import os

input_file = "v19_Dome-Recommendations-Format_Validated.tsv"
output_file = "v20_Dome-Recommendations-EPMC_Metadata.tsv"
epmc_log_file = "v20_EPMC_Retrieval_Log.tsv"

# EPMC API Endpoint
EPMC_API_URL = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"

def get_epmc_batch_metadata(pmid_list):
    if not pmid_list: return {}
    
    # Construct Query: EXT_ID:123 OR EXT_ID:456 ...
    # API limits: URL length. keep batches reasonable (e.g., 20-25)
    query_parts = [f"EXT_ID:{p}" for p in pmid_list]
    query = " OR ".join(query_parts)
    
    params = {
        'query': query,
        'format': 'json',
        'resultType': 'core',
        'pageSize': len(pmid_list)
    }
    
    try:
        response = requests.get(EPMC_API_URL, params=params)
        response.raise_for_status()
        data = response.json()
        
        results = {}
        if 'resultList' in data and 'result' in data['resultList']:
            for item in data['resultList']['result']:
                # Key by PMID
                p = item.get('pmid')
                if p:
                    results[str(p)] = {
                        'title': item.get('title', ''),
                        'journal': item.get('journalInfo', {}).get('journal', {}).get('title', ''),
                        'authors': item.get('authorString', ''),
                        'doi': item.get('doi', '')
                    }
        return results
    except Exception as e:
        print(f"API Request failed for batch: {e}")
        return {}

def robust_pmid_clean(val):
    """Clean PMID string, handling floats and NaNs common in CSV reads."""
    s = str(val).strip()
    if s.lower() in ['nan', 'none', '', '<na>']:
        return None
    # Remove .0 artifact if pandas read it as float
    if s.endswith('.0'):
        s = s[:-2]
    # Verify it is digits
    if s.isdigit():
        return s
    return None

if os.path.exists(input_file):
    df = pd.read_csv(input_file, sep='\t')
    print(f"Loaded {len(df)} records from {input_file}")
    
    # Ensure author column exists
    if 'publication/authors' not in df.columns:
        print("Adding 'publication/authors' column...")
        df['publication/authors'] = ""

    # Robust extraction of unique PMIDs
    unique_pmids = set()
    if 'publication/pmid' in df.columns:
        for val in df['publication/pmid']:
            cleaned = robust_pmid_clean(val)
            if cleaned:
                unique_pmids.add(cleaned)
    
    valid_pmids = sorted(list(unique_pmids))
    
    print(f"Found {len(valid_pmids)} unique PMIDs to query.")
    
    if valid_pmids:
        # Batch Processing
        BATCH_SIZE = 25
        epmc_data_map = {}
        retrieval_log = []
        
        print("Querying EPMC API...")
        for i in range(0, len(valid_pmids), BATCH_SIZE):
            batch = valid_pmids[i:i+BATCH_SIZE]
            print(f"  Batch {i}-{min(i+BATCH_SIZE, len(valid_pmids))} / {len(valid_pmids)}...", end='\r')
            
            results = get_epmc_batch_metadata(batch)
            epmc_data_map.update(results)
            
            time.sleep(0.3) # Politeness delay
            
        print(f"\nAPI Retrieval Complete. Retrieved metadata for {len(epmc_data_map)} PMIDs.")
        
        # Apply Updates
        updated_count = 0
        
        print("Applying updates to dataframe...")
        for idx, row in df.iterrows():
            pmid = robust_pmid_clean(row.get('publication/pmid'))
            
            if pmid and pmid in epmc_data_map:
                meta = epmc_data_map[pmid]
                
                # Capture old values for logging
                old_title = str(row.get('publication/title', ''))
                old_doi = str(row.get('publication/doi', ''))
                
                # Overwrite
                df.at[idx, 'publication/title'] = meta['title']
                
                # Standardize journal name? Keeping raw from EPMC for now
                current_j = str(row.get('publication/journal', ''))
                # Only update journal if currently missing? Or always?
                # "overwrite the respective tsv fields" -> implies Always
                df.at[idx, 'publication/journal'] = meta['journal']
                
                df.at[idx, 'publication/authors'] = meta['authors']
                
                # Update DOI if present in EPMC
                current_doi = old_doi
                new_doi = current_doi
                if meta['doi']:
                    df.at[idx, 'publication/doi'] = meta['doi']
                    new_doi = meta['doi']

                updated_count += 1
                
                retrieval_log.append({
                    'pmid': pmid,
                    'old_title': old_title,
                    'new_title': meta['title'],
                    'old_doi': old_doi,
                    'new_doi': new_doi
                })
                
        # Save
        df.to_csv(output_file, sep='\t', index=False)
        print("-" * 30)
        print(f"Overwrite Complete.")
        print(f"Updated {updated_count} records with EPMC data.")
        print(f"Saved to {output_file}")
        
        # Save Log
        if retrieval_log:
            pd.DataFrame(retrieval_log).to_csv(epmc_log_file, sep='\t', index=False)
            print(f"Saved retrieval log to {epmc_log_file}")
    else:
        print("No valid PMIDs found to process. Saving copy of input to v20.")
        df.to_csv(output_file, sep='\t', index=False)

else:
    print(f"Input file {input_file} not found.")

Loaded 270 records from v19_Dome-Recommendations-Format_Validated.tsv
Found 263 unique PMIDs to query.
Querying EPMC API...
  Batch 250-263 / 263...
API Retrieval Complete. Retrieved metadata for 262 PMIDs.
Applying updates to dataframe...
------------------------------
Overwrite Complete.
Updated 262 records with EPMC data.
Saved to v20_Dome-Recommendations-EPMC_Metadata.tsv
Saved retrieval log to v20_EPMC_Retrieval_Log.tsv


In [75]:
# 23. Update Timestamp (Version 21)
# Objective: Set 'publication/updated' to the current date and time for all records.
# Format: MM/DD/YYYY HH:MM:SS (e.g., 06/23/2022 03:07:23)

import pandas as pd
from datetime import datetime
import os

input_file = "v20_Dome-Recommendations-EPMC_Metadata.tsv"
output_file = "v21_Dome-Recommendations-Ready_For_Import.tsv"

if os.path.exists(input_file):
    df = pd.read_csv(input_file, sep='\t')
    print(f"Loaded {len(df)} records from {input_file}")
    
    # Generate timestamp
    # Format: MM/DD/YYYY HH:MM:SS
    now_str = datetime.now().strftime("%m/%d/%Y %H:%M:%S")
    print(f"Setting 'publication/updated' to: {now_str}")
    
    # Overwrite column for all rows
    df['publication/updated'] = now_str
    
    # Save final dataset
    df.to_csv(output_file, sep='\t', index=False)
    print(f"Saved final dataset to {output_file}")
    
else:
    print(f"Error: {input_file} not found.")

Loaded 270 records from v20_Dome-Recommendations-EPMC_Metadata.tsv
Setting 'publication/updated' to: 02/02/2026 19:46:52
Saved final dataset to v21_Dome-Recommendations-Ready_For_Import.tsv


In [76]:
# 24. Reset Review State (Version 22)
# Objective: Reset 'reviewState' to 'undefined' for all records.
# Rationale: Prepare records for fresh import/review cycle.

import pandas as pd
import os

input_file = "v21_Dome-Recommendations-Ready_For_Import.tsv"
output_file = "v22_Dome-Recommendations-ReviewState_Reset.tsv"

if os.path.exists(input_file):
    df = pd.read_csv(input_file, sep='\t')
    print(f"Loaded {len(df)} records from {input_file}")
    
    # Overwrite column for all rows
    print("Setting 'reviewState' to 'undefined' for all entries...")
    df['reviewState'] = 'undefined'
    
    # Save final dataset
    df.to_csv(output_file, sep='\t', index=False)
    print(f"Saved dataset with reset reviewState to {output_file}")
    
else:
    print(f"Error: {input_file} not found.")

Loaded 270 records from v21_Dome-Recommendations-Ready_For_Import.tsv
Setting 'reviewState' to 'undefined' for all entries...
Saved dataset with reset reviewState to v22_Dome-Recommendations-ReviewState_Reset.tsv


In [77]:
# 25. Migrate 'update' to 'created/$date' (Version 23)
# Objective: Move values from legacy 'update' column to 'created/$date' where the target is empty.
# Formatting: Ensure consistency with existing 'created/$date' format.
# Cleanup: Clear the 'update' column after migration.

import pandas as pd
import os
from dateutil import parser
from datetime import datetime

input_file = "v22_Dome-Recommendations-ReviewState_Reset.tsv"
output_file = "v23_Dome-Recommendations-Created_Date_Fixed.tsv"

if os.path.exists(input_file):
    df = pd.read_csv(input_file, sep='\t')
    print(f"Loaded {len(df)} records from {input_file}")
    
    # helper for checking emptiness
    def is_empty(val):
        return str(val).lower() in ['nan', 'none', '', '<na>', 'nat']

    # Ensure columns exist
    if 'created/$date' not in df.columns: df['created/$date'] = ""
    if 'update' not in df.columns:
        print("Column 'update' not found, skipping migration.")
    else:
        migrated_count = 0
        
        # Determine target format from an existing valid entry in 'created/$date' if possible
        # Or assume standard ISO or similar based on user request "alignment with these existing entries"
        # Let's inspect a sample if possible, otherwise default to a robust format.
        # Usually DOME uses something like "2022-06-23T03:07:23.000Z" or similar?
        # User accepted "06/23/2022 03:07:23" for publication/updated earlier.
        # But `created/$date` usually implies a specific system format (often ISO 8601).
        # Let's try to detect format from non-empty created/$date entries.
        
        example_dates = df[~df['created/$date'].apply(is_empty)]['created/$date'].head(5).tolist()
        print(f"Sample existing 'created/$date' values: {example_dates}")
        
        # Heuristic: If we can't determine, we stick to the string as is, or try to normalize.
        # User said "reformat in alignment with these existing entries formatting"
        
        for idx, row in df.iterrows():
            target_val = row.get('created/$date', '')
            source_val = row.get('update', '')
            
            # Only migrate if target is empty AND source is NOT empty
            if is_empty(target_val) and not is_empty(source_val):
                
                # Attempt reformatting
                # We need to guess the format of 'source_val' and convert to format of 'target'
                # For now, let's just copy it, assuming relatively compatible, 
                # but let's try to be smart if the source looks like a date.
                
                try:
                    # Parse source
                    dt = parser.parse(str(source_val))
                    
                    # Format: If we have examples, try to match? 
                    # If existing examples are ISO (2023-01-01T...), allow that.
                    # If simpler, allow that.
                    # Without examples, valid ISO string is safest for system dates.
                    # Let's use the format derived from the previous step "06/23/2022 03:07:23" 
                    # OR stick to exactly what the user asked for.
                    # Since I cannot see the existing format dynamically without running, 
                    # I will try to detect if it looks like ISO or US style.
                    
                    # Defaulting to preserving the parsed date string representation usually works,
                    # but let's try to format it like: "2022-06-23T15:30:00.000Z" (common for $date)
                    # OR "06/23/2022 03:07:23"
                    
                    # Let's blindly copy first, relying on parser only if needed?
                    # "reformat in alignment" -> Implies I should make it look like the others.
                    # I'll convert to a standard string.
                    
                    # Assuming standard format: YYYY-MM-DDTHH:MM:SS.mmmZ is typical for Mongo/JSON systems using $date
                    # But the previous TSV steps used "MM/DD/YYYY HH:MM:SS".
                    
                    # Let's check the first non-empty created/$date provided in the print above (mentally).
                    # I will assume standard format for now.
                    
                    new_val = str(source_val) # Fallback
                    
                    # Logic: If existing entries contain 'T', assume ISO
                    # If existing contain '/', assume US
                    has_iso = any('T' in str(x) for x in example_dates)
                    has_slash = any('/' in str(x) for x in example_dates)
                    
                    if has_iso:
                        new_val = dt.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
                    elif has_slash:
                         new_val = dt.strftime("%m/%d/%Y %H:%M:%S")
                    else:
                        # Fallback to a clean string representation
                        new_val = dt.strftime("%Y-%m-%d %H:%M:%S")

                    df.at[idx, 'created/$date'] = new_val
                    migrated_count += 1
                    
                except Exception as e:
                    # If parsing fails, just copy raw string? Or skip?
                    # "properly filled entries"... let's copy raw if parse fails.
                    df.at[idx, 'created/$date'] = source_val
                    migrated_count += 1

        print(f"Migrated {migrated_count} values from 'update' to 'created/$date'.")

        # Clear 'update' column
        print("Clearing 'update' column...")
        df['update'] = "" 
        
    df.to_csv(output_file, sep='\t', index=False)
    print(f"Saved dataset to {output_file}")
else:
    print(f"{input_file} not found.")

Loaded 270 records from v22_Dome-Recommendations-ReviewState_Reset.tsv
Sample existing 'created/$date' values: ['2022-09-01T15:16:05.444Z', '2022-09-01T15:16:05.444Z', '2022-09-01T15:16:05.444Z', '2022-09-01T15:16:05.443Z', '2022-12-21T01:13:22.874Z']
Migrated 143 values from 'update' to 'created/$date'.
Clearing 'update' column...
Saved dataset to v23_Dome-Recommendations-Created_Date_Fixed.tsv


In [78]:
# 26. Repair Date Formatting (Version 24)
# Objective: Align 'created/$date' format to strict ISO with 3 millisecond digits (YYYY-MM-DDTHH:MM:SS.mmmZ).
# Fix: Truncate 6-digit microseconds (python default) to 3-digit milliseconds to match existing DOME format.

import pandas as pd
import os
import re

input_file = "v23_Dome-Recommendations-Created_Date_Fixed.tsv"
output_file = "v24_Dome-Recommendations-Dates_Aligned.tsv"

if os.path.exists(input_file):
    df = pd.read_csv(input_file, sep='\t')
    print(f"Loaded {len(df)} records.")
    
    def align_date_ms(val):
        s = str(val).strip()
        if s.lower() in ['nan', 'none', '']: return ""
        
        # Regex to find .mu_secondsZ pattern (e.g. .000000Z or .123456Z)
        # We want to keep just the first 3 digits of the fraction.
        
        # 1. Check if it matches the 'Bad' format (6 digits)
        # Regex: Ends with .ddddddZ
        if re.search(r'\.\d{6}Z$', s):
            # Truncate to .dddZ
            # Split on the decimal point logic is safest
            parts = s.split('.')
            if len(parts) >= 2:
                main_part = parts[0] # YYYY-MM-DDTHH:MM:SS
                frac_part = parts[-1] # 000000Z
                
                # Take first 3 digits of fraction
                if len(frac_part) >= 3:
                    new_frac = frac_part[:3] + "Z"
                    return f"{main_part}.{new_frac}"
        
        return s

    # Apply to created/$date
    if 'created/$date' in df.columns:
        print("Repairing date formats in 'created/$date'...")
        
        # Check counts of bad formats
        bad_mask = df['created/$date'].astype(str).str.contains(r'\.\d{6}Z', regex=True, na=False)
        print(f"Entries requiring repair: {bad_mask.sum()}")
        
        if bad_mask.sum() > 0:
            sample = df.loc[bad_mask, 'created/$date'].iloc[0]
            print(f"Sample before: {sample}")
            
            # Apply fix
            df['created/$date'] = df['created/$date'].apply(align_date_ms)
            
            # check after
            new_sample = align_date_ms(sample)
            print(f"Sample after:  {new_sample}")
        else:
            print("No 6-digit timestamps found.")
        
        # Verify
        print("Repair complete.")
    
    df.to_csv(output_file, sep='\t', index=False)
    print(f"Saved aligned dataset to {output_file}")

else:
    print(f"{input_file} not found.")

Loaded 270 records.
Repairing date formats in 'created/$date'...
Entries requiring repair: 143
Sample before: 2022-05-20T16:23:02.000000Z
Sample after:  2022-05-20T16:23:02.000Z
Repair complete.
Saved aligned dataset to v24_Dome-Recommendations-Dates_Aligned.tsv


In [79]:
# 27. Sync 'publication/updated' to 'updated/$date' (Version 25)
# Objective: Copy timestamp from 'publication/updated' to 'updated/$date'.
# Transformation: Convert from US Format (MM/DD/YYYY HH:MM:SS) to ISO 8601 (YYYY-MM-DDTHH:MM:SS.mmmZ)
# to ensure the columns are synchronized in time but aligned to their respective schema formats.

import pandas as pd
import os
from datetime import datetime

input_file = "v24_Dome-Recommendations-Dates_Aligned.tsv"
output_file = "v25_Dome-Recommendations-Sync_Updates.tsv"

if os.path.exists(input_file):
    df = pd.read_csv(input_file, sep='\t')
    print(f"Loaded {len(df)} records.")
    
    # helper to convert formats
    def convert_date_format(val):
        s = str(val).strip()
        if s.lower() in ['nan', 'none', '']: return ""
        
        try:
            # Parse from US format set in Step 23
            # Format: MM/DD/YYYY HH:MM:SS
            dt = datetime.strptime(s, "%m/%d/%Y %H:%M:%S")
            
            # Output to ISO format required for $date fields (as per Step 26 standards)
            # Format: YYYY-MM-DDTHH:MM:SS.mmmZ
            return dt.strftime("%Y-%m-%dT%H:%M:%S.000Z")
        except ValueError:
            # If parsing fails, maybe it's already in ISO or different? 
            # Log it or just return original?
            # User instructions "copy over", let's try to maintain data.
            print(f"Warning: Could not parse date '{s}' with expected format. Copying raw.")
            return s

    if 'publication/updated' in df.columns:
        print("Syncing 'publication/updated' -> 'updated/$date' with reformatting...")
        
        # Ensure target column exists
        if 'updated/$date' not in df.columns:
            df['updated/$date'] = ""
            
        # Apply conversion
        df['updated/$date'] = df['publication/updated'].apply(convert_date_format)
        
        # Verify
        print(f"Sample Source: {df['publication/updated'].iloc[0]}")
        print(f"Sample Target: {df['updated/$date'].iloc[0]}")
        
        df.to_csv(output_file, sep='\t', index=False)
        print(f"Saved synced dataset to {output_file}")
    else:
        print("Error: 'publication/updated' column missing.")

else:
    print(f"{input_file} not found.")

Loaded 270 records.
Syncing 'publication/updated' -> 'updated/$date' with reformatting...
Sample Source: 02/02/2026 19:46:52
Sample Target: 2026-02-02T19:46:52.000Z
Saved synced dataset to v25_Dome-Recommendations-Sync_Updates.tsv


In [None]:
# 28. Sync IDs from JSON (Version 29 - Adaptive Structure Discovery)
# Objective: Retrieve IDs using direct match on specific fields.
# Improvement: Automatically discovers the JSON structure to find where fields like 'regularization' are hiding.
# This solves issues where the "content" wrapper might be missing or keys might be capitalized differently.

import pandas as pd
import json
import os

input_file = "v25_Dome-Recommendations-Sync_Updates.tsv"
json_source = "public_dome_review_raw_human_20260202.json"
output_file = "v26_Dome-Recommendations-ID_Sync_Adaptive.tsv"
failure_report = "v26_ID_Sync_Failures.tsv"

PRIORITY_FIELDS = [
    'optimization/regularization', 
    'dataset/provenance',
    'dataset/splits',
    'model/learning'
]

def normalize(val):
    if val is None: return ""
    return str(val).lower().strip().replace('nan', '')

# --- Structure Discovery Helper ---
def find_keys_recursive(data, target_key, current_path=[]):
    """
    Recursively searches a dictionary for a specific key (case-insensitive).
    Returns the path to the parent of that key.
    """
    if isinstance(data, dict):
        for k, v in data.items():
            if k.lower() == target_key.lower():
                return current_path # Found it! Return path to parent
            
            # Recurse
            path = find_keys_recursive(v, target_key, current_path + [k])
            if path is not None:
                return path
    return None

def get_value_by_path(entry, path, leaf_key):
    """Retrieves value given a path list and a leaf key."""
    curr = entry
    try:
        # Traverse path
        for p in path:
            curr = curr[p]
        
        # Find the leaf key (case-insensitive match in the final dict)
        if isinstance(curr, dict):
            for k, v in curr.items():
                if k.lower() == leaf_key.lower():
                    return v
    except:
        return ""
    return ""

if os.path.exists(input_file) and os.path.exists(json_source):
    df = pd.read_csv(input_file, sep='\t')
    print(f"Loaded TSV: {len(df)} records.")
    
    with open(json_source, 'r') as f:
        json_data = json.load(f)
    print(f"Loaded JSON: {len(json_data)} entries.")

    # 1. Inspect Structure
    # Look at the first 10 entries to find where 'regularization' lives
    found_paths = {}
    
    print("Detecting JSON structure...")
    for field in PRIORITY_FIELDS:
        _, key = field.split('/')
        
        for entry in json_data[:20]: # Check first 20 entries
            path = find_keys_recursive(entry, key, [])
            if path is not None:
                found_paths[field] = path
                print(f"  Found '{key}' at path: {path}")
                break
        
        if field not in found_paths:
            print(f"  WARNING: Could not find key '{key}' in sample JSON entries. Matching for this field will fail.")

    # 2. Build Lookups using Discovered Paths
    lookups = {field: {} for field in PRIORITY_FIELDS}
    collisions = {field: set() for field in PRIORITY_FIELDS}
    
    print("Building lookups...")
    
    for entry in json_data:
        ids = {
            '_id/$oid': entry.get('_id', {}).get('$oid', ''),
            'uuid': entry.get('uuid', ''),
            'shortid': entry.get('shortid', '')
        }
        
        for field in PRIORITY_FIELDS:
            if field in found_paths:
                section, key = field.split('/')
                path = found_paths[field]
                
                # Extract value using the dynamic path
                raw_val = get_value_by_path(entry, path, key)
                val = normalize(raw_val)
                
                if len(val) > 3:
                    if val in lookups[field]:
                        collisions[field].add(val)
                    else:
                        lookups[field][val] = ids

    # Clean collisions
    for field in PRIORITY_FIELDS:
        if collisions[field]:
            print(f"  Field '{field}': Removing {len(collisions[field])} non-unique values.")
            for k in collisions[field]:
                if k in lookups[field]:
                    del lookups[field][k]

    # 3. Match
    synced_count = 0
    updated_rows = 0
    failures = []
    
    for c in ['_id/$oid', 'uuid', 'shortid']:
        if c not in df.columns: df[c] = ""
        
    for i, row in df.iterrows():
        # Skip if already fully active
        if normalize(row.get('_id/$oid')) and normalize(row.get('uuid')):
            continue
            
        match_found = None
        match_source = ""
        
        for field in PRIORITY_FIELDS:
            val = normalize(row.get(field, ''))
            
            if len(val) > 3 and field in lookups and val in lookups[field]:
                match_found = lookups[field][val]
                match_source = field
                break
        
        if match_found:
            updated_any = False
            for k, v in match_found.items():
                if not normalize(df.at[i, k]):
                    df.at[i, k] = v
                    updated_any = True
            if updated_any:
                synced_count += 1
                
        # Check Final Status for Reporting
        # If we still don't have an ID after attempting match
        current_oid = normalize(df.at[i, '_id/$oid'])
        if not current_oid:
             failures.append({
                'index': i,
                'doi': row.get('publication/doi'),
                'title': row.get('publication/title'),
                'optimization/regularization': row.get('optimization/regularization'),
                'reason': 'No unique match found in priority fields'
             })

    # Save
    df.to_csv(output_file, sep='\t', index=False)
    print("-" * 30)
    print(f"Sync Complete. Recovered IDs for {synced_count} entries.")
    print(f"Saved to {output_file}")
    
    # 4. Report Failures
    if failures:
        print(f"\n[ATTENTION] {len(failures)} entries are STILL missing IDs after sync.")
        df_fail = pd.DataFrame(failures)
        df_fail.to_csv(failure_report, sep='\t', index=False)
        print(f"Failure report saved to: {failure_report}")
        print("First 5 failures:")
        print(df_fail[['index', 'doi', 'optimization/regularization']].head(5).to_string())
    else:
        print("\n[SUCCESS] No missing IDs remaining!")
    
else:
    print("Files not found.")

Loaded TSV: 270 records.
Loaded JSON: 281 entries.
Detecting JSON structure...
  Found 'regularization' at path: ['optimization']
  Found 'provenance' at path: ['dataset']
  Found 'splits' at path: ['dataset']
Building lookups...
  Field 'optimization/regularization': Removing 4 non-unique values.
  Field 'dataset/provenance': Removing 2 non-unique values.
  Field 'dataset/splits': Removing 4 non-unique values.
------------------------------
Sync Complete. Recovered IDs for 118 entries.
Saved to v26_Dome-Recommendations-ID_Sync_Adaptive.tsv
