# Europe PMC Links Generator

Here is the Python notebook version of the Perl script you provided. It replicates the functionality exactly: it validates your input against the official Europe PMC Schema (labslink.xsd), handles the Provider Details, and converts your TSV file into the required XML format.

## Prerequisites
To run this in a local Jupyter environment or Google Colab, you will need the lxml and pandas libraries. You can install them by running this cell first:

In [1]:
!pip install lxml pandas requests



## Step 1: Imports and Setup
This cell sets up the necessary libraries and defines the location of the Europe PMC schema.

In [2]:
import os
import sys
import time
import requests
import pandas as pd
from lxml import etree

# Location of the official Europe PMC LabsLink schema
SCHEMA_URL = 'http://europepmc.org/docs/labslink.xsd'

def get_schema():
    """Fetches and parses the validation schema from Europe PMC."""
    try:
        response = requests.get(SCHEMA_URL)
        response.raise_for_status()
        schema_root = etree.XML(response.content)
        return etree.XMLSchema(schema_root)
    except Exception as e:
        print(f"Error loading schema from {SCHEMA_URL}: {e}")
        sys.exit(1)

# Initialize schema
schema = get_schema()
print("Schema loaded successfully.")

Schema loaded successfully.


## Step 2: Configuration
Define your file names here. In a notebook, this replaces the command-line arguments.

In [3]:
# --- USER CONFIGURATION ---
TAB_DELIMITED_FILE = 'DOME_Registry_TSV_Files/remediated_Failed_DOI_Mappings_2025-11-20.tsv'       # Your input TSV file
PROVIDER_DETAILS_FILE = 'provider_info.xml'  # Where to save/read provider details
OUTPUT_LINKS_FILE = 'output_links.xml'       # Final XML output
# --------------------------

if not os.path.exists(TAB_DELIMITED_FILE):
    print(f"Error: Cannot find input file '{TAB_DELIMITED_FILE}'.")
    print("Please upload your TSV file before proceeding.")
else:
    print(f"Input file '{TAB_DELIMITED_FILE}' found.")

Input file 'DOME_Registry_TSV_Files/remediated_Failed_DOI_Mappings_2025-11-20.tsv' found.


## Step 3: Handle Provider Details
This section checks if your provider XML exists. If not, it prompts you for the details (Provider ID, Resource Name, etc.), creates the file, and validates it.

In [4]:
provider_id = ''

def prompt_user(message):
    """Helper to get user input."""
    value = input(f"{message}: ").strip()
    if not value:
        raise ValueError("You must enter a value.")
    return value

if os.path.exists(PROVIDER_DETAILS_FILE):
    # File exists: Read and Validate
    try:
        parser = etree.XMLParser(remove_blank_text=True)
        xml_doc = etree.parse(PROVIDER_DETAILS_FILE, parser)
        schema.assertValid(xml_doc)
        
        # Extract Provider ID
        root = xml_doc.getroot()
        # Namespace handling might be needed depending on schema, 
        # but usually LabsLink is simple. We look for the &lt;id&gt; tag.
        found_id = root.find('.//id')
        if found_id is not None:
            provider_id = found_id.text
            print(f"Loaded Provider ID: {provider_id}")
        else:
            raise ValueError("Could not find <id> element in provider file.")

    except etree.DocumentInvalid as e:
        print(f"Validation Error in {PROVIDER_DETAILS_FILE}:")
        print(e)
        sys.exit(1)
else:
    # File does not exist: Create it
    print(f"'{PROVIDER_DETAILS_FILE}' not found. Let's create it.")
    
    try:
        p_id = prompt_user("Enter your provider ID")
        r_name = prompt_user("Enter your resource name (heading for links)")
        desc = prompt_user("Enter a description")
        email = prompt_user("Enter your email address")
        
        # Build XML
        providers = etree.Element('providers')
        provider = etree.SubElement(providers, 'provider')
        
        etree.SubElement(provider, 'id').text = p_id
        etree.SubElement(provider, 'resourceName').text = r_name
        etree.SubElement(provider, 'description').text = desc
        etree.SubElement(provider, 'email').text = email
        
        xml_doc = etree.ElementTree(providers)
        
        # Validate before saving
        schema.assertValid(xml_doc)
        
        # Save
        xml_doc.write(PROVIDER_DETAILS_FILE, pretty_print=True, xml_declaration=True, encoding='UTF-8')
        provider_id = p_id
        print(f"Created and validated {PROVIDER_DETAILS_FILE}")
        
    except Exception as e:
        print(f"Error creating provider file: {e}")
        sys.exit(1)

'provider_info.xml' not found. Let's create it.
Created and validated provider_info.xml
Created and validated provider_info.xml


## Step 4: Convert TSV to Links XML
This is the main logic. It reads your TSV using pandas (which handles line endings better than raw Perl), constructs the XML DOM using lxml, validates the final structure, and saves it.

In [6]:
print("Processing links...")

try:
    # Read the TSV file. 
    # We read it as a standard dataframe first, then map to the required structure
    # The input file is expected to be the DOME Registry TSV (or remediation file)
    df = pd.read_csv(TAB_DELIMITED_FILE, sep='\t', dtype=str)
    
    print(f"Loaded {len(df)} rows from {TAB_DELIMITED_FILE}")
    
    # Create root XML element
    root_element = etree.Element('links')

    valid_rows = 0
    
    for index, row in df.iterrows():
        # 1. Determine Source and ID
        # Priority: Manual PMID > Mapped PMID > Manual PMCID > Mapped PMCID > DOI
        source = None
        record_id = None
        
        # Helper to check if value is valid (not NaN/None/Empty)
        def is_valid(val):
            return pd.notna(val) and str(val).strip() != '' and str(val).lower() != 'nan'

        # Check PMIDs (Manual then Mapped)
        if is_valid(row.get('Manual_PMID')):
            source = 'MED'
            record_id = str(row.get('Manual_PMID')).split('.')[0] # Remove .0 if float string
        elif is_valid(row.get('mapped_pmid')):
            source = 'MED'
            record_id = str(row.get('mapped_pmid')).split('.')[0]
            
        # Check PMCIDs if no PMID
        if not source:
            if is_valid(row.get('Manual_PMCID')):
                source = 'PMC'
                record_id = str(row.get('Manual_PMCID'))
            elif is_valid(row.get('mapped_pmcid')):
                source = 'PMC'
                record_id = str(row.get('mapped_pmcid'))
        
        # Check DOI if no PMID/PMCID
        if not source:
            if is_valid(row.get('publication_doi')):
                source = 'DOI'
                record_id = str(row.get('publication_doi'))

        # 2. Determine URL and Title
        shortid = row.get('shortid')
        if not is_valid(shortid):
            # Try '_id' if shortid is missing
            shortid = row.get('_id')
            
        if not is_valid(shortid):
            print(f"Skipping line {index + 2}: Missing shortid.")
            continue
            
        # Generate DOME URL
        link_url = f"https://registry.dome-ml.org/entry/{shortid}"
        
        # Get Title
        link_title = row.get('publication_title', '')
        if not is_valid(link_title):
            link_title = f"DOME Registry Entry {shortid}"

        # Skip if we couldn't find a valid source/id to link TO in Europe PMC
        if not source or not record_id:
            print(f"Skipping line {index + 2} (ID: {shortid}): No valid PMID, PMCID, or DOI found.")
            continue

        # Create hierarchy: link -> resource/record
        link_element = etree.SubElement(root_element, 'link', providerId=provider_id)
        
        # Resource block (The external link to DOME)
        resource_element = etree.SubElement(link_element, 'resource')
        url_element = etree.SubElement(resource_element, 'url')
        url_element.text = link_url
        
        if link_title:
            title_element = etree.SubElement(resource_element, 'title')
            title_element.text = str(link_title)

        # Record block (The Europe PMC record we are attaching to)
        record_element = etree.SubElement(link_element, 'record')
        source_element = etree.SubElement(record_element, 'source')
        source_element.text = source
        id_element = etree.SubElement(record_element, 'id')
        id_element.text = record_id
        
        valid_rows += 1

    # Create the ElementTree
    links_doc = etree.ElementTree(root_element)

    # Validate the generated XML against the schema
    schema.assertValid(links_doc)
    print("Generated XML is valid against labslink.xsd")

    # Write to file
    links_doc.write(OUTPUT_LINKS_FILE, pretty_print=True, xml_declaration=True, encoding='UTF-8')
    
    print("------------------------------------------------")
    print("Program finished.")
    print(f"Processed {valid_rows} links.")
    print(f"Files created: {OUTPUT_LINKS_FILE}, {PROVIDER_DETAILS_FILE}")
    print("Please upload both files to the Europe PMC FTP site.")

except etree.DocumentInvalid as e:
    print(f"XML Validation Failed: {e}")
    # Save error log
    with open(f"validation_errors_{int(time.time())}.txt", "w") as f:
        f.write(str(e))
    print("See validation_errors file for details.")

except Exception as e:
    print(f"An unexpected error occurred: {e}")

Processing links...
Loaded 33 rows from DOME_Registry_TSV_Files/remediated_Failed_DOI_Mappings_2025-11-20.tsv
XML Validation Failed: Element 'source': [facet 'enumeration'] The value 'DOI' is not an element of the set {'AGR', 'CIT', 'CBA', 'CTX', 'ETH', 'HIR', 'MED', 'PAT', 'PMC'}.
See validation_errors file for details.
