# Link Dataset to Publications
**[Work in progress]**

This notebook downloads and standardizes accession numbers from life science and biological databases textmined from PubMed Central (PMC) full text articles and preprints (PPR) by [Europe PMC](https://europepmc.org/) for ingestion into a Knowledge Graph. In addition, it downloads PubMed to PDB mappings.

Data sources: [Europe PMC](ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/), [PMC](https://www.ncbi.nlm.nih.gov/pmc/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
ftp = 'ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/'

In [4]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Assign unique identifiers for interoperabilitiy
A [CURIE](https://en.wikipedia.org/wiki/CURIE) (Compact URI) is a compact abbreviation for Uniform Resource Identifiers (URIs). CURIEs consist of registered prefix and an accession number (prefix:accession). They provide a name space for identifiers to enable uniqueness of identifiers and interoperability among data resources.

[Identifiers.org](http://identifiers.org/) provides a registry and resolution service for life science CURIEs. 

In [5]:
def assign_publication_id(row):
    if row['PMCID'] != '':
        # CURIE: pmc (PubMed Central, PMC)
        return 'pmc:' + str(row['PMCID'])
    elif row['SOURCE'] == 'PPR':
        # no CURIE available, use URI for preprints
        return 'https://europepmc.org/article/PPR/' + row['EXTID']
    else:
        return ''

### UniProt
**accession**: CURIE: [uniprot](https://registry.identifiers.org/registry/uniprot) ( UniProt Knowledgebase, UniProtKB)

In [6]:
uniprot = pd.read_csv(ftp + "uniprot.csv", dtype=str)
uniprot.fillna('', inplace=True)

In [7]:
uniprot['id'] = uniprot.apply(assign_publication_id, axis=1)
uniprot['accession'] = 'uniprot:' + uniprot['uniprot']
uniprot = uniprot[['id','accession']]
uniprot.query("id != ''", inplace=True)

In [8]:
uniprot.shape

(388137, 2)

### NCBI Reference Sequences
**accession**: CURIE: [refseq](https://registry.identifiers.org/registry/refseq) (NCBI Reference Sequences, Refseq)

In [9]:
refseq = pd.read_csv(ftp + "refseq.csv", dtype=str)
refseq.fillna('', inplace=True)
refseq.head()

Unnamed: 0,refseq,PMCID,EXTID,SOURCE
0,NM_015973,PMC7512552,32963006,MED
1,NM_001034,PMC7512552,32963006,MED
2,NM_001789,PMC7512552,32963006,MED
3,NM_001008708,PMC7512552,32963006,MED
4,NM_203467,PMC7512552,32963006,MED


In [10]:
refseq['id'] = refseq.apply(assign_publication_id, axis=1)
# Remove version number from refseq to match to the latest version
refseq['accession'] = 'refseq:' + refseq['refseq'].str.split('.', expand=True)[0]
refseq = refseq[['id','accession']]
refseq.query("id != ''", inplace=True)

In [11]:
refseq.shape

(311282, 2)

### GISAID Genome Sequences
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)


In [12]:
gisaid = pd.read_csv(ftp + "gisaid.csv", dtype=str)
gisaid.fillna('', inplace=True)

In [13]:
gisaid['id'] = gisaid.apply(assign_publication_id, axis=1)

In [14]:
gisaid['accession'] = 'https://www.gisaid.org/' + gisaid['gisaid']
gisaid = gisaid[['id','accession']]
gisaid.query("id != ''", inplace=True)

In [15]:
gisaid.shape

(6978, 2)

### Protein Data Bank
**accession**: CURIE: [pdb](https://registry.identifiers.org/registry/pdb) (Protein Data Bank, PDB)

In [16]:
pdb = pd.read_csv(ftp + "pdb.csv", dtype=str)
pdb.fillna('', inplace=True)

In [17]:
pdb['id'] = pdb.apply(assign_publication_id, axis=1)
pdb['accession'] = 'pdb:' + pdb['pdb']
pdb = pdb[['id','accession']]
pdb.query("id != ''", inplace=True)

In [18]:
pdb.shape

(525528, 2)

# Match Dataset Mentions

### Match by UniProt accessions

In [19]:
ref1 = pd.read_csv(NEO4J_IMPORT / "01a-UniProtProtein.csv")
ref1 = ref1[['accession']]
ref1 = ref1.drop_duplicates()
ref1.head()

Unnamed: 0,accession
0,uniprot:P0DTD1
16,uniprot:P0DTC7
18,uniprot:P0DTD2
19,uniprot:P0DTC9
20,uniprot:P0DTC3


In [20]:
pmc_uniprot = pd.merge(uniprot, ref1, on='accession')

In [21]:
pmc_uniprot.to_csv(NEO4J_IMPORT / "01h-PMC-UniProtProtein.csv", index=False)

In [22]:
print('UniProt mentions:', pmc_uniprot.shape[0])

UniProt mentions: 161775


In [23]:
pmc_uniprot.head()

Unnamed: 0,id,accession
0,pmc:PMC6713643,uniprot:P24385
1,pmc:PMC5761900,uniprot:P24385
2,pmc:PMC3275796,uniprot:P24385
3,pmc:PMC4823807,uniprot:P24385
4,pmc:PMC5474285,uniprot:P24385


### Match Strains by NCBI refSeq and GISAID accessions

In [24]:
ref2 = pd.read_csv(NEO4J_IMPORT / "01c-CNCBStrain.csv")

In [25]:
ref2['secondaryAccession'] = ref2['accessions'].str.split(';')
ref2 = ref2.explode('secondaryAccession')
ref2.rename(columns={'id': 'primaryAccession'}, inplace=True)
ref2 = ref2[['primaryAccession', 'secondaryAccession']]
ref2 = ref2.drop_duplicates()
ref2.head()

Unnamed: 0,primaryAccession,secondaryAccession
0,https://www.gisaid.org/EPI_ISL_402132,NMDC60013088-01
0,https://www.gisaid.org/EPI_ISL_402132,https://www.gisaid.org/EPI_ISL_402132
1,https://www.gisaid.org/EPI_ISL_403963,https://www.gisaid.org/EPI_ISL_403963
2,https://www.gisaid.org/EPI_ISL_403962,https://www.gisaid.org/EPI_ISL_403962
3,https://www.gisaid.org/EPI_ISL_402120,NMDC60013085-01


In [26]:
pmc_cncb1 = pd.merge(refseq, ref2, left_on='accession', right_on='secondaryAccession')

In [27]:
pmc_cncb1.head()

Unnamed: 0,id,accession,primaryAccession,secondaryAccession
0,pmc:PMC7596387,refseq:NC_045512,insdc:MN908947,refseq:NC_045512
1,pmc:PMC7290700,refseq:NC_045512,insdc:MN908947,refseq:NC_045512
2,https://europepmc.org/article/PPR/PPR204419,refseq:NC_045512,insdc:MN908947,refseq:NC_045512
3,pmc:PMC7352669,refseq:NC_045512,insdc:MN908947,refseq:NC_045512
4,https://europepmc.org/article/PPR/PPR227723,refseq:NC_045512,insdc:MN908947,refseq:NC_045512


Use primary accession to establish linkage

In [28]:
pmc_cncb1['accession'] = pmc_cncb1['primaryAccession']

In [29]:
pmc_cncb1 = pmc_cncb1[['id', 'accession']]

In [30]:
print('regseq mentions:', pmc_cncb1.shape[0])

regseq mentions: 670


In [31]:
pmc_cncb1.head()

Unnamed: 0,id,accession
0,pmc:PMC7596387,insdc:MN908947
1,pmc:PMC7290700,insdc:MN908947
2,https://europepmc.org/article/PPR/PPR204419,insdc:MN908947
3,pmc:PMC7352669,insdc:MN908947
4,https://europepmc.org/article/PPR/PPR227723,insdc:MN908947


In [32]:
pmc_cncb2 = pd.merge(gisaid, ref2, left_on='accession', right_on='secondaryAccession')

In [33]:
pmc_cncb2.head()

Unnamed: 0,id,accession,primaryAccession,secondaryAccession
0,https://europepmc.org/article/PPR/PPR167663,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131
1,pmc:PMC7166309,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131
2,https://europepmc.org/article/PPR/PPR190800,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131
3,pmc:PMC7497811,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131
4,pmc:PMC7205519,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131


In [34]:
pmc_cncb2['accession'] = pmc_cncb2['primaryAccession']

In [35]:
pmc_cncb2 = pmc_cncb2[['id', 'accession']]

In [36]:
print('GISAID mentions:', pmc_cncb2.shape[0])

GISAID mentions: 3081


In [37]:
pmc_cncb2.head()

Unnamed: 0,id,accession
0,https://europepmc.org/article/PPR/PPR167663,GWHABKP00000001
1,pmc:PMC7166309,GWHABKP00000001
2,https://europepmc.org/article/PPR/PPR190800,GWHABKP00000001
3,pmc:PMC7497811,GWHABKP00000001
4,pmc:PMC7205519,GWHABKP00000001


In [38]:
pmc_cncb = pd.concat([pmc_cncb1, pmc_cncb2])

In [39]:
pmc_cncb.to_csv(NEO4J_IMPORT / "01h-PMC-CNCBStrain.csv", index=False)

### Match Genomes by refseq

In [40]:
ref3 = pd.read_csv(NEO4J_IMPORT / "Genome.csv")
ref3 = ref3[['refSeq']]
ref3 = ref3.drop_duplicates()
ref3.rename(columns={'refSeq': 'accession'}, inplace=True)
ref3.head()

Unnamed: 0,accession
0,refseq:NC_045512
1,refseq:NC_038294
2,refseq:NC_004718
3,refseq:NC_002645
4,refseq:NC_000001


In [41]:
pmc_genome = pd.merge(refseq, ref3, on='accession')

In [42]:
print('regseq mentions:', pmc_genome.shape[0])

regseq mentions: 3400


In [43]:
pmc_genome.head()

Unnamed: 0,id,accession
0,pmc:PMC2684143,refseq:NC_002645
1,pmc:PMC6390631,refseq:NC_002645
2,pmc:PMC3812135,refseq:NC_002645
3,pmc:PMC7121196,refseq:NC_002645
4,pmc:PMC3966378,refseq:NC_002645


In [44]:
pmc_genome.to_csv(NEO4J_IMPORT / "01h-PMC-Genome.csv", index=False)

### Match Protein-Protein Interactions by UniProt accession

In [45]:
ref4 = pd.read_csv(NEO4J_IMPORT / "01e-ProteinProteinInteractionProtein.csv")
ref4 = ref4[['accession']]
ref4 = ref4.drop_duplicates()
ref4.head()               

Unnamed: 0,accession
0,uniprot:A0A663DJA2
1,uniprot:A0MZ66
2,uniprot:A0PJW6
3,uniprot:A1L3X0
4,uniprot:A3KN83


In [46]:
pmc_ppi = pd.merge(uniprot, ref4, on='accession')

In [47]:
print('Protein-Protein interaction mentions:', pmc_ppi.shape[0])

Protein-Protein interaction mentions: 22635


In [48]:
pmc_ppi.head()

Unnamed: 0,id,accession
0,pmc:PMC6713643,uniprot:Q92769
1,pmc:PMC6109601,uniprot:Q92769
2,pmc:PMC4182858,uniprot:Q92769
3,pmc:PMC4446364,uniprot:Q92769
4,pmc:PMC4253452,uniprot:Q92769


In [49]:
pmc_ppi.to_csv(NEO4J_IMPORT / "01h-PMC-ProteinProteinInteraction.csv", index=False)

### Match Protein Structures by PDB ID

In [50]:
ref5 = pd.read_csv(NEO4J_IMPORT / "01f-PDBStructure.csv")
ref5 = ref5[['pdbId']]
ref5.rename(columns={'pdbId': 'accession'}, inplace=True)
ref5 = ref5.drop_duplicates()
ref5.head()   

Unnamed: 0,accession
0,pdb:4CBT
1,pdb:4CBU
2,pdb:4CBV
3,pdb:4CBW
4,pdb:4CBX


In [51]:
pmc_pdb = pd.merge(pdb, ref5, on='accession')

In [52]:
print('PDB mentions:', pmc_pdb.shape[0])

PDB mentions: 413281


In [53]:
pmc_pdb.head()

Unnamed: 0,id,accession
0,pmc:PMC4309170,pdb:1A0J
1,pmc:PMC4190110,pdb:1A0J
2,pmc:PMC3458898,pdb:1A0J
3,pmc:PMC3057020,pdb:1A0J
4,pmc:PMC2974730,pdb:1A0J


In [54]:
pmc_pdb.to_csv(NEO4J_IMPORT / "01h-PMC-PDBStructures.csv", index=False)

### Concatenate all ids

In [55]:
pmc_ids = pd.concat([pmc_uniprot, pmc_cncb, pmc_genome, pmc_ppi, pmc_pdb])

### Create a list of preprints (anything that is not a pmcId)

In [56]:
preprints = pmc_ids[~(pmc_ids['id'].str.startswith('pmc'))][['id']]

In [57]:
preprints.head()

Unnamed: 0,id
144,https://europepmc.org/article/PPR/PPR176767
240,https://europepmc.org/article/PPR/PPR190474
503,https://europepmc.org/article/PPR/PPR184992
509,https://europepmc.org/article/PPR/PPR162209
544,https://europepmc.org/article/PPR/PPR180528


In [58]:
preprints.to_csv(NEO4J_IMPORT / "01h-PPR-Ids.csv", index=False)

In [59]:
pmc_ids.rename(columns={'id': 'pmcId'}, inplace=True)

In [60]:
pmc_ids = pmc_ids[['pmcId']]

In [61]:
print('Number of matched ids', pmc_ids.shape[0])
pmc_ids.head()

Number of matched ids 604842


Unnamed: 0,pmcId
0,pmc:PMC6713643
1,pmc:PMC5761900
2,pmc:PMC3275796
3,pmc:PMC4823807
4,pmc:PMC5474285


### Download PMID to PMCID mappings and metadata

In [62]:
pmc = pd.read_csv("https://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz", 
                 usecols=['PMID', 'PMCID', 'DOI', 'Journal Title', 'Year', 'Volume', 'Issue', 'Page'], dtype='str')

In [63]:
pmc.rename(columns={'PMID': 'pmId'}, inplace=True)
pmc.rename(columns={'PMCID': 'pmcId'}, inplace=True)
pmc.rename(columns={'DOI': 'doi'}, inplace=True)
pmc.rename(columns={'Journal Title': 'journal'}, inplace=True)
pmc.rename(columns={'Year': 'year'}, inplace=True)
pmc.rename(columns={'Volume': 'volume'}, inplace=True)
pmc.rename(columns={'Issue': 'issue'}, inplace=True)
pmc.rename(columns={'Page': 'page'}, inplace=True)

In [64]:
pmc.fillna('', inplace=True)

Assign unique identifier (CURIEs) resolvable by [Identifiers.org](https://identifiers.org)

In [65]:
pmc['id'] = 'pubmed:' + pmc['pmId']
pmc['pmcId'] = 'pmc:' + pmc['pmcId']
pmc['doi'] = 'doi:' + pmc['doi']

In [66]:
pmc.head()

Unnamed: 0,journal,year,volume,issue,page,doi,pmcId,pmId,id
0,Breast Cancer Res,2000,3,1,55,doi:10.1186/bcr271,pmc:PMC13900,11250746,pubmed:11250746
1,Breast Cancer Res,2000,3,1,61,doi:10.1186/bcr272,pmc:PMC13901,11250747,pubmed:11250747
2,Breast Cancer Res,2000,3,1,66,doi:10.1186/bcr273,pmc:PMC13902,11250748,pubmed:11250748
3,Breast Cancer Res,1999,2,1,59,doi:10.1186/bcr29,pmc:PMC13911,11056684,pubmed:11056684
4,Breast Cancer Res,1999,2,1,64,doi:10.1186/bcr30,pmc:PMC13912,11400682,pubmed:11400682


In [67]:
pmc_ids = pmc_ids.merge(pmc, on='pmcId')

In [68]:
print('Number of PMC matches:', pmc_ids.shape[0])
pmc_ids.head()

Number of PMC matches: 596489


Unnamed: 0,pmcId,journal,year,volume,issue,page,doi,pmId,id
0,pmc:PMC6713643,Cancer Biol Med,2019,16,2,377,doi:10.20892/j.issn.2095-3941.2018.0386,31516757,pubmed:31516757
1,pmc:PMC6713643,Cancer Biol Med,2019,16,2,377,doi:10.20892/j.issn.2095-3941.2018.0386,31516757,pubmed:31516757
2,pmc:PMC6713643,Cancer Biol Med,2019,16,2,377,doi:10.20892/j.issn.2095-3941.2018.0386,31516757,pubmed:31516757
3,pmc:PMC6713643,Cancer Biol Med,2019,16,2,377,doi:10.20892/j.issn.2095-3941.2018.0386,31516757,pubmed:31516757
4,pmc:PMC6713643,Cancer Biol Med,2019,16,2,377,doi:10.20892/j.issn.2095-3941.2018.0386,31516757,pubmed:31516757


In [69]:
pmc_ids.to_csv(NEO4J_IMPORT / "01h-PMC-Ids.csv", index=False)

### Download PDB - PubMed mappings

In [70]:
sifts_url = 'http://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/tsv/pdb_pubmed.tsv.gz'

In [71]:
pm_pdb = pd.read_csv(sifts_url, usecols=['PDB', 'PUBMED_ID'], sep='\t', skiprows=1, dtype=str)
pm_pdb.head()

Unnamed: 0,PDB,PUBMED_ID
0,100d,7816639
1,101d,7711020
2,102d,7608897
3,102l,8429913
4,103d,7966337


In [72]:
pm_pdb.rename(columns={'PDB': 'accession'}, inplace=True)
pm_pdb.rename(columns={'PUBMED_ID': 'id'}, inplace=True)

In [73]:
pm_pdb['accession'] = 'pdb:' + pm_pdb['accession'].str.upper()
pm_pdb['id'] = 'pubmed:' + pm_pdb['id']

In [74]:
print('Number of matches:', pm_pdb.shape[0])
pm_pdb.head()

Number of matches: 167450


Unnamed: 0,accession,id
0,pdb:100D,pubmed:7816639
1,pdb:101D,pubmed:7711020
2,pdb:102D,pubmed:7608897
3,pdb:102L,pubmed:8429913
4,pdb:103D,pubmed:7966337


In [75]:
pm_pdb.to_csv(NEO4J_IMPORT / "01h-PM-PDBStructures.csv", index=False)