# Load Accession Numbers Mappings
**[Work in progress]**

This notebook downloads and standardizes accession numbers from life science and biological databases textmined from PubMedCentral full text articles by [Europe PMC](https://europepmc.org/) for ingestion into a Knowledge Graph.

Data source: [Europe PMC](ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
ftp = 'ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/'

In [4]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Assign unique identifiers for interoperabilitiy
A [CURIE](https://en.wikipedia.org/wiki/CURIE) (Compact URI) is a compact abbreviation for Uniform Resource Identifiers (URIs). CURIEs consist of registered prefix and an accession number (prefix:accession). They provide a name space for identifiers to enable uniqueness of identifiers and interoperability among data resources.

[Identifiers.org](http://identifiers.org/) provides a registry and resolution service for life science CURIEs. 

In [5]:
def assign_publication_id(row):
    if row['PMCID'] != '':
        # CURIE: pmc (PubMed Central, PMC)
        return 'pmc:' + str(row['PMCID'])
    elif row['SOURCE'] == 'PPR':
        # no CURIE available, use URI for preprints
        return 'https://europepmc.org/article/PPR/' + row['EXTID']
    else:
        return ''

### UniProt
**accession**: CURIE: [uniprot](https://registry.identifiers.org/registry/uniprot) ( UniProt Knowledgebase, UniProtKB)

In [6]:
uniprot = pd.read_csv(ftp + "uniprot.csv", dtype=str)
uniprot.fillna('', inplace=True)

In [7]:
uniprot['id'] = uniprot.apply(assign_publication_id, axis=1)
uniprot['accession'] = 'uniprot:' + uniprot['uniprot']
uniprot = uniprot[['id','accession']]
uniprot.query("id != ''", inplace=True)

In [8]:
uniprot.shape

(387031, 2)

### NCBI Reference Sequences
**accession**: CURIE: [refseq](https://registry.identifiers.org/registry/refseq) (NCBI Reference Sequences, Refseq)

In [9]:
refseq = pd.read_csv(ftp + "refseq.csv", dtype=str)
refseq.fillna('', inplace=True)
refseq.head()

Unnamed: 0,refseq,PMCID,EXTID,SOURCE
0,NM_015973,PMC7512552,32963006,MED
1,NM_001789,PMC7512552,32963006,MED
2,NM_001008708,PMC7512552,32963006,MED
3,NM_001100625,PMC7512552,32963006,MED
4,NM_018455,PMC7512552,32963006,MED


In [10]:
refseq['id'] = refseq.apply(assign_publication_id, axis=1)
# Remove version number from refseq to match to the latest version
refseq['accession'] = 'refseq:' + refseq['refseq'].str.split('.', expand=True)[0]
refseq = refseq[['id','accession']]
refseq.query("id != ''", inplace=True)

In [11]:
refseq.shape

(313536, 2)

### GISAID Genome Sequences
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)


In [12]:
gisaid = pd.read_csv(ftp + "gisaid.csv", dtype=str)
gisaid.fillna('', inplace=True)

In [13]:
gisaid['id'] = gisaid.apply(assign_publication_id, axis=1)

In [14]:
gisaid['accession'] = 'https://www.gisaid.org/' + gisaid['gisaid']
gisaid = gisaid[['id','accession']]
gisaid.query("id != ''", inplace=True)

In [15]:
gisaid.shape

(6696, 2)

### Protein Data Bank
**accession**: CURIE: [pdb](https://registry.identifiers.org/registry/pdb) (Protein Data Bank, PDB)

In [16]:
pdb = pd.read_csv(ftp + "pdb.csv", dtype=str)
pdb.fillna('', inplace=True)

In [17]:
pdb['id'] = pdb.apply(assign_publication_id, axis=1)
pdb['accession'] = 'pdb:' + pdb['pdb']
pdb = pdb[['id','accession']]
pdb.query("id != ''", inplace=True)

In [18]:
pdb.shape

(522980, 2)

# Match Dataset Mentions

### Match by UniProt accessions

In [19]:
ref1 = pd.read_csv(NEO4J_IMPORT / "01a-UniProtProtein.csv")
ref1 = ref1[['accession']]
ref1 = ref1.drop_duplicates()
ref1.head()

Unnamed: 0,accession
0,uniprot:P0DTD1
16,uniprot:P0DTC7
18,uniprot:P0DTD2
19,uniprot:P0DTC9
20,uniprot:P0DTC3


In [20]:
pmc_uniprot = pd.merge(uniprot, ref1, on='accession')

In [21]:
pmc_uniprot.to_csv(NEO4J_IMPORT / "01h-PMC-UniProtProtein.csv", index=False)

In [22]:
print('UniProt mentions:', pmc_uniprot.shape[0])

UniProt mentions: 161184


In [23]:
pmc_uniprot.head()

Unnamed: 0,id,accession
0,pmc:PMC6713643,uniprot:P24385
1,pmc:PMC5761900,uniprot:P24385
2,pmc:PMC3275796,uniprot:P24385
3,pmc:PMC4823807,uniprot:P24385
4,pmc:PMC5474285,uniprot:P24385


### Match Strains by NCBI refSeq and GISAID accessions

In [24]:
ref2 = pd.read_csv(NEO4J_IMPORT / "01c-CNCBStrain.csv")

In [25]:
ref2['secondaryAccession'] = ref2['accessions'].str.split(';')
ref2 = ref2.explode('secondaryAccession')
ref2.rename(columns={'id': 'primaryAccession'}, inplace=True)
ref2 = ref2[['primaryAccession', 'secondaryAccession']]
ref2 = ref2.drop_duplicates()
ref2.head()

Unnamed: 0,primaryAccession,secondaryAccession
0,https://www.gisaid.org/EPI_ISL_402132,NMDC60013088-01
0,https://www.gisaid.org/EPI_ISL_402132,https://www.gisaid.org/EPI_ISL_402132
1,https://www.gisaid.org/EPI_ISL_403963,https://www.gisaid.org/EPI_ISL_403963
2,https://www.gisaid.org/EPI_ISL_403962,https://www.gisaid.org/EPI_ISL_403962
3,https://www.gisaid.org/EPI_ISL_402120,NMDC60013085-01


In [26]:
pmc_cncb1 = pd.merge(refseq, ref2, left_on='accession', right_on='secondaryAccession')

In [27]:
pmc_cncb1.head()

Unnamed: 0,id,accession,primaryAccession,secondaryAccession
0,pmc:PMC7596387,refseq:NC_045512,insdc:MN908947,refseq:NC_045512
1,pmc:PMC7290700,refseq:NC_045512,insdc:MN908947,refseq:NC_045512
2,https://europepmc.org/article/PPR/PPR204419,refseq:NC_045512,insdc:MN908947,refseq:NC_045512
3,https://europepmc.org/article/PPR/PPR227723,refseq:NC_045512,insdc:MN908947,refseq:NC_045512
4,https://europepmc.org/article/PPR/PPR199641,refseq:NC_045512,insdc:MN908947,refseq:NC_045512


Use primary accession to establish linkage

In [28]:
pmc_cncb1['accession'] = pmc_cncb1['primaryAccession']

In [29]:
pmc_cncb1 = pmc_cncb1[['id', 'accession']]

In [30]:
print('regseq mentions:', pmc_cncb1.shape[0])

regseq mentions: 629


In [31]:
pmc_cncb1.head()

Unnamed: 0,id,accession
0,pmc:PMC7596387,insdc:MN908947
1,pmc:PMC7290700,insdc:MN908947
2,https://europepmc.org/article/PPR/PPR204419,insdc:MN908947
3,https://europepmc.org/article/PPR/PPR227723,insdc:MN908947
4,https://europepmc.org/article/PPR/PPR199641,insdc:MN908947


In [32]:
pmc_cncb2 = pd.merge(gisaid, ref2, left_on='accession', right_on='secondaryAccession')

In [33]:
pmc_cncb2.head()

Unnamed: 0,id,accession,primaryAccession,secondaryAccession
0,https://europepmc.org/article/PPR/PPR167663,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131
1,pmc:PMC7166309,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131
2,https://europepmc.org/article/PPR/PPR190800,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131
3,pmc:PMC7497811,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131
4,pmc:PMC7205519,https://www.gisaid.org/EPI_ISL_402131,GWHABKP00000001,https://www.gisaid.org/EPI_ISL_402131


In [34]:
pmc_cncb2['accession'] = pmc_cncb2['primaryAccession']

In [35]:
pmc_cncb2 = pmc_cncb2[['id', 'accession']]

In [36]:
print('GISAID mentions:', pmc_cncb2.shape[0])

GISAID mentions: 2840


In [37]:
pmc_cncb2.head()

Unnamed: 0,id,accession
0,https://europepmc.org/article/PPR/PPR167663,GWHABKP00000001
1,pmc:PMC7166309,GWHABKP00000001
2,https://europepmc.org/article/PPR/PPR190800,GWHABKP00000001
3,pmc:PMC7497811,GWHABKP00000001
4,pmc:PMC7205519,GWHABKP00000001


In [38]:
pmc_cncb = pd.concat([pmc_cncb1, pmc_cncb2])

In [39]:
pmc_cncb.to_csv(NEO4J_IMPORT / "01h-PMC-CNCBStrain.csv", index=False)

### Match Genomes by refseq

In [40]:
ref3 = pd.read_csv(NEO4J_IMPORT / "Genome.csv")
ref3 = ref3[['refSeq']]
ref3 = ref3.drop_duplicates()
ref3.rename(columns={'refSeq': 'accession'}, inplace=True)
ref3.head()

Unnamed: 0,accession
0,refseq:NC_045512
1,refseq:NC_038294
2,refseq:NC_004718
3,refseq:NC_002645
4,refseq:NC_000001


In [41]:
pmc_genome = pd.merge(refseq, ref3, on='accession')

In [42]:
print('regseq mentions:', pmc_genome.shape[0])

regseq mentions: 3341


In [43]:
pmc_genome.head()

Unnamed: 0,id,accession
0,pmc:PMC2684143,refseq:NC_002645
1,pmc:PMC6390631,refseq:NC_002645
2,pmc:PMC3812135,refseq:NC_002645
3,pmc:PMC7121196,refseq:NC_002645
4,pmc:PMC3966378,refseq:NC_002645


In [44]:
pmc_genome.to_csv(NEO4J_IMPORT / "01h-PMC-Genome.csv", index=False)

### Match Protein-Protein Interactions by UniProt accession

In [45]:
ref4 = pd.read_csv(NEO4J_IMPORT / "01e-ProteinProteinInteractionProtein.csv")
ref4 = ref4[['accession']]
ref4 = ref4.drop_duplicates()
ref4.head()               

Unnamed: 0,accession
0,uniprot:A0A663DJA2
1,uniprot:A0MZ66
2,uniprot:A0PJW6
3,uniprot:A1L3X0
4,uniprot:A3KN83


In [46]:
pmc_ppi = pd.merge(uniprot, ref4, on='accession')

In [47]:
print('Protein-Protein interaction mentions:', pmc_ppi.shape[0])

Protein-Protein interaction mentions: 22543


In [48]:
pmc_ppi.head()

Unnamed: 0,id,accession
0,pmc:PMC6713643,uniprot:Q92769
1,pmc:PMC6109601,uniprot:Q92769
2,pmc:PMC4182858,uniprot:Q92769
3,pmc:PMC4446364,uniprot:Q92769
4,pmc:PMC4253452,uniprot:Q92769


In [49]:
pmc_ppi.to_csv(NEO4J_IMPORT / "01h-PMC-ProteinProteinInteraction.csv", index=False)

### Match Protein Structures by PDB ID

In [50]:
ref5 = pd.read_csv(NEO4J_IMPORT / "01f-PDBStructure.csv")
ref5 = ref5[['pdbId']]
ref5.rename(columns={'pdbId': 'accession'}, inplace=True)
ref5 = ref5.drop_duplicates()
ref5.head()   

Unnamed: 0,accession
0,pdb:6W9Q
1,pdb:6VXS
2,pdb:6W9C
3,pdb:6VWW
4,pdb:6VYO


In [51]:
pmc_pdb = pd.merge(pdb, ref5, on='accession')

In [52]:
print('PDB mentions:', pmc_pdb.shape[0])

PDB mentions: 3963


In [53]:
pmc_pdb.head()

Unnamed: 0,id,accession
0,pmc:PMC7554297,pdb:6VYB
1,pmc:PMC7467145,pdb:6VYB
2,pmc:PMC7584483,pdb:6VYB
3,https://europepmc.org/article/PPR/PPR151136,pdb:6VYB
4,pmc:PMC7282679,pdb:6VYB


In [54]:
pmc_pdb.to_csv(NEO4J_IMPORT / "01h-PMC-PDBStructures.csv", index=False)

In [55]:
pmc_ids = pd.concat([pmc_uniprot, pmc_cncb, pmc_genome, pmc_ppi, pmc_pdb])

In [56]:
pmc_ids.shape

(194500, 2)

In [57]:
pmc_uniprot.query("id != ''", inplace=True)

In [58]:
pmc_ids.shape

(194500, 2)

In [60]:
pmc_ids.to_csv(NEO4J_IMPORT / "01h-PMC-Ids.csv", index=False)