# Load Accession Numbers Mappings
**[Work in progress]**

This notebook downloads and standardizes accession numbers from life science and biological databases textmined from PubMedCentral full text articles by [Europe PMC](https://europepmc.org/) for ingestion into a Knowledge Graph.

Data source: [Europe PMC](ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
ftp = 'ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/'

#### Collect datasets with epi references

In [4]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-19636412-9e74-4bac-8a4c-c6c8b49bb9d3/installation-4.1.0/import


In [5]:
ref1 = pd.read_csv(NEO4J_IMPORT / "01b-Nextstrain.csv")
ref1 = ref1[['id']]
ref1 = ref1.drop_duplicates()
ref1.rename(columns={'id': 'accession'}, inplace=True)
ref1.head()

Unnamed: 0,accession
0,https://www.gisaid.org/EPI_ISL_406798
1,https://www.gisaid.org/EPI_ISL_402121
2,https://www.gisaid.org/EPI_ISL_402128
3,https://www.gisaid.org/EPI_ISL_402130
4,https://www.gisaid.org/EPI_ISL_457733


In [6]:
ref2 = pd.read_csv(NEO4J_IMPORT / "01d-CNCBStrain.csv")
ref2.head()

Unnamed: 0,id,name,alias,taxonomyId,hostTaxonomyId,collectionDate,country,admin1,admin2,city,locationLevels
0,NMDC60013088-01,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01;EPI_ISL_402132,taxonomy:2697049,taxonomy:9606,2019-12-30,China,Hubei,,,1
1,https://www.gisaid.org/EPI_ISL_402132,BetaCoV/Wuhan/HBCDC-HB-01/2019,NMDC60013088-01;EPI_ISL_402132,taxonomy:2697049,taxonomy:9606,2019-12-30,China,Hubei,,,1
2,https://www.gisaid.org/EPI_ISL_403963,hCoV-19/Thailand/74/2020,EPI_ISL_403963,taxonomy:2697049,taxonomy:9606,2020-01-13,Thailand,Nonthaburi,,,1
3,https://www.gisaid.org/EPI_ISL_403962,hCoV-19/Thailand/61/2020,EPI_ISL_403962,taxonomy:2697049,taxonomy:9606,2020-01-08,Thailand,Nonthaburi,,,1
4,NMDC60013085-01,BetaCoV/Wuhan/IVDC-HB-04/2020,NMDC60013085-01;EPI_ISL_402120,taxonomy:2697049,taxonomy:9606,2020-01-01,China,Hubei,Wuhan,,2


In [7]:
ref2 = ref2[['id']]
ref2 = ref2.drop_duplicates()
ref2.rename(columns={'id': 'accession'}, inplace=True)
ref2.head()

Unnamed: 0,accession
0,NMDC60013088-01
1,https://www.gisaid.org/EPI_ISL_402132
2,https://www.gisaid.org/EPI_ISL_403963
3,https://www.gisaid.org/EPI_ISL_403962
4,NMDC60013085-01


In [8]:
ref3 = pd.read_csv(NEO4J_IMPORT / "01c-NCBIRefSeq.csv")
ref3 = ref3[['genomeAccession']]
ref3 = ref3.drop_duplicates()
ref3.rename(columns={'genomeAccession': 'accession'}, inplace=True)
ref3.head()

Unnamed: 0,accession
0,ncbiprotein:NC_045512
38,insdc:MN908947


In [9]:
ref4 = pd.read_csv(NEO4J_IMPORT / "01e-ProteinProteinInteractionProtein.csv")
ref4 = ref4[['accession']]
ref4 = ref4.drop_duplicates()
ref4.head()               

Unnamed: 0,accession
0,uniprot:A0A663DJA2
1,uniprot:A0MZ66
2,uniprot:A0PJW6
3,uniprot:A1L3X0
4,uniprot:A3KN83


In [10]:
ref = pd.concat([ref1, ref2, ref3, ref4])

## Assign unique identifiers for interoperabilitiy
A [CURIE](https://en.wikipedia.org/wiki/CURIE) (Compact URI) is a compact abbreviation for Uniform Resource Identifiers (URIs). CURIEs consist of registered prefix and an accession number (prefix:accession). They provide a name space for identifiers to enable uniqueness of identifiers and interoperability among data resources.

[Identifiers.org](http://identifiers.org/) provides a registry and resolution service for life science CURIEs. 

### NCBI Reference Sequences
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [ncbiprotein](https://registry.identifiers.org/registry/ncbiprotein) (NCBI Reference Sequences, Refseq)

In [11]:
df1 = pd.read_csv(ftp + "refseq.csv", dtype=str)
df1.head()

Unnamed: 0,refseq,PMCID,EXTID,SOURCE
0,NM_199203,PMC2785473,19956559,MED
1,NM_006544,PMC2785473,19956559,MED
2,NM_212472,PMC2785473,19956559,MED
3,NM_003934,PMC2785473,19956559,MED
4,NM_153693,PMC2785473,19956559,MED


In [12]:
# Remove version number from refseq to match to the latest version
df1['id'] = 'pmc:' + df1['PMCID']
df1['accession'] = 'ncbiprotein:' + df1['refseq'].str.split('.', expand=True)[0]
df1 = df1[['id','accession']]

In [13]:
df1 = df1.merge(ref, on="accession")
df1 = df1[['id','accession']]
print("Number of refseq matches:", df1.shape[0])
df1.head()

Number of refseq matches: 106


Unnamed: 0,id,accession
0,pmc:PMC7290700,ncbiprotein:NC_045512
1,pmc:PMC7272177,ncbiprotein:NC_045512
2,pmc:PMC7272177,ncbiprotein:NC_045512
3,pmc:PMC7295489,ncbiprotein:NC_045512
4,pmc:PMC7118541,ncbiprotein:NC_045512


### GISAID Genome Sequences
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)


In [14]:
df2 = pd.read_csv(ftp + "gisaid.csv", dtype=str)

In [15]:
df2['id'] = 'pmc:' + df2['PMCID']
df2['accession'] = 'https://www.gisaid.org/' + df2['gisaid']
df2 = df2[['id','accession']]
df2.head()

Unnamed: 0,id,accession
0,pmc:PMC4634248,https://www.gisaid.org/EPI179482
1,pmc:PMC4634248,https://www.gisaid.org/EPI179438
2,pmc:PMC4634248,https://www.gisaid.org/EPI232919
3,pmc:PMC4634248,https://www.gisaid.org/EPI233018
4,pmc:PMC4634248,https://www.gisaid.org/EPI272597


In [16]:
df2 = df2.merge(ref, on="accession")
df2.dropna(inplace=True)
print("Number of GISAID matches:", df2.shape[0])
df2.head()

Number of GISAID matches: 688


Unnamed: 0,id,accession
0,pmc:PMC7314507,https://www.gisaid.org/EPI_ISL_435057
1,pmc:PMC7314507,https://www.gisaid.org/EPI_ISL_435057
2,pmc:PMC7314507,https://www.gisaid.org/EPI_ISL_424366
3,pmc:PMC7314511,https://www.gisaid.org/EPI_ISL_424366
4,pmc:PMC7314507,https://www.gisaid.org/EPI_ISL_427391


### UniProt
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [uniprot](https://registry.identifiers.org/registry/uniprot) ( UniProt Knowledgebase, UniProtKB)

In [17]:
df3 = pd.read_csv(ftp + "uniprot.csv", dtype=str)

In [18]:
df3['id'] = 'pmc:' + df3['PMCID']
df3['accession'] = 'uniprot:' + df3['uniprot']
df3 = df3[['id','accession']]
df3.head()

Unnamed: 0,id,accession
0,pmc:PMC6713643,uniprot:P24385
1,pmc:PMC6713643,uniprot:P42224
2,pmc:PMC6713643,uniprot:Q13043
3,pmc:PMC6713643,uniprot:Q92769
4,pmc:PMC6713643,uniprot:P16234


In [19]:
df3 = df3.merge(ref, on="accession")
df3.dropna(inplace=True)
print("Number of UniProt matches:", df3.shape[0])
df3.head()

Number of UniProt matches: 20991


Unnamed: 0,id,accession
0,pmc:PMC6713643,uniprot:Q92769
1,pmc:PMC6109601,uniprot:Q92769
2,pmc:PMC4182858,uniprot:Q92769
3,pmc:PMC4446364,uniprot:Q92769
4,pmc:PMC4253452,uniprot:Q92769


### Protein Data Bank (NOT USED YET)
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [pdb](https://registry.identifiers.org/registry/pdb) (Protein Data Bank, PDB)

In [20]:
# df4 = pd.read_csv(ftp + "pdb.csv", dtype=str)

In [21]:
# df4['id'] = 'pmc:' + df4['PMCID']
# df4['accession'] = 'pdb:' + df4['pdb']
# df4 = df4[['id','accession']]
# df4.head()

### Digital Object Identifier (DOI) (NOT USED YET)
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [doi](https://registry.identifiers.org/registry/doi) (Digital Object Identifier System, DOI)

In [22]:
#df5 = pd.read_csv(ftp + "doi.csv", dtype=str)

In [23]:
# df5['id'] = 'pmc:' + df5['PMCID']
# df5['accession'] = 'doi' + df5['doi']
# df5 = df5[['id','accession']]
# df5.head()

### Save data for Knowledge Graph Import

In [24]:
df = pd.concat([df1, df2, df3])
df = df.fillna('')
df = df.query("id != ''")
df = df.query("accession != ''")
print('Mappings:', df.count())

Mappings: id           21780
accession    21780
dtype: int64


In [25]:
df.head()

Unnamed: 0,id,accession
0,pmc:PMC7290700,ncbiprotein:NC_045512
1,pmc:PMC7272177,ncbiprotein:NC_045512
2,pmc:PMC7272177,ncbiprotein:NC_045512
3,pmc:PMC7295489,ncbiprotein:NC_045512
4,pmc:PMC7118541,ncbiprotein:NC_045512


In [26]:
df.to_csv(NEO4J_IMPORT / "01h-PMC-Accession.csv", index=False)