# Load Accession Numbers Mappings
**[Work in progress]**

This notebook downloads and standardizes accession numbers from life science and biological databases textmined from PubMedCentral full text articles by [Europe PMC](https://europepmc.org/) for ingestion into a Knowledge Graph.

Data source: [ftp site](ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
ftp = 'ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/'

In [4]:
NEO4J_HOME = Path(os.getenv('NEO4J_HOME'))
print(NEO4J_HOME)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-4af96121-2328-4e2f-ba60-6d8b728a26d5/installation-4.0.3


## Assign unique identifiers for interoperabilitiy
A [CURIE](https://en.wikipedia.org/wiki/CURIE) (Compact URI) is a compact abbreviation for Uniform Resource Identifiers (URIs). CURIEs consist of registered prefix and an accession number (prefix:accession). They provide a name space for identifiers to enable uniqueness of identifiers and interoperability among data resources.

[Identifiers.org](http://identifiers.org/) provides a registry and resolution service for life science CURIEs. 

### NCBI Reference Sequences (NOT USED YET)
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [ncbiprotein](https://registry.identifiers.org/registry/ncbiprotein) (NCBI Reference Sequences, Refseq)

In [5]:
df1 = pd.read_csv(ftp + "refseq.csv", dtype=str)

In [6]:
# Remove version number from refseq to match to the latest version
df1['id'] = 'pmc:' + df1['PMCID']
df1['accession'] = 'ncbiprotein:' + df1['refseq'].str.split('.', expand=True)[0]
df1 = df1[['id','accession']]

In [7]:
refseq = pd.read_csv(NEO4J_HOME / "import/01c-NCBIRefSeq.csv")
refseq = refseq[['genbank_id']]
refseq = refseq.drop_duplicates()

In [8]:
df1 = df1.merge(refseq, left_on="accession", right_on='genbank_id')
df1 = df1[['id','accession']]
df1.head()

Unnamed: 0,id,accession
0,pmc:PMC7118541,ncbiprotein:NC_045512
1,pmc:PMC7067954,ncbiprotein:NC_045512
2,pmc:PMC7092802,ncbiprotein:NC_045512
3,pmc:PMC7102578,ncbiprotein:NC_045512
4,pmc:PMC7060195,ncbiprotein:NC_045512


### GISAID Genome Sequences
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)


In [9]:
df2 = pd.read_csv(ftp + "gisaid.csv", dtype=str)

In [10]:
df2['id'] = 'pmc:' + df2['PMCID']
df2['accession'] = 'https://www.gisaid.org/' + df2['gisaid']
df2 = df2[['id','accession']]
df2.head()

Unnamed: 0,id,accession
0,pmc:PMC4634248,https://www.gisaid.org/EPI179482
1,pmc:PMC4634248,https://www.gisaid.org/EPI179438
2,pmc:PMC4634248,https://www.gisaid.org/EPI232919
3,pmc:PMC4634248,https://www.gisaid.org/EPI233018
4,pmc:PMC4634248,https://www.gisaid.org/EPI272597


In [11]:
nextstrain = pd.read_csv(NEO4J_HOME / "import/01b-Nextstrain.csv")
nextstrain['accession'] = nextstrain['id']
nextstrain = nextstrain[['accession']]

In [12]:
df2 = df2.merge(nextstrain, on="accession")
df2.dropna(inplace=True)
df2.head()

Unnamed: 0,id,accession
0,pmc:PMC7106203,https://www.gisaid.org/EPI_ISL_406716
1,pmc:PMC7106203,https://www.gisaid.org/EPI_ISL_408008
2,pmc:PMC7106203,https://www.gisaid.org/EPI_ISL_408009
3,pmc:PMC7106203,https://www.gisaid.org/EPI_ISL_406034
4,pmc:PMC7106203,https://www.gisaid.org/EPI_ISL_408010


### UniProt (NOT USED YET)
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [uniprot](https://registry.identifiers.org/registry/uniprot) ( UniProt Knowledgebase, UniProtKB)

In [13]:
# df3 = pd.read_csv(ftp + "uniprot.csv", dtype=str)

In [14]:
# df3['id'] = 'pmc:' + df3['PMCID']
# df3['accession'] = 'uniprot:' + df3['uniprot']
# df3 = df3[['id','accession']]
# df3.head()

### Protein Data Bank (NOT USED YET)
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [pdb](https://registry.identifiers.org/registry/pdb) (Protein Data Bank, PDB)

In [15]:
# df4 = pd.read_csv(ftp + "pdb.csv", dtype=str)

In [16]:
# df4['id'] = 'pmc:' + df4['PMCID']
# df4['accession'] = 'pdb:' + df4['pdb']
# df4 = df4[['id','accession']]
# df4.head()

### Digital Object Identifier (DOI) (NOT USED YET)
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [doi](https://registry.identifiers.org/registry/doi) (Digital Object Identifier System, DOI)

In [17]:
#df5 = pd.read_csv(ftp + "doi.csv", dtype=str)

In [18]:
# df5['id'] = 'pmc:' + df5['PMCID']
# df5['accession'] = 'doi' + df5['doi']
# df5 = df5[['id','accession']]
# df5.head()

### Save data for Knowledge Graph Import

In [19]:
df = pd.concat([df1, df2])
df = df.fillna('')
df = df.query("id != ''")
df = df.query("accession != ''")
print('Mappings:', df.count())

Mappings: id           81
accession    81
dtype: int64


In [20]:
df.to_csv(NEO4J_HOME / "import/01d-PMC-Accession.csv", index=False)