# Load Accession Numbers Mappings
**[Work in progress]**

This notebook downloads and standardizes accession numbers from life science and biological databases textmined from PubMedCentral full text articles by [Europe PMC](https://europepmc.org/) for ingestion into a Knowledge Graph.

Data source: [Europe PMC](ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
ftp = 'ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/'

#### Collect datasets with epi references

In [4]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-328d8379-6ab4-4cc1-a397-2de37909d2e4/installation-4.1.0/import


In [5]:
ref1 = pd.read_csv(NEO4J_IMPORT / "01b-Nextstrain.csv")
ref1 = ref1[['id']]
ref1 = ref1.drop_duplicates()
ref1.rename(columns={'id': 'accession'}, inplace=True)
ref1.head()

Unnamed: 0,accession
0,https://www.gisaid.org/EPI_ISL_428451
1,https://www.gisaid.org/EPI_ISL_457733
2,https://www.gisaid.org/EPI_ISL_402121
3,https://www.gisaid.org/EPI_ISL_402130
4,https://www.gisaid.org/EPI_ISL_406798


In [6]:
ref2 = pd.read_csv(NEO4J_IMPORT / "01d-CNCBStrain.csv")

In [7]:
ref2 = ref2[['id']]
ref2 = ref2.drop_duplicates()
ref2.rename(columns={'id': 'accession'}, inplace=True)
ref2.head()

Unnamed: 0,accession
0,https://www.gisaid.org/EPI_ISL_402132
1,https://www.gisaid.org/EPI_ISL_403963
2,https://www.gisaid.org/EPI_ISL_403962
3,https://www.gisaid.org/EPI_ISL_402120
4,https://www.gisaid.org/EPI_ISL_402119


In [8]:
ref3 = pd.read_csv(NEO4J_IMPORT / "01c-NCBIRefSeq.csv")
ref3 = ref3[['genomeAccession']]
ref3 = ref3.drop_duplicates()
ref3.rename(columns={'genomeAccession': 'accession'}, inplace=True)
ref3.head()

Unnamed: 0,accession
0,refseq:NC_045512
38,insdc:MN908947


In [9]:
ref4 = pd.read_csv(NEO4J_IMPORT / "01e-ProteinProteinInteractionProtein.csv")
ref4 = ref4[['accession']]
ref4 = ref4.drop_duplicates()
ref4.head()               

Unnamed: 0,accession
0,uniprot:A0A663DJA2
1,uniprot:A0MZ66
2,uniprot:A0PJW6
3,uniprot:A1L3X0
4,uniprot:A3KN83


In [10]:
ref5 = pd.read_csv(NEO4J_IMPORT / "01f-PDBStructure.csv")
ref5 = ref5[['pdbId']]
ref5.rename(columns={'pdbId': 'accession'}, inplace=True)
ref5 = ref5.drop_duplicates()
ref5.head()   

Unnamed: 0,accession
0,pdb:5R84
1,pdb:5R83
2,pdb:5R7Y
3,pdb:5R80
4,pdb:5R82


In [11]:
ref = pd.concat([ref1, ref2, ref3, ref4, ref5])

## Assign unique identifiers for interoperabilitiy
A [CURIE](https://en.wikipedia.org/wiki/CURIE) (Compact URI) is a compact abbreviation for Uniform Resource Identifiers (URIs). CURIEs consist of registered prefix and an accession number (prefix:accession). They provide a name space for identifiers to enable uniqueness of identifiers and interoperability among data resources.

[Identifiers.org](http://identifiers.org/) provides a registry and resolution service for life science CURIEs. 

In [12]:
def assign_publication_id(row):
    if row['PMCID'] != '':
        # CURIE: pmc (PubMed Central, PMC)
        return 'pmc:' + str(row['PMCID'])
    elif row['SOURCE'] == 'PPR':
        # no CURIE available, use URI for preprints
        return 'https://europepmc.org/article/PPR/' + row['EXTID']
    else:
        return ''

### NCBI Reference Sequences
**accession**: CURIE: [refseq](https://registry.identifiers.org/registry/refseq) (NCBI Reference Sequences, Refseq)

In [13]:
df1 = pd.read_csv(ftp + "refseq.csv", dtype=str)
df1.fillna('', inplace=True)
df1.head()

Unnamed: 0,refseq,PMCID,EXTID,SOURCE
0,NM_199203,PMC2785473,19956559,MED
1,NM_006544,PMC2785473,19956559,MED
2,NM_212472,PMC2785473,19956559,MED
3,NM_003934,PMC2785473,19956559,MED
4,NM_153693,PMC2785473,19956559,MED


In [14]:
df1['id'] = df1.apply(assign_publication_id, axis=1)
# Remove version number from refseq to match to the latest version
df1['accession'] = 'refseq:' + df1['refseq'].str.split('.', expand=True)[0]
df1 = df1[['id','accession']]

In [15]:
df1 = df1.merge(ref, on="accession")
df1 = df1[['id','accession']]
df1.dropna(inplace=True)
print("Number of refseq matches:", df1.shape[0])
df1.head()

Number of refseq matches: 390


Unnamed: 0,id,accession
0,pmc:PMC7290700,refseq:NC_045512
1,https://europepmc.org/article/PPR/PPR204419,refseq:NC_045512
2,pmc:PMC7352669,refseq:NC_045512
3,https://europepmc.org/article/PPR/PPR199641,refseq:NC_045512
4,pmc:PMC7272177,refseq:NC_045512


### GISAID Genome Sequences
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)


In [16]:
df2 = pd.read_csv(ftp + "gisaid.csv", dtype=str)
df2.fillna('', inplace=True)

In [17]:
df2['id'] = df2.apply(assign_publication_id, axis=1)

In [18]:
df2['accession'] = 'https://www.gisaid.org/' + df2['gisaid']
df2 = df2[['id','accession']]

In [19]:
df2 = df2.merge(ref, on="accession")
df2.dropna(inplace=True)
print("Number of GISAID matches:", df2.shape[0])
df2.head()

Number of GISAID matches: 1289


Unnamed: 0,id,accession
0,https://europepmc.org/article/PPR/PPR204749,https://www.gisaid.org/EPI_ISL_515350
1,https://europepmc.org/article/PPR/PPR204749,https://www.gisaid.org/EPI_ISL_515342
2,https://europepmc.org/article/PPR/PPR204749,https://www.gisaid.org/EPI_ISL_515340
3,https://europepmc.org/article/PPR/PPR204749,https://www.gisaid.org/EPI_ISL_515341
4,https://europepmc.org/article/PPR/PPR204749,https://www.gisaid.org/EPI_ISL_515348


### UniProt
**accession**: CURIE: [uniprot](https://registry.identifiers.org/registry/uniprot) ( UniProt Knowledgebase, UniProtKB)

In [20]:
df3 = pd.read_csv(ftp + "uniprot.csv", dtype=str)
df3.fillna('', inplace=True)

In [21]:
df3['id'] = df3.apply(assign_publication_id, axis=1)
df3['accession'] = 'uniprot:' + df3['uniprot']
df3 = df3[['id','accession']]

In [22]:
df3 = df3.merge(ref, on="accession")
df3.dropna(inplace=True)
print("Number of UniProt matches:", df3.shape[0])
df3.head()

Number of UniProt matches: 21961


Unnamed: 0,id,accession
0,pmc:PMC6713643,uniprot:Q92769
1,pmc:PMC6109601,uniprot:Q92769
2,pmc:PMC4182858,uniprot:Q92769
3,pmc:PMC4446364,uniprot:Q92769
4,pmc:PMC4253452,uniprot:Q92769


### Protein Data Bank (NOT USED YET)
**accession**: CURIE: [pdb](https://registry.identifiers.org/registry/pdb) (Protein Data Bank, PDB)

In [23]:
df4 = pd.read_csv(ftp + "pdb.csv", dtype=str)
df4.fillna('', inplace=True)

In [24]:
df4['id'] = df4.apply(assign_publication_id, axis=1)
df4['accession'] = 'pdb:' + df4['pdb']
df4 = df4[['id','accession']]

In [25]:
df4 = df4.merge(ref, on="accession")
df4.dropna(inplace=True)
print("Number of PDB matches:", df4.shape[0])
df4.head()

Number of PDB matches: 2184


Unnamed: 0,id,accession
0,pmc:PMC7467145,pdb:6Y2E
1,pmc:PMC7282679,pdb:6Y2E
2,https://europepmc.org/article/PPR/PPR204619,pdb:6Y2E
3,pmc:PMC7394272,pdb:6Y2E
4,pmc:PMC7403965,pdb:6Y2E


### Digital Object Identifier (DOI) (NOT USED YET)
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [doi](https://registry.identifiers.org/registry/doi) (Digital Object Identifier System, DOI)

In [26]:
#df5 = pd.read_csv(ftp + "doi.csv", dtype=str)
#df5.fillna('', inplace=True)

In [27]:
#df5['id'] = df5.apply(assign_publication_id, axis=1)

In [28]:
#df5.head()

### Save data for Knowledge Graph Import

In [29]:
df = pd.concat([df1, df2, df3, df4])
df.fillna('', inplace=True)
df = df.query("id != ''")
df = df.query("accession != ''")
print('Mappings:', df.shape[0])

Mappings: 25766


In [30]:
df.head()

Unnamed: 0,id,accession
0,pmc:PMC7290700,refseq:NC_045512
1,https://europepmc.org/article/PPR/PPR204419,refseq:NC_045512
2,pmc:PMC7352669,refseq:NC_045512
3,https://europepmc.org/article/PPR/PPR199641,refseq:NC_045512
4,pmc:PMC7272177,refseq:NC_045512


In [31]:
df.to_csv(NEO4J_IMPORT / "01h-PMC-Accession.csv", index=False)