# Load Accession Numbers Mappings
**[Work in progress]**

This notebook downloads and standardizes accession numbers from life science and biological databases textmined from PubMedCentral full text articles by [Europe PMC](https://europepmc.org/) for ingestion into a Knowledge Graph.

Data source: [Europe PMC](ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import dateutil
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
ftp = 'ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/'

#### Collect datasets with epi references

In [4]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


In [5]:
ref1 = pd.read_csv(NEO4J_IMPORT / "01a-UniProtProtein.csv")
ref1 = ref1[['accession']]
ref1 = ref1.drop_duplicates()
ref1.head()

Unnamed: 0,accession
0,uniprot:P0DTD1
16,uniprot:P0DTC7
18,uniprot:P0DTD2
19,uniprot:P0DTC9
20,uniprot:P0DTC3


In [6]:
ref2 = pd.read_csv(NEO4J_IMPORT / "01d-CNCBStrain.csv")

In [7]:
ref2['accession'] = ref2['accessions'].str.split(';')
ref2 = ref2.explode('accession')
ref2 = ref2[['accession']]
ref2 = ref2.drop_duplicates()
ref2.head()

Unnamed: 0,accession
0,NMDC60013088-01
0,https://www.gisaid.org/EPI_ISL_402132
1,https://www.gisaid.org/EPI_ISL_403963
2,https://www.gisaid.org/EPI_ISL_403962
3,NMDC60013085-01


In [8]:
ref2[ref2['accession'].str.startswith('refseq')]

Unnamed: 0,accession
7,refseq:NC_045512


In [9]:
ref3 = pd.read_csv(NEO4J_IMPORT / "Genome.csv")
ref3 = ref3[['refSeq']]
ref3 = ref3.drop_duplicates()
ref3.rename(columns={'refSeq': 'accession'}, inplace=True)
ref3.head()

Unnamed: 0,accession
0,refseq:NC_045512
1,refseq:NC_038294
2,refseq:NC_004718
3,refseq:NC_000001
4,refseq:NC_000002


In [10]:
ref4 = pd.read_csv(NEO4J_IMPORT / "01e-ProteinProteinInteractionProtein.csv")
ref4 = ref4[['accession']]
ref4 = ref4.drop_duplicates()
ref4.head()               

Unnamed: 0,accession
0,uniprot:A0A663DJA2
1,uniprot:A0MZ66
2,uniprot:A0PJW6
3,uniprot:A1L3X0
4,uniprot:A3KN83


In [11]:
ref5 = pd.read_csv(NEO4J_IMPORT / "01f-PDBStructure.csv")
ref5 = ref5[['pdbId']]
ref5.rename(columns={'pdbId': 'accession'}, inplace=True)
ref5 = ref5.drop_duplicates()
ref5.head()   

Unnamed: 0,accession
0,pdb:6W9Q
1,pdb:6VXS
2,pdb:6W9C
3,pdb:6VWW
4,pdb:6VYO


In [12]:
ref6 = pd.read_csv(NEO4J_IMPORT / "01c-NCBIGenome.csv")
ref6 = ref6[['genomeAccession']]
ref6.rename(columns={'genomeAccession': 'accession'}, inplace=True)
ref6 = ref6.drop_duplicates()
ref6.head()   

Unnamed: 0,accession
0,refseq:NC_045512
1,refseq:NC_038294
2,refseq:NC_004718


In [13]:
ref = pd.concat([ref1, ref2, ref3, ref4, ref5, ref6])
#ref = pd.concat([ref2, ref3, ref4, ref5, ref6])

In [14]:
ref.drop_duplicates(inplace=True)

In [15]:
print('Number of unique accessions: ', ref.shape[0])

Number of unique accessions:  258668


## Assign unique identifiers for interoperabilitiy
A [CURIE](https://en.wikipedia.org/wiki/CURIE) (Compact URI) is a compact abbreviation for Uniform Resource Identifiers (URIs). CURIEs consist of registered prefix and an accession number (prefix:accession). They provide a name space for identifiers to enable uniqueness of identifiers and interoperability among data resources.

[Identifiers.org](http://identifiers.org/) provides a registry and resolution service for life science CURIEs. 

In [16]:
def assign_publication_id(row):
    if row['PMCID'] != '':
        # CURIE: pmc (PubMed Central, PMC)
        return 'pmc:' + str(row['PMCID'])
    elif row['SOURCE'] == 'PPR':
        # no CURIE available, use URI for preprints
        return 'https://europepmc.org/article/PPR/' + row['EXTID']
    else:
        return ''

### NCBI Reference Sequences
**accession**: CURIE: [refseq](https://registry.identifiers.org/registry/refseq) (NCBI Reference Sequences, Refseq)

In [17]:
df1 = pd.read_csv(ftp + "refseq.csv", dtype=str)
df1.fillna('', inplace=True)
df1.head()

Unnamed: 0,refseq,PMCID,EXTID,SOURCE
0,NM_015973,PMC7512552,32963006,MED
1,NM_001034,PMC7512552,32963006,MED
2,NM_001789,PMC7512552,32963006,MED
3,NM_001008708,PMC7512552,32963006,MED
4,NM_203467,PMC7512552,32963006,MED


In [18]:
df1['id'] = df1.apply(assign_publication_id, axis=1)
# Remove version number from refseq to match to the latest version
df1['accession'] = 'refseq:' + df1['refseq'].str.split('.', expand=True)[0]
df1 = df1[['id','accession']]

In [19]:
df1 = df1.merge(ref, on="accession")
df1 = df1[['id','accession']]
df1.dropna(inplace=True)
print("Number of refseq matches:", df1.shape[0])
df1.head()

Number of refseq matches: 3224


Unnamed: 0,id,accession
0,pmc:PMC7027195,refseq:NC_000008
1,pmc:PMC1181929,refseq:NC_000008
2,pmc:PMC4430495,refseq:NC_000008
3,pmc:PMC4768118,refseq:NC_000008
4,pmc:PMC4210469,refseq:NC_000008


### GISAID Genome Sequences
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: URI: [https://www.gisaid.org/](https://www.gisaid.org/help/publish-with-gisaid-references) (Global Initiative on Sharing All Influenza Data, GISAID)


In [20]:
df2 = pd.read_csv(ftp + "gisaid.csv", dtype=str)
df2.fillna('', inplace=True)

In [21]:
df2['id'] = df2.apply(assign_publication_id, axis=1)

In [22]:
df2['accession'] = 'https://www.gisaid.org/' + df2['gisaid']
df2 = df2[['id','accession']]

In [23]:
df2 = df2.merge(ref, on="accession")
df2.dropna(inplace=True)
print("Number of GISAID matches:", df2.shape[0])
df2.head()

Number of GISAID matches: 2797


Unnamed: 0,id,accession
0,https://europepmc.org/article/PPR/PPR167663,https://www.gisaid.org/EPI_ISL_402131
1,pmc:PMC7166309,https://www.gisaid.org/EPI_ISL_402131
2,https://europepmc.org/article/PPR/PPR190800,https://www.gisaid.org/EPI_ISL_402131
3,pmc:PMC7497811,https://www.gisaid.org/EPI_ISL_402131
4,pmc:PMC7205519,https://www.gisaid.org/EPI_ISL_402131


### UniProt
**accession**: CURIE: [uniprot](https://registry.identifiers.org/registry/uniprot) ( UniProt Knowledgebase, UniProtKB)

In [24]:
df3 = pd.read_csv(ftp + "uniprot.csv", dtype=str)
df3.fillna('', inplace=True)

In [25]:
df3['id'] = df3.apply(assign_publication_id, axis=1)
df3['accession'] = 'uniprot:' + df3['uniprot']
df3 = df3[['id','accession']]

In [26]:
df3 = df3.merge(ref, on="accession")
df3.dropna(inplace=True)
print("Number of UniProt matches:", df3.shape[0])
df3.head()

Number of UniProt matches: 160907


Unnamed: 0,id,accession
0,pmc:PMC6713643,uniprot:P24385
1,pmc:PMC5761900,uniprot:P24385
2,pmc:PMC3275796,uniprot:P24385
3,pmc:PMC4823807,uniprot:P24385
4,pmc:PMC5474285,uniprot:P24385


### Protein Data Bank
**accession**: CURIE: [pdb](https://registry.identifiers.org/registry/pdb) (Protein Data Bank, PDB)

In [27]:
df4 = pd.read_csv(ftp + "pdb.csv", dtype=str)
df4.fillna('', inplace=True)

In [28]:
df4['id'] = df4.apply(assign_publication_id, axis=1)
df4['accession'] = 'pdb:' + df4['pdb']
df4 = df4[['id','accession']]

In [29]:
df4 = df4.merge(ref, on="accession")
df4.dropna(inplace=True)
print("Number of PDB matches:", df4.shape[0])
df4.head()

Number of PDB matches: 3731


Unnamed: 0,id,accession
0,pmc:PMC7554297,pdb:6VYB
1,pmc:PMC7467145,pdb:6VYB
2,pmc:PMC7584483,pdb:6VYB
3,pmc:PMC7282679,pdb:6VYB
4,pmc:PMC7169934,pdb:6VYB


### Digital Object Identifier (DOI) (NOT USED YET)
**id**: CURIE: [pmc](https://registry.identifiers.org/registry/pmc) (PubMed Central, PMC)

**accession**: CURIE: [doi](https://registry.identifiers.org/registry/doi) (Digital Object Identifier System, DOI)

In [30]:
#df5 = pd.read_csv(ftp + "doi.csv", dtype=str)
#df5.fillna('', inplace=True)

In [31]:
#df5['id'] = df5.apply(assign_publication_id, axis=1)

In [32]:
#df5.head()

### Save data for Knowledge Graph Import

In [33]:
df = pd.concat([df1, df2, df3, df4])
df.fillna('', inplace=True)
df = df.query("id != ''")
df.drop_duplicates(inplace=True)
df = df.query("accession != ''")
print('Mappings:', df.shape[0])

Mappings: 170286


In [34]:
df.head()

Unnamed: 0,id,accession
0,pmc:PMC7027195,refseq:NC_000008
1,pmc:PMC1181929,refseq:NC_000008
2,pmc:PMC4430495,refseq:NC_000008
3,pmc:PMC4768118,refseq:NC_000008
4,pmc:PMC4210469,refseq:NC_000008


In [35]:
df.to_csv(NEO4J_IMPORT / "01h-PMC-Accession.csv", index=False)

In [36]:
df = df[(df['accession'].str.startswith('https://www.gisaid.org/')) | (df['accession'].str.startswith('refseq'))]

In [37]:
df.to_csv(NEO4J_IMPORT / "01h-PMC-Strain-Accession.csv", index=False)