# UniProt Viral and Host Protein Data
**[Work in progress]**

This notebook downloads and standardizes viral and host protein data from UniProt for ingestion into the Knowledge Graph.

Data source: [UniProt](https://www.uniprot.org/)

Authors: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import re
import hashlib 
import urllib
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


### Get list of organisms in the Knowledge Graph

In [4]:
organisms = pd.read_csv("../../reference_data/Organism.csv", dtype=str)

In [5]:
# exclude organisms without an NCBI taxonomy id
organisms = organisms[organisms['id'].str.startswith('taxonomy')]
# remove CURIE
organisms['taxonomy'] = organisms['id'].apply(lambda x: x.split(':')[1])
taxonomy_ids = organisms['taxonomy'].unique()

### Download data from UniProt

In [6]:
columns = 'id,entry%20name,p,sequence,length,protein%20names,reviewed,organism-id,feature(CHAIN),feature(PEPTIDE),go(biological%20process)'

In [7]:
dfs = list()
for taxon in taxonomy_ids:
    url = f'https://www.uniprot.org/uniprot/?query=organism:{taxon}&columns={columns}&format=tab'
    try:
        df = pd.read_csv(url, sep='\t', dtype='str')
        if df.shape[0] > 0:
            print(f'Downloaded {df.shape[0]} proteins for taxonomy id {taxon}')
            dfs.append(df)
        else:
            print(f'Downloaded 0 proteins for taxonomy id {taxon}')
    except:
        print(f'Downloaded 0 proteins for taxonomy id {taxon}')

Downloaded 18413 proteins for taxonomy id 2697049
Downloaded 10 proteins for taxonomy id 1263720
Downloaded 96 proteins for taxonomy id 694009
Downloaded 9 proteins for taxonomy id 443239
Downloaded 1578 proteins for taxonomy id 31631
Downloaded 455 proteins for taxonomy id 11137
Downloaded 985 proteins for taxonomy id 277944
Downloaded 14 proteins for taxonomy id 12131
Downloaded 16 proteins for taxonomy id 12134
Downloaded 10 proteins for taxonomy id 766791
Downloaded 146 proteins for taxonomy id 693998
Downloaded 15 proteins for taxonomy id 1487703
Downloaded 12 proteins for taxonomy id 285949
Downloaded 194237 proteins for taxonomy id 9606
Downloaded 86618 proteins for taxonomy id 10090
Downloaded 144 proteins for taxonomy id 59477
Downloaded 22 proteins for taxonomy id 608659
Downloaded 0 proteins for taxonomy id 49442
Downloaded 80 proteins for taxonomy id 9974
Downloaded 187 proteins for taxonomy id 143292
Downloaded 0 proteins for taxonomy id 71116
Downloaded 0 proteins for tax

In [8]:
unp = pd.concat(dfs)

In [9]:
unp.reset_index(drop=True,inplace=True)

In [10]:
unp.fillna('', inplace=True)
print(unp.shape)

(393931, 10)


In [11]:
unp.tail()

Unnamed: 0,Entry,Entry name,Sequence,Length,Protein names,Status,Organism ID,Chain,Peptide,Gene ontology (biological process)
393926,A0A0A0Y6P1,A0A0A0Y6P1_PANTI,MSYRDLIVGIIFLSQTVVGIVGNFSLLYHYLFLYHTECRVRSTHLI...,314,Vomeronasal type-1 receptor,unreviewed,9694,,,response to pheromone [GO:0019236]
393927,K7NVN4,K7NVN4_PANTI,FKYLLMFLITMMILVTANNLFQLFIGWEGVGIMSFLLIGWWYGRAD...,159,NADH:ubiquinone reductase (H(+)-translocating)...,unreviewed,9694,,,
393928,B6E5W4,B6E5W4_PANTI,LLGKTECHFTNGTELVRFLDRYFYNGEEYVRFDSDVGEYRAVTELG...,79,MHC class II antigen (Fragment),unreviewed,9694,,,antigen processing and presentation [GO:001988...
393929,H9L6P9,H9L6P9_PANTI,RLHQRGHDVVVIAPEASVYIKEGAFYTLKSYPVPFRREDVEASFTG...,213,UDP-glucuronosyltransferase 1A1 (Fragment),unreviewed,9694,,,
393930,Q5MLT9,Q5MLT9_PANTI,ILVTANNLFQLFIGWEGVGIMSFLLIGWWYGRADANTAALQAILYN...,163,NADH:ubiquinone reductase (H(+)-translocating)...,unreviewed,9694,,,


In [12]:
unp['reviewed'] = unp['Status'].apply(lambda s: 'True' if s == 'reviewed' else 'False')

#### Format synonymes

In [13]:
unp.query("Entry == 'P0DTC2'")['Protein names'].values

array(["Spike glycoprotein (S glycoprotein) (E2) (Peplomer protein) [Cleaved into: Spike protein S1; Spike protein S2; Spike protein S2']"],
      dtype=object)

Remove terms in brackets, e.g., [Cleaved into: ...]

["Spike glycoprotein (S glycoprotein) (E2) (Peplomer protein) [Cleaved into: Spike protein S1; Spike protein S2; Spike protein S2']

In [14]:
unp['synonymes'] = unp['Protein names'].str.replace("\\[.+\\]", "")

In [15]:
unp.query("Entry == 'P0DTC2'")['synonymes'].values

array(['Spike glycoprotein (S glycoprotein) (E2) (Peplomer protein) '],
      dtype=object)

Convert synonymes to a semicolon separated list to represent these one to many relationships in a CSV file.

In [16]:
unp['synonymes'] = unp['synonymes'].str.replace('(', ';')
unp['synonymes'] = unp['synonymes'].str.replace(' ;', ';')
unp['synonymes'] = unp['synonymes'].str.replace(')', '')
unp['synonymes'] = unp['synonymes'].str.strip()

In [17]:
unp.query("Entry == 'P0DTC2'")['synonymes'].values

array(['Spike glycoprotein;S glycoprotein;E2;Peplomer protein'],
      dtype=object)

In [18]:
unp.head()

Unnamed: 0,Entry,Entry name,Sequence,Length,Protein names,Status,Organism ID,Chain,Peptide,Gene ontology (biological process),reviewed,synonymes
0,P0DTC5,VME1_SARS2,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,222,Membrane protein (M) (E1 glycoprotein) (Matrix...,reviewed,2697049,"CHAIN 1..222; /note=""Membrane protein""; /id=...",,mitigation of host immune response by virus [G...,True,Membrane protein;M;E1 glycoprotein;Matrix glyc...
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein
2,P0DTD3,Y14_SARS2,MLQSCYNFLKEQHCQKASTQKGAEAAVKPLLVPHHVVATVQEIQLQ...,73,Uncharacterized protein 14 (ORF14),reviewed,2697049,"CHAIN 1..73; /note=""Uncharacterized protein 1...",,,True,Uncharacterized protein 14;ORF14
3,P0DTC4,VEMP_SARS2,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...,75,Envelope small membrane protein (E) (sM protein),reviewed,2697049,"CHAIN 1..75; /note=""Envelope small membrane p...",,pore formation by virus in membrane of host ce...,True,Envelope small membrane protein;E;sM protein
4,P0DTC2,SPIKE_SARS2,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,1273,Spike glycoprotein (S glycoprotein) (E2) (Pepl...,reviewed,2697049,"CHAIN 13..1273; /note=""Spike glycoprotein""; ...",,endocytosis involved in viral entry into host ...,True,Spike glycoprotein;S glycoprotein;E2;Peplomer ...


In [19]:
unp.query("Entry == 'P01042'")['Chain'].values

array(['CHAIN 19..644;  /note="Kininogen-1";  /id="PRO_0000006685"; CHAIN 19..380;  /note="Kininogen-1 heavy chain";  /id="PRO_0000006686"; CHAIN 390..644;  /note="Kininogen-1 light chain";  /id="PRO_0000006689"'],
      dtype=object)

In [20]:
unp.query("Entry == 'P01042'")['Peptide'].values

array(['PEPTIDE 376..389;  /note="T-kinin";  /id="PRO_0000372485"; PEPTIDE 380..389;  /note="Lysyl-bradykinin";  /id="PRO_0000006687"; PEPTIDE 381..389;  /note="Bradykinin";  /id="PRO_0000006688"; PEPTIDE 431..434;  /note="Low molecular weight growth-promoting factor";  /id="PRO_0000006690"'],
      dtype=object)

In [21]:
def parse_feature_record(record, feature_type):
    items = record.split(';')
    feature = np.empty(5, dtype=object)
        
    feature[0] = feature_type
    for item in items:
        item = item.strip()
        if '..' in item:
            start_end = item.split('..')
            # in a few cases a '?' is used to represent an unknown start or end, check if it's a digit
            if start_end[0].isdigit():
                feature[1] = start_end[0]
            else:
                feature[1] = ''
            if start_end[1].isdigit():
                feature[2] = start_end[1]
            else:
                feature[2] = ''
        elif item.startswith("/note="):
            name = item[6:].replace('\"', '')
            feature[3] = name
        elif item.startswith("/id="):
            pro_id = item[4:].replace('\"', '')
            feature[4] = 'uniprot.chain:' + pro_id
                
    return feature

In [22]:
def parse_features(row):
    chain_features = []
    if 'CHAIN' in row['Chain']:
        chains = row['Chain'].split('CHAIN')
        if chains[0] == '':
            chains = chains[1:]
        chain_features = [parse_feature_record(chain, 'CHAIN') for chain in chains]

    protein_features = []
    # Full-length (coding sequence) proteins are inconsistenly handled in UniProt. 
    # For some entries, the full-length protein is included
    # in the chain features (e.g. P0DTD1), for others it's not (e.g., P01042)
    # Check if full-length protein is included in chain list
    full_length = False
    for f in chain_features:
        if f[1] == '1' and f[2] == row['Length']:
            full_length = True
            break
    # Add entry if full-length protein is not in chain list
    if not full_length:
        protein_name = row['Protein names'].split('(')[0]
        protein_features = [np.array(['PROTEIN','1', row['Length'], protein_name,''], dtype=object)]
            
    peptide_features = []
    if 'PEPTIDE' in row['Peptide']:
        peptides = row['Peptide'].split('PEPTIDE')
        if peptides[0] == '':
            peptides = peptides[1:]
        peptide_features = [parse_feature_record(peptide, 'PEPTIDE') for peptide in peptides]
    
    return protein_features + chain_features + peptide_features

In [23]:
unp['Features'] = unp.apply(parse_features, axis=1)

In [24]:
unp = unp.explode('Features')

In [25]:
unp[['type', 'start', 'end', 'name', 'proId']] = unp.apply(lambda row: row['Features'], axis=1, result_type="expand")

Handle missing values

In [26]:
unp.fillna('', inplace=True)

In [27]:
unp.head(50)

Unnamed: 0,Entry,Entry name,Sequence,Length,Protein names,Status,Organism ID,Chain,Peptide,Gene ontology (biological process),reviewed,synonymes,Features,type,start,end,name,proId
0,P0DTC5,VME1_SARS2,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,222,Membrane protein (M) (E1 glycoprotein) (Matrix...,reviewed,2697049,"CHAIN 1..222; /note=""Membrane protein""; /id=...",,mitigation of host immune response by virus [G...,True,Membrane protein;M;E1 glycoprotein;Matrix glyc...,"[CHAIN, 1, 222, Membrane protein, uniprot.chai...",CHAIN,1,222,Membrane protein,uniprot.chain:PRO_0000449652
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 1, 4405, Replicase polyprotein 1a, uni...",CHAIN,1,4405,Replicase polyprotein 1a,uniprot.chain:PRO_0000449634
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 1, 180, Host translation inhibitor nsp...",CHAIN,1,180,Host translation inhibitor nsp1,uniprot.chain:PRO_0000449635
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 181, 818, Non-structural protein 2, un...",CHAIN,181,818,Non-structural protein 2,uniprot.chain:PRO_0000449636
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 819, 2763, Non-structural protein 3, u...",CHAIN,819,2763,Non-structural protein 3,uniprot.chain:PRO_0000449637
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 2764, 3263, Non-structural protein 4, ...",CHAIN,2764,3263,Non-structural protein 4,uniprot.chain:PRO_0000449638
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 3264, 3569, 3C-like proteinase, unipro...",CHAIN,3264,3569,3C-like proteinase,uniprot.chain:PRO_0000449639
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 3570, 3859, Non-structural protein 6, ...",CHAIN,3570,3859,Non-structural protein 6,uniprot.chain:PRO_0000449640
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 3860, 3942, Non-structural protein 7, ...",CHAIN,3860,3942,Non-structural protein 7,uniprot.chain:PRO_0000449641
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 3943, 4140, Non-structural protein 8, ...",CHAIN,3943,4140,Non-structural protein 8,uniprot.chain:PRO_0000449642


#### Cleave sequences into peptides

In [28]:
def get_subsequence(row):
    if row['start'].isdigit() and row['end'].isdigit():
        start = int(row['start'])
        end = int(row['end'])
        sequence = row['Sequence']
        return sequence[start-1: end]
    else:
        return ''

In [29]:
unp['sequence'] = unp.apply(lambda row: get_subsequence(row), axis=1)

Set flag if protein chain is full length

In [30]:
unp['fullLength'] = (unp['start'] == '1') & (unp['end'] == unp['Length'])

In [31]:
unp['name'] = unp['name'].str.strip()

In [32]:
def set_synonymes(row):
    if row['fullLength']:
        return row['synonymes']
    else:
        return row['name']

Cleaved protein, should not inherit synonymes from the full length protein

In [33]:
unp['synonymes'] = unp.apply(set_synonymes, axis=1)

In [34]:
unp.head(50)

Unnamed: 0,Entry,Entry name,Sequence,Length,Protein names,Status,Organism ID,Chain,Peptide,Gene ontology (biological process),reviewed,synonymes,Features,type,start,end,name,proId,sequence,fullLength
0,P0DTC5,VME1_SARS2,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,222,Membrane protein (M) (E1 glycoprotein) (Matrix...,reviewed,2697049,"CHAIN 1..222; /note=""Membrane protein""; /id=...",,mitigation of host immune response by virus [G...,True,Membrane protein;M;E1 glycoprotein;Matrix glyc...,"[CHAIN, 1, 222, Membrane protein, uniprot.chai...",CHAIN,1,222,Membrane protein,uniprot.chain:PRO_0000449652,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,True
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 1, 4405, Replicase polyprotein 1a, uni...",CHAIN,1,4405,Replicase polyprotein 1a,uniprot.chain:PRO_0000449634,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,True
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Host translation inhibitor nsp1,"[CHAIN, 1, 180, Host translation inhibitor nsp...",CHAIN,1,180,Host translation inhibitor nsp1,uniprot.chain:PRO_0000449635,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,False
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Non-structural protein 2,"[CHAIN, 181, 818, Non-structural protein 2, un...",CHAIN,181,818,Non-structural protein 2,uniprot.chain:PRO_0000449636,AYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKR...,False
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Non-structural protein 3,"[CHAIN, 819, 2763, Non-structural protein 3, u...",CHAIN,819,2763,Non-structural protein 3,uniprot.chain:PRO_0000449637,APTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEKCSAYTVEL...,False
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Non-structural protein 4,"[CHAIN, 2764, 3263, Non-structural protein 4, ...",CHAIN,2764,3263,Non-structural protein 4,uniprot.chain:PRO_0000449638,KIVNNWLKQLIKVTLVFLFVAAIFYLITPVHVMSKHTDFSSEIIGY...,False
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,3C-like proteinase,"[CHAIN, 3264, 3569, 3C-like proteinase, unipro...",CHAIN,3264,3569,3C-like proteinase,uniprot.chain:PRO_0000449639,SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS...,False
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Non-structural protein 6,"[CHAIN, 3570, 3859, Non-structural protein 6, ...",CHAIN,3570,3859,Non-structural protein 6,uniprot.chain:PRO_0000449640,SAVKRTIKGTHHWLLLTILTSLLVLVQSTQWSLFFFLYENAFLPFA...,False
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Non-structural protein 7,"[CHAIN, 3860, 3942, Non-structural protein 7, ...",CHAIN,3860,3942,Non-structural protein 7,uniprot.chain:PRO_0000449641,SKMSDVKCTSVVLLSVLQQLRVESSSKLWAQCVQLHNDILLAKDTT...,False
1,P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Non-structural protein 8,"[CHAIN, 3943, 4140, Non-structural protein 8, ...",CHAIN,3943,4140,Non-structural protein 8,uniprot.chain:PRO_0000449642,AIASEFSSLPSYAAFATAQEAYEQAVANGDSEVVLKKLKKSLNVAK...,False


In [35]:
unp.rename(columns={'Organism ID': 'taxonomyId','Entry': 'accession', 'Entry name': 'entryName'}, inplace=True)

##### Assign unique identifiers

md5 hashcodes for the protein sequence and CURIEs for accession and taxonomyId

In [36]:
unp['id'] = unp['sequence'].apply(lambda seq: 'md5:' + hashlib.md5(seq.encode()).hexdigest())

In [37]:
# disambiguate id by taxonomyId (same sequence for different organisms)
unp['id'] = unp['id'] + '-' + unp['taxonomyId']

In [38]:
unp['accession'] = 'uniprot:' + unp['accession']
unp['taxonomyId'] = 'taxonomy:' + unp['taxonomyId']

In [39]:
unp.query("accession == 'uniprot:P01042'")

Unnamed: 0,accession,entryName,Sequence,Length,Protein names,Status,taxonomyId,Chain,Peptide,Gene ontology (biological process),reviewed,synonymes,Features,type,start,end,name,proId,sequence,fullLength,id
37974,uniprot:P01042,KNG1_HUMAN,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,644,Kininogen-1 (Alpha-2-thiol proteinase inhibito...,reviewed,taxonomy:9606,"CHAIN 19..644; /note=""Kininogen-1""; /id=""PRO...","PEPTIDE 376..389; /note=""T-kinin""; /id=""PRO_...",antimicrobial humoral immune response mediated...,True,Kininogen-1;Alpha-2-thiol proteinase inhibitor...,"[PROTEIN, 1, 644, Kininogen-1 , ]",PROTEIN,1,644,Kininogen-1,,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,True,md5:693c7762bf152c58e00ff05e19347899-9606
37974,uniprot:P01042,KNG1_HUMAN,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,644,Kininogen-1 (Alpha-2-thiol proteinase inhibito...,reviewed,taxonomy:9606,"CHAIN 19..644; /note=""Kininogen-1""; /id=""PRO...","PEPTIDE 376..389; /note=""T-kinin""; /id=""PRO_...",antimicrobial humoral immune response mediated...,True,Kininogen-1,"[CHAIN, 19, 644, Kininogen-1, uniprot.chain:PR...",CHAIN,19,644,Kininogen-1,uniprot.chain:PRO_0000006685,QESQSEEIDCNDKDLFKAVDAALKKYNSQNQSNNQFVLYRITEATK...,False,md5:7fce5e096d222db791e61728783862ef-9606
37974,uniprot:P01042,KNG1_HUMAN,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,644,Kininogen-1 (Alpha-2-thiol proteinase inhibito...,reviewed,taxonomy:9606,"CHAIN 19..644; /note=""Kininogen-1""; /id=""PRO...","PEPTIDE 376..389; /note=""T-kinin""; /id=""PRO_...",antimicrobial humoral immune response mediated...,True,Kininogen-1 heavy chain,"[CHAIN, 19, 380, Kininogen-1 heavy chain, unip...",CHAIN,19,380,Kininogen-1 heavy chain,uniprot.chain:PRO_0000006686,QESQSEEIDCNDKDLFKAVDAALKKYNSQNQSNNQFVLYRITEATK...,False,md5:65df10fd1e958993a8df7cbc9eeaef49-9606
37974,uniprot:P01042,KNG1_HUMAN,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,644,Kininogen-1 (Alpha-2-thiol proteinase inhibito...,reviewed,taxonomy:9606,"CHAIN 19..644; /note=""Kininogen-1""; /id=""PRO...","PEPTIDE 376..389; /note=""T-kinin""; /id=""PRO_...",antimicrobial humoral immune response mediated...,True,Kininogen-1 light chain,"[CHAIN, 390, 644, Kininogen-1 light chain, uni...",CHAIN,390,644,Kininogen-1 light chain,uniprot.chain:PRO_0000006689,SSRIGEIKEETTVSPPHTSMAPAQDEERDSGKEQGHTRRHDWGHEK...,False,md5:918354eb803d70a0af51f56240aa6918-9606
37974,uniprot:P01042,KNG1_HUMAN,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,644,Kininogen-1 (Alpha-2-thiol proteinase inhibito...,reviewed,taxonomy:9606,"CHAIN 19..644; /note=""Kininogen-1""; /id=""PRO...","PEPTIDE 376..389; /note=""T-kinin""; /id=""PRO_...",antimicrobial humoral immune response mediated...,True,T-kinin,"[PEPTIDE, 376, 389, T-kinin, uniprot.chain:PRO...",PEPTIDE,376,389,T-kinin,uniprot.chain:PRO_0000372485,ISLMKRPPGFSPFR,False,md5:7c3750d37b3f3520024958fdba8c0d46-9606
37974,uniprot:P01042,KNG1_HUMAN,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,644,Kininogen-1 (Alpha-2-thiol proteinase inhibito...,reviewed,taxonomy:9606,"CHAIN 19..644; /note=""Kininogen-1""; /id=""PRO...","PEPTIDE 376..389; /note=""T-kinin""; /id=""PRO_...",antimicrobial humoral immune response mediated...,True,Lysyl-bradykinin,"[PEPTIDE, 380, 389, Lysyl-bradykinin, uniprot....",PEPTIDE,380,389,Lysyl-bradykinin,uniprot.chain:PRO_0000006687,KRPPGFSPFR,False,md5:33b2d4498a0558b6ae786d4a6d4620cd-9606
37974,uniprot:P01042,KNG1_HUMAN,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,644,Kininogen-1 (Alpha-2-thiol proteinase inhibito...,reviewed,taxonomy:9606,"CHAIN 19..644; /note=""Kininogen-1""; /id=""PRO...","PEPTIDE 376..389; /note=""T-kinin""; /id=""PRO_...",antimicrobial humoral immune response mediated...,True,Bradykinin,"[PEPTIDE, 381, 389, Bradykinin, uniprot.chain:...",PEPTIDE,381,389,Bradykinin,uniprot.chain:PRO_0000006688,RPPGFSPFR,False,md5:c5a9e54cc23314d0f69ea9cca09ee617-9606
37974,uniprot:P01042,KNG1_HUMAN,MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNS...,644,Kininogen-1 (Alpha-2-thiol proteinase inhibito...,reviewed,taxonomy:9606,"CHAIN 19..644; /note=""Kininogen-1""; /id=""PRO...","PEPTIDE 376..389; /note=""T-kinin""; /id=""PRO_...",antimicrobial humoral immune response mediated...,True,Low molecular weight growth-promoting factor,"[PEPTIDE, 431, 434, Low molecular weight growt...",PEPTIDE,431,434,Low molecular weight growth-promoting factor,uniprot.chain:PRO_0000006690,WGHE,False,md5:fd9ee00f93cbe5c7daef3a211f113b63-9606


### Save proteins

In [43]:
columns = ['id', 'name', 'synonymes', 'accession', 'entryName', 'proId', 'taxonomyId','sequence', 
           'start', 'end', 'fullLength', 'reviewed']
unp.to_csv(NEO4J_IMPORT / '01a-UniProtProtein.csv', columns=columns, index = False)

In [44]:
unp.head()

Unnamed: 0,accession,entryName,Sequence,Length,Protein names,Status,taxonomyId,Chain,Peptide,Gene ontology (biological process),reviewed,synonymes,Features,type,start,end,name,proId,sequence,fullLength,id
0,uniprot:P0DTC5,VME1_SARS2,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,222,Membrane protein (M) (E1 glycoprotein) (Matrix...,reviewed,taxonomy:2697049,"CHAIN 1..222; /note=""Membrane protein""; /id=...",,mitigation of host immune response by virus [G...,True,Membrane protein;M;E1 glycoprotein;Matrix glyc...,"[CHAIN, 1, 222, Membrane protein, uniprot.chai...",CHAIN,1,222,Membrane protein,uniprot.chain:PRO_0000449652,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,True,md5:1cd6abff79ad3633e17582eb0e576539-2697049
1,uniprot:P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,taxonomy:2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Replicase polyprotein 1a;pp1a;ORF1a polyprotein,"[CHAIN, 1, 4405, Replicase polyprotein 1a, uni...",CHAIN,1,4405,Replicase polyprotein 1a,uniprot.chain:PRO_0000449634,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,True,md5:e781b58591b8dbdd15f84dcbdec82105-2697049
1,uniprot:P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,taxonomy:2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Host translation inhibitor nsp1,"[CHAIN, 1, 180, Host translation inhibitor nsp...",CHAIN,1,180,Host translation inhibitor nsp1,uniprot.chain:PRO_0000449635,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,False,md5:5c2c364f44079728c451280435c4236a-2697049
1,uniprot:P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,taxonomy:2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Non-structural protein 2,"[CHAIN, 181, 818, Non-structural protein 2, un...",CHAIN,181,818,Non-structural protein 2,uniprot.chain:PRO_0000449636,AYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKR...,False,md5:073edb2349ddcd9a72ecd9f5c1dccdc4-2697049
1,uniprot:P0DTC1,R1A_SARS2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,4405,Replicase polyprotein 1a (pp1a) (ORF1a polypro...,reviewed,taxonomy:2697049,"CHAIN 1..4405; /note=""Replicase polyprotein 1...",,induction by virus of catabolism of host mRNA ...,True,Non-structural protein 3,"[CHAIN, 819, 2763, Non-structural protein 3, u...",CHAIN,819,2763,Non-structural protein 3,uniprot.chain:PRO_0000449637,APTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEKCSAYTVEL...,False,md5:73935ca55d0ab6130627210ef6743c39-2697049
