## Protein-Protein Interaction Databases

Extract PPI data from STRING (API courtesy of Alexander Pelletier. I (Dylan) modified it a little bit (mostly just reduced it down to the basics.)

NOTE You have to run the nodes file first, do this, then run the nodes file again

In [1]:
import pandas as pd
import csv
import json
import numpy as np
import urllib.parse
import urllib.request
import requests
from biomed_apis import *

In [2]:
protein_list = list(set(pd.read_csv('output/nodes/nodes_proteins.csv')['Protein (UniProt)']))
protein_set = set()
for protein_ids in protein_list:
    
    # Split each "protein" into multiple proteins if there are multiple
    protein_ids = protein_ids.split(', ')
    
    for protein_id in protein_ids:
        
        # Remove "UniProt:" prefix
        if 'UniProt:' in protein_id:
            protein_id = protein_id.split('UniProt:')[1]

        # Don't add numbers (genes), add UniProt IDs
        if not protein_id[0:].isnumeric():
            protein_set.add(protein_id)
len(protein_set)

21971

## Align UniProt-STRING 

In [3]:
job_id = submit_id_mapping_UniProtAPI(
                  from_db = 'UniProtKB_AC-ID',
                  to_db = 'STRING',
                  ids = protein_set)

if check_id_mapping_results_ready_UniProtAPI(job_id):
    link = get_id_mapping_results_link_UniProtAPI(job_id)
    results = get_id_mapping_results_search_UniProtAPI(link)

Job still running. Retrying in 3s
Job still running. Retrying in 3s
Job still running. Retrying in 3s
Job still running. Retrying in 3s
Job still running. Retrying in 3s
Fetched: 19243 / 19243

In [4]:
stringISuniprot, uniprotISstring = dict(), dict()

for entry in results['results']:
    string_id = entry['to']
    uniprot_id = entry['from']
    
    if '9606' in string_id:
        uniprotISstring[uniprot_id] = string_id
        stringISuniprot[string_id] = uniprot_id

print(len(stringISuniprot),'/',len(protein_set), 'UniProt proteins aligned with STRING')

18809 / 21971 UniProt proteins aligned with STRING


## STRING PPI

In [5]:
string_ids = list(stringISuniprot.keys())

string_ppi_df, counts1 = k_hop_interactors_STRINGAPI(string_ids, \
                                          k=5, \
                                          score_thresh = 0.70,\
                                          debug = True, \
                                          return_counts = True)
string_ppi_df.tail()

19074 total	265 new proteins added for 1 hop
19074 total	0 new proteins added for 2 hop
No results!!
No results!!
No results!!
19074 total proteins


Unnamed: 0,query_ensp,query_name,partner_ensp,partner_name,combined_score
498203,9606.ENSP00000466718,LSM12,9606.ENSP00000252622,LSM7,0.778
498204,9606.ENSP00000466718,LSM12,9606.ENSP00000313007,PABPC1,0.767
498205,9606.ENSP00000466718,LSM12,9606.ENSP00000468025,TMEM101,0.73
498206,9606.ENSP00000466718,LSM12,9606.ENSP00000410758,LSM5,0.73
498207,9606.ENSP00000480946,SETSIP,9606.ENSP00000482245,DACH1,0.716


## Align STRING-UniProt IDs

In [13]:
ppi_string_ids = set(string_ppi_df['query_ensp']).union(set(string_ppi_df['partner_ensp']))

### Select the PPI's STRING IDs that still need UniProt IDs
string_ids2 = set()
for string_id in ppi_string_ids:
    if string_id not in stringISuniprot.keys():
        string_ids2.add(string_id)
        
string_ids2.add(list(ppi_string_ids)[0])

In [14]:
### NOT WORKING, BUT WE ALREADY HAVE STRING IS UNIPROT
# HOWEVER THEY AREN'T ANY NEW PROTEINS
#
'''UniProt API: Aligning STRING to UniProt'''
job_id = submit_id_mapping_UniProtAPI(
                  from_db = 'STRING',
                  to_db = 'UniProtKB-Swiss-Prot', 
                  ids = list(string_ids2))

if check_id_mapping_results_ready_UniProtAPI(job_id):
    link = get_id_mapping_results_link_UniProtAPI(job_id)
    results = get_id_mapping_results_search_UniProtAPI(link)
    
for u2s in results['results']:
    uniprot = u2s['from']
    string  = u2s['to']
    
    if '9606' in string:
        uniprotISstring[uniprot] = string
        stringISuniprot[string] = uniprot
    
print('If 1/1, that means there were no new STRING-is-UniProt found')

Job still running. Retrying in 3s
If 1/1, that means there were no new STRING-is-UniProt found


In [7]:
print(len(stringISuniprot),'/',len(protein_set), 'UniProt proteins aligned with STRING')

0 / 21971 UniProt proteins aligned with STRING


In [None]:
18809 / 21971 UniProt proteins aligned with STRING


In [15]:
with open('output/protein2protein/edges_protein2protein_string.csv','w') as fout:
    writer = csv.writer(fout)
    writer.writerow(['Protein (UniProt)','Protein (UniProt)','Relationship','Weight'])
    string2string_ppi = dict()
    for i in range(0,len(string_ppi_df)):
        try:
            protein1 = stringISuniprot[string_ppi_df['query_ensp'].iloc[i]]
            protein2 = stringISuniprot[string_ppi_df['partner_ensp'].iloc[i]]
            score = str(string_ppi_df['combined_score'].iloc[i])
            writer.writerow(['UniProt:'+protein1, 'UniProt:'+protein2, '-ppi-',score])
            string2string_ppi.setdefault(protein1,[]).append(protein2)
        except:
            continue
            
    string_ppi = pd.read_csv('output/protein2protein/edges_protein2protein_string.csv')
string_ppi = string_ppi[~string_ppi.apply(frozenset, axis=1).duplicated()] # remove duplicates
string_ppi.to_csv('output/edges/edges_protein-INTERACTS-protein_string.csv')
string_ppi.tail()

Unnamed: 0,Protein (UniProt),Protein (UniProt).1,Relationship,Weight
491744,UniProt:Q96D46,UniProt:P42766,-ppi-,0.963
491745,UniProt:Q96D46,UniProt:Q9Y2X3,-ppi-,0.956
491746,UniProt:Q96D46,UniProt:P62424,-ppi-,0.956
491747,UniProt:Q96D46,UniProt:Q2NL82,-ppi-,0.954
491748,UniProt:Q96D46,UniProt:P62888,-ppi-,0.951


In [132]:
rels = []
for prot1,prot2 in string2string_ppi.items():
    rels.append(len(prot2))
print(sum(rels), 'PPI relationships (includes 2x duplicates)')

492430 PPI relationships (includes 2x duplicates)


### Reactome

In [10]:
# Get PPI from Reactome (includes mostly UniProt and some Chebi and Reactome)
with open('output/protein2protein/edges_all_protein2protein_reactome.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['Protein (UniProt)','Protein (UniProt)', 'Relationship'])
    reactome_ppi = dict()
    not_a_protein = set()
    count = 0; rels = dict()
    with open('input/reactome_ppi_tsv.txt') as f:
        for line in f:
            count += 1
            if(count == 1):
                continue
            line = line.split('\t')
            prot1 = line[0].split(':')[1]
            prot2 = line[3].split(':')[1]
            rel = line[6]
            rels[rel] = rels.get(rel,0) + 1
            
            # Check if protein is not a gene
            try: 
                prot1 = int(prot1)
            except:
                try: prot2 = int(prot2)
                except: proteins_are_good = ''
                    
            if type(prot1) == str and type(prot2) == str and 'R-' not in prot1 and 'R-' not in prot2:
                writer.writerow([prot1, prot2, rel])
                reactome_ppi.setdefault(prot1,[]).append(prot2)
                reactome_ppi.setdefault(prot2,[]).append(prot1)
            else:
                not_a_protein.add(prot1)
                not_a_protein.add(prot2)
ppis = list()
for vs in reactome_ppi.values():
    for v in vs:
        ppis.append(v)
print('Proteins involved in reactions', len(reactome_ppi))
print('Protein-Protein Interactions:',len(ppis)/2)
rels

Proteins involved in reactions 5588
Protein-Protein Interactions: 69003.0


{'physical association': 91726,
 'enzymatic reaction': 7328,
 'oxidoreductase activity electron transfer reaction': 397,
 'dephosphorylation reaction': 603,
 'glycosylation reaction': 96,
 'phospholipase reaction': 142,
 'isomerase reaction': 86,
 'cleavage reaction': 1440,
 'decarboxylation reaction': 19,
 'deacetylation reaction': 18,
 'carboxylation reaction': 16,
 'acetylation reaction': 94,
 'nucleoside triphosphatase reaction': 52,
 'deneddylation reaction': 8,
 'deubiquitination reaction': 24,
 'gtpase reaction': 48,
 'atpase reaction': 6,
 'demethylation reaction': 20,
 'sulfurtransfer reaction': 5,
 'ubiquitination': 1,
 'amidation reaction': 9,
 'de-ADP-ribosylation reaction': 1,
 'phosphopantetheinylation': 1,
 'tyrosinylation': 2}

In [11]:
for np in not_a_protein.copy():
    if 'R-' not in str(np) and type(np) == str:
        not_a_protein.remove(np)
for np in list(set(not_a_protein)):
    print(np)

R-HSA-5649865
R-HSA-5633405
R-HSA-2393991
R-HSA-9624350
R-ALL-9611838
16393
R-ALL-983326
R-ALL-2454186
32796
32797
16412
R-HSA-5617863
8228
16422
16424
90152
16427
32814
R-ALL-68469
R-HSA-6806576
R-HSA-379750
90171
90174
16450
R-ALL-429974
R-HSA-5683890
57427
16469
R-HSA-8870704
16474
R-HSA-5096485
16480
R-HSA-9654776
R-HSA-379788
R-ALL-9679750
R-HSA-2888961
16494
57453
16500
R-ALL-2173769
16516
R-ALL-164585
16523
16525
16530
R-HSA-8938501
151
R-HSA-9613637
R-HSA-6806585
R-ALL-141678
16551
16556
24757
R-HSA-9708821
R-ALL-9695994
57538
16580
57540
32966
R-MMU-9624653
16584
R-HSA-8937042
R-ALL-167913
R-HSA-8954172
R-HSA-9653748
16595
R-HSA-9038547
57566
57567
16608
16610
R-HSA-3700977
16618
180460
R-ALL-70986
R-HSA-379727
R-HSA-2127254
16633
33019
R-HSA-2106587
57604
R-ALL-450597
R-HSA-2466389
16650
R-HSA-1911477
R-HSA-6797760
R-HSA-6806606
16670
R-HSA-6801206
R-ALL-5687642
57634
R-ALL-5368267
57642
R-HSA-9018329
R-HSA-6784670
R-ALL-1227674
57655
R-HSA-6803933
R-HSA-9006123
16704
57667
R

R-HSA-6791347
15477
15478
15479
R-ALL-901089
R-ALL-8851032
15487
15488
15494
R-HSA-5649928
R-HSA-6804192
15517
R-HSA-5660406
15519
15521
R-HSA-5641210
15524
15525
15531
R-ALL-9737774
15532
15533
64685
64689
R-HSA-6798348
R-HSA-6785425
15544
15547
31932
15551
15552
15553
15554
15555
R-HSA-8877337
R-HSA-6782418
R-HSA-8935722
R-COV-9685518
138490
15611
31998
R-HSA-9690376
R-HSA-8939051
R-ALL-927796
15627
15628
15632
15635
15637
15638
48408
15641
15647
15650
15651
R-HSA-2466373
R-ALL-9678745
R-ALL-9678687
15681
15682
138563
R-HSA-5617393
R-HSA-9613630
15694
R-HSA-379742
15698
R-HSA-4518558
R-HSA-72323
R-ALL-5637956
R-FLU-193303
R-ALL-9611741
138601
138602
15725
32111
15729
R-ALL-9737282
138612
138613
R-ALL-111345
R-ALL-9657530
R-ALL-75085
15746
R-HSA-5629151
15750
R-HSA-8937029
R-HSA-156798
15756
R-HAL-8982085
138640
138641
138642
15760
R-HSA-9629443
15765
138646
138647
15768
15769
R-ALL-9732875
R-ALL-9671491
R-ALL-2077408
40356
15784
R-ALL-9656757
15786
15793
R-ALL-9680829
R-HSA-110184
R-

In [12]:
reactome_ppi = pd.read_csv('output/protein2protein/edges_all_protein2protein_reactome.csv')

### Merged
try to merge these later. Reactome doesn't add much though

In [16]:
reactome_ppi.append(string_ppi)

Unnamed: 0,Protein (UniProt),Protein (UniProt).1,Relationship,Weight
0,Q9Y287,Q9Y287,physical association,
1,P37840,P37840,physical association,
2,P0DJI8,P0DJI8,physical association,
3,P06727,P06727,physical association,
4,P01160,P01160,physical association,
...,...,...,...,...
491744,UniProt:Q96D46,UniProt:P42766,-ppi-,0.963
491745,UniProt:Q96D46,UniProt:Q9Y2X3,-ppi-,0.956
491746,UniProt:Q96D46,UniProt:P62424,-ppi-,0.956
491747,UniProt:Q96D46,UniProt:Q2NL82,-ppi-,0.954


In [19]:
string_ppi.drop_duplicates()

Unnamed: 0,Protein (UniProt),Protein (UniProt).1,Relationship,Weight
0,UniProt:Q8NDX1,UniProt:Q9Y2T2,-ppi-,0.741
1,UniProt:Q8NDX1,UniProt:P62330,-ppi-,0.731
2,UniProt:Q8NDX1,UniProt:Q9Y2R4,-ppi-,0.706
3,UniProt:P04264,UniProt:Q07021,-ppi-,0.983
4,UniProt:P04264,UniProt:P13645,-ppi-,0.979
...,...,...,...,...
491744,UniProt:Q96D46,UniProt:P42766,-ppi-,0.963
491745,UniProt:Q96D46,UniProt:Q9Y2X3,-ppi-,0.956
491746,UniProt:Q96D46,UniProt:P62424,-ppi-,0.956
491747,UniProt:Q96D46,UniProt:Q2NL82,-ppi-,0.954


In [21]:
string_ppi = string_ppi[~string_ppi.apply(frozenset, axis=1).duplicated()]

Unnamed: 0,Protein (UniProt),Protein (UniProt).1,Relationship,Weight
0,UniProt:Q8NDX1,UniProt:Q9Y2T2,-ppi-,0.741
1,UniProt:Q8NDX1,UniProt:P62330,-ppi-,0.731
2,UniProt:Q8NDX1,UniProt:Q9Y2R4,-ppi-,0.706
3,UniProt:P04264,UniProt:Q07021,-ppi-,0.983
4,UniProt:P04264,UniProt:P13645,-ppi-,0.979
...,...,...,...,...
489526,UniProt:Q9H9T3,UniProt:Q9Y3T9,-ppi-,0.845
490290,UniProt:Q9Y3T9,UniProt:Q96D46,-ppi-,0.889
490354,UniProt:Q9Y3T9,UniProt:Q92979,-ppi-,0.789
490392,UniProt:Q9Y5K2,UniProt:P30411,-ppi-,0.766


In [23]:
string_ppi.to_csv('output/edges to use/Protein_(UniProt)_2_Protein_(UniProt).csv', index=False)

## Protein to text description
move this to the names file

In [37]:
# For each ID, get the protein names (one protein at a time)
IDs = protein_set
SIZE = str(500)
id2description = dict()
bad_ids = list()

for index, ID in enumerate(IDs):
    print(index+1,'/',len(IDs),end='\r')
    names = list()
    
    # Get data via UniProt API 
    if type(ID) != str:
        continue
    
    url = 'https://rest.uniprot.org/uniprotkb/search?fields=accession,protein_name,cc_function&format=json&query='\
           +ID+'&size='+SIZE
    res = requests.get(url).json()
    try:
        r = res['results'][0]['proteinDescription']
    except:
        bad_ids.append(ID)
        continue
        
    try:
        r = res['results'][0]
        uniprot_id = r['primaryAccession']
        name = r['proteinDescription']['recommendedName']['fullName']['value']
        function_description = r['comments'][0]['texts'][0]['value']
    except: 
        continue
    
    id2description[uniprot_id] = name + ' ' +function_description


5899 / 21971

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



14637 / 21971

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



21971 / 21971

In [39]:
json.dump(id2description, open('output/node2text/protein2text.json','w'))

In [38]:
id2description['P00156']

'Cytochrome b Component of the ubiquinol-cytochrome c reductase complex (complex III or cytochrome b-c1 complex) that is part of the mitochondrial respiratory chain. The b-c1 complex mediates electron transfer from ubiquinol to cytochrome c. Contributes to the generation of a proton gradient across the mitochondrial membrane that is then used for ATP synthesis'