# Protein Protein Interaction Data
**[Work in progress]**

This notebook downloads and standardizes viral-host protein data from IntAct for ingestion into the Knowledge Graph.

Data source: [IntAct](https://www.ebi.ac.uk/intact/query/pubid:IM-27814)

Authors: Kaushik Ganapathy, Eric Yu, Peter Rose (krganapa@ucsd.edu, ery010@ucsd.edu, pwrose@ucsd.edu)

In [1]:
import os
import re
import hashlib 

import pandas as pd
import numpy as np

from pathlib import Path
from Bio import SeqIO

pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [2]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


https://www.uniprot.org/help/uniprotkb_column_names
https://www.uniprot.org/uniprot/P0DTD1#PRO_0000449630

### Get list of organisms to include in the Knowledge Graph

In [3]:
genomes = pd.read_csv("../../reference_data/Genome.csv", dtype=str)

In [4]:
genomes['taxonomy'] = genomes['taxonomyId'].apply(lambda x: x.split(':')[1])

In [5]:
taxonomy_ids = set(genomes['taxonomy'].unique())

In [6]:
taxonomy_ids

{'11137', '1263720', '2697049', '694009', '9606'}

### Retrieve interaction data from IntAct

In [7]:
urls = [f'https://www.ebi.ac.uk/intact/export?format=mitab_25&query=taxid%3A{taxon}&negative=false&spoke=false&ontology=false&sort=intact-miscore&asc=false'
        for taxon in taxonomy_ids]

In [8]:
data = pd.concat((pd.read_csv(url, sep='\t', dtype='str') for url in urls))

In [9]:
print('Number of interactions:', data.shape[0])

Number of interactions: 694543


In [10]:
data.head(10)

Unnamed: 0,#ID(s) interactor A,ID(s) interactor B,Alt. ID(s) interactor A,Alt. ID(s) interactor B,Alias(es) interactor A,Alias(es) interactor B,Interaction detection method(s),Publication 1st author(s),Publication Identifier(s),Taxid interactor A,Taxid interactor B,Interaction type(s),Source database(s),Interaction identifier(s),Confidence value(s)
0,uniprotkb:P15423,uniprotkb:P15423,intact:EBI-25474626|uniprotkb:P89342|uniprotkb...,intact:EBI-25474626|uniprotkb:P89342|uniprotkb...,psi-mi:spike_cvh22(display_long)|uniprotkb:E2(...,psi-mi:spike_cvh22(display_long)|uniprotkb:E2(...,"psi-mi:""MI:0410""(3D electron microscopy)",Li et al. (2019),pubmed:31650956|imex:IM-27809,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0915""(physical association)","psi-mi:""MI:0469""(IntAct)",intact:EBI-25474657|pdbe:6u7h|emdb:EMD-20668|i...,intact-miscore:0.32
1,uniprotkb:P0C6X1-PRO_0000037301,uniprotkb:P0C6X1-PRO_0000037301,intact:EBI-25708564,intact:EBI-25708564,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,"psi-mi:""MI:0114""(x-ray crystallography)",Ponnusamy et al. (2008),pubmed:18694760|imex:IM-28148,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0469""(IntAct)",intact:EBI-25708571|pdbe:2J97|imex:IM-28148-1,intact-miscore:0.61
2,uniprotkb:P0C6X1-PRO_0000037301,uniprotkb:P0C6X1-PRO_0000037301,intact:EBI-25708564,intact:EBI-25708564,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,"psi-mi:""MI:0114""(x-ray crystallography)",Ponnusamy et al. (2008),pubmed:18694760|imex:IM-28148,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0469""(IntAct)",intact:EBI-25749642|pdbe:2J98|imex:IM-28148-10,intact-miscore:0.61
3,uniprotkb:P0C6X1-PRO_0000037301,uniprotkb:P0C6X1-PRO_0000037301,intact:EBI-25708564,intact:EBI-25708564,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,"psi-mi:""MI:0030""(cross-linking study)",Ponnusamy et al. (2008),pubmed:18694760|imex:IM-28148,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0915""(physical association)","psi-mi:""MI:0469""(IntAct)",intact:EBI-25711856|imex:IM-28148-3,intact-miscore:0.61
4,uniprotkb:P0C6X1-PRO_0000037301,uniprotkb:P0C6X1-PRO_0000037301,intact:EBI-25708564,intact:EBI-25708564,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,"psi-mi:""MI:0030""(cross-linking study)",Ponnusamy et al. (2008),pubmed:18694760|imex:IM-28148,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0469""(IntAct)",intact:EBI-25712141|imex:IM-28148-5,intact-miscore:0.61
5,uniprotkb:P0C6X1-PRO_0000037301,uniprotkb:P0C6X1-PRO_0000037301,intact:EBI-25708564,intact:EBI-25708564,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,"psi-mi:""MI:0030""(cross-linking study)",Ponnusamy et al. (2008),pubmed:18694760|imex:IM-28148,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0915""(physical association)","psi-mi:""MI:0469""(IntAct)",intact:EBI-25748324|imex:IM-28148-8,intact-miscore:0.61
6,uniprotkb:P0C6X1-PRO_0000037300,uniprotkb:P0C6X1-PRO_0000037299,intact:EBI-26366585,intact:EBI-26366570,psi-mi:p0c6x1-pro_0000037300(display_long)|uni...,psi-mi:p0c6x1-pro_0000037299(display_long)|uni...,"psi-mi:""MI:0069""(mass spectrometry studies of ...",Krichel et al. (2020),doi:10.1101/2020.09.30.320762|imex:IM-28416|pu...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0469""(IntAct)",intact:EBI-26366564|imex:IM-28416-29,intact-miscore:0.56
7,uniprotkb:P0C6X1-PRO_0000037300,uniprotkb:P0C6X1-PRO_0000037299,intact:EBI-26366585,intact:EBI-26366570,psi-mi:p0c6x1-pro_0000037300(display_long)|uni...,psi-mi:p0c6x1-pro_0000037299(display_long)|uni...,"psi-mi:""MI:0069""(mass spectrometry studies of ...",Krichel et al. (2020),doi:10.1101/2020.09.30.320762|imex:IM-28416|pu...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0469""(IntAct)",intact:EBI-26366593|imex:IM-28416-43,intact-miscore:0.56
8,uniprotkb:P0C6X1-PRO_0000037300,uniprotkb:P0C6X1-PRO_0000037299,intact:EBI-26366585,intact:EBI-26366570,psi-mi:p0c6x1-pro_0000037300(display_long)|uni...,psi-mi:p0c6x1-pro_0000037299(display_long)|uni...,"psi-mi:""MI:0069""(mass spectrometry studies of ...",Krichel et al. (2020),doi:10.1101/2020.09.30.320762|imex:IM-28416|pu...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0469""(IntAct)",intact:EBI-26366601|imex:IM-28416-31,intact-miscore:0.56
9,uniprotkb:P0C6X1-PRO_0000037297,uniprotkb:P0C6X1,intact:EBI-26366728,intact:EBI-25637276|uniprotkb:Q05002|uniprotkb...,psi-mi:p0c6x1-pro_0000037297(display_long)|uni...,psi-mi:r1ab_cvh22(display_long)|uniprotkb:rep(...,"psi-mi:""MI:0435""(protease assay)",Krichel et al. (2020),doi:10.1101/2020.09.30.320762|imex:IM-28416|pu...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0570""(protein cleavage)","psi-mi:""MI:0469""(IntAct)",intact:EBI-26366721|imex:IM-28416-47,intact-miscore:0.44


### Process Data

In [11]:
data.rename(columns={'#ID(s) interactor A': 'interactorA', 'ID(s) interactor B': 'interactorB'}, inplace=True)

#### Extract UniProt accession number and Uniprot protein ids from interactor columns

In [12]:
# uniprot:P0DTD1-PRO_0000449619 -> P0DTD1-PRO_0000449619
data['id_a'] = data['interactorA'].str.replace('uniprotkb:', '')
data['id_b'] = data['interactorB'].str.replace('uniprotkb:', '')

# P0DTD1-PRO_0000449619 -> P0DTD1 (UniProt accession number)
data['accession_a'] = data['id_a'].str.split('-PRO', expand=True)[0]
data['accession_b'] = data['id_b'].str.split('-PRO', expand=True)[0]

# Remove isoform id: Q9UBL6-2 -> Q9UBL6
data['accession_a'] = data['accession_a'].str.split('-', expand=True)[0]
data['accession_b'] = data['accession_b'].str.split('-', expand=True)[0]

# ADD CURIE "uniprot" as prefix (see https://registry.identifiers.org/registry/uniprot)
data['accession_a'] = 'uniprot:' + data['accession_a']
data['accession_b'] = 'uniprot:' + data['accession_b']

# P0DTD1-PRO_0000449619 -> PRO_0000449619 (UniProt protein id)
data['pro_id_a'] = data['id_a'].str.split('-PRO', expand=True)[1]
data['pro_id_b'] = data['id_b'].str.split('-PRO', expand=True)[1]

# Add CURIE "uniprot.chain" as prefix (see https://registry.identifiers.org/registry/uniprot.chain)
data['pro_id_a'] = data['pro_id_a'].str.replace('_', 'uniprot.chain:PRO_')
data['pro_id_b'] = data['pro_id_b'].str.replace('_', 'uniprot.chain:PRO_')

data.fillna('', inplace=True)

### Remove duplicates

Create a unique interaction id

In [13]:
data['interaction_id'] = data[['id_a', 'id_b']].apply(lambda x: x[0] + x[1] if x[0] < x[1] else x[1] + x[0], axis=1)

In [14]:
data.shape

(694543, 22)

In [15]:
data.drop_duplicates(subset=['interaction_id'], inplace=True)

In [16]:
data.shape

(378593, 22)

In [17]:
data.head()

Unnamed: 0,interactorA,interactorB,Alt. ID(s) interactor A,Alt. ID(s) interactor B,Alias(es) interactor A,Alias(es) interactor B,Interaction detection method(s),Publication 1st author(s),Publication Identifier(s),Taxid interactor A,Taxid interactor B,Interaction type(s),Source database(s),Interaction identifier(s),Confidence value(s),id_a,id_b,accession_a,accession_b,pro_id_a,pro_id_b,interaction_id
0,uniprotkb:P15423,uniprotkb:P15423,intact:EBI-25474626|uniprotkb:P89342|uniprotkb...,intact:EBI-25474626|uniprotkb:P89342|uniprotkb...,psi-mi:spike_cvh22(display_long)|uniprotkb:E2(...,psi-mi:spike_cvh22(display_long)|uniprotkb:E2(...,"psi-mi:""MI:0410""(3D electron microscopy)",Li et al. (2019),pubmed:31650956|imex:IM-27809,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0915""(physical association)","psi-mi:""MI:0469""(IntAct)",intact:EBI-25474657|pdbe:6u7h|emdb:EMD-20668|i...,intact-miscore:0.32,P15423,P15423,uniprot:P15423,uniprot:P15423,,,P15423P15423
1,uniprotkb:P0C6X1-PRO_0000037301,uniprotkb:P0C6X1-PRO_0000037301,intact:EBI-25708564,intact:EBI-25708564,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,psi-mi:p0c6x1-pro_0000037301(display_long)|uni...,"psi-mi:""MI:0114""(x-ray crystallography)",Ponnusamy et al. (2008),pubmed:18694760|imex:IM-28148,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0469""(IntAct)",intact:EBI-25708571|pdbe:2J97|imex:IM-28148-1,intact-miscore:0.61,P0C6X1-PRO_0000037301,P0C6X1-PRO_0000037301,uniprot:P0C6X1,uniprot:P0C6X1,uniprot.chain:PRO_0000037301,uniprot.chain:PRO_0000037301,P0C6X1-PRO_0000037301P0C6X1-PRO_0000037301
6,uniprotkb:P0C6X1-PRO_0000037300,uniprotkb:P0C6X1-PRO_0000037299,intact:EBI-26366585,intact:EBI-26366570,psi-mi:p0c6x1-pro_0000037300(display_long)|uni...,psi-mi:p0c6x1-pro_0000037299(display_long)|uni...,"psi-mi:""MI:0069""(mass spectrometry studies of ...",Krichel et al. (2020),doi:10.1101/2020.09.30.320762|imex:IM-28416|pu...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0469""(IntAct)",intact:EBI-26366564|imex:IM-28416-29,intact-miscore:0.56,P0C6X1-PRO_0000037300,P0C6X1-PRO_0000037299,uniprot:P0C6X1,uniprot:P0C6X1,uniprot.chain:PRO_0000037300,uniprot.chain:PRO_0000037299,P0C6X1-PRO_0000037299P0C6X1-PRO_0000037300
9,uniprotkb:P0C6X1-PRO_0000037297,uniprotkb:P0C6X1,intact:EBI-26366728,intact:EBI-25637276|uniprotkb:Q05002|uniprotkb...,psi-mi:p0c6x1-pro_0000037297(display_long)|uni...,psi-mi:r1ab_cvh22(display_long)|uniprotkb:rep(...,"psi-mi:""MI:0435""(protease assay)",Krichel et al. (2020),doi:10.1101/2020.09.30.320762|imex:IM-28416|pu...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0570""(protein cleavage)","psi-mi:""MI:0469""(IntAct)",intact:EBI-26366721|imex:IM-28416-47,intact-miscore:0.44,P0C6X1-PRO_0000037297,P0C6X1,uniprot:P0C6X1,uniprot:P0C6X1,uniprot.chain:PRO_0000037297,,P0C6X1P0C6X1-PRO_0000037297
11,uniprotkb:P15130,uniprotkb:P15130,intact:EBI-8172439|uniprotkb:Q66175|intact:MIN...,intact:EBI-8172439|uniprotkb:Q66175|intact:MIN...,psi-mi:ncap_cvh22(display_long)|uniprotkb:N(ge...,psi-mi:ncap_cvh22(display_long)|uniprotkb:N(ge...,"psi-mi:""MI:0028""(cosedimentation in solution)",Lo et al. (2013),pubmed:23178926|mint:MINT-8401903|imex:IM-27962,taxid:11137(hcov-229e)|taxid:11137(Human coron...,taxid:11137(hcov-229e)|taxid:11137(Human coron...,"psi-mi:""MI:0407""(direct interaction)","psi-mi:""MI:0471""(MINT)",intact:EBI-8172458|mint:MINT-8401926|imex:IM-2...,intact-miscore:0.62,P15130,P15130,uniprot:P15130,uniprot:P15130,,,P15130P15130


#### Extract pubmed id
Example: imex:IM-27912|pubmed:32275855 -> 2275855

In [18]:
position_pattern = re.compile('pubmed:(\d*).')
position_pattern = re.compile('pubmed:(\d*)')

def extract_pubmed_id(s):
    groups = position_pattern.search(s)
    if groups == None:
        return ''
    else:
        return groups.group(1)

In [19]:
data['pubmedId'] = data['Publication Identifier(s)'].apply(extract_pubmed_id)

#### Extract taxonomy id
Example: taxid:9606(human)|taxid:9606(Homo sapiens) -> 9606

In [20]:
position_pattern = re.compile('taxid:(\d*)\(')

def extract_tax_id(s):
    groups = position_pattern.search(s)
    if groups == None:
        return ''
    else:
        return groups.group(1)

In [21]:
data['taxonomy_id_a'] = data['Taxid interactor A'].apply(extract_tax_id)
data['taxonomy_id_b'] = data['Taxid interactor B'].apply(extract_tax_id)

In [22]:
data = data[['id_a', 'id_b', 'accession_a', 'accession_b', 'pro_id_a', 'pro_id_b', 'taxonomy_id_a', 'taxonomy_id_b', 'pubmedId']]

In [23]:
data.head()

Unnamed: 0,id_a,id_b,accession_a,accession_b,pro_id_a,pro_id_b,taxonomy_id_a,taxonomy_id_b,pubmedId
0,P15423,P15423,uniprot:P15423,uniprot:P15423,,,11137,11137,31650956
1,P0C6X1-PRO_0000037301,P0C6X1-PRO_0000037301,uniprot:P0C6X1,uniprot:P0C6X1,uniprot.chain:PRO_0000037301,uniprot.chain:PRO_0000037301,11137,11137,18694760
6,P0C6X1-PRO_0000037300,P0C6X1-PRO_0000037299,uniprot:P0C6X1,uniprot:P0C6X1,uniprot.chain:PRO_0000037300,uniprot.chain:PRO_0000037299,11137,11137,33024972
9,P0C6X1-PRO_0000037297,P0C6X1,uniprot:P0C6X1,uniprot:P0C6X1,uniprot.chain:PRO_0000037297,,11137,11137,33024972
11,P15130,P15130,uniprot:P15130,uniprot:P15130,,,11137,11137,23178926


#### Restrict data to the set of currently supported taxonomy ids

In [24]:
data = data[data['taxonomy_id_a'].isin(taxonomy_ids) & data['taxonomy_id_b'].isin(taxonomy_ids)]

Remove data with accession numbers that are not UniProt accession numbers

In [25]:
data = data[~(data['id_a'].str.contains(':')) & ~(data['id_b'].str.contains(':'))]

Remove self-interactions (they make graph display too crowded)

In [26]:
data = data[~(data['accession_a'] == data['accession_b'])]

In [27]:
data.shape

(292809, 9)

In [28]:
data['source'] = 'IntAct'

### Save interaction data

In [29]:
data = data[['accession_a', 'accession_b', 'pro_id_a', 'pro_id_b', 'source', 'pubmedId']]
data.drop_duplicates(inplace=True)
data.to_csv(NEO4J_IMPORT / '01e-ProteinProteinInteraction.csv', index = False)

In [30]:
print('Number of interactions:', data.shape[0])

Number of interactions: 291365


In [31]:
data.sample(5)

Unnamed: 0,accession_a,accession_b,pro_id_a,pro_id_b,source,pubmedId
495103,uniprot:F5H1C8,uniprot:P20674,,,IntAct,30833792
171544,uniprot:Q9UBB9,uniprot:O75127,,,IntAct,32296183
417681,uniprot:P38936,uniprot:Q04724,,,IntAct,21900206
36998,uniprot:P57081,uniprot:Q9UBP6,,,IntAct,26751069
546518,uniprot:Q9H8H2,uniprot:P01859,,,IntAct,28514442
