# Protein Protein Interaction Data
**[Work in progress]**

This notebook downloads and standardizes viral-host protein data from IntAct and other sources for ingestion into the Knowledge Graph.

Data sources: [IntAct](https://www.ebi.ac.uk/intact/query/pubid:IM-27814), [Sequence Information](https://docs.google.com/spreadsheets/d/1m2SiCxyU_B1f4Ruu0wZafNXu8VQnjmog73bjWCS834A/edit?usp=sharing), [BioArXiv](https://www.biorxiv.org/content/10.1101/2020.03.22.002386v3.full)

Authors: Kaushik Ganapathy, Eric Yu (krganapa@ucsd.edu, ery010@ucsd.edu)

### External Package Imports

In [38]:
import os
import re
import hashlib 

import pandas as pd
import numpy as np

from pathlib import Path

pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [None]:
NEO4J_HOME = Path(os.getenv('NEO4J_HOME'))
print(NEO4J_HOME)

**Downloaded data from IntAct MI-TAB 2.5 format**

Autodownloading this is a WIP

In [9]:
data = pd.read_csv('../reference_data/intact_data.txt', sep = '\t')

### Data Cleanup

**Dropping unnecessary columns.**

In [10]:
columns_to_drop = 'Iteraction detection method(s)	Publication 1st author(s)	Publication Identifier(s)	Taxid interactor A	Taxid interactor B	Interaction type(s)\tSource database(s)\tConfidence value(s)'
columns_to_drop = columns_to_drop.split('\t')
columns_retain = [col for col in data.columns if col not in columns_to_drop]
data = data[columns_retain]

In [11]:
data.head(3)

Unnamed: 0,#ID(s) interactor A,ID(s) interactor B,Alt. ID(s) interactor A,Alt. ID(s) interactor B,Alias(es) interactor A,Alias(es) interactor B,Interaction detection method(s),Interaction identifier(s)
0,uniprotkb:P0DTC4,uniprotkb:Q8IWA5,intact:EBI-25475850|intact:EBI-25475853|intact...,intact:EBI-11722896|uniprotkb:B2RBB1|uniprotkb...,psi-mi:e_wcpv(display_long),psi-mi:ctl2_human(display_long)|uniprotkb:SLC4...,"psi-mi:""MI:0096""(pull down)",intact:EBI-25490454
1,uniprotkb:P0DTC4,uniprotkb:Q86VM9,intact:EBI-25475850|intact:EBI-25475853|intact...,intact:EBI-1045965|uniprotkb:Q96DG4|uniprotkb:...,psi-mi:e_wcpv(display_long),psi-mi:zch18_human(display_long)|uniprotkb:NHN...,"psi-mi:""MI:0096""(pull down)",intact:EBI-25490454
2,uniprotkb:P0DTC4,uniprotkb:Q6UX04,intact:EBI-25475850|intact:EBI-25475853|intact...,intact:EBI-2214108|uniprotkb:O60529|uniprotkb:...,psi-mi:e_wcpv(display_long),psi-mi:cwc27_human(display_long)|uniprotkb:SDC...,"psi-mi:""MI:0096""(pull down)",intact:EBI-25490454


**Minor Column Renaming and Validation of Data Sources**

***Cell should print "All set to clean up"***

In [12]:
data = data.rename({'#ID(s) interactor A': 'SARS_COV2_Protein_ID'}, axis = 1)
unique_ids = data['SARS_COV2_Protein_ID'].unique()
unique_data_sources_ids = np.unique([id_.split(':')[0] for id_ in unique_ids])
if len(unique_data_sources_ids) == 1 and unique_data_sources_ids[0] == 'uniprotkb':
    print('All set to clean up SARS-COV-2 Column')
else:
    raise ValueError('Unknown Data Sources present. Please check before proceeding')

All set to clean up SARS-COV-2 Column


In [13]:
def standardize_names(identifier_id):
    if 'uniprot' in identifier_id: return identifier_id.replace('uniprotkb', 'uniprot')
    elif 'intact' in identifier_id: return identifier_id
data['SARS_COV2_Protein_ID'] = data['SARS_COV2_Protein_ID'].apply(standardize_names)

In [14]:
handled = set(['uniprotkb', 'intact'])
data = data.rename({'ID(s) interactor B': 'Human_Protein_ID'}, axis = 1)
unique_ids = data['Human_Protein_ID'].unique()
unique_data_sources_ids = np.unique([id_.split(':')[0] for id_ in unique_ids])
if set(unique_data_sources_ids) == handled: print('All set!')
else:
    print(unique_data_sources_ids)
    print ('Unknown Data Sources present. Please check before proceeding')

All set!


In [15]:
data['Human_Protein_ID'] = data['Human_Protein_ID'].apply(standardize_names)

### Standardizing names to match with other data sources

In [16]:
def find_viral_name(viral_name):
    return viral_name.split(':')[1].split('(')[0].split('_')[0].upper() 


def find_human_name(human_name):
    return human_name.split(':')[1].split('(')[0].upper() 


data['Alias(es) interactor A'] = data['Alias(es) interactor A'].apply(find_viral_name)
data['Alias(es) interactor B'] = data['Alias(es) interactor B'].apply(find_human_name)

data = data.rename({'Alias(es) interactor A':'SARS_COV2_Protein_Name', 'Alias(es) interactor B':'Human_Protein_Name', 'Interaction identifier(s)\
':'Interaction_ID'}, axis = 1)
data = data.drop('Interaction detection method(s)', axis = 1)

### Manual Fixes to errors in Data Entry 
_Automated Workflow: WIP_

**Correcting ORF3B**

In [17]:
data[data['Human_Protein_ID'] == 'intact:EBI-25475912']

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID,Alt. ID(s) interactor A,Alt. ID(s) interactor B,SARS_COV2_Protein_Name,Human_Protein_Name,Interaction_ID
159,uniprot:Q9UJZ1,intact:EBI-25475912,intact:EBI-1044428|uniprotkb:B4E1K7|uniprotkb:...,-,STML2,ORF3B_WCPV,intact:EBI-25491308


In [18]:
correct = ['intact:EBI-25491308', 'uniprot:Q9UJZ1', '-', 'intact:EBI-1044428|uniprotkb:B4E1K7|uniprotkb:O60376|uniprotkb:Q53G29|uniprotkb:Q96FY2|uniprotkb:Q9P042|uniprotkb:D3DRN3', 'ORF3B', 'STML2', 'intact:EBI-25491308']
data.loc[159] = correct

**Correcting NSP5**

In [19]:
data[data['SARS_COV2_Protein_ID'] == 'uniprot:Q92769']

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID,Alt. ID(s) interactor A,Alt. ID(s) interactor B,SARS_COV2_Protein_Name,Human_Protein_Name,Interaction_ID
108,uniprot:Q92769,uniprot:P0DTD1-PRO_0000449623,intact:EBI-301821|uniprotkb:B3KRS5|uniprotkb:E...,intact:EBI-25475864,HDAC2,NSP5_WCPV,intact:EBI-25490970


In [20]:
correct = ['uniprot:P0DTD1-PRO_0000449623', 'uniprot:Q92769', 'intact:EBI-25475864', 'intact:EBI-301821|uniprotkb:B3KRS5|uniprotkb:E1P561|uniprotkb:Q5SRI8|uniprotkb:Q5SZ86|uniprotkb:Q8NEH4|uniprotkb:B4DL58', 'NSP5', 'HDAC2', 'intact:EBI-25490970']
data.loc[108] = correct

**Correcting NSP11**

In [21]:
data[data['SARS_COV2_Protein_ID'] == 'uniprot:O75347']

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID,Alt. ID(s) interactor A,Alt. ID(s) interactor B,SARS_COV2_Protein_Name,Human_Protein_Name,Interaction_ID
33,uniprot:O75347,uniprot:P0DTC1-PRO_0000449645,intact:EBI-2686341|uniprotkb:B4DT30,intact:EBI-25475882,TBCA,NSP11_WCPV,intact:EBI-25490682


In [22]:
correct = ['uniprot:P0DTC1-PRO_0000449645', 'uniprot:O75347', 'intact:EBI-25475882','intact:EBI-2686341|uniprotkb:B4DT30', 'NSP11', 'TBCA', 'intact:EBI-25490682']
data.loc[33] = correct

**Correcting NSP-C145A**

In [23]:
data.loc[102]['SARS_COV2_Protein_ID'] = 'uniprot:NSP5_C145A'
data.loc[103]['SARS_COV2_Protein_ID'] = 'uniprot:NSP5_C145A'
data.loc[102]['SARS_COV2_Protein_Name'] = 'NSP5_C145A'
data.loc[103]['SARS_COV2_Protein_Name'] = 'NSP5_C145A'

**Writing ```interactions.csv```**

In [24]:
interactions = data[['SARS_COV2_Protein_ID', 'Human_Protein_ID', 'Interaction_ID']]
interactions.to_csv('interactions_data.csv', index = False)

### Creating Node data files

In [25]:
data = data.rename({'Alt. ID(s) interactor A':'SARS_COV2_Alt_ID', 'Alt. ID(s) interactor B':'Human_Alt_ID'}, axis = 1)
virus_df = data[['SARS_COV2_Protein_ID', 'SARS_COV2_Alt_ID', 'SARS_COV2_Protein_Name']]
virus_df = virus_df.drop_duplicates(subset = ['SARS_COV2_Protein_ID', 'SARS_COV2_Protein_Name']).reset_index(drop = True)

**Unnesting the Alternate IDs**

In [26]:
def pre_un_nest(id_):
    if '|' not in id_:
        if 'intact' in id_:
            return {'Alt_intact_ID': id_, 'Alt_uniprot_ID': np.nan}
        elif 'uniprot_ID' in id_:
            return {'Alt_intact_ID': np.nan, 'Alt_uniprot_ID': id_}
    else:
        ids = id_.split('|')
        intact_data = []
        uniprot_data = []
        
        for id__ in ids:   
            if 'intact' in id__:
                intact_data += [id__]
                    
            elif 'uniprot' in id__:
                uniprot_data += [id__]
                            
        if len(uniprot_data) == 1:
            uniprot_data = uniprot_data[0]
            
        if len(intact_data) == 1:
            intact_data = intact_data[0]
            
        return {'Alt_uniprot_ID': uniprot_data, 'Alt_intact_ID': intact_data}

In [27]:
unnested = (virus_df['SARS_COV2_Alt_ID'].apply(pre_un_nest)).apply(pd.Series)
unnested['SARS_COV2_Protein_ID'] = virus_df['SARS_COV2_Protein_ID']
unnested = unnested[unnested.columns.tolist()[::-1]]
virus_df = virus_df.drop('SARS_COV2_Alt_ID', axis = 1)

**Load in Jeff Law's Data with Sequences**

In [28]:
sequence_interactions = pd.read_excel('../reference_data/2020-04-krogan-sarscov2-sequences-uniprot-mapping.xlsx')
sequence_interactions = sequence_interactions.loc[:26, :]

**Standardizing Names with IntAct data and selecting appropriate columns**

In [29]:
sequence_interactions['SARS_COV2_Protein_Name'] = sequence_interactions['Krogan name'].apply(lambda name: name.split()[-1].upper())
all_external_data = sequence_interactions[['SARS_COV2_Protein_Name', 'Sequence', 'Length', 'Start Pos', 'End Pos']]
virus_df = virus_df.merge(all_external_data, how = 'outer', indicator=True)
virus_df = virus_df[virus_df.columns[:-1]]

### Generating md5 hash based on sequence

In [30]:
assert len(virus_df) == len(virus_df['Sequence'].unique())

virus_df['md5Hash'] = virus_df['Sequence'].apply(lambda seq: hashlib.md5(seq.encode()).hexdigest())
virus_df = virus_df.rename({'SARS_COV2_Protein_ID': 'SARS_COV2_Identifier', 'md5Hash': 'SARS_COV2_Protein_ID'}, axis = 1)

sequences = virus_df[['SARS_COV2_Protein_ID', 'SARS_COV2_Identifier', 'SARS_COV2_Protein_Name', 'Sequence', 'Length', 'Start Pos', 'End Pos']]
sequences['Start Pos'] = sequences['Start Pos'].astype(int)
sequences['End Pos'] = sequences['End Pos'].astype(int)

map_identifiers = virus_df[['SARS_COV2_Identifier', 'SARS_COV2_Protein_ID']]

### Mapping interactions to the ```interactions``` dataframe and the ```unnested``` dataframe

In [31]:
interactions = interactions.rename({'SARS_COV2_Protein_ID': 'SARS_COV2_Identifier'}, axis = 1)
interactions = interactions.merge(map_identifiers)
interactions = interactions.drop('SARS_COV2_Identifier', axis = 1)
interactions = interactions[['Interaction_ID', 'SARS_COV2_Protein_ID', 'Human_Protein_ID']]


unnested = unnested.rename({'SARS_COV2_Protein_ID': 'SARS_COV2_Identifier'}, axis = 1)
unnested['Alt_uniprot_ID'] = unnested['Alt_uniprot_ID'].apply(lambda id_: np.nan if id_ == [] else id_)
unnested = unnested.merge(map_identifiers).drop('SARS_COV2_Identifier', axis = 1)
unnested = unnested[unnested.columns[::-1]]

### Creating the Human Proteins File

In [32]:
human_data = data[['Human_Protein_ID', 'Human_Protein_Name', 'Human_Alt_ID']]
unnested_aliases = (human_data['Human_Alt_ID'].apply(pre_un_nest)).apply(pd.Series)
human_data = human_data.drop('Human_Alt_ID', axis = 1)
unnested_aliases['Human_Protein_ID'] = human_data['Human_Protein_ID']
unnested_aliases = unnested_aliases[unnested_aliases.columns[::-1]]

### Writing all files to csv

In [None]:
sequences.to_csv(NEO4J_HOME / 'import/01e-virus_data.csv', index = False)
unnested.to_csv(NEO4J_HOME / 'import/01e-virus_alias.csv', index = False)
human_data.to_csv(NEO4J_HOME / 'import/01e-human_data.csv', index = False)
unnested_aliases.to_csv(NEO4J_HOME / 'import/01e-human_alias.csv', index = False)
interactions.to_csv(NEO4J_HOME / 'import/01e-interactions.csv', index = False)

### End resulting files

**```01e-virus_data.csv```**: Contains all the viral protein sequences and associated information, with all conflicts resolved from the 3 data sources. Proteins are identified with the md5 hash a.k.a. protein id. Also contains information on sequences, start-point in genome, end-point in genome, and a standard identifiers.org representation. 

**```01e-virus_alias.csv```**: Contains all the alias IDs known for the viral sequences,be it interact or uniprot. Keyed by the protein ID.

**```01e-human_data.csv```**: Contains all the human protein sequences and associated information.Proteins are identified with the standard identifiers.org representation a.k.a.which is essentially the protein id.

**```01e-human_alias.csv```**: Contains all the alias IDs known for the human sequences,be it interact or uniprot. Keyed by the protein ID.

**```01e-interactions.csv```**: Contains all the interactions between a viral protein and a human protein. Each one of these interactions also has an ID which is resolvable on identifiers.org.

In [34]:
#TODO: 
#Automated Download of Intact Data
#Nodes based on Taxonomy ID (eliminating the need for virus_data, human_data)