## Clustering workflow
This notebook shows processes for preparing Pyseus outputs for clustering.
It also shows how to post-process clustering results so that they can be uploaded to the OpenCell database 

In [1]:
import sys
sys.path.append('../')

import pandas as pd
from pyseus import validation_analysis as va


Clustering interactome requires a graph dataset. This means that the data is represented by nodes, and edges that connect the nodes: in our case our dataset is composed of baits and proteingroups (or ENSG annotations) as nodes, and some measure of the interaction as edges. 

In OpenCell, we use interaction stoichiometry as edges. Below is a simple example of how to create a graph data with an output file from Pyseus processing. 

Here we'll use an already processed protein-protein-interactions table from OpenCell. In this table, target is the IP-MS pulldown

In [9]:
standard_output = pd.read_csv('../data/clustering/OpenCell_1.0_interactions.csv', index_col=0)
standard_output.head()

Unnamed: 0,target,prey,protein_ids,pvals,interaction_stoi,abundance_stoi,enrichment,plate,hits,minor_hits,fdr1,fdr5
1821637,AAMP,SSRP1,Q08945;A0A0U1RRK2;E9PMD4;E9PPZ7,4.665206,0.012199,1.746892,2.693357,CZBMPI_0022,False,True,"[3, 2.2]","[3, 1.9]"
1821877,AAMP,CWF19L2,Q2TBE0;Q2TBE0-2;H7C3G7;Q2TBE0-3;H0YE03,5.480125,0.022009,0.063036,2.889502,CZBMPI_0022,True,False,"[3, 2.2]","[3, 1.9]"
1822157,AAMP,ARGLU1,Q9NWB6;Q9NWB6-3;Q9NWB6-2,5.495914,0.128618,0.222927,2.868131,CZBMPI_0022,True,False,"[3, 2.2]","[3, 1.9]"
1822224,AAMP,SUPT16H,Q9Y5B9,6.259778,0.006352,1.854561,2.755307,CZBMPI_0022,True,False,"[3, 2.2]","[3, 1.9]"
1822281,AAMP,RPL10,X1WI28;P27635;B8A6G2;A6QRI9;Q96L21,15.156173,0.521148,13.14691,4.847218,CZBMPI_0022,True,False,"[3, 2.2]","[3, 1.9]"


We are going to convert the table into unique interactions table that removes duplicate node-node relationships regardless of direction.

To do so we create Validations class from validation_analysis module. In the process, we'll choose p-value as the edge to represent the interactions. For target and prey, we'll use Uniprot gene names but you could convert them to ENSGs also.

In [28]:
import imp
imp.reload(va)

<module 'pyseus.validation_analysis' from '/Users/kibeom.kim/Documents/GitHub/pyseus/notebooks/../pyseus/validation_analysis.py'>

In [29]:
validation = va.Validation(hit_table=None,
    interaction_table=standard_output,
    target_col='target',
    prey_col='prey',
    )

# create unique interactions table with interaction stoichiometry as edge weights
# to get the max edge weight for each unique interaction is time-consuming
validation.convert_to_unique_interactions(get_edge=True, edge='pvals')
uniques_pvals = validation.unique_interaction_table.copy()
uniques_pvals.head()

Unnamed: 0,prot_1,prot_2,pvals
0,AAMP,SSRP1,4.665206
1,AAMP,CWF19L2,5.480125
2,AAMP,ARGLU1,5.495914
3,AAMP,SUPT16H,6.259778
4,AAMP,RPL10,15.156173


In OpenCell, we actually used both interaction & abundance stoichiometries to create edge weight. This was in the empirical evidence from Marco Hein's paper that a 'circle' region of the 2-D stoichiometry plot had the most stable complexes. 
To create weights from this, another function is in place called return_circle_uniques()

The weights are user-specified in 'major_val', 'minor_val', 'rest_val'. You can tweak these weights in clustering for best outputs - the defaults are what was used in OpenCell

In [38]:
validation = va.Validation(hit_table=None,
    interaction_table=standard_output,
    target_col='target',
    prey_col='prey',
    )

# run the stoi-circle edge weight with all the default parameters
validation.return_circle_uniques(unique=True)

In [39]:
validation.circle_table

Unnamed: 0,prot_1,prot_2,circle_stoi
0,AAMP,SSRP1,0.2
1,AAMP,CWF19L2,0.2
2,AAMP,ARGLU1,4.0
3,AAMP,SUPT16H,0.2
4,AAMP,RPL10,0.2
...,...,...,...
28590,STX7,TMEM106B,0.2
28591,STARD3NL,TMEM106B,0.2
28592,TMEM106B,VAMP7,0.2
28593,DMXL2,WDR7,4.0


The above table is a good start of the PPI (protein-protein-interactions) graph dataset to start clustering. 
I haven't included steps of graph-clustering with MCL. I think that there are likely improved methods of clustering that can be developed in OpenCell 2.0 or beyond. The explanation of the clustering done on OpenCell, described in the methods section. is below:

Our final clustering analysis (using MCL) used an inflation parameter of 3.0 (Fig. S4I). The clusters
were pruned to remove any node included in a cluster on the basis of a single edge. The resulting
clusters correspond to the protein “communities” described in the text. We then utilized another
round of MCL clustering to identify core-clusters within each community by considering only
highly stoichiometric interactions (interaction stoichiometries between 0.05 and 10, and cellular
abundance stoichiometry between 0.1 and 10). The resulting core-clusters represented highly
stable core clusters within the original communities.