### This is a notebook to run through OpenCell mass-spec processing.
The full OC mass-spec results file resides in the CZBMPI_interactome in Google Drive. 
Our final processing was used from "Pulldown_Results_4x11Plates", which used MBR from 11 plates. 
Here, the example is using 3 plates. 

Usually, sample names have to be edited to fit the proper format for the scripts, refer to "change_sample_names.ipynb" for a guide to changing sample names. 

In [2]:
import sys
sys.path.append('../')

import pandas as pd
from pyseus import basic_processing as ip
from pyseus import primary_analysis as pa
from pyseus import validation_analysis as va
from pyseus import stoichiometry as stoi


### Assign parameters
Imputation:
For small batches, bait imputation performs better to fill out missing values. 
For large batches, there usually is no need for imputation except for some proteingroups that may not have enough sample size. In such case we use prey imputation where we impute around the mean intensity of the proteingroup. 



In [7]:
root = '../data/OC_Plate_22-25_MBR/'
analysis = '20220726'

# this is a renamed pg file from 'change_sample_names.ipynb'
pg_file = 'proteinGroups_renamed.txt'

# Use LFQ or absolute intensity
intensity_type = 'LFQ intensity'

# Imputation parameters
impute_mode = 'bait' # another mode is 'prey', if you do not wish to impute, set it to 'prey' with a very high threshold

distance = 1.8
width = 0.3

thresh=100 # For prey-imputation, number of samples needed to skip imputation in a row

# regular expression to group the replicates together
regexp = r'(P\d{3})(?:.\d{2})?(_.*)_\d{2}'

### Standard Processing
The standard processing goes through multiple steps: MQ filtering, log2 transformation, grouping triplicates, removing invalid rows, imputation, and creating bait matrix. Please refer to pyseus ReadTheDocs for more information

In [9]:
initial_tables = ip.opencell_initial_processing(
    root=root,
    analysis=analysis,
    pg_file=pg_file,
    intensity_type=intensity_type,
    impute=impute_mode,
    distance=distance,
    width=width,
    thresh=thresh,
    group_regexp=regexp
    )

Filtered 169 of 3829 rows. Now 3660 rows.
Removed invalid rows. 3448 from 3660 rows remaining.


In [10]:
# here we are saving the intermediary tables to the designated analysis folder
initial_tables.bait_imputed_table.to_csv(root + analysis + '/imputed_table.csv')
initial_tables.bait_matrix.to_csv(root + analysis + '/analysis_exclusion_matrix.csv')

### Calculate p-values and Enrichment
For p-val calculation, we use the AnalysisTables class in primary_analysis.py module

In [11]:
# initiate class for calculating p-val and enrichment
an_tables = pa.AnalysisTables(
    root=root,
    analysis=analysis,
    grouped_table=initial_tables.bait_imputed_table,
    exclusion_matrix = initial_tables.bait_matrix)

an_tables.load_exclusion_matrix()

In [12]:
## Self explanatory Methods

# an_tables.print_baits()
# an_tables.print_excluded_controls('Glycine_Low_pH_LAMP1')
an_tables.print_controls('P022_AAMP')
# an_tables.select_wildtype_controls(wt_re='_WT')
# an_tables.restore_default_exclusion_matrix()

Unnamed: 0,Samples
1,P022_ATL3
2,P022_CDKN2A
3,P022_CLTA
4,P022_COMMD1
5,P022_COMMD2
...,...
118,P025_YWHAE
119,P025_YWHAG
120,P025_YWHAH
121,P025_YWHAQ


#### Below shows two options, simple pval calculation or two-step bootstrap calculation. For clarification of the difference in two methods, please refer to the readthedocs

In [13]:
# Simple Calculation
# calculate p-val ane enrichment, and convert table to standard format for validation
an_tables.simple_pval_enrichment(std_enrich=False)
an_tables.convert_to_standard_table()

P-val calculations..
Finished!


In [53]:
# Two step bootstrap pval calculation
an_tables.two_step_bootstrap_pval_enrichment(std_enrich=True)
an_tables.convert_to_standard_table(simple_analysis=False)

First round p-val calculations..
First round finished!
Second round p-val calculations...
Second round finished!


In [56]:
# save the pval/enrichment table
an_tables.standard_hits_table.to_csv(root + analysis + '/standard_pval_table.csv')

# save the wide table
an_tables.simple_pval_table.to_csv(root + analysis + '/wide_pval_table.csv')
# an_tables.two_step_pval_table.to_csv(root + analysis + '/wide_pval_table.csv')


### Next, we call FDR for interaction calling. Here we are using dynamic FDR as described by the paper. 
For this we use Validation class from validation_analyses.py module

In [17]:
# initiate class
vali = va.Validation(hit_table = an_tables.standard_hits_table, target_col='target', prey_col='prey')

# perc is the parameter that determines how strict the interaction calling.  
vali.dynamic_fdr(perc=10)

# save the full table and just the interaction table
vali.called_table.to_csv(root + analysis + '/full_hits_table.csv')
vali.interaction_table.to_csv(root + analysis + '/interactions_table.csv')

### Stoichiometry calculations

To do so, we need the imputed table from basic processing, and the wide pval-table from analysis, along with some other required files.
These are found in the stoichiometry_input directory

The script for stoichiometry calculation is found in stoichiometry.py module

In [27]:
# import necessary input files
# read the accompanying Confluence page for details on how to retrieve/generate these files

pulldown_meta = pd.read_csv('../data/stoichiometry_input/pulldown_metadata_1002.csv')
total_abundances = pd.read_csv('../data/stoichiometry_input/HEK_abundance_digitonin_rnaseq_ensg.csv')
seq_table = stoi.fasta_df('../data/stoichiometry_input/uniprot_proteome.fasta')
ensembl_uniprot = pd.read_csv('../data/stoichiometry_input/ensembl_uniprot_association.csv')


In [134]:
# For 'pvals', here we use wide-pval table from analysis - an_tables.simple_pval_table, 
# but it could be an_tables.two_step_pval_table if you used two-step bootstrapping

stois, pg_mapping = stoi.compute_stoich_df(imputed=initial_tables.bait_imputed_table,
    seq_df=seq_table, rnaseq=total_abundances, pvals=an_tables.simple_pval_table,
    pull_uni=pulldown_meta, ensembl_uniprot=ensembl_uniprot,
    target_re=r'P(\d{3}_.*)')

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [137]:
# save the stoichiometry table, pg_mapping is a pg<->ensg mapping that may be useful as a reference
stois.to_csv(root + analysis + '/stoichiometry_table.csv')