## DAISY- the DAta-mIning SYnthetic-lethality-identification pipeline
```
Title:   Data Mining Synthetic Lethality Identification Pipeline (DAISY)
Author:  Bahar Tercan
Created: 02-07-2022
Purpose: Retrive Synthetic Lethal Partners of The Genes in the Given List Using DAISY Algorithm 
Notes: Runs in MyBinder 
```


Please cite: 
For Implementation: 

Our paper,

For DAISY algorithm: 

Jerby-Arnon, L., Pfetzer, N., Waldman, Y. Y., McGarry, L., James, D., Shanks, E., ... & Gottlieb, E. (2014). Predicting cancer-specific vulnerability via data-driven detection of synthetic lethality. Cell, 158(5), 1199-1209.

For CCLE Omics data:

Ghandi, M., Huang, F.W., Jané-Valbuena, J. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019). https://doi.org/10.1038/s41586-019-1186-3

For CRISPR Data: 

Robin M. Meyers, Jordan G. Bryan, James M. McFarland, Barbara A. Weir, ... David E. Root, William C. Hahn, Aviad Tsherniak. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nature Genetics 2017 October 49:1779–1784. doi:10.1038/ng.3984

Dempster, J. M., Rossen, J., Kazachkova, M., Pan, J., Kugener, G., Root, D. E., & Tsherniak, A. (2019). Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines. BioRxiv, 720243.

For shRNA Data:

James M. McFarland, Zandra V. Ho, Guillaume Kugener, Joshua M. Dempster, Phillip G. Montgomery, Jordan G. Bryan, John M. Krill-Burger, Thomas M. Green, Francisca Vazquez, Jesse S. Boehm, Todd R. Golub, William C. Hahn, David E. Root, Aviad Tsherniak. (2018). Improved estimation of cancer dependencies from large-scale RNAi screens using model-based normalization and data integration. Nature Communications 9, 1. https://doi.org/10.1038/s41467-018-06916-5

For ISB-CGC:
Reynolds, S. M., Miller, M., Lee, P., Leinonen, K., Paquette, S. M., Rodebaugh, Z., ... & Shmulevich, I. (2017). The ISB Cancer Genomics Cloud: a flexible cloud-based platform for cancer genomics research. Cancer research, 77(21), e7-e10.

For Pancancer Atlas Data:
Hutter, C., and Zenklusen, J.C. (2018). The Cancer Genome Atlas: Creating Lasting Value beyond Its Data. Cell 173, 283–285.

This notebook is a reimplementation of DAISY Synthetic Lethal Pair Prediction Algorithm

It consists 3 modules: 

1. SL candidate determination using gene co-expression
2. SL candidate determination using survival of fittest
3. SL candidate determination using CRISPR and shRNA experiments

* The results from the three modules were then aggregated into one ranked list of candidate SL pairs

Input Parameters
* Cancer type 
* The genes whose SL partners are seeked

Input Data (available in bigquery tables)
* Gene expression data 
* Gene mutation data
* Copy number variation data
* Gene effect data (CRISPR)
* Gene dependency scores data (shRNA)

Output
* List of candidate SL pairs

Please contact Bahar Tercan btercan@systemsbiology.org for your questions and detailed information. 

In [None]:
# This code block installs the dependencies, please run it only once, the first time you run this notebook

!pip3 install google.cloud
!pip3 install importlib
!pip3 install pandas
!pip3 install ipywidgets
!pip3 install numpy
!pip3 install statsmodels


### 1. Import python libraries required
The required libraries are imported. 

In [None]:
import sys
sys.path.append('../Scripts/') # to be able to use the .py files in ../Scripts folder
from google.cloud import bigquery
import importlib
import pandas as pd
import DAISY_operations
importlib.reload(DAISY_operations)
from DAISY_operations import *
import ipywidgets as widgets

In [None]:
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

## Google Authentication
The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).

Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries.

In [None]:
#Please make sure that you have installed Cloud SDK.
#See support from https://cloud.google.com/sdk/docs/install

!gcloud auth application-default login


### 2. Sign in Google Bigquery with the project id

Bigquery connection

Please replace syntheticlethality with your project name.

In [None]:
# please replace 'syntheticlethality' with your project id
project_id='syntheticlethality'
client = bigquery.Client(project_id)

### 4. Prediction of synthetic lethal partners using different modules on DAISY


There are three modules for synthetic lethal pair inferences on DAISY :
1. Pairwise gene coexpression, 

2. Genomic survival of the fittest. 

3. shRNA or CRISPR based functional examination.
You can get more information in the original paper : https://www.sciencedirect.com/science/article/pii/S0092867414009775.

In pairwise gene coexpression module and genomic survial of the fittest module, we will use PancancerAtlas and CCLE data.<br>
In functional examination module, we will use CRISPR and shRNA data together with CCLE data. <br>

Python codes required are  in the ../Scripts/ folder and they are imported at the beginning. 


#### 4.0. Default parameters for DAISY, you can edit them

For SDL prediction, please replace 'SL' with 'SDL' and 'Inactive' with 'Overactive' in the following code lines

For SOF and FuncExamination Procedures, input_mutations is an optional parameter, if you don't want to use,  you can skip 

In [None]:
input_mutations = ['Nonsense_Mutation', 'Frame_Shift_Ins', 'Frame_Shift_Del'] 
# DAISY default parameters for SL prediction
percentile_threshold = 10
cn_threshold = -0.3 
cor_threshold = 0.5
p_threshold = 0.05
pval_correction = 'Bonferroni'
fdr_level='gene_level' #it can be gene_level or analysis_level

# for SDL prediction DAISY parameters are 
#percentile_threshold = 90
#cn_threshold = 0.3 
#cor_threshold = 0.5
#p_threshold = 0.05


The tumor types are the TCGA cancer types, the cancer types that have corresponding Celllines are listed in the combobox. Please click on the tissue(s) you want to do the analyses on.

In [None]:
TCGA_list=GetTCGASubtypes(client)
TCGA_list = [i for i in TCGA_list if i]

tumor_type = widgets.SelectMultiple(
    options=['pancancer'] + TCGA_list  ,
    value=[],
    description='Tumor type',
    disabled=False
)
display(tumor_type)

The gene list that we have interested in finding SL partners of.

In [None]:
gene_list=["BRCA1", "BRCA2", "ARID1A"] # any number of genes in list format

#### 4.1. Pairwise gene coexpression module

In the pairwise co-expression module, DAISY makes inferences based on the assumption that synthetic lethal gene pairs play a role in related biological processes and are co-expressed. Gene expression is measured for patient-derived data from TCGA and cancer cell line-derived data from CCLE. Pairwise co-expression is estimated from the Spearman correlation which we calculated between each gene of interest (each item in the query gene list) and all other genes. Candidate synthetic lethal gene pairs are those with correlation coefficient greater than 0.5 and whose Bonferroni-corrected P value was smaller than 0.05 by default, these parameters can be updated. 


4.1.1. Pairwise gene coexpression module on PancancerAtlas.

In [None]:
coexp_pancancer = CoexpressionAnalysis(client, 'SL', "PanCancerAtlas", gene_list , pval_correction, fdr_level, list(tumor_type.value))
try:
    coex_pan_intermediate_report=coexp_pancancer.loc[(coexp_pancancer['FDR'] < p_threshold)&(coexp_pancancer['Correlation'] > cor_threshold)]
    coexp_pancancer_report=coex_pan_intermediate_report.groupby('Inactive').apply(lambda x: x.sort_values('FDR'))
except:
    coexp_pancancer_report=pd.DataFrame()
    print("No results returned.")
    
coexp_pancancer_report

<br>
4.1.2. Pairwise gene coexpression module on CCLE data

In [None]:
coexp_CCLE=CoexpressionAnalysis(client, 'SL', 'CCLE', gene_list, pval_correction, fdr_level, list(tumor_type.value ))
try: 
    coex_ccle_intermediate_report=coexp_CCLE.loc[(coexp_CCLE['FDR'] < p_threshold)&(coexp_CCLE['Correlation'] > cor_threshold)]
    coexp_CCLE_report=coex_ccle_intermediate_report.groupby('Inactive').apply(lambda x: x.sort_values('FDR'))
except:
    coexp_CCLE_report=pd.DataFrame()
    print("No results returned.")
coexp_CCLE_report    
    

#### 4.2. Genomic survival of fittest module
The genomic survival of the fittest inference module is based on the statistical test of the copy number alteration of the gene in the search domain, given whether the gene of interest is inactive (overactive) or not.The gene of interest in a sample is considered inactive if its expression is less than 10th percentile across all samples and its copy number alteration is less than -0.3 or if it has a nonsense, frame shift or frame-del mutation. The gene of interest in a sample is considered overactive if it has gene expression exceeding the 90th percentile across all samples and its copy number alteration is greater  than 0.3 (over-activity is used in synthetic dosage lethal pair prediction)
The one-sided Wilcoxon rank-sum (Mann-Whitney U) test was applied to the copy number alteration of the candidate synthetic lethal pair of each gene of interest. The higher copy number of the candidate synthetic lethal pair for  the samples whose gene of interest is inactive (overactive) is considered as an indicator of the genes being in a synthetic lethal or synthetic dosage lethal relationship. The SL/SDL pairs whose Bonferroni - corrected p-value is  less than 0.05  were returned. This inference procedure was applied on PanCancer Atlas  and CCLE data separately. 

4.2.1. Genomic survival of fittest module on CCLE data

In [None]:
sof_CCLE = SurvivalOfFittest(client, 'SL', "CCLE", gene_list,  percentile_threshold, cn_threshold, pval_correction, fdr_level, list(tumor_type.value), input_mutations)
try: 
    sof_ccle_intermediate_report=sof_CCLE.loc[(sof_CCLE['FDR'] < p_threshold),]
    sof_ccle_report=sof_ccle_intermediate_report.groupby('Inactive').apply(lambda x: x.sort_values('FDR'))
except:
    sof_ccle_report=pd.DataFrame()
    print("No results returned.")
sof_ccle_report

4.2.2. Genomic survival of fittest module on PancancerAtlas data

In [None]:
sof_pancancer = SurvivalOfFittest(client, 'SL', "PanCancerAtlas", gene_list, percentile_threshold, cn_threshold, pval_correction, fdr_level, list(tumor_type.value), input_mutations)
try:
    sof_pancancer_intermediate_report=sof_pancancer.loc[(sof_pancancer['FDR'] < p_threshold),]                
    sof_pancancer_report=sof_pancancer_intermediate_report.groupby('Inactive').apply(lambda x: x.sort_values('FDR'))
except:
    sof_pancancer_report=pd.DataFrame()
    print("No results returned.") 
sof_pancancer_report

#### 4.3. Functional examination inference module

The rationale for the functional examination inference module is that if the synthetic lethal partner of a gene is inactive in a given sample, subsequent inactivation of that gene will be lethal. Therefore, for a gene of interest, we first defined two groups for the test, one in which the gene was inactive and the other in which it was not. We then performed a one-sided Wilcoxon rank-sum (Mann-Whitney U) test on the knockdown/knockout sensitivity of candidate synthetic lethal pairs of interest. Lower viability that is associated with higher knockout/knockdown sensitivity is an indicator of a potential SLI. The synthetic lethal pairs for  whom the test result P value was lower than 0.05 were returned. This inference procedure was applied to the gene-dependency scores or gene effect scores for the shRNA and CRISPR datasets separately. 

4.3.1. CRISPR based functional examination inference module

In [None]:
crispr_result = FunctionalExamination(client,'SL', "CRISPR", gene_list, percentile_threshold, 
                                      cn_threshold, pval_correction,  fdr_level, list(tumor_type.value), input_mutations )
try:
    crispr_intermediate_report=crispr_result.loc[(crispr_result['PValue'] < p_threshold),]
    crispr_report=crispr_intermediate_report.groupby('Inactive').apply(lambda x: x.sort_values('PValue'))
except:
    crispr_report=pd.DataFrame()
    print("No results returned.")
crispr_report   

<br>
4.3.2. shRNA based functional examination inference module

In [None]:
shRNA_result = FunctionalExamination(client, 'SL', "shRNA", gene_list , percentile_threshold, \
                                     cn_threshold, pval_correction,  fdr_level, list(tumor_type.value),input_mutations)
try:
    shRNA_intermediate_report=shRNA_result.loc[(shRNA_result['PValue'] < p_threshold),]
    shRNA_report=shRNA_intermediate_report.groupby('Inactive').apply(lambda x: x.sort_values('PValue'))
    
except:
    shRNA_report=pd.DataFrame()
    print("No results returned.")
shRNA_report

### 5. Integration of results

5.1. Integration of the pairwise Co-expression gene co-expression results on Pancancer and CCLE

The union of results from PanCancer Atlas and CCLE was used. 


In [None]:
try:
    coexpression_result = UnionResults([coexp_pancancer_report, coexp_CCLE_report],'SL', ['FDR', 'FDR'],  list(tumor_type.value))
    coexpression_result=coexpression_result.sort_values('Inactive')
except:
    coexpression_result=pd.DataFrame()
    print("No Result From Pairwise Co-expression Inference Procedure")
    
coexpression_result

<br>
5.2. Integration of Survival of Fittest results on Pancancer and CCLE

The union of results from PanCancer Atlas and CCLE was used. 

In [None]:
try:
    sof_result = UnionResults([sof_ccle_report, sof_pancancer_report],  'SL', ['FDR', 'FDR'], list(tumor_type.value))
    sof_result=sof_result.sort_values('Inactive')
except:
    sof_result=pd.DataFrame()
    print("No Result From Survival of Fittest Inference Procedure")
sof_result    

<br>
5.3. Integration of shRNA and CRISPR based functional examination inference module.

We reported the union of results from shRNA and CRISPR-based datasets. 

In [None]:
try:
    functional_screening_result = UnionResults([crispr_report, shRNA_report],'SL', ['PValue', 'PValue'], list(tumor_type.value))
    functional_screening_result=functional_screening_result.sort_values('Inactive')
    
except:
    functional_screening_result=pd.DataFrame()
    print("No Result From Functional Examination Inference Procedure")
functional_screening_result    

<br>
5.4. Merging the results from all three inference procedures

The intersection of SL pairs from different inference procedures compose the final list. 


In [None]:
try:
    all_merged_results = MergeResults([coexpression_result, sof_result, functional_screening_result], 'SL',  list(tumor_type.value))
    all_merged_results=all_merged_results.sort_values('Inactive')
except:
    all_merged_results=pd.DataFrame()
    print("No results found")
all_merged_results

Results can also be saved into the excel files.

In [None]:
WriteToExcel("DAISY_SL_results.xlsx", [all_merged_results], ["final results"])