### Mutation dependent synthetic lethal pipeline
```
Title:       Data Mining Synthetic Lethality Identification Pipeline (DAISY)
Author:      Guangrong Qin
Contact:     gqin@systemsbiology.org
Created:     
Description: This notebook is used to answer which gene knockouts or knockdowns are likely to show sensitivity to mutations in specified genes.  
```
Citations: The functional screening data and omics data for cell lines is from the Depmap and CCLE project from the Broad institute (DepMap Public 20Q3). To use this jupyter notebook and the data which are used in the jupyter notebook, Please cite the following papers<br/>

Bahar Tercan, Guangrong Qin, Taek-Kyun Kim, Boris Aguilar, Christopher J. Kemp, Nyasha Chambwe, Ilya Shmulevich. SL-Cloud: A Computational Resource to Support Synthetic Lethal Interaction Discovery. BioRxiv 2021.09.18.459450; doi: https://doi.org/10.1101/2021.09.18.459450

For this DepMap release:
DepMap, Broad (2020): DepMap 20Q3 Public. figshare. Dataset doi:10.6084/m9.figshare.11791698.v2.

For CRISPR datasets:
Robin M. Meyers, Jordan G. Bryan, James M. McFarland, Barbara A. Weir, ... David E. Root, William C. Hahn, Aviad Tsherniak. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nature Genetics 2017 October 49:1779–1784. doi:10.1038/ng.3984. PMID: 29083409

Dempster, J. M., Rossen, J., Kazachkova, M., Pan, J., Kugener, G., Root, D. E., & Tsherniak, A. (2019). Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines. BioRxiv, 720243.

For omics datasets:
Mahmoud Ghandi, Franklin W. Huang, Judit Jané-Valbuena, Gregory V. Kryukov, ... Todd R. Golub, Levi A. Garraway & William R. Sellers. 2019. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019).PMID: 31068700


In [9]:
#Check the required libraries
try:
    from google.cloud import bigquery
    print("module 'google-cloud-bigquery' is installed")
except ModuleNotFoundError:
    !pip install google-cloud-bigquery
    from google.cloud import bigquery

try:
    import ipywidgets as widgets
    print("module 'ipywidgets' is installed")
except ModuleNotFoundError:
    !pip install ipywidgets
    import ipywidgets as widgets

try:
    import pyarrow
    print("module 'pyarrow' is installed")
except ModuleNotFoundError:
    !pip install pyarrow
    import pyarrow

try:
    import pandas as pd
    print("module 'pandas' is installed")
except ModuleNotFoundError:
    !pip install pandas
    import pandas as pd

try:
    import numpy as np
    print("module 'numpy' is installed")
except ModuleNotFoundError:
    !pip install numpy
    import numpy as np

try:
    from scipy import stats    
    print("module 'scipy' is installed")
except ModuleNotFoundError:
    !pip install scipy
    from scipy import stats    

try:
    import statsmodels.stats.multitest as multi   
    print("module 'statsmodels' is installed")
except ModuleNotFoundError:
    !pip install statsmodels
    import statsmodels.stats.multitest as multi

try:
    from MDSLP import MDSLP
    print("module 'MDSLP' is installed")
except ModuleNotFoundError:
    !pip install -i https://test.pypi.org/simple/ MDSLP==0.2
    from MDSLP import MDSLP
        

module 'google-cloud-bigquery' is installed
module 'ipywidgets' is installed
module 'pyarrow' is installed
module 'pandas' is installed
module 'numpy' is installed
module 'scipy' is installed
module 'statsmodels' is installed
module 'MDSLP' is installed


In [3]:
# users need to run the following commend in their local machine or through the notebook.
# Make sure to install the google cloud in the local environment. For more detail of gcloud installation, please see support from https://cloud.google.com/sdk/docs/install

!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=uUsaONgBNmYNlZgP8yv7rofipnYfTJ&access_type=offline&code_challenge=6SHI2biCQEQtUyfUiLvFZDOGX0FLJ5sHDseeT1k2W_M&code_challenge_method=S256


Credentials saved to file: [/Users/guangrong/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "isb-cgc-04-0002" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.


In [None]:
%load_ext google.cloud.bigquery

#### Set user input 1:
###### 1. Data_source: the desired data source, either "shRNA" or "Crispr". datatype: string
###### 2. Mutated genes to be investigated: A list of either a single gene or multiple genes. datatype: list 


In [13]:
# Users need to authenticate with their google cloud project to query the data in the BigQuery tables. 
project_id='syntheticlethality' # replace this id with the user google project

# User input; The natural language question we ask here is which gene show a synthetic lethal interaction with the target gene.
Data_source = "shRNA" # only two options are avaiable, "shRNA" or "Crispr", datatype: string

Gene_list = ['BRCA2'] # data type: list of gene symbols



#### Set user input 2:
###### Tumor types being considered. Users can select one or multiple tumor types for analysis. 


In [14]:
query = ''' 
SELECT DepMap_ID, primary_disease,TCGA_subtype
FROM `syntheticlethality.DepMap_public_20Q3.sample_info_Depmap_withTCGA_labels` 
'''
sample_info = client.query(query).result().to_dataframe()

pancancer_cls = sample_info.loc[~sample_info['primary_disease'].isin(['Non-Cancerous','Unknown','Engineered','Immortalized'])]
pancancer_cls = pancancer_cls.loc[~(pancancer_cls['primary_disease'].isna())]

TCGA_list = [x for x in list(set(pancancer_cls['primary_disease'])) if x == x]

Not_none_values = filter(None.__ne__, TCGA_list)
TCGA_list = list(Not_none_values)

tumor_type = widgets.SelectMultiple(
    options=['pancancer'] + TCGA_list  ,
    value=[],
    description='Tumor type',
    disabled=False
)
display(tumor_type)

  TCGA_list = list(Not_none_values)


SelectMultiple(description='Tumor type', options=('pancancer', 'Lymphoma', 'Lung Cancer', 'Gastric Cancer', 'M…

#### Get mutation data from CCLE, CRISPR gene knockout effects from Depmap and shRNA gene knockdown gene dependency data from demeter2 v6. Depmap version 20Q3 is used for the following analysis

In [11]:
# Query data resources for further analysis
client = bigquery.Client(project_id)

# ID mapping between the CCLE annotation and input gene symbols
id_mapping, Gene_list_matched = MDSLP.GeneSymbol_standardization(Gene_list, project_id)

# get the mutation data, shRNA data or Crispr dataset
Mut_mat = MDSLP.get_ccle_mutation_data(project_id) # Get mutation table for the ccle cell lines (version: Depmap 20Q3)

if Data_source == "shRNA" :
    Demeter_data = MDSLP.get_demeter_shRNA_data(project_id) # Get shRNA-based gene knockdown effects from the Depmap project (Demeter2)
elif Data_source == "Crispr": 
    Depmap_matrix = MDSLP.get_depmap_crispr_data(project_id) # Get the CRISPR-based gene knockout effects from the Depmap project (version: Depmap 20Q3)
else:
    print("Data_source has only two options: shRNA or Crispr")


Unnamed: 0
AZ521_STOMACH
GISTT1_GASTROINTESTINAL_TRACT
MB157_BREAST
SW527_BREAST


##### 
You are expecting to see the message above as follows:
Unnamed: 0 <br/>
AZ521_STOMACH<br/>
GISTT1_GASTROINTESTINAL_TRACT<br/>
MB157_BREAST<br/>
SW527_BREAST<br/>
<br/>
It means these cell lines are not included in the analysis as they show mismatched annotations from different datasets.


#### Select shRNA dataset or Crispr dataset to infer synthetic lethality pairs for mutated genes! 

In [17]:
if Data_source == "shRNA":
    result = MDSLP.Mutational_based_SL_pipeline(list(tumor_type.value), Gene_list_matched, Mut_mat, Demeter_data, Data_source,project_id)
    if result.shape[0] > 0:
        result_sig = result.loc[result['FDR_all_exp'] < 0.05]
        result_sig = result_sig.loc[result_sig['ES']<0] # ES < 0 represents SL pairs
    else:
        result_sig = pd.DataFrame()
        
elif  Data_source == "Crispr":
    result = MDSLP.Mutational_based_SL_pipeline(list(tumor_type.value), Gene_list_matched, Mut_mat, Depmap_matrix, Data_source,project_id)
    if result.shape[0] > 0:
        result_sig = result.loc[result['FDR_all_exp'] < 0.05]
        result_sig = result_sig.loc[result_sig['ES'] < 0]  # ES < 0 represents SL pairs
    else:
        result_sig = pd.DataFrame()

  Depmap_matrix_sele = Depmap_matrix.loc[Samples_with_mut_kd,:].transpose()


Gene mutated: BRCA2
Number of samples with mutation: 116


In [18]:
result_sig.sort_values(by = ['FDR_all_exp'])

Unnamed: 0,Gene_mut,Gene_mut_symbol,Gene_kd,Gene_kd_symbol,Mutated_samples,pvalue,ES,FDR_by_gene,FDR_all_exp,Tumor_type
3005,BRCA2,BRCA2,CTNNB1,CTNNB1,116,2.026703e-16,-0.858768,2.214368e-12,2.214368e-12,pancancer
10876,BRCA2,BRCA2,DDX27,DDX27,116,2.558632e-16,-0.855453,2.214368e-12,2.214368e-12,pancancer
3479,BRCA2,BRCA2,DHX9,DHX9,116,3.563206e-13,-0.755262,2.055851e-09,2.055851e-09,pancancer
4512,BRCA2,BRCA2,SCAP,SH2D2A,116,6.567143e-12,-0.711796,2.841767e-08,2.841767e-08,pancancer
3619,BRCA2,BRCA2,DLST,DLST,116,1.186358e-11,-0.702331,4.106935e-08,4.106935e-08,pancancer
...,...,...,...,...,...,...,...,...,...,...
8535,BRCA2,BRCA2,MAP3K11,MAP3K11,116,1.557562e-03,-0.323304,4.854290e-02,4.854290e-02,pancancer
11083,BRCA2,BRCA2,DMAP1,DMAP1,116,1.558155e-03,-0.323325,4.854290e-02,4.854290e-02,pancancer
8250,BRCA2,BRCA2,ZNF658B,ZNF658B,81,1.586154e-03,-0.383814,4.920205e-02,4.920205e-02,pancancer
4837,BRCA2,BRCA2,SRGAP2,SRGAP3,114,1.600124e-03,-0.324791,4.936995e-02,4.936995e-02,pancancer


In [None]:
result_sig.to_csv("result_sig.csv")

###### Result interpretation 
result_sig table contains the synthetic lethal gene pairs predicted from this pipeline.<br/>
###### table annotation:
Gene_mut: mutated genes;<br/>
Gene_kd: gene knockdown or knock out <br/>
Mutated_samples: Number of mutated cell lines in the selected tumor type<br/>
pvalue: p-value result from t-test<br/>
ES: effect size of gene effects between the mutated group and wild type group<br/>
FDR_all_exp: FDR for p-value for all analysis<br/>
FDR_by_gene: FDR for p-value by one gene mutation<br/>
Tumor_type: tumor types in analysis