### About this notebook: 
This notebook is used to answer which gene knockout or gene knockdown  show sensitivity to certain gene mutation or the mutation of a group of genes. <br/>

Please cite the following paper when use this notebook. 

<font color='blue'>The functional screening data and omics data for cell lines is from the Depmap and CCLE project from the Broad institute (DepMap Public 20Q3). To use this jupyter notebook and the data which are used in the jupyter notebook, Please cite the following papers</font> <br/>

....our paper

For this DepMap release:
DepMap, Broad (2020): DepMap 20Q3 Public. figshare. Dataset doi:10.6084/m9.figshare.11791698.v2.

For CRISPR datasets:
Robin M. Meyers, Jordan G. Bryan, James M. McFarland, Barbara A. Weir, ... David E. Root, William C. Hahn, Aviad Tsherniak. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nature Genetics 2017 October 49:1779–1784. doi:10.1038/ng.3984. PMID: 29083409

Dempster, J. M., Rossen, J., Kazachkova, M., Pan, J., Kugener, G., Root, D. E., & Tsherniak, A. (2019). Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines. BioRxiv, 720243.

For omics datasets:
Mahmoud Ghandi, Franklin W. Huang, Judit Jané-Valbuena, Gregory V. Kryukov, ... Todd R. Golub, Levi A. Garraway & William R. Sellers. 2019. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019).PMID: 31068700

For more detailed information, please contact gqin@systemsbiology.org


In [1]:
#The following libraries are needed.
from google.cloud import bigquery
import pandas as pd
import numpy as np
import ipywidgets as widgets
from scipy import stats 
import statsmodels.stats.multitest as multi
import sys
sys.path.append('../Scripts/')
import MDSLP

In [None]:
# users need to run the following commend in their local machine or throught the notebook.
# Make sure to install the google cloud in the local envirionment. For more detail of gcloud installation, please see support from https://cloud.google.com/sdk/docs/install

!gcloud auth application-default login

In [2]:
%load_ext google.cloud.bigquery

In [3]:
# Users need to a google cloud project to query the data in the BigQuery tables. 
project_id='syntheticlethality'
client = bigquery.Client(project_id)

#### Get mutation data from CCLE, CRISPR gene knockout effects from Depmap and shRNA gene knockdown gene dependency data from demeter2 v6. Depmap version 20Q3 is used for the following analysis

In [4]:
#This step may take a little bit longer time, take a cup of coffee and relax
Mut_mat = MDSLP.get_ccle_mutation_data()
Demeter_data = MDSLP.get_demeter_shRNA_data()
Depmap_matrix = MDSLP.get_depmap_crispr_data()

Unnamed: 0
AZ521_STOMACH
GISTT1_GASTROINTESTINAL_TRACT
MB157_BREAST
SW527_BREAST


##### 
You are expecting to see the message above as follows:
Unnamed: 0 <br/>
AZ521_STOMACH<br/>
GISTT1_GASTROINTESTINAL_TRACT<br/>
MB157_BREAST<br/>
SW527_BREAST<br/>
<br/>
It means these cell lines are not included for the analysis as they show mismatching annotation from different datasets.


#### Set user input:
###### 1, Data_source: only two options are avaiable, "shRNA" or "Crispr", datatype: string
###### 2, Mutated genes to be interested. It can be a list of genes or one single gene in a list format. 
###### 3, Tumor types to be included in the analysis. Users can select 'pancancer' or select one or multiple tumor types to theirs interests.

In [None]:
# User input; The natural language question we ask here is which gene show senthetic lethality with the gene being mutated.
Data_source = "shRNA" # only two options are avaiable, "shRNA" or "Crispr", datatype: string
Gene_list = ['BRCA2'] # data type: list of gene symbols


In [None]:
# ID mapping between the CCLE annotation and input gene symbols
id_mapping, Gene_list_matched = MDSLP.GeneSymbol_standardization(Gene_list)


#### Select tumor types

In [None]:
query = ''' 
SELECT DepMap_ID, primary_disease,TCGA_subtype
FROM `syntheticlethality.DepMap_public_20Q3.sample_info_Depmap_withTCGA_labels` 
'''
sample_info = client.query(query).result().to_dataframe()

pancancer_cls = sample_info.loc[~sample_info['primary_disease'].isin(['Non-Cancerous','Unknown','Engineered','Immortalized'])]
pancancer_cls = pancancer_cls.loc[~(pancancer_cls['primary_disease'].isna())]

TCGA_list = [x for x in list(set(pancancer_cls['primary_disease'])) if x == x]

Not_none_values = filter(None.__ne__, TCGA_list)
TCGA_list = list(Not_none_values)

tumor_type = widgets.SelectMultiple(
    options=['pancancer'] + TCGA_list  ,
    value=[],
    description='Tumor type',
    disabled=False
)
display(tumor_type)

#### Select shRNA dataset or Crispr dataset to infer synthetic lethality pairs for mutated genes! 

In [None]:
Data_source = "shRNA"
if Data_source == "shRNA":
    result_shRNA = MDSLP.Mutational_based_SL_pipeline(list(tumor_type.value), Gene_list_matched, Mut_mat, Demeter_data, Data_source)
    if result_shRNA.shape[0] > 0:
        result_shRNA_sig = result_shRNA.loc[result_shRNA['FDR_all_exp'] < 0.05]
        result_shRNA_sig = result_shRNA_sig.loc[result_shRNA_sig['ES']<0] # ES < 0 represents SL pairs

In [None]:
result_shRNA_sig.to_csv("BRCA2_shRNA_sig.csv")

In [None]:
Data_source = "Crispr"
if Data_source == "Crispr":
    result_Crispr = MDSLP.Mutational_based_SL_pipeline(list(tumor_type.value), Gene_list_matched, Mut_mat, Depmap_matrix, Data_source)
    if result_Crispr.shape[0] > 0:
        result_Crispr_sig = result_Crispr.loc[result_Crispr['FDR_all_exp'] < 0.05]
        result_Crispr_sig = result_Crispr_sig.loc[result_Crispr_sig['ES'] < 0]  # ES < 0 represents SL pairs

In [None]:
result_Crispr_sig.to_csv("BRCA2_Crispr_sig.csv")

###### Result interpretation 
result_Crispr_sig or result_shRNA_sig contains the synthetic lethal gene pairs predicted from this pipeline.<br/>
###### table annotation:
Gene_mut: mutated genes;<br/>
Gene_kd: gene knockdown or knock out <br/>
Mutated_samples: Number of mutated cell lines in the selected tumor type<br/>
pvalue: p-value result from t-test<br/>
ES: effect size of gene effects between the mutated group and wild type group<br/>
FDR_all_exp: FDR for p-value for all analysis<br/>
FDR_by_gene: FDR for p-value by one gene mutation<br/>
Tumor_type: tumor types in analysis

In [None]:
# User defined analysis

### 2. Tumor specific analysis

In [None]:
TCGA_list.append('pancancer')

In [None]:
#The pancancer analysis may take a couple of minutes, take a cup of coffee please.
Gene_list = ['ARID1A']
pan_cancer_result =  pd.DataFrame()
for tumor in TCGA_list:
    print(tumor)
    Data_source = "shRNA"
    if Data_source == "shRNA":
        result_shRNA = MDSLP.Mutational_based_SL_pipeline([tumor], Gene_list, Mut_mat, Demeter_data, Data_source)
        if result_shRNA.shape[0] > 0:
            result_shRNA_ARID1B = result_shRNA.loc[result_shRNA['Gene_kd_symbol'] =='ARID1B']
            pan_cancer_result = pd.concat([pan_cancer_result, result_shRNA_ARID1B])
            

In [None]:
pan_cancer_result['source']=['MDSLP-shRNA']*pan_cancer_result.shape[0]

In [None]:
Gene_list = ['ARID1A']
pan_cancer_result_crispr =  pd.DataFrame()
for tumor in TCGA_list:
    print(tumor)
    Data_source = "Crispr"
    if Data_source == "Crispr":
        result_crispr = MDSLP.Mutational_based_SL_pipeline([tumor], Gene_list, Mut_mat, Depmap_matrix, Data_source)
        if result_crispr.shape[0] > 0:
            result_crispr_ARID1B = result_crispr.loc[result_crispr['Gene_kd_symbol'] =='ARID1B']
            pan_cancer_result_crispr = pd.concat([pan_cancer_result_crispr, result_crispr_ARID1B])
            

In [None]:
pan_cancer_result_crispr['source']=['MDSLP-CRISPR']*pan_cancer_result_crispr.shape[0]

In [None]:
result = pd.concat([pan_cancer_result_crispr,pan_cancer_result])

In [None]:
result['-log(FDR)'] = -1 *np.log(result['FDR_all_exp'])

In [None]:
result.to_csv("tumor_specific_analysis_ARID1A_ARID1B.csv")

#### Visulization of the results

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize = [4,4], dpi = 300)

clrs = []
for x in range(0,7):
    clrs.append('#5477b4')
    clrs.append('#dc895a')

ax = sns.barplot(x="ES", y="Tumor_type", hue="source",data=result,
                 orient = 'h', 
                 order = ['pancancer',
                          'Ovarian Cancer',
                          
                          'Gastric Cancer',
                          'Colon/Colorectal Cancer',
                          'Bladder Cancer',
                          'Lung Cancer',
                          'Endometrial/Uterine Cancer',
                          'Pancreatic Cancer',
                          'Leukemia'
                          ],
                palette = clrs)

plt.setp(ax.get_legend().get_texts(), fontsize='8') # for legend text
ax.set_xlabel('Effect size (Mut - WT)', fontsize=14)
ax.set_ylabel('', fontsize=0)
ax.set(xlim=(-2, 0))
plt.legend(loc='lower left')


In [None]:
clrs = []
for x in range(0,7):
    clrs.append('#5477b4')
    clrs.append('#dc895a')
    
plt.figure(figsize = [4,4], dpi = 300)
ax1 = sns.barplot(x="-log(FDR)", y="Tumor_type", hue="source",
                  data=result,
                  orient = 'h' ,
                  order = ['pancancer',
                             'Ovarian Cancer',
                             'Gastric Cancer',
                             'Colon/Colorectal Cancer',
                             'Bladder Cancer',
                             'Lung Cancer', 
                             'Endometrial/Uterine Cancer',
                             'Pancreatic Cancer',
                             'Leukemia'
                            ],
                  palette = clrs)
plt.setp(ax1.get_legend().get_texts(), fontsize='8') # for legend text
ax1.set_xlabel('Statistical Significance', fontsize=14) #-1 * log (FDR)
ax1.set_ylabel('', fontsize=0) #ignore the y axis label
plt.legend(loc='lower right')

plt.plot([1.301029996, 1.301029996], [0, 8], 'k-', lw=0.5) # Plot the significance threshold 1.301029996 = -log(0.05)


In [None]:
result.loc[result['source'] == 'MDSLP-CRISPR'].sort_values(by = ['FDR_all_exp'])

In [None]:
## End analysis