#  DAISY- the DAta-mIning SYnthetic-lethality-identification pipeline

Please cite: 

For Implementation: 

Our paper,

For DAISY algorithm: 

Jerby-Arnon, L., Pfetzer, N., Waldman, Y. Y., McGarry, L., James, D., Shanks, E., ... & Gottlieb, E. (2014). Predicting cancer-specific vulnerability via data-driven detection of synthetic lethality. Cell, 158(5), 1199-1209.

For CCLE Omics data:

Ghandi, M., Huang, F.W., Jané-Valbuena, J. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019). https://doi.org/10.1038/s41586-019-1186-3

For CRISPR Data: 

Robin M. Meyers, Jordan G. Bryan, James M. McFarland, Barbara A. Weir, ... David E. Root, William C. Hahn, Aviad Tsherniak. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nature Genetics 2017 October 49:1779–1784. doi:10.1038/ng.3984

Dempster, J. M., Rossen, J., Kazachkova, M., Pan, J., Kugener, G., Root, D. E., & Tsherniak, A. (2019). Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines. BioRxiv, 720243.


This notebook is a reimplementation of DAISY Synthetic Lethal Pair Prediction Algorithm

Please first run the table_creation notebook before runnnig the DAISY notebook. 

It consists 3 modules: 

1. SL candidate determination using gene co-expression
2. SL candidate determination using survival of fittest
3. SL candidate determination using CRISPR and ShRNA experiment


* The results from the three modules were then aggregated into one ranked list of candidate SL pairs


Input Parameters
* Cancer type 
* The genes whose SL partners are seeked


Input Data
* Gene expression data 
* Gene mutation data
* Copy number variation data
* Gene effect data (CRISPR)
* Gene Dependency scores data (shRNA)

Output
* Ranked list of candidate SL pairs
![../../figures/daisy_pipeline.png](attachment:dene.png)

In [1]:
reset 

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [2]:
pwd

'/Users/bahar/Desktop/DAISY_imp/SL-Cloud-main/DAISY_pipeline'

### 1. Import python libraries required
The required libraries are imported. 

In [12]:
from datetime import datetime
import sys
sys.path.append('../scripts/') #need to add "scripts" directory in a parent directory 
from google.cloud import bigquery
import importlib
import pandas as pd
import DAISY_operations
importlib.reload(DAISY_operations)
from DAISY_operations import *
from helper_functions import *
from BIGQUERY_operations import *

In [13]:
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

### 2. Sign in Google Bigquery with the project id

Bigquery connection
Please replace syntheticlethality with your project name

In [14]:
project_id='syntheticlethality'
client = bigquery.Client(project_id)
#client = bigquery.Client(credentials=credentials, project=credentials.project_id)

!gcloud auth login

Traceback (most recent call last):
  File "/Users/bahar/Downloads/google-cloud-sdk/lib/gcloud.py", line 104, in <module>
    main()
  File "/Users/bahar/Downloads/google-cloud-sdk/lib/gcloud.py", line 100, in main
    sys.exit(gcloud_main.main())
  File "/Users/bahar/Downloads/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 171, in main
    gcloud_cli = CreateCLI([])
  File "/Users/bahar/Downloads/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 151, in CreateCLI
    generated_cli = loader.Generate()
  File "/Users/bahar/Downloads/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 504, in Generate
    cli = self.__MakeCLI(top_group)
  File "/Users/bahar/Downloads/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 674, in __MakeCLI
    log.AddFileLogging(self.__logs_dir)
  File "/Users/bahar/Downloads/google-cloud-sdk/lib/googlecloudsdk/core/log.py", line 1039, in AddFileLogging
    _log_manager.AddLogsDir(logs_dir=logs_dir)
  File "/Use

In [15]:
%load_ext google.cloud.bigquery

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


In [16]:
%%bigquery tsg

UsageError: %%bigquery is a cell magic, but the cell body is empty.


### 3. Input genes of interest for the SL partner prediction

We will predict synthetic lethal partner genes for tumor suppressor genes as default.
The query will use a permission required big-query table for tumor suppressor genes.
To execute this query, you need to register for a new COSMIC account and to get a permission.
Please follow this link for the registration : https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/data/COSMIC_about.html.
If you want to test the genes of your interest, please skip this and add you own genes to the variable "input_genes".

In [17]:
query='''
SELECT Gene_Symbol
  FROM `isb-cgc.COSMIC_v90_grch38.Cancer_Gene_Census` 
 WHERE Role_in_Cancer LIKE '%TSG%'

INTERSECT DISTINCT

SELECT HGNC_gene_symbol
  FROM `syntheticlethality.gene_information.cancer_driver_genes`
 
'''
driver_tsg_genes = client.query(query).result().to_dataframe()

<br>
Conversion from Hugo Symbols into EntrezIDs 

In [18]:
input_genes = driver_tsg_genes["Gene_Symbol"].to_list()
input_entrez_ids = ConvertGene(client, input_genes, 'Gene', ['EntrezID'])
input_entrez_ids

Unnamed: 0,Gene,EntrezID
0,APC,324
1,ATM,472
2,ATR,545
3,B2M,567
4,CIC,23152
...,...,...
108,PRKAR1A,5573
109,SMARCA4,6597
110,SMARCB1,6598
111,TBL1XR1,79718


### 4. Prediction of synthetic lethal partners using different modules on DAISY


There are three modules for synthetic lethal pair inferences on DAISY : 1. Pairwise gene coexpression, 2. Genomic survival of the fittest. 3. shRNA or CRISPR based functional examination. You can get more information in the original paper : https://www.sciencedirect.com/science/article/pii/S0092867414009775.

In pairwise gene coexpression module and genomic survial of the fittest module, we will use PancancerAtlas and CCLE data.<br>
In functional examination module, we will use CRISPR and shRNA data. <br>

Python codes for each module are built in our internal library (../scripts/SL_library.py) which was already imported at the beginning. 


#### 4.0. Default parameters for DAISY, you can edit them

In [19]:
input_mutations = ['Nonsense_Mutation', 'Frame_Shift_Ins', 'Frame_Shift_Del'] # Three mutation types were chosed as default by DAISY.
percentile_threshold = 10
cn_threshold = -0.3 
cor_threshold = 0.5
p_threshold = 0.05
pval_correction = 'Bonferroni'

#### 4.1. Pairwise gene coexpression module

4.1.1. Pairwise gene coexpression module on PancancerAtlas.

In [20]:
coexp_pancancer = CoexpressionAnalysis(client, "PanCancerAtlas", [472], cor_threshold, p_threshold, pval_correction)


In [21]:
coexp_pancancer

Unnamed: 0_level_0,Unnamed: 1_level_0,EntrezID_Inactive,Gene_Inactive,EntrezID_SL_Candidate,Gene_SL_Candidate,Correlation,PValue
Gene_Inactive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ATM,0,472,ATM,4297,KMT2A,0.719314,0.0
ATM,130,472,ATM,1106,CHD2,0.525927,0.0
ATM,131,472,ATM,7621,ZNF70,0.525772,0.0
ATM,132,472,ATM,100101467,ZSCAN30,0.525512,0.0
ATM,133,472,ATM,23506,BICRAL,0.525503,0.0
ATM,...,...,...,...,...,...,...
ATM,71,472,ATM,55619,DOCK10,0.551148,0.0
ATM,72,472,ATM,54891,INO80D,0.550570,0.0
ATM,73,472,ATM,22990,PCNX1,0.550037,0.0
ATM,63,472,ATM,64430,PCNX4,0.554314,0.0


<br>
4.1.2. Pairwise gene coexpression module on CCLE data

In [22]:
coexp_CCLE=CoexpressionAnalysis(client, "CCLE", [472], cor_threshold, p_threshold, pval_correction) 

In [23]:
coexp_CCLE

Unnamed: 0_level_0,Unnamed: 1_level_0,EntrezID_Inactive,Gene_Inactive,EntrezID_SL_Candidate,Gene_SL_Candidate,Correlation,PValue
Gene_Inactive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ATM,0,472,ATM,4863,NPAT,0.671066,3.222075e-167
ATM,1,472,ATM,200576,PIKFYVE,0.637336,2.370146e-145
ATM,2,472,ATM,2145,EZH1,0.593870,6.509447e-121
ATM,3,472,ATM,84437,MSANTD4,0.579161,1.842351e-113
ATM,4,472,ATM,143684,FAM76B,0.571153,1.462356e-109
ATM,...,...,...,...,...,...,...
ATM,59,472,ATM,9044,BTAF1,0.502733,3.157164e-80
ATM,60,472,ATM,10181,RBM5,0.502723,3.183590e-80
ATM,61,472,ATM,10180,RBM6,0.502311,4.570113e-80
ATM,62,472,ATM,55599,RNPC3,0.500605,2.032599e-79


#### 4.2. Genomic survival of fittest module

4.2.1. Genomic survival of fittest module on CCLE data

In [24]:
sof_CCLE = SurvivalOfFittest(client, "CCLE", p_threshold, [472], input_mutations, percentile_threshold, cn_threshold, pval_correction)


Empty DataFrame
Columns: [Gene_Inactive, Gene_SL_Candidate, PValue]
Index: []


In [25]:
sof_CCLE

Unnamed: 0,Gene_Inactive,Gene_SL_Candidate,PValue


<br>
4.2.2. Genomic survival of fittest module on PancancerAtlas

In [26]:
sof_pancancer = SurvivalOfFittest(client, "PanCancerAtlas", p_threshold, [472], input_mutations, percentile_threshold, cn_threshold,pval_correction)


     Gene_Inactive Gene_SL_Candidate    PValue
0              ATM      LOC100132111  0.000000
1              ATM            OR10J5  0.000000
2              ATM              DEDD  0.000000
3              ATM           TOMM40L  0.000000
4              ATM             UHMK1  0.000000
...            ...               ...       ...
2446           ATM              ATL1  0.049183
2447           ATM      RP11-88I21.2  0.049570
2448           ATM              GNG2  0.049795
2449           ATM             GATA2  0.049817
2450           ATM             FRMD6  0.049853

[2451 rows x 3 columns]


In [27]:
sof_pancancer

Unnamed: 0_level_0,Unnamed: 1_level_0,EntrezID_Inactive,Gene_Inactive,EntrezID_SL_Candidate,Gene_SL_Candidate,PValue
Gene_Inactive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATM,0,472,ATM,127385,OR10J5,0.000000
ATM,58,472,ATM,57823,SLAMF7,0.000000
ATM,57,472,ATM,2312,FLG,0.000000
ATM,56,472,ATM,84824,FCRLA,0.000000
ATM,55,472,ATM,5824,PEX19,0.000000
ATM,...,...,...,...,...,...
ATM,2179,472,ATM,6400,SEL1L,0.048261
ATM,2180,472,ATM,51062,ATL1,0.049183
ATM,2181,472,ATM,54331,GNG2,0.049795
ATM,2182,472,ATM,2624,GATA2,0.049817


#### 4.3. Functional examination inference module

4.3.1. CRISPR based functional examination inference module

In [28]:
crispr_result = FunctionalExamination(client, "CRISPR", p_threshold,[472], percentile_threshold, cn_threshold, 'none')


In [29]:
crispr_result

Unnamed: 0_level_0,Unnamed: 1_level_0,EntrezID_Inactive,Gene_Inactive,EntrezID_SL_Candidate,Gene_SL_Candidate,PValue
Gene_Inactive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATM,0,472,ATM,84811,BUD13,2.512259e-10
ATM,1,472,ATM,2017,CTTN,3.285199e-10
ATM,2,472,ATM,156,GRK2,3.697528e-10
ATM,3,472,ATM,10235,RASGRP2,5.162614e-10
ATM,4,472,ATM,6734,SRPRA,2.068264e-09
ATM,...,...,...,...,...,...
ATM,2147,472,ATM,27248,ERLEC1,4.975831e-02
ATM,2146,472,ATM,10874,NMU,4.975831e-02
ATM,2148,472,ATM,2145,EZH1,4.981428e-02
ATM,2149,472,ATM,55032,SLC35A5,4.982033e-02


<br>
4.3.2. shRNA based functional examination inference module

In [30]:
siRNA_result = FunctionalExamination(client, "siRNA", p_threshold, [472], percentile_threshold, cn_threshold, 'none')


In [31]:
siRNA_result

Unnamed: 0_level_0,Unnamed: 1_level_0,EntrezID_Inactive,Gene_Inactive,EntrezID_SL_Candidate,Gene_SL_Candidate,PValue
Gene_Inactive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATM,0,472,ATM,441087,LOC441087,0.000006
ATM,1,472,ATM,341640,FREM2,0.000007
ATM,2,472,ATM,89910,UBE3B,0.000008
ATM,3,472,ATM,6734,SRPRA,0.000016
ATM,4,472,ATM,29942,PURG,0.000023
ATM,...,...,...,...,...,...
ATM,1206,472,ATM,26145,IRF2BP1,0.049769
ATM,1207,472,ATM,85437,ZCRB1,0.049814
ATM,1208,472,ATM,3123,HLA-DRB1,0.049859
ATM,1209,472,ATM,115548,FCHO2,0.049892


### 5. Integration of results

5.1. Integration of the pairwise Co-expression gene co-expression results on Pancancer and CCLE

In [32]:
coexpression_result = UnionResults([coexp_pancancer, coexp_CCLE])

In [33]:
coexpression_result

Unnamed: 0,EntrezID_Inactive,Gene_Inactive,EntrezID_SL_Candidate,Gene_SL_Candidate,Correlation_0,PValue_0,Correlation_1,PValue_1,PValue
0,472,ATM,4297,KMT2A,0.719314,0.0,0.550823,3.966528e-100,0.000000e+00
1,472,ATM,1106,CHD2,0.525927,0.0,,,0.000000e+00
2,472,ATM,7621,ZNF70,0.525772,0.0,,,0.000000e+00
3,472,ATM,100101467,ZSCAN30,0.525512,0.0,,,0.000000e+00
4,472,ATM,23506,BICRAL,0.525503,0.0,,,0.000000e+00
...,...,...,...,...,...,...,...,...,...
241,472,ATM,64766,S100PBP,,,0.504090,9.554430e-81,9.554430e-81
242,472,ATM,10181,RBM5,,,0.502723,3.183590e-80,3.183590e-80
243,472,ATM,10180,RBM6,,,0.502311,4.570113e-80,4.570113e-80
244,472,ATM,55599,RNPC3,,,0.500605,2.032599e-79,2.032599e-79


<br>
5.2. Integration of Survival of Fittest results on Pancancer and CCLE

In [34]:
sof_result = UnionResults([sof_CCLE, sof_pancancer])

At least one of the dataframes is empty, please run only with nonempty dataframes


In [35]:
sof_result=sof_pancancer

<br>
5.3. Integration of shRNA and CRISPR based functional examination inference module.

In [36]:
functional_screening_result = UnionResults([crispr_result, siRNA_result])

<br>
5.4. Merging the results from all three inference procedures

In [37]:
all_merged_results = MergeResults([coexpression_result, sof_result, functional_screening_result])

In [38]:
all_merged_results

Unnamed: 0,EntrezID_Inactive,Gene_Inactive,EntrezID_SL_Candidate,Gene_SL_Candidate,PValue
0,472,ATM,27185,DISC1,0.0
1,472,ATM,9859,CEP170,0.0


Results are saved in excel file

In [39]:
WriteToExcel("DAISY_results.xlsx", [coexp_pancancer,  all_merged_results],["Co-exp_Pancancer", "All"])
