<table style="border:2px solid white;" cellspacing="0" cellpadding="0" border-collapse: collapse; border-spacing: 0;>
  <tr> 
    <th style="background-color:white"> <img src="../media/cnt_logo_700.png" width=250 height=250></th>
    <th style="background-color:white"> <img src="../media/CCAL.png" width=210 height=210></th>
    <th style="background-color:white"> <img src="../media/logoMoores.jpg" width=170 height=170></th>
    <th style="background-color:white"> <img src="../media/UCSD_School_of_Medicine_logo.png" width=175 height=175></th> 
  </tr>
</table>

# GSEA Notebook v2

# Find the Mystery Pathway

<hr style="border: none; border-bottom: 3px solid #88BBEE;">

## Configure notebook

%load_ext autoreload
%autoreload 2

import numpy 
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)
import plotly as py
import pandas as pd
import numpy as np
np.random.seed(7678)

import sys
sys.path.insert(0, '../tools/')
import ccal 
        
import random    
import warnings
warnings.filterwarnings('ignore')

py.offline.init_notebook_mode(connected=True)

from IPython.display import display, HTML

display(HTML(data="""
<style>
    div#notebook-container    { width: 95%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 99%; }
</style>
"""))    

### Read the Hallmarks collection from MSigDB

In [11]:
gmt1 = '../data/h.all.v7.0.symbols.gmt'
gene_sets = ccal.read_gmts([gmt1], collapse=False)
gene_sets

Unnamed: 0_level_0,Gene 0,Gene 1,Gene 10,Gene 100,Gene 101,Gene 102,Gene 103,Gene 104,Gene 105,Gene 106,...,Gene 90,Gene 91,Gene 92,Gene 93,Gene 94,Gene 95,Gene 96,Gene 97,Gene 98,Gene 99
Gene Set,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HALLMARK_TNFA_SIGNALING_VIA_NFKB,KLF4,BCL3,RCAN1,KLF2,SLC2A3,GADD45B,NFKB2,PTPRE,MAFF,HES1,...,CXCL11,GEM,PLPP3,B4GALT5,DUSP1,PHLDA1,PDLIM5,IFIT2,GADD45A,ABCA1
HALLMARK_HYPOXIA,HEXA,FAM162A,PPP1R3C,KLHL24,HSPA5,ERO1A,GPC3,NR3C1,DUSP1,PDGFB,...,KLF6,B3GALT6,PPFIA4,GYS1,HOXB9,TGM2,PLAUR,HAS1,XPNPEP1,BNIP3L
HALLMARK_CHOLESTEROL_HOMEOSTASIS,LPL,MVK,ANXA5,,,,,,,,...,,,,,,,,,,
HALLMARK_MITOTIC_SPINDLE,PPP4R2,PLK1,TUBD1,KIF3B,ANLN,NEK2,SMC4,ARHGEF7,TIAM1,TAOK2,...,RASA1,INCENP,RAB3GAP1,DOCK4,RAPGEF6,TTK,CEP192,PDLIM5,RAPGEF5,CEP72
HALLMARK_WNT_BETA_CATENIN_SIGNALING,DKK1,TP53,HDAC5,,,,,,,,...,,,,,,,,,,
HALLMARK_TGF_BETA_SIGNALING,CDK9,SERPINE1,BMP2,,,,,,,,...,,,,,,,,,,
HALLMARK_IL6_JAK_STAT3_SIGNALING,IL3RA,PTPN1,CRLF2,,,,,,,,...,,,,,,,,,,
HALLMARK_DNA_REPAIR,NELFCD,TARBP2,PNP,ARL6IP1,NUDT21,RFC4,POLA2,GTF2H3,UMPS,AAAS,...,ZWINT,ERCC3,NME3,DDB2,POLR3C,PRIM1,POLD3,POLB,VPS37D,NT5C3A
HALLMARK_G2M_CHECKPOINT,RAD21,PLK1,KIF15,NEK2,NUP98,UPF1,ABL1,RBL1,MCM3,H2AFZ,...,PTTG1,HNRNPD,STAG1,CKS1B,HUS1,HMGA1,MAPK14,YTHDC1,NUP50,SMC4
HALLMARK_APOPTOSIS,SLC20A1,IGF2R,ERBB3,GSN,FAS,IL18,SATB1,PDCD4,HGF,RELA,...,IL1A,BIRC3,PRF1,XIAP,NEDD9,IFNB1,BTG3,TNFSF10,DNAJA1,MCL1


### Read Expression File for 750 Cancer Cell Lines
##### Cancer Cell Line Encyclopedia: dataset from the [Dependencies Map Project at the Broad Institute](https://depmap.org/portal/). This is a great resource for cancer research in general.

In [32]:
ccle_exp = ccal.read_gct('../data/ccle_gene_expression.gct')
print(ccle_exp.shape)

(48642, 750)


### Read Drug Sensitivity Database 
##### [Cancer Therapeutics Response Portal](https://portals.broadinstitute.org/ctrp.v2.1/). This is part of the [Cancer Target Discovery and Development Network (CTD2)](https://ocg.cancer.gov/programs/ctd2) project sponsored by the NCI Offcie of Cancer Genomics. UCSD is also a member of this network.

In [5]:
ccle_drug_sens = ccal.read_gct('../data/ccle_drug_sensitivity.gct')
print(ccle_drug_sens.shape)

(481, 645)


### To select a single drug as phenotype extract the relevant row using the drug name
#### The use of the notnull function is to remove NA values (some drugs do not have entries for all cell lines)

In [33]:
phen = ccle_drug_sens.loc['drug name', pd.notnull(ccle_drug_sens.loc['drug name',:])]
phen

# Find the Mystery Pathway

## The exercise consists of the following. Using the 3 datasets already loaded in the notebook:

### I. The Hallmarks gene sets (gene_sets)
### II. The CCLE gene expression dataset (ccle_exp)
### III. The CTRP drug sensitivity dataset (ccle_drug_sens)

### 1.- Define a phenotype based on the drug sensitivty for the compound ML162 in the CTRP dataset (row name 'ML162'). This drug is a representative of a new type of therapeutic compounds that induce "ferroptosis" cellular death. See for example the [ferroptosis](https://en.wikipedia.org/wiki/Ferroptosis) page in wikipedia. Notice that in the CTRP dataset lower values imply higher sensitivity to the drug (Area under the curve drug sensitivity metric).

### 2.- Use single_sample GSEA function from the first notebook) to produce a dataset of pathways (hallmarks) vs. samples. You can do this by cutting and pasting the ccal.ssGSEA function from the first notebook and setting the parameters accordingly. This could take about 5-10 minutes.

### 3.- Use the make_match_panel function to match the ssGSEA profiles of the hallmark gene sets againt the drug response of ML162. You can do this by cutting and pasting the ccal.make_match_panel function from the first notebook and setting the parameters accordingly.

### Based on your results answers these two questions:

### A. What is the pathway most associated with response to the ML162 compound? In other words, What is the cellular context or molecular pathway associated with response? This could be answered based on the best scoring hallmark from 2.

### B. What are the cancer types more sensitive to ML162? To do this just sort the ML162 phenotype and see which cell lines (and cancer types) correspond to the smallest values of the ML162 sensitivity profile. You can do this using the pandas sorting function: dataset.sort_values(ascending=True).

### Both questions are important as there are pharmaceutical companies developing clinical-grade drugs similar to ML162 to be used to treat human patients. In this context it is important to know how cancer patients could be selected as candidates for this drug based on some key molecular biomarkers (pathway) and/or cancer tissue type.

### Now try to see if you can answer those two questions


### If you get stuck, you can peek at the solution which is shown at the end of this notebook. However, only do that after a good try.

## This is the Solution


## Extract the profile of the ML162 compound

In [34]:
phen = ccle_drug_sens.loc['ML162', pd.notnull(ccle_drug_sens.loc['ML162',:])]
phen

CHP212_AUTONOMIC_GANGLIA             10.3520
IMR32_AUTONOMIC_GANGLIA               4.8662
KELLY_AUTONOMIC_GANGLIA               9.8269
KPNSI9S_AUTONOMIC_GANGLIA             9.5538
KPNYN_AUTONOMIC_GANGLIA               9.2834
MHHNB11_AUTONOMIC_GANGLIA            11.6980
NB1_AUTONOMIC_GANGLIA                 9.9825
NH6_AUTONOMIC_GANGLIA                 7.8775
SIMA_AUTONOMIC_GANGLIA               10.0620
SKNAS_AUTONOMIC_GANGLIA               7.7706
SKNBE2_AUTONOMIC_GANGLIA              7.6312
SKNDZ_AUTONOMIC_GANGLIA               2.5079
SKNSH_AUTONOMIC_GANGLIA              10.7390
HUH28_BILIARY_TRACT                   7.6524
SNU1079_BILIARY_TRACT                 6.7048
SNU1196_BILIARY_TRACT                 7.9141
SNU308_BILIARY_TRACT                 10.6410
SNU478_BILIARY_TRACT                 10.6680
SNU869_BILIARY_TRACT                  9.6990
A673_BONE                             6.4476
CAL78_BONE                           11.7890
G292CLONEA141B1_BONE                  7.0571
HOS_BONE  

### Run the ssGSEA analysis to produce a dataset of pathways vs. samples

In [12]:
pathway_exp = ccal.ssGSEA(
    ccle_exp, 
    gene_sets, 
    statistic="auc", 
    alpha = 1,
    file_path=None,
    sample_norm_type='zscore'
)

Estimating ssGSEA for sample #1 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #2 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #3 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #4 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #5 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #6 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #7 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #8 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #9 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #10 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #11 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #12 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #13 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #14 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #15 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #16 (out of 750) for 50 gene sets
E

Estimating ssGSEA for sample #131 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #132 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #133 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #134 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #135 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #136 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #137 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #138 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #139 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #140 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #141 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #142 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #143 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #144 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #145 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #146 (out o

Estimating ssGSEA for sample #260 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #261 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #262 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #263 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #264 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #265 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #266 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #267 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #268 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #269 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #270 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #271 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #272 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #273 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #274 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #275 (out o

Estimating ssGSEA for sample #389 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #390 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #391 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #392 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #393 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #394 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #395 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #396 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #397 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #398 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #399 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #400 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #401 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #402 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #403 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #404 (out o

Estimating ssGSEA for sample #518 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #519 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #520 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #521 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #522 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #523 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #524 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #525 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #526 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #527 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #528 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #529 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #530 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #531 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #532 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #533 (out o

Estimating ssGSEA for sample #647 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #648 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #649 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #650 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #651 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #652 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #653 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #654 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #655 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #656 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #657 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #658 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #659 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #660 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #661 (out of 750) for 50 gene sets
Estimating ssGSEA for sample #662 (out o

### Now match the dataset of pathways vs samples against the profile of the ML162 compound

In [35]:
DPE_scores = ccal.make_match_panel(
                target = phen,
                target_type = 'continuous',
                target_ascending = True,
                data = pathway_exp,
                data_type = 'continuous',
                n_extreme = 50,
                n_permutation = 10,
                n_sampling = 50,
                plot_std = 2,
                score_ascending = True,
                title = ' ',
                layout_width = 1100, 
                row_height = 50, 
                layout_side_margin = 260, 
                annotation_font_size = 9,
                file_path_prefix = '../results/ML162_sens_vs_pathways.html')

target.index (620) & data.columns (750) have 577 in common.
Computing score using compute_information_coefficient with 1 process ...
Computing MoE with 50 sampling ...
Computing p-value and FDR with 10 permutation ...
../results/erastin_sens_vs_pathways.html.html


### Now look at the cell lines (names) with lower ML162 levels

In [36]:
phen.sort_values(ascending = True)

SKNDZ_AUTONOMIC_GANGLIA                        2.5079
OV7_OVARY                                      2.6479
D283MED_CENTRAL_NERVOUS_SYSTEM                 2.7238
LOXIMVI_SKIN                                   3.5063
HS934T_SKIN                                    4.1445
NCIH522_LUNG                                   4.2349
AM38_CENTRAL_NERVOUS_SYSTEM                    4.2962
IALM_LUNG                                      4.3319
8505C_THYROID                                  4.5002
SNU182_LIVER                                   4.5783
RKN_SOFT_TISSUE                                4.5793
TE5_OESOPHAGUS                                 4.6046
NCIH1341_LUNG                                  4.6450
IMR32_AUTONOMIC_GANGLIA                        4.8662
CAL54_KIDNEY                                   4.8784
JVM2_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE        5.0302
SNU8_OVARY                                     5.1242
BT20_BREAST                                    5.2077
SNU1033_LARGE_INTESTINE     

### Based on these results the answer is 

## The Ephitelial Mesenchymal Transition (EMT)

### and the relevant cancers are, for example:

## autonomic ganglia, ovary, central nervous system, skin and lung




### If you had performed this analysis in the context of a real cancer project 4 years ago you could have become the co-author of a high-profile cancer paper:

[Viswanathan et al. 2017 Dependency of a therapy-resistant state of cancer cells on a lipid peroxidase pathway Nature. 2017 Jul 27; 547(7664): 453–457.](https://www.nature.com/articles/nature23007)

### and help open up a new potential area of development for novel cancer therapeutics



