<table style="border:2px solid white;" cellspacing="0" cellpadding="0" border-collapse: collapse; border-spacing: 0;>
  <tr> 
    <th style="background-color:white"> <img src="../media/CCAL.png" width=225 height=225></th>
    <th style="background-color:white"> <img src="../media/logoMoores.jpg" width=175 height=175></th>
    <th style="background-color:white"> <img src="../media/GP.png" width=200 height=200></th>
    <th style="background-color:white"> <img src="../media/UCSD_School_of_Medicine_logo.png" width=175 height=175></th> 
    <th style="background-color:white"> <img src="../media/Broad.png" width=130 height=130></th> 
  </tr>
</table>

<hr style="border: none; border-bottom: 3px solid #88BBEE;">

# **Onco-*GPS* Methodology**
## **Chapter 1.  Generating Oncogenic Activation Signature** 

**Authors:** Huwate (Kwat) Yeerna -  *Computational Cancer Analysis Laboratory (CCAL), UCSD Moores Cancer Center*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 William Kim - Cancer Program, *Eli and Edythe Broad Institute*      
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
Taylor Cavazos - *Computational Cancer Analysis Laboratory (CCAL), UCSD Moores Cancer Center*   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
Kate Medetgul-Ernar - *Computational Cancer Analysis Laboratory (CCAL), UCSD Moores Cancer Center*   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
Clarence Mah - *Mesirov Lab, UCSD School of Medicine and Moores Cancer Center*      
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
Jill P. Mesirov - *Mesirov Lab, UCSD School of Medicine and Moores Cancer Center*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
Pablo Tamayo - *Computational Cancer Analysis Laboratory (CCAL), UCSD Moores Cancer Center* 

**Date:** Jan 5, 2017

**Article:** [*Kim et al.* Decomposing Oncogenic Transcriptional Signatures to Generate Maps of Divergent Cellular States](https://drive.google.com/file/d/0B0MQqMWLrsA4b2RUTTAzNjFmVkk/view?usp=sharing)

**Analysis overview**

The Onco-GPS method makes use of a signature from an isogenic system that provides clean and direct transcriptional information relevant to the transcriptional changes associated with the activation of an oncogene; while at the same time incorporating diverse regulatory circuits inherently represented across multiple cellular contexts in a reference dataset. This deconvolves the functional consequences of oncogene activation in a more direct and unambiguous way. 

In this notebook we will generate a KRAS signature based on RNASeq profiling of lentiviral constructs of KRAS mut G12 vs. controls in lung SALE epithelial cell lines. We performed pilot experiments to identify optimal set of conditions (time, viral titer) to carry out the  experiments. This KRAS signature will contain a set of differentially expressed genes, i.e. those with FDR below 0.035, according to the Information Coefficient (IC). 

The Information Coefficient (IC) ([*Linfoot 1957*](http://www.sciencedirect.com/science/article/pii/S001999585790116X); [*Joe 1989*](https://www.jstor.org/stable/2289859?seq=1#page_scan_tab_contents); [*Kim, J.W., Botvinnik 2016*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4868596/)) is a normalized version of the mutual information defined as,

$$IC(x, y) = sign(\rho(x,y)) \sqrt[]{ (1 - \exp(2I(x,y))} $$

where $I(x, y)$  is the differential mutual information between $x$, the KRAS mut vs. cntrl binary phenotype, and $y$, the expression profile for each gene. This quantity  lies in the range [-1, 1], in analogy with the correlation coefficient. The sign of the correlation coefficient $\rho(x, y)$ is used to provide directionality. The differential [Mutual Information](https://en.wikipedia.org/wiki/Mutual_information) $I(x, y)$  is a function of the ratio of joint and marginal probabilities, 

$$I(x,y y) = \int \int P(x, y) \log \frac{P(x,y)}{P(x)P(y)} dx dy = H(x, y) - H(x) - H(y).$$

The $H(x,y)$, $H(x)$ and $H(y)$ are the joint and marginal [entropies](https://en.wikipedia.org/wiki/Entropy_(information_theory). Estimating the mutual information between a phenotype and gene expression profiles requires the empirical approximation of continuous probability density distributions using kernel [density estimators](https://en.wikipedia.org/wiki/Density_estimation) ([*Sheather 2004*](http://www.stat.washington.edu/courses/stat527/s13/readings/Sheather_StatSci_2004.pdf)).

Go to the [next chapter (2)](2 Onco-GPS -- Decomposing Signature and Defining Transcriptional Components.ipynb).
Back to the [introduction (0)](0 Onco-GPS -- Introduction and Overview.ipynb).


<hr style="border: none; border-bottom: 3px solid #88BBEE;">
### 1. Set up notebook and import Computational Cancer Analysis Library ([CCAL](https://github.com/KwatME/ccal))

In [2]:
import sys
import os
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.figsize'] = (8, 5)
mpl.rcParams['figure.max_open_warning'] = 100
HOME_DIR = os.environ['HOME']
sys.path.insert(0, os.path.join(HOME_DIR, 'ccal'))
import ccal

<hr style="border: none; border-top: 3px solid #88BBEE;">
### 2. Set input and result files/directories
These are the files where the input expression data and the output signature will be stored.

In [3]:
DATASET = HOME_DIR + '/OncoGPS_Analysis2/data/kras_isogenic_vs_imortalized.gct'
SIGNATURE = HOME_DIR + '/OncoGPS_Analysis2/results/kras_signature'

<hr style="border: none; border-bottom: 3px solid #88BBEE;">
### 3. Generate oncogenic signature 
As mentioned in the introduction the signature will consist of the genes that have expression profiles that are associated, i.e. share information as estimated by the IC, with the KRAS mut vs. cntrl phenotype. 

<hr style="border: none; border-bottom: 1px solid #88BBEE;">
#### 3.1 Read the input gene expression dataset 
This function below reads an expression dataset in GCR format.

In [4]:
gene_exp = ccal.read_gct(DATASET)

<hr style="border: none; border-bottom: 1px solid #88BBEE;">
#### 3.2 Define the KRAS mut vs. cntrl phenotype 
This is a vector of 1 and -1 indicating which samples are KRAS mut and which are controls (see [article](https://drive.google.com/file/d/0B0MQqMWLrsA4b2RUTTAzNjFmVkk/view?usp=sharing) for details).

In [5]:
phenotype = [1, 1, 1, 1, 1, 1, -1, -1, -1, -1]
target = pd.Series(phenotype, name='KRAS mut vs. cntrl', index=gene_exp.columns)

<hr style="border: none; border-bottom: 1px solid #88BBEE;">
#### 3.3 Find top differentially expressed genes between KRAS mut vs. cntl and show them in a heatmap.
This is the main function used in this notebook. It computes the association between the phenotype and the gene expression profiles as described in the introduction above. At completion this function will produce a heatmap (SIGNATURE.pdf) and a text file (SIGNATURE.txt) where the genes have been sorted by their association with the phenotype as measured by the IC. The function also computes a bootstrap confidence interval for the IC (shown in parenthesis) and the p-values and False Discovery Rates (FDR) using an empirical permutations test (using *n_permutations* times the number of genes). The heatmap below shows the 20 genes on top (UP) and at the bottom (DOWN) of the list. The gene names are on the left of the heatmap. This computation takes a few hours and therefore it is desirable to run overnight.

In [None]:
gene_scores = ccal.oncogps.make_association_panel(target = target,                                  # Target profile (e.g. phenotype)
                                                                                                features = gene_exp,                       # Data matrix with input data
                                                                                                target_type = 'binary',                     # Target profile type
                                                                                                n_permutations = 200,                    # Number of random permutations
                                                                                                filepath_prefix = SIGNATURE,         # Output files (.txt and .pdf)
                                                                                                max_n_features = 20,                      # Max. number of features shown in heatmap
                                                                                                random_seed = 12345)                   # Random number generation seed

<hr style="border: none; border-bottom: 1px solid #88BBEE;">
#### 3.4 Generate signature and display the member genes in a heatmap


This computation selects UP and DOWN genes with FDR below 0.03 to generate the KRAS oncogenic signature that contains 957 genes.

In [13]:
kras_relevant_genes = gene_scores.ix[gene_scores.ix[:, 'fdr'] <= 0.05, :].index
print(len(kras_relevant_genes))

995


Make a heatmap showing the profiles of the resulting signature genes (this heatmap is shown on the left of Fig 3 in the article).

In [None]:
ccal.plot_heatmap(dataframe = gene_exp.ix[kras_relevant_genes, :],    # Input data matrix
                                  normalization_method = '-0-',                                       # Normalization method
                                  normalization_axis = 1,                                                   # Normalization axis
                                  column_annotation = phenotype,                                # Annotations for columns
                                  title = 'KRAS Oncogenic Activation Signature') 