# Chapter 2:  Generate oncogenic-activation signature

<br>
<div>
    <img src="../media/method_chap1.png" width=2144 height=1041>
</div>

### Analysis overview

In this chapter we will execute the first step in the Onco-GPS methodology: generating the oncogenic activation signature.

The Onco-GPS method makes use of a signature from an isogenic system that provides clean and direct transcriptional information relevant to the transcriptional changes associated with the activation of an oncogene; while at the same time incorporating diverse regulatory circuits inherently represented across multiple cellular contexts in a reference dataset. This deconvolves the functional consequences of oncogene activation in a more direct and unambiguous way. 

In this notebook we will generate a KRAS signature based on RNASeq profiling of lentiviral constructs of KRAS mut G12 vs. controls in lung SALE epithelial cell lines. We performed pilot experiments to identify optimal set of conditions (time, viral titer) to carry out the  experiments. This KRAS signature will contain the set of top 1,000 differentially expressed genes, (top 500, bottom 500), according to the Information Coefficient (IC).  The Information Coefficient (IC) ([*Linfoot 1957*](http://www.sciencedirect.com/science/article/pii/S001999585790116X); [*Joe 1989*](https://www.jstor.org/stable/2289859?seq=1#page_scan_tab_contents); [*Kim, J.W., Botvinnik 2016*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4868596/)) is a normalized version of the mutual information defined as,

<p>
<div>
    <img src="../media/equation_chapter_2_1.png" width=300 height=50>
</div>

where $I(x, y)$  is the differential mutual information between $x$, the KRAS mut vs. cntrl binary phenotype, and $y$, the expression profile for each gene. This quantity  lies in the range [-1, 1], in analogy with the correlation coefficient. The sign of the correlation coefficient $\rho(x, y)$ is used to provide directionality. The differential [mutual Information](https://en.wikipedia.org/wiki/Mutual_information) $I(x, y)$  is a function of the ratio of joint and marginal probabilities, 

<p>
<div>
    <img src="../media/equation_chapter_2_2.png" width=450 height=50>
</div>

The $H(x,y)$, $H(x)$ and $H(y)$ are the joint and marginal [entropies](https://en.wikipedia.org/wiki/Entropy_(information_theory). Estimating the mutual information between a phenotype and gene expression profiles requires the empirical approximation of continuous probability density distributions using kernel [density estimators](https://en.wikipedia.org/wiki/Density_estimation) ([*Sheather 2004*](http://www.stat.washington.edu/courses/stat527/s13/readings/Sheather_StatSci_2004.pdf)).


### 1. Set up notebook and import [CCAL](https://github.com/KwatME/ccal)

In [2]:
from notebook_environment import *


%load_ext autoreload
%autoreload 2
%matplotlib inline

Added '../tools' to the path.


### 2. Read gene expression dataset

In [3]:
gene_x_kras_isogenic_and_imortalized_celllines = ccal.read_gct(
    '../data/gene_x_kras_isogenic_and_imortalized_celllines.gct')

gene_x_kras_isogenic_and_imortalized_celllines.index.name = 'Gene'
gene_x_kras_isogenic_and_imortalized_celllines.columns.name = 'Cellline'

### 3. Define KRAS mut vs. cntrl phenotype
This is a vector of 1 and -1 indicating which samples are KRAS mut and which are controls.

In [4]:
target = pd.Series(
    [1] * 6 + [-1] * 4,
    name='KRAS Mutants vs Controls',
    index=gene_x_kras_isogenic_and_imortalized_celllines.columns)

### 4. Find top differentially expressed genes between KRAS mutants and control

The function below computes the association between the phenotype (KRAS mut vs. cntrl) and the gene expression profiles as described in the introduction above. At completion this function will produce a heatmap (SIGNATURE.pdf) and a text file (SIGNATURE.txt) where the genes have been sorted by their association with the phenotype as measured by the IC. This computation takes a few hours and therefore it is desirable to run overnight.

In [None]:
# gene_scores = ccal.make_match_panel(
#     target,
#     gene_x_kras_isogenic_and_imortalized_celllines,
#     n_jobs=28,
#     n_features=20,
#     n_permutations=200,
#     random_seed=12345,
#     target_type='binary',
#     file_path_prefix='../output/kras_isogenic_and_imortalized_celllines')

### 5. Generate oncogenic signature by selecting top and bottom 500 genes

The heatmap produced below is shown on the left of Fig 3 in the article.

In [None]:
kras_relevant_genes = ccal.get_top_and_bottom_series_indices(
    gene_scores['Score'], 500).to_series()
kras_relevant_genes.name = 'KRAS Relevant Genes'

kras_relevant_genes.to_csv(
    '../output/kras_relevant_genes.txt', index=False, header=True)

ccal.plot_heatmap(
    gene_x_kras_isogenic_and_imortalized_celllines.loc[kras_relevant_genes, :],
    normalization_axis=1,
    normalization_method='-0-',
    xticklabels=True,
    column_annotation=[1, 1, 1, 1, 1, 1, -1, -1, -1, -1],
    title='KRAS Oncogenic Activation Signature')

### [Next chapter (3)](3 Decompose oncogenic-activation signature and define transcriptional components.ipynb)