<table style="border:2px solid white;" cellspacing="0" cellpadding="0" border-collapse: collapse; border-spacing: 0;>
  <tr> 
    <th style="background-color:white"> <img src="../media/ccal-logo-D3.png" width=225 height=225></th>
    <th style="background-color:white"> <img src="../media/logoMoores.jpg" width=175 height=175></th>
    <th style="background-color:white"> <img src="../media/GP.png" width=200 height=200></th>
    <th style="background-color:white"> <img src="../media/UCSD_School_of_Medicine_logo.png" width=175 height=175></th> 
    <th style="background-color:white"> <img src="../media/Broad.png" width=130 height=130></th> 
  </tr>
</table>

**Part 3 of building a PlatiRes Map**   
Daniela Nachmanson *Fall 2017*

Using the Chapter 3 of the **Onco-*GPS* notebooks** in order to do NMF with a platinum phenotype derived gene list.
The genelist used in this NMF is derived from bulkRNA seq experiment done in Professor Harismendy's as well as an experiment performed in Marchion et al experiment.

<hr style="border: none; border-bottom: 3px solid #88BBEE;">
# **Onco-*GPS* Methodology**
## **Chapter 3. Annotating the Transcriptional Components**

**Authors:** William Kim$^{1}$, Huwate (Kwat) Yeerna$^{2}$, Taylor Cavazos$^{2}$, Kate Medetgul-Ernar$^{2}$, Clarence Mah$^{3}$, Stephanie Ting$^{2}$, Jason Park$^{2}$, Jill P. Mesirov$^{2, 3}$ and Pablo Tamayo$^{2,3}$.

**Date:** April 17, 2017

1. Eli and Edythe Broad Institute      
2. UCSD Moores Cancer Center
3. UCSD School of Medicine 

**Article:** [*Kim et al.* Decomposing Oncogenic Transcriptional Signatures to Generate Maps of Divergent Cellular States](https://drive.google.com/file/d/0B0MQqMWLrsA4b2RUTTAzNjFmVkk/view?usp=sharing)

**Analysis overview:** In this chapter we perform a detailed analysis of the platinum-resistance transcriptional components produced by the NMF decomposition in chapter 2 in order to assign a biological interpretation to each component. 

<img src="../media/method_chap3.png" width=2144 height=1041>

The analysis consists of the following steps:
* Define a target profile for each component in the CCLE Reference Dataset using the amplitudes of the $H$ matrix. This matrix represents the magnitude of each NMF component per sample. 
* Using the Information Coefficient (IC) ([*Kim, J.W., Botvinnik 2016*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4868596/)) to estimate the degree of association of each component target profile and several types of genomic features.

The genomic features include the following:

1. **Mutations and Copy Number Alterations (CNA).** CCLE mutation and copy number datasets (www.broadinstitute.org/ccle, [*Barretina et al. 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/)).
2.	 **Gene expression.** CCLE RNA Seq dataset (http://www.broadinstitute.org/ccle, [*Barretina et al. 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/)).
3.	  **Pathway expression** (single sample GSEA of MSigDB gene sets) MSigDB v5.1 sub-collections c2, c5, c6 and h www.msigdb.org, (Liberzon et al. 2011; [*Liberzon et al. 2016. Cell Systems, 1(6), pp.417–425.*](https://www.ncbi.nlm.nih.gov/pubmed/26771021). and a few additional gene sets (see supplementary information in the article).
4.	**Transcription factors and master regulators expression** (single sample GSEA of gene sets) MSigDB v5.1, ([*Liberzon et al. 2011*](https://www.ncbi.nlm.nih.gov/pubmed/21546393)) http://www.msigdb.org, sub-collection c3 and 1,598 IPA gene sets, http://www.ingenuity.com.
5.	 **Protein expression.** CCLE Reverse Phased Protein Array (RPPA) dataset (http://www.broadinstitute.org/ccle, [*Barretina et al. 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/)).
6.	 **Drug sensitivity.** CCLE dataset (http://www.broadinstitute.org/ccle, [*Barretina et al. 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/))
7.	**Gene dependency.** RNAi Achilles dataset, http://www.broadinstitute.org/achilles, ([*Cowley et al. 2014*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4432652/)).


<hr style="border: none; border-bottom: 3px solid #88BBEE;">
### 1. Set up notebook and import Computational Cancer Analysis Library ([CCAL](https://github.com/KwatME/ccal))

In [1]:
from environment import *

%matplotlib inline
%load_ext autoreload
%autoreload 2

### 2. Read the annotation datasets table and $H$  matrix

The table describing the datasets that will be used in the annotation analysis (annotation.data_table.txt) is included in the directory "../data."

In [2]:
pd.read_csv('../data/annotation.data_table.txt', sep='\t')

Unnamed: 0,Data Name,Data Type,Emphasis,Filepath
0,drug_sensitivity,continuous,low,../data/ccle_drug_sensitivity.gct
1,gene_expression,continuous,high,../data/ccle_gene_expression.gct
2,gene_dependency,continuous,high,../data/ccle_gene_dependency.gct
3,mutation,binary,high,../data/ccle_mut_CNA.gct
4,pathway_expression,continuous,high,../data/ccle_pathway_expression_all.gct
5,protein_expression,continuous,high,../data/ccle_protein_expression.gct
6,regulator,continuous,high,../data/ccle_regulator.gct
7,tissue,binary,high,../data/ccle_tissue.gct


This function below will read that table and the datasets defined in the "Filepath" column

In [3]:
data_table = ccal.load_data_table('../data/annotation.data_table.txt')

Making data bundle for drug_sensitivity ...
	Loaded ../data/ccle_drug_sensitivity.gct.
Making data bundle for gene_expression ...
	Loaded ../data/ccle_gene_expression.gct.
Making data bundle for gene_dependency ...
	Loaded ../data/ccle_gene_dependency.gct.
Making data bundle for mutation ...
	Loaded ../data/ccle_mut_CNA.gct.
Making data bundle for pathway_expression ...
	Loaded ../data/ccle_pathway_expression_all.gct.
Making data bundle for protein_expression ...
	Loaded ../data/ccle_protein_expression.gct.
Making data bundle for regulator ...
	Loaded ../data/ccle_regulator.gct.
Making data bundle for tissue ...
	Loaded ../data/ccle_tissue.gct.


 Read the $H$ matrix produced in notebook 2

In [4]:
h_matrix = ccal.read_gct('../output/nmf_cc/nmf/nmf_k7_h.gct')

In [5]:
h_matrix

Unnamed: 0,A101D_SKIN,A172_CENTRAL_NERVOUS_SYSTEM,A204_SOFT_TISSUE,A2058_SKIN,A2780_OVARY,A375_SKIN,A498_KIDNEY,A549_LUNG,A673_BONE,A704_KIDNEY,...,WM88_SKIN,WM983B_SKIN,YAPC_PANCREAS,YD10B_UPPER_AERODIGESTIVE_TRACT,YD38_UPPER_AERODIGESTIVE_TRACT,YD8_UPPER_AERODIGESTIVE_TRACT,YH13_CENTRAL_NERVOUS_SYSTEM,YKG1_CENTRAL_NERVOUS_SYSTEM,ZR751_BREAST,ZR7530_BREAST
C1,232.067182,1478.095344,610.102215,304.444904,1852.53,331.823433,101.448277,357.015088,3100.34806,6.750842,...,370.498254,555.332236,337.854484,462.251312,289.45773,177.374047,932.345,1634.421092,600.984288,5.163962
C2,0.321277,979.679441,3424.984387,534.758511,4272.116,705.565301,1387.366937,1331.712449,2519.452662,1746.373567,...,816.140078,219.001867,739.033621,471.127722,0.134198,840.432109,26.137021,1081.411271,4329.997888,4737.871
C3,644.84292,1955.798697,2324.925571,1360.520392,2931.981,1746.97191,1760.101873,3468.060422,2263.639368,2289.968057,...,50.866697,916.339951,1538.860277,2069.84564,2524.487797,1406.969205,1497.428196,799.44186,242.309702,0.027625
C4,186.648538,19.566442,38.619616,372.349661,6.124098,492.214407,953.565669,920.808159,139.932584,2211.291331,...,225.358655,161.965636,2335.1523,280.133274,794.193296,95.562497,362.996504,229.054029,2742.021757,3134.720759
C5,1379.580409,4059.425214,1739.160372,879.096088,756.9543,913.214162,3375.133431,1625.953739,1389.711154,2413.70049,...,936.252055,958.728154,677.965254,1262.685186,890.344381,4404.869902,5426.175363,3498.336481,0.770284,303.328827
C6,8190.890283,1008.767726,1851.351694,7205.088176,172.4504,6362.994378,615.308777,715.85283,917.742018,270.281574,...,8564.95703,7943.831754,330.214109,811.646059,846.579874,171.418896,1140.103031,3224.551292,650.885735,385.2462
C7,591.403983,350.068929,61.445347,369.370027,3.604511e-08,279.630035,1331.956109,1091.132989,0.056211,505.515183,...,486.540362,492.042817,3573.093226,4227.383185,4151.553681,2305.468819,312.526337,6.527126,1476.087208,1333.166071


### 3. Find the top genomic features that match each component profile
The annotation consists of running the association analysis for each component against all the genomic datasets. Because this a double iteration over componets and feature datasets it will take hours to complete. As the program runs it will display the specific target vs. features comparison being made.

In [None]:
ccal.association.make_association_panels(target=h_matrix, data_bundle = data_table, dropna='all', target_ascending=False,
                            target_prefix='', data_prefix='', target_type='continuous', n_jobs=1, n_features=20, n_samplings=30, 
                            n_permutations=50, random_seed=12345,directory_path='../output/component_annotation')

C1 vs Regulator ...
Created directory /Users/DanielaNachmanson/Desktop/OncoGPS_Analysis_paper/dana/results/component_annotation.
C1 vs Gene Expression ...
C1 vs Mutation ...
C1 vs Protein Expression ...
C1 vs Pathway Expression ...
C1 vs Tissue ...
C1 vs Gene Dependency ...
C1 vs Drug Sensitivity ...
C2 vs Regulator ...
C2 vs Gene Expression ...
C2 vs Mutation ...
C2 vs Protein Expression ...
C2 vs Pathway Expression ...
C2 vs Tissue ...
C2 vs Gene Dependency ...
C2 vs Drug Sensitivity ...
C3 vs Regulator ...
C3 vs Gene Expression ...
C3 vs Mutation ...
