# Chapter 6. Annotate cellular states

<br>
<div>
    <img src="../media/method_chap5.png" width=2144 height=1041>
</div>

### Analysis overview
In this chapter we perform a detailed analysis and annotation of the KRAS mutant oncogenic states (S1-S4) in order to assign a biological interpretation to each component. 

This analysis is very similar to the one we performed in chapter 4. The main difference is that in this chapter we will use as target profiles binary variables representing the membership of samples in each of the oncogenic states, rather than the transcriptional component profiles.

As in [chapter 4](4 Annotate transcriptional components.ipynb) the genomic features include the following (see [chapter 1](1 Set up data) for sources):

1. **Mutations and Copy Number Alterations (CNA).** CCLE mutation and copy number datasets ([*Barretina et al. 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/)).
2.	 **Gene expression.**  CCLE RNA Seq dataset ([*Barretina et al. 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/)).
3.	  **Pathway expression** (single sample GSEA of MSigDB gene sets) MSigDB v5.1 sub-collections c2, c5, c6 and h (Liberzon et al. 2011; [*Liberzon et al. 2016. Cell Systems, 1(6), pp.417–425.*](https://www.ncbi.nlm.nih.gov/pubmed/26771021). and a few additional gene sets (see supplementary information in the article).
4.	**Transcription factors and master regulators expression** (single sample GSEA of gene sets) MSigDB v5.1, ([*Liberzon et al. 2011*](https://www.ncbi.nlm.nih.gov/pubmed/21546393)), sub-collection c3 and 1,598 [IPA gene sets](http://www.ingenuity.com).
5.	 **Protein expression.** CCLE Reverse Phased Protein Array (RPPA) dataset ([*Barretina et al. 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/)).
6.	 **Drug sensitivity.** CCLE dataset ([*Barretina et al. 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/))
7.	**Gene dependency.** RNAi Achilles dataset, ([*Cowley et al. 2014*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4432652/)).

### 1. Set up notebook and import [CCAL](https://github.com/KwatME/ccal)

In [1]:
from notebook_environment import *


%load_ext autoreload
%autoreload 2
%matplotlib inline

Added '../tools' to the path.


### 2. Read CCLE object and sample state labels

In [2]:
with gzip.open('../data/ccle.pickle.gz') as f:
    CCLE = pickle.load(f)

In [3]:
cs = pd.read_table('../output/hccs/hccs.txt', index_col=0)

kras_sample_labels = cs.loc['K4']

kras_sample_labels.name = 'State'

state_x_sample = ccal.make_membership_df_from_categorical_series(
    kras_sample_labels)
state_x_sample.index = ['State {}'.format(i + 1) for i in state_x_sample.index]

state_x_sample

Unnamed: 0,A549_LUNG,AGS_STOMACH,ASPC1_PANCREAS,CAL62_THYROID,CALU1_LUNG,CALU6_LUNG,CAPAN1_PANCREAS,CAPAN2_PANCREAS,CFPAC1_PANCREAS,CL11_LARGE_INTESTINE,...,T3M10_LUNG,T3M4_PANCREAS,T84_LARGE_INTESTINE,TCCPAN2_PANCREAS,TEN_ENDOMETRIUM,TGBC11TKB_STOMACH,TOV21G_OVARY,UMUC3_URINARY_TRACT,YAPC_PANCREAS,YD8_UPPER_AERODIGESTIVE_TRACT
State 1,0,0,0,0,0,0,1,1,1,0,...,1,1,0,0,0,0,0,0,1,0
State 2,0,1,0,0,0,0,0,0,0,1,...,0,0,1,0,1,1,0,0,0,0
State 3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
State 4,0,0,1,1,1,1,0,0,0,0,...,0,0,0,1,0,0,0,1,0,1


### 3. Annotate the states

Here, each component is annotated with all the feature datasets using information-based association (IC). 

In [None]:
for i, state in state_x_sample.iterrows():

    for features_name, d in CCLE.items():

        features_ = d['df']
        emphasis = d['emphasis']
        data_type = d['data_type']

        print('Annotating with {} (emphasis={} & data_type={})'.format(
            features_name, emphasis, data_type))

#         ccal.make_match_panel(
#             state,
#             features_,
#             n_jobs=16,
#             n_features=20,
#             n_samplings=3,
#             n_permutations=3,
#             scores_ascending=[True, False][emphasis == 'high'],
#             features_type=data_type,
#             title=features_name,
#             file_path_prefix='../output/match_states/match_{}_and_{}'.format(
#                 i, features_name))

#         mpl.pyplot.show()

### [Next chapter (7)](7 Display genomic features on Onco-GPS map.ipynb)