# KnowEnG Pipelines Notebook

## Setup the Knowledge Engene for Genomics Pipelines.

* Output from one or more pipelines may be used as input to another.


### 1! Create a directory and follow the instructions in the repository to setup your local computer.
* Move this notebook to the directory where the pipelines have been cloned from git hub.

### 2! Clone the code from github.
* In your directory, cut and paste to the command line:
*  git clone https://github.com/KnowEnG-Research/KnowEnG_Pipelines_Library.git
*  git clone https://github.com/KnowEnG-Research/Data_Cleanup_Pipeline.git
*  git clone https://github.com/KnowEnG-Research/Clustering_Evaluation.git
*  git clone https://github.com/KnowEnG-Research/Gene_Prioritization_Pipeline.git
*  git clone https://github.com/KnowEnG-Research/GeneSet_Characterization_Pipeline.git
*  git clone https://github.com/KnowEnG-Research/Samples_Clustering_Pipeline.git

### 3! Run the next cell to import the libraries for this notebook page.

In [1]:
import os
import sys
import time

import numpy as np
import pandas as pd

sys.path.insert(1, '../clones/KnowEnG_Pipelines_Library')
sys.path.insert(1, '../clones/KnowEnG_Pipelines_Library/knpackage')
import redis_utilities as redisutil

sys.path.insert(1, '../clones/Data_Cleanup_Pipeline/src')
import data_cleanup_toolbox as dc_tbx

sys.path.insert(1, '../clones/Clustering_Evaluation/src')
import clustering_eval_toolbox as ce_tbx

sys.path.insert(1, '../clones/Gene_Prioritization_Pipeline/src')
import gene_prioritization_toolbox as gp_tbx

sys.path.insert(1, '../clones/GeneSet_Characterization_Pipeline/src')
import geneset_characterization_toolbox as gsc_tbx

sys.path.insert(1, '../clones/Samples_Clustering_Pipeline/src')
import sample_clustering_toolbox as sc_tbx

import knpackage.toolbox as kn

### 4! Type in: local data and results directory names into the quoted lines below.

In [8]:
local_dir = '/Users/mojo/'

testy_data_dir = os.path.join(local_dir, 'BigDataTank/pipeline_spreadsheets/raw')
testy_data_file_list = os.listdir(testy_data_dir)

# The next lines create a data directory.
this_dir = '../clones'
results_dir = kn.create_dir(this_dir, 'keggo_leggo_test')

# These lines get the path names for installation testing the cloned repositories.
ce_dir = os.path.join(this_dir, 'Clustering_Evaluation/data')
dc_dir = os.path.join(this_dir, 'Data_Cleanup_Pipeline/data')

gp_dir = os.path.join(this_dir, 'Gene_Prioritization_Toolbox/data')
gc_dir = os.path.join(this_dir, 'GeneSet_Characterization_Toolbox/data')
sc_dir = os.path.join(this_dir, 'Samples_Clustering_Pipeline/data')

### Setup for Samples Clustering: enter the name of the spreadsheet, network and phenotype data files.
* The cloned samples clustering directory structure below has a network you may use
* Note that there are three changes of run parameter dictionaries in the samples clustering pipeline.

   * 1) the cleanup parameters that describe the source files
   * 2) the Samples Clustering run time options parameters
   * 3) the clustering evaluation post-processing parameters
   

* Each pipeline has "yaml" files for a python dictionary in its associated ...Pipeline/data/run_files/ directory

### 5! Setup a python "dictionary" for choosing and cleaning the input data:
* Then run the next two cells.

In [9]:
yml_cleanup_file = 'run_files/data_cleanup.yml'
cleanup_parameters = kn.get_run_parameters(run_directory=dc_dir, run_file=yml_cleanup_file)

cleanup_parameters['results_directory'] = results_dir
cleanup_parameters['spreadsheet_name_full_path'] = os.path.join(testy_data_dir,'Hsap.ccle.G.gene_mut.binary.df')
cleanup_parameters['phenotype_full_path'] = os.path.join(testy_data_dir,'Hsap.ccle.P.cell_line_info.mixed.df')

### 6! Next step requires internet connection - clean the data: check and translate the gene names.

In [4]:
validation_flag, message = dcp_dc_tbx.run_samples_clustering_pipeline(cleanup_parameters)
print('validation_flag = ', validation_flag)
print('message = ', message)

validation_flag =  True
message =  This is a valid user spreadsheet. Proceed to next step analysis.


#### The message above shoule look something like:
* validation_flag =  True
* message =  This is a valid user spreadsheet. Proceed to next step analysis.

#### If so setup and run the next cell by getting the file names from the results directory.


In [5]:
clean_data_dir = results_dir
phenotypes_file = os.path.join(clean_data_dir, 'Hsap.ccle.P.cell_line_info.mixed_ETL.tsv')

run_parameters = kn.get_run_parameters(sc_dir, 'run_files/BENCHMARK_7_SC_cc_net_nmf_parallel_shared.yml')
run_parameters['spreadsheet_name_full_path'] = os.path.join(clean_data_dir, 'Hsap.ccle.G.gene_mut.binary_ETL.tsv')

run_parameters.pop('phenotype_name_full_path', None)
run_parameters['number_of_bootstraps'] = 10

run_parameters['gg_network_name_full_path'] = os.path.join(sc_dir, 'networks/keg_ST90_4col.edge')
run_parameters['results_directory'] = results_dir

run_parameters['processing_method'] = parallel   # available methods: serial, parallel, distribute

#### The next cell does a lot of processing and may take a long time - ipython notebooks are slow compared to running on bare metal.
* The yaml file in the above cell is set to parallel by default.
* This may cause the notebook to stop responding and not write the files to the results directory.
* Try setting the method to serial.
* The 'number_of_bootstraps' key will increase the running time in linear proprotion.
* The 'number_of_clusters' key will slow things down in exponential proportion - 7 or 8 might be too much for a notebook

In [6]:
t0 = time.time()
sc_tbx.run_cc_net_nmf(run_parameters)
print('Samples Clustering finished in %0.3f seconds'%(time.time()-t0))

Samples Clustering finished in 151.167 seconds


#### The output file names are time-stamped so the next dictionary requires more manual attention before the second cell below.
* identify the file you need by the prefix 'samples_label_by_cluster_cc_net_nmf_ ...  _viz.tsv' and suffix

In [8]:
ce_parameters = kn.get_run_parameters(ce_dir, 'run_files/BENCHMARK_1_cluster_eval.yml')
ce_parameters['cluster_mapping_full_path'] =\
    os.path.join(results_dir,'samples_label_by_cluster_cc_net_nmf_Wed_15_Feb_2017_17_04_27.845877885_viz.tsv')
ce_parameters['phenotype_data_full_path'] = os.path.join(clean_data_dir, 'Hsap.ccle.P.cell_line_info.mixed_ETL.tsv')
ce_parameters['results_directory'] = results_dir

In [9]:
ce_tbx.clustering_evaluation(ce_parameters)

#### The results are all in the 'results_directory' and may be viewed in another notebook
#### The ouput files (in reverse order) are listed below
* clustering_evaluation_result_ ... .tsv  - stastical view of the clustering vs the phenotype data.
* silhouette_average_cc_net_nmf_ ... _viz.tsv  - statistic says something about the chosen 'number_of_clusters'.
* top_genes_by_cluster_cc_net_nmf_ ... _download.tsv  - very sparse view of which genes are most associated with clustering.
* genes_averages_by_cluster_cc_net_nmf_ ... _viz.tsv  - how strongly each gene is in each cluster.
* genes_variance_cc_net_nmf_ ... _viz.tsv  - the variance of each gene.
* genes_by_samples_heatmap_cc_net_nmf_ ... _viz.tsv  - the (network smoothed) spreadsheet that was clusterd.
* consensus_matrix_cc_net_nmf_ .. _viz.tsv  - the samples x samples consensus matrix (if consensus clustering was used)