<div align="right" vertical-align="middle" style="border: 2px solid;border-radius: 5px;background-color:lightgrey;padding:5px;padding-right:20px;padding-left:10px;">
        <a style="color:black;text-decoration:none;" title="Home" href="../index.ipynb">
            <img src="../../css/iconmonstr-christmas-house-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        &nbsp;
        <b>|</b>
        &nbsp;
        <a style="color:black;text-decoration:none;" title="Build" href="build.ipynb">
            <img src="../../css/iconmonstr-puzzle-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        <a style="color:black;text-decoration:none;" title="Assemble" href="../assemble_docs/assemble.ipynb">
            <img src="../../css/iconmonstr-puzzle-17-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        <a style="color:black;text-decoration:none;" title="Query" href="../query_docs/query.ipynb">
            <img src="../../css/iconmonstr-flask-3-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
</div>


<h1>TUTORIAL: Build an EGRIN 2.0 Ensemble</h1>

### In a nutshell

The `BUILD` scripts configure [cMonkey2](https://github.com/baliga-lab/cmonkey2) to run on an SGE managed cluster. In the process, they change several cMonkey2 parameters to generate variability in each run. They also choose experiments to include in each run according to user-provided annotations.

### Requirements

User must supply several files, including:

- `ratios`: a Tab-delimited or Comma-separated file containing a matrix of gene expression values across all conditions.
- `blocks`: a Comma-separated file containing annotations for each condition.
- `inclusion_blocks`: : a Comma-separated file containing groups of blocks to be co-included in runs.
- `exclusion_blocks`: : a Comma-separated file containing groups of blocks to be co-excluded in runs.

Optionally:
- `pipeline`: a JSON file containing custom scoring pipeline. Currently only set-enrichment pipeline is supported
- `setenrich_files`: Comma-separated files containing set-enrichment sets. Multiple files should be separated by a comma.

The format for each of these files will be described in detail below.

We have provided [example files](./static/example_files/) for building an *Mycobacterium Tuberculosis* ensemble. The data come from the following publications:

[EJR Peterson, DJ Reiss, S Turkarslan, KJ Minch, T Rustad, CL Plaisier, WJR Longabaugh, DR Sherman, NS Baliga. (2014). A high-resolution network model for global gene regulation in Mycobacterium tuberculosis. *Nucleic Acids Research*.](http://nar.oxfordjournals.org/content/early/2014/09/17/nar.gku777.full)

[TR Rustad, KJ Minch, S Ma, JK Winkler, S Hobbes, MJ Hickey, W Brabant, S Turkarslan, ND Price, NS Baliga and DR Sherman. (2014). Mapping and manipulating the Mycobacterium tuberculosis transcriptome using a transcription factor overexpression-derived regulatory network. *Genome Biology*](http://genomebiology.com/2014/15/11/502/abstract)

[KJ Minch, TR Rustad, EJR Peterson, J Winkler, DJ Reiss, S Ma, M Hickey, W Brabant, B Morrison, S Turkaslan, C Mawhinney, JE Galagan, ND Price, NS Baliga, DR. Sherman. (2014). The DNA-binding network of Mycobacterium tuberculosis. *Nature Communications*](http://www.nature.com/ncomms/2015/150112/ncomms6829/full/ncomms6829.html)

Additionally, the Python modules described on the [Home page](../index.ipynb) are required to run tese scripts.

### Scripts

`cMonkeyQSub.py`: The control function for `BUILD` scripts. Writes QSub script.

`cMonkeyIniGen.py`: Templating function to generate cMonkey2 initialization (.ini) files.

`ensemblePicker.py`: Picks experiments to include in a run given user-supplied experimental blocks

### `BUILD` an EGRIN 2.0 ensemble

In this tutorial we will `BUILD` an ensemble for *Mycobacterium tuberculosis* using several example files, which we supply [here](./static/example_files/)


### STEP 1: Generate required input files


First, let's explore the required files.

#### `ratios.csv.gz`

A Tab-delimited or Comma-separated file containing a matrix of gene expression values across all conditions. Rows should correspond to genes and columns to individual conditions, as in the example file below.

<img src="./static/ratios.png" height = "75%" width = "75%" style="display:inline">

#### `blocks.csv.gz`

A Tab-delimited or Comma-separated file containing the block membership for each experiment in the dataset as below.

<img src="./static/blocks.png" height = "75%" width = "75%" style="display:inline">

#### `inclusion_blocks.csv.gz`

A Tab-delimited or Comma-separated file containing definitions for groups of `blocks` to be co-included in a cMonkey2 run, as defined above. These blocks are not **strictly** co-included in every cMonkey2 run, rather placement of a `block` into an `inclusion block` increases the likelihood that a particular `block` will be selected given that another block from its `inclusion block` has already been included in a particular cMonkey run. 

Each group of `blocks` should be separated by `:::`, as indicated in the template below. The names of each block should be the same as those defined in `blocks.csv.gz` above.

The `block.sample.num` column is not required, but can be useful for evaluating the blocks.

<img src="./static/inclusion_blocks.png" height = "75%" width = "75%" style="display:inline">

#### `exclusion_blocks.csv.gz`

A Tab-delimited or Comma-separated file containing definitions for groups of `blocks` to be co-excluded from a cMonkey2 run. Typically these blocks are defined for testing purposes. If they are not defined, several random `exclusion blocks` will be defined in order to evaluate model over-fitting (to be implemented) 

The format for the `exclusion blocks` is the same as the `inclusion blocks` before. Each group of `blocks` should be separated by `:::`, as indicated in the template below. The names of each block refer to those defined in `blocks.csv.gz` above.

Again, the `block.sample.num` column is not required.

<img src="./static/exclusion_blocks.png" height = "75%" width = "75%" style="display:inline">

Each of these files need to be supplied by the user. 

### (Optional) Generate custom scoring pipeline and files

[cMonkey2](https://github.com/baliga-lab/cmonkey2) can use a custom scoring pipeline. Currently, this is limited to set-enrichment.

In this example, we will generate an ensemble. We do so by providing several additional options to the `cMonkeyQSub.py` function, namely:

-`pipeline`: a JSON file containing custom scoring pipeline. Currently only set-enrichment pipeline is supported
-`setenrich`: Name(s) of set enrichment 'sets' to include. Names should be comma separated.
-`setenrich_files`: Comma-separated files containing set-enrichment sets. Multiple files should be separated by a comma.

Users interested in building [custom scoring pipelines](https://github.com/baliga-lab/cmonkey2/wiki/configuration-and-run-options) or [configuring set-enrichment](https://github.com/baliga-lab/cmonkey2/wiki/Set-Enrichment-Setup) should consult the [cMonkey2 documentation](https://github.com/baliga-lab/cmonkey2/wiki/).

### STEP 2: Call `cMonkeyQSub.py`

Generating an ensemble configuration entails calling a single Python script. There are a number of required and optional parameters that can be passed to this command line function, which are described below:

In [3]:
%run ../../build/cMonkeyQSub.py -h

usage: cMonkeyQSub.py [-h] --organism ORGANISM --ratios RATIOS --targetdir
                      TARGETDIR [--numruns NUMRUNS] [--ncbi_code NCBI_CODE]
                      [--mincols MINCOLS] [--num_cores NUM_CORES]
                      [--max_tasks MAX_TASKS] [--user USER] [--csh]
                      [--blocks BLOCKS] [--inclusion INCLUSION]
                      [--exclusion EXCLUSION] [--pipeline PIPELINE]
                      [--setenrich SETENRICH]
                      [--setenrich_files SETENRICH_FILES]

cMonkeyQSub.py - prepare cluster runs

optional arguments:
  -h, --help            show this help message and exit
  --organism ORGANISM   3 letter organism code
  --ratios RATIOS       Path to ratios file
  --targetdir TARGETDIR
                        Path to output directory
  --numruns NUMRUNS     Number of cMonkey2 runs to configure
  --ncbi_code NCBI_CODE
                        NCBI organism code
  --mincols MINCOLS     Minimum number of experiments to include in a c

Here we will concentrate on the required arguments.

#### Required Arguments

- organism: 3-letter organism code
- ratios: ratios file described above
- targetdir: location of a directory in which to configure the cMonkey2 runs

If you do not supply `block` files as described above, the experiments to include in each run will be choosen randomly. Several random `exclusion_blocks` will be defined for testing (currently not supported - block files must be supplied / ANB 03042015)

To generate. 

Here we will assume that the required files are in the local working directory. Furthermore, we will assume that the `cMonkeyQSub.py` is in the working directory and that all of the required modules are in your $PYTHONPATH. 

For the example, we will generate 5 cMonkey2 runs.

On the command line this is as follows:

`python cMonkeyQSub.py --organism mtu --ratios ratios.csv --targetdir mtu-ens-2014 --numruns 10 --blocks blocks.csv --inclusion inclusion_blocks.csv --exclusion exclusion_blocks.csv --pipeline setenrich_pipeline.json --setenrich chipseq,tfoe --setenrich_files ChIPSeq.csv,DE.csv --csh`

Optionally you can run this within the iPython, assuming you have the entire Git repository.

In [22]:
%run ../../build/cMonkeyQSub.py --organism mtu --ratios ./static/example_files/ratios.csv --targetdir mtu-ens-2014 --numruns 5 --blocks ./static/example_files/blocks.csv --inclusion ./static/example_files/inclusion_blocks.csv --exclusion ./static/example_files/exclusion_blocks.csv --pipeline ./static/example_files/setenrich_pipeline.json --setenrich chipseq,tfoe --setenrich_files ./static/example_files/ChIPSeq.csv,./static/example_files/DE.csv --csh

Choosing ensemble conditions
Writing reports
Writing ratio files
Done
Writing ensemble config files


IndexError: list index out of range