<div align="right" vertical-align="middle" style="border: 2px solid;border-radius: 5px;background-color:lightgrey;padding:5px;padding-right:20px;padding-left:10px;">
        <a style="color:black;text-decoration:none;" title="Home" href="../index.ipynb">
            <img src="../../css/iconmonstr-christmas-house-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        &nbsp;
        <b>|</b>
        &nbsp;
        <a style="color:black;text-decoration:none;" title="Build" href="../build_docs/build.ipynb">
            <img src="../../css/iconmonstr-puzzle-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        <a style="color:black;text-decoration:none;" title="Assemble" href="assemble.ipynb">
            <img src="../../css/iconmonstr-puzzle-17-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        <a style="color:black;text-decoration:none;" title="Query" href="../query_docs/query.ipynb">
            <img src="../../css/iconmonstr-flask-3-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
</div>

# TUTORIAL: Assemble an EGRIN 2.0 Ensemble

####*Important!!! This tutorial assumes you have access to a complete cMonkey2 ensemble.*

### In a nutshell

The `ASSEMBLE` scripts transfer and compile individual [cMonkey2](https://github.com/baliga-lab/cmonkey2) SQLite databases into an integrated MongoDB database. 

In addition, they perform several post-processing steps, including: detection of gene regulatory elements (GREs) by comparing individual bicluster motifs with [`TOMTOM`](http://meme.ebi.edu.au/meme/doc/tomtom.html) and clustering with [MCL](http://micans.org/mcl/), genome-wide scanning of motifs with FIMO, and detection of co-regulated modules or **corems** using link-community detection.

### Requirements

- MongoDB >= 2.4.9

IMPORTANT: This tutorial currently assumes that `TOMTOM`, `MCL` and `FIMO` have already been run.

A single GRE definition file is read from, eg:
    
    /ensemble-head-dir
        /out.mot_metaclustering.txt.I45.txt
        
FIMO scans are read from each run sub-directory, eg:
    
    /ensemble-head-dir
        /org-out-xxx
            /fimo-outs
                /fimo-out-xxxx.bz2

Optional:
- `row_annot`: tab-delimited row (gene) annotations. Will be downloaded from MicrobesOnline automatically using --ncbi_code if undefined 
- `col_annot`: tab-delimited column (condition) annotations.

The format for these files will be described in detail below.

Additionally, the Python modules described on the [Home page](../index.ipynb) are required to run these scripts.

### Scripts

- `assembler.py`: The control function for `ASSEMBLE` scripts.
- `makeCorems.py`: Identifies corems
- `resample_QSub.py`: Generates QSub script for submission of resamples to cluster
- `sql2mongoDB.py`: Merges individual cMonkey SQLite dbs and post-processing data into MongoDB

In [4]:
%run ../../assemble/assembler.py -h

usage: assembler.py [-h] --organism ORGANISM --ratios RATIOS --targetdir
                    TARGETDIR --ncbi_code NCBI_CODE [--cores CORES]
                    [--ensembledir ENSEMBLEDIR] [--col_annot COL_ANNOT]
                    [--host HOST] [--port PORT] [--prefix PREFIX]
                    [--row_annot ROW_ANNOT]
                    [--row_annot_matchCol ROW_ANNOT_MATCHCOL]
                    [--gre2motif GRE2MOTIF] [--db DB]
                    [--genome_annot GENOME_ANNOT]
                    [--backbone_pval BACKBONE_PVAL]
                    [--link_comm_score LINK_COMM_SCORE]
                    [--link_comm_increment LINK_COMM_INCREMENT]
                    [--link_comm_density_score LINK_COMM_DENSITY_SCORE]
                    [--corem_size_threshold COREM_SIZE_THRESHOLD]
                    [--n_resamples N_RESAMPLES] [--cluster CLUSTER]
                    [--finish_only FINISH_ONLY] [--user USER]

assemble.py - prepare cluster runs

optional arguments:
  -h, --help  

### `ASSEMBLE` an EGRIN 2.0 ensemble

In this tutorial we will see how you would `ASSEMBLE` an *Escherichia coli* EGRIN 2.0 ensemble using several example files and a couple of cMonkey2 runs, which we provide [here](./static/example_files/).

### STEP 1: Generate optional input files

First, let's explore the optional annotation files. Providing annotations for genes and conditions is a great way to enrich your analysis of the ensemble. You can get a better idea for the utility of these metainformation by following the [advanced mining tutorial](../query_docs/advanced_mining.ipynb)

As noted above, the `row_annotations` file will be downloaded automatically from MicrobesOnline if a custom annotation is not provided. If you provide your own `row_annot` file, however, you will also need to specificy `--row_annot_matchCol`, which is the name of the column in your annotation file that matches the gene name used by cMonkey2 (i.e. the row names in your ratios file). 

The row annotation file should look like the annotation file supplied by MicrobesOnline, where each row specifies a gene and each of the columns specifies some information about that gene. Again, you must ensure that at least one of the columns contains gene names that match the gene names in the ratios file used by cMonkey2, in the case of MicrobesOnline, it is the `sysName` column.

Here is an example annotation file for *E. coli* direct from MicrobesOnline, the file itself is available [here](./static/example_files/E_coli_v4_Build_6.experiment_feature_descriptions.tsv).

<img src="./static/row_annot.png" height = "75%" width = "75%" style="display:inline">

The `col_annot` file provides metainformation about. 



<img src="./static/experiment_annotation.png" height = "75%" width = "75%" style="display:inline">
