<div align="right" vertical-align="middle" style="border: 2px solid;border-radius: 5px;background-color:lightgrey;padding:5px;padding-right:20px;padding-left:10px;">
        <a style="color:black;text-decoration:none;" title="Home" href="../index.ipynb">
            <img src="../../css/iconmonstr-christmas-house-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        &nbsp;
        <b>|</b>
        &nbsp;
        <a style="color:black;text-decoration:none;" title="Build" href="../build_docs/build.ipynb">
            <img src="../../css/iconmonstr-puzzle-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        <a style="color:black;text-decoration:none;" title="Assemble" href="../assemble_docs/assemble.ipynb">
            <img src="../../css/iconmonstr-puzzle-17-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
        <a style="color:black;text-decoration:none;" title="Query" href="query.ipynb">
            <img src="../../css/iconmonstr-flask-3-icon.svg" height = "30" width = "30" style="display:inline">
        </a>
</div>

# Basic Mining

## EGRIN2.0 MongoDB query using `coremFinder`

### In a nutshell

**Corems** or <u>condition-specific co-regulated modules</u> are sets of genes that are tightly co-expressed in a condition-specific manner and whose expression is often controlled by common transcriptional regulators.

*Expert note: A different perspective on `corems` is that they are highly-refined, reproducibly-detected, biclusters that violate some constraints imposed on cMonkey-detected biclusters.*

`coremFinder` mines information about corems that are detected by EGRIN 2.0. 

Typically, about half of the genes in the genome may be discovered in corems. For thios subset of genes, the `coremFinder` function can return information about the corem, including:

- gene composition
- condition-specific activity
- edges contained in corem
- corem density

This function can be combined with the [`agglom` function](basic_mining_agglom.ipynb) to drive analysis of EGRIN 2.0 ensembles beyond genes contained in corems.

### Set-up

*Make sure ./egrin-tools/ folder is in your python path*

You can do this on Mac/Linux by adding the path to you Bash Shell Startup Files, e.g. ~/.bashrc or ~/.bash_profile

for example in ~/.bash_profile add the following line:

`export PYTHONPATH=$PYTHONPATH:path/to/egrin2-tools/`

### Load required modules

In [1]:
from query.egrin2_query import *

host = "baligadev"
db = "eco_db"
port = 27017

There are several dependencies that need to be satisfied, including:
- pymongo
- numpy
- pandas
- joblib
- scipy
- statsmodels
- itertools

### `coremFinder` function

The `coremFinder` function is very similar to the [`agglom` function](basic_mining_agglom.ipynb). Basically, it co-associates information about corems, where the infomration supplied and retrieved is modulated by defining the arguments: `x`, `x_type`, and `y_type`. 

The function returns the requested information about a corem.

You can find out more about this function and its parameters by issuing the following commmand:

In [5]:
?coremFinder

#### Example 1: Find corem genes

The most straightforward way to use `coremFinder` is to find genes contained in a corem.

For example, to find all genes in *E. coli* corem `#1` we would type:

In [2]:
corem_1 = coremFinder( x = 1, x_type = "corem", y_type = "genes", host=host, db=db)
corem_1

Unnamed: 0,genes
0,b3317
1,b3320
2,b3319
3,b3313
4,b3315
5,b3318
6,b3314
7,b3321
8,b3316


There are several things to note in this query.

First the arguments:
- `x` specfies the query. This can be `gene(s)`, `condition(s)`, `GRE(s)`, or `edge(s)`. `x` can be a single entity or a list of entitites of the same type. 
- `x_type` indicates the type of `x`. This can include `gene`, `condition`, `gres`, and `edges`. Basically: "what is x?" The parameter typing is pretty flexible, so - for example - `rows` can be used instead of `genes`.
- `y_type` is the type of. Again, `genes`, `conditions`, `gres`, or `edges`. 
- `host` specifies where the MongoDB database is running. In this case it is running on a machine called `baligadev`. If you are hosting the database locally this would be `localhost`
- `db` is the name of the database you want to perform the query in. Typically databases are specified by the three letter organism code (e.g., **eco**) followed by **_db**. A list of maintained databases is available <a href="../check_dbs.ipynb">here</a>.

Also notice that corems (like GREs) are named as integer values. 

It should also be noted that corems are ordered by their weighted-density. Thus, corem `#1` is the most densly connected corem in the network. Basically, this means that each gene in the corem is co-discovered frequently in biclusters with every other gene in that corem (strongly connected subnetwork). 

Here we see that if we translate the names of these genes, we find that they are part of a ribosomal operon, which makes sense in light of the fact that ribosomal genes are tightly co-expressed.

In [5]:
row2id_batch(corem_1.genes.tolist(), return_field = "name", host=host, db=db)

Reverting to translation by single matches. Defining 'input_type' will dramatically speed up query.


[u'rplB',
 u'rplC',
 u'rplD',
 u'rplP',
 u'rplV',
 u'rplW',
 u'rpsC',
 u'rpsJ',
 u'rpsS']

#### Example 2: Find corems for a specific gene

More commonly, you want to know the corems to which a particualr gene belongs.

This can be accomplished by changing `x`, `x_type`, and `y_type`, as follows:

In [6]:
carA_corems = coremFinder( x = "carA", x_type = "gene", y_type = "corems", host=host, db=db)
carA_corems

Reverting to translation by single matches. Defining 'input_type' will dramatically speed up query.


Unnamed: 0,corems
0,107
1,471
2,835
3,847


We can see from this query that *carA* belongs to four corems. We could retrieve the genes in these corems like in `Example 1`:

In [8]:
coremFinder( x = carA_corems.corems.tolist(), x_type = "corems", y_type = "genes", host=host, db=db)

Unnamed: 0,genes
0,b0002
1,b0003
2,b0004
3,b0032
4,b0033
5,b0197
6,b0198
7,b0273
8,b0287
9,b0336


####Example 3: Logical operations

Similar to the `agglom` function we can implement logical operations. For example, if we wanted to know the genes that belonged to *all* of the corems in which *carA* is a member we would simply set `logic = "and"`

In [9]:
coremFinder( x = carA_corems.corems.tolist(), x_type = "corems", y_type = "genes", host=host, db=db, logic = "and")

No genes found
