# Example 01 - GRNBoost2 with transposed input file

In this example notebook, we illustrate how to easily prepare the input data using a [Pandas](http://pandas.pydata.org/) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), in case the input file happens to be transposed with respect to the Arboreto input conventions.

In [1]:
import os
import pandas as pd

from arboreto.algo import grnboost2, genie3
from arboreto.utils import load_tf_names

## 1. Load and prepare the input data

* We use the [Pandas](http://pandas.pydata.org/) library to read the data from a tab-separated text file.
* Arboreto expects the `expression_data` matrix to have observations as rows and genes as columns.

In [2]:
wd = os.getcwd().split('arboreto')[0] + 'arboreto/resources/dream5/'

net1_ex_path = wd + 'net1/net1_expression_data.transposed.tsv'
net1_tf_path = wd + 'net1/net1_transcription_factors.tsv'

* Let's first have a look at the input expression matrix

In [3]:
wrong_ex_matrix = pd.read_csv(net1_ex_path, sep='\t')

In [4]:
wrong_ex_matrix.head()

Unnamed: 0,G1,0.4254475,0.4424002,1.0568470,1.1172264,0.9710677,1.1393856,1.0648694,0.8761173,1.2059661,...,0.8676780,0.9524525,0.8088893,0.7760007,0.7497160,0.9193069,0.7698596,0.7274581,0.8320656,0.6812067
0,G2,0.017829,0.050525,0.208454,0.003001,0.001056,0.122047,0.140508,0.073814,0.153407,...,0.021559,0.012235,0.000125,0.014447,0.002786,0.012417,0.033997,0.003877,0.010711,0.00056
1,G3,0.907989,0.869368,0.467448,0.317654,0.354651,0.402465,0.481763,1.058292,0.760861,...,0.811866,0.744307,0.336347,0.832623,0.489709,0.793202,0.73537,0.356966,0.830129,0.373693
2,G4,0.448247,0.445851,0.505077,0.387204,0.474532,0.348436,0.474857,0.730366,0.655846,...,0.50405,0.585896,0.625708,0.434733,0.497823,0.550684,0.349348,0.615756,0.465677,0.719729
3,G5,0.172324,0.173311,0.244883,0.253792,0.207718,0.168614,0.182643,0.053656,0.157731,...,0.225154,0.177781,0.017787,0.194447,0.141594,0.215473,0.171566,0.081724,0.179892,0.070691
4,G6,0.273489,0.274889,0.208451,0.17936,0.102833,0.255774,0.11243,0.175109,0.141754,...,0.152427,0.102012,0.0014,0.159369,0.153731,0.290485,0.148841,0.117133,0.101629,0.004762


In [5]:
assert wrong_ex_matrix.shape == (805, 1643)

AssertionError: 

* **PROBLEM**: the orientation of the matrix is wrong.
* Let's read it in a better way from file.

In [6]:
ex_matrix = pd.read_csv(net1_ex_path, sep='\t', index_col=0, header=None).T

* Let's quickly check the the input matrix by inspecting its shape and top 5 rows.

In [7]:
assert ex_matrix.shape == (805, 1643)

In [8]:
ex_matrix.shape

(805, 1643)

In [9]:
ex_matrix.head()

Unnamed: 0,G1,G2,G3,G4,G5,G6,G7,G8,G9,G10,...,G1634,G1635,G1636,G1637,G1638,G1639,G1640,G1641,G1642,G1643
1,0.425448,0.017829,0.907989,0.448247,0.172324,0.273489,0.843766,0.648201,1.004533,0.365305,...,0.011979,0.963306,1.16987,0.331381,0.3506,0.822844,0.304483,0.319917,0.36428,0.765945
2,0.4424,0.050525,0.869368,0.445851,0.173311,0.274889,0.764049,0.74787,1.022589,0.434106,...,0.022247,1.014137,0.888465,0.281649,0.48594,0.915617,0.317507,0.238074,0.50913,0.691403
3,1.056847,0.208454,0.467448,0.505077,0.244883,0.208451,0.665355,1.192092,0.824068,0.146987,...,0.422066,0.895203,1.028826,0.825126,0.444819,0.349069,0.04231,0.165208,0.952178,0.678781
4,1.117226,0.003001,0.317654,0.387204,0.253792,0.17936,0.939244,0.868668,0.963028,0.233785,...,0.001163,1.04654,1.058098,0.484225,0.150689,0.449126,0.125197,4.7e-05,0.878127,0.566691
5,0.971068,0.001056,0.354651,0.474532,0.207718,0.102833,0.745871,0.909753,1.151865,0.318988,...,0.000845,1.041745,1.061129,0.384363,0.326859,0.51227,0.26141,0.000156,0.883981,0.646715


* That's bettter!
* We can now proceed as usual to infer the GRN network

In [10]:
tf_names = load_tf_names(net1_tf_path)

## 2. Launch gene regulatory network inference

In [11]:
%%time
network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names)

CPU times: user 30.9 s, sys: 15.7 s, total: 46.6 s
Wall time: 50.9 s


In [12]:
network.head()

Unnamed: 0,TF,target,importance
108,G109,G1406,155.517394
15,G16,G1440,130.234599
15,G16,G687,128.137108
175,G176,G228,127.023538
175,G176,G367,121.146559


In [13]:
len(network)

318833

## 3. Write the GRN link list to file `[TF, target, importance]`.

In [14]:
network.to_csv('ex_03_network.tsv', sep='\t', header=False, index=False)