First, a note: Here, we'll be using import statements as if you were in the pandasGEXpress directory (eg, from wherever you've forked l1ktools: l1ktools/python/broadinstitute_cmap/io/pandasGEXpress). If you'd like to use nested imports, first run setup.py (found in l1ktools/python/broadinstitute_cmap; type 

```
$ python setup.py --help 
```

on the command line for details. Then, from your python session, you should be able to access pandasGEXpress with import statements in the following form: 

```python
from broadinstitute_cmap.io.pandasGEXpress import [package]
```


#### Table of Contents

# Reading in gct/x files

## If you'd like to read in an entire (.gct or .gctx) file to a GCToo instance:

In [5]:
from broadinstitute_cmap.io.GCToo import parse
my_gctoo = parse.parse("functional_tests/both_metadata_example_n1476x978.gctx")
my_gctoo

<GCToo.GCToo at 0x11694e490>

## If you're using the .gctx format, you can also only read in row or column metadata from a .gctx file. 

In [45]:
# read in row metadata only 
row_metadata_only = parse.parse("functional_tests/mini_gctx_with_metadata_n2x3.gctx", meta_only="row")
row_metadata_only

rhd,pr_analyte_id,pr_analyte_num,pr_gene_id,pr_model_id
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
200814_at,Analyte 11,11,5720,
218597_s_at,Analyte 12,12,55847,
217140_s_at,Analyte 12,12,7416,


In [8]:
# read in column metadata only
col_metadata_only = parse.parse("functional_tests/mini_gctx_with_metadata_n2x3.gctx", meta_only="col")
col_metadata_only

chd,bead_batch,bead_revision
cid,Unnamed: 1_level_1,Unnamed: 2_level_1
LJP005_A375_24H:DMSO:-666,b19,r2
LJP005_A375_24H:BRD-K76908866:10,b19,r2


## You can also read in only a certain subset of rids and/or cids to a GCToo instance.

Practically speaking, this is more useful for GCTX files than GCT files, since (as a text file) you'll need to read in the entire GCT file anyway. **You'll need to have a list of desired rids and/or cids already (can be obtained from reading only metadata in first, then subsetting)**


In [46]:
my_rids = ["218597_s_at", "200814_at"]
my_cids = ["LJP005_A375_24H:BRD-K76908866:10"]

# you can subset by rids, cids, or both rids and cids 
mini_gctoo_subset = parse.parse("functional_tests/mini_gctx_with_metadata_n2x3.gctx", rid = my_rids, cid= my_cids)
mini_gctoo_subset.row_metadata_df

rhd,pr_analyte_id,pr_analyte_num,pr_gene_id,pr_model_id
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
200814_at,Analyte 11,11,5720,
218597_s_at,Analyte 12,12,55847,


In [11]:
mini_gctoo_subset.data_df.shape

(2, 1)

In [12]:
mini_gctoo_subset.col_metadata_df

chd,bead_batch,bead_revision
cid,Unnamed: 1_level_1,Unnamed: 2_level_1
LJP005_A375_24H:BRD-K76908866:10,b19,r2


# Construct a GCToo object programatically

For instance, say we want to create a GCToo instance of a file we've read in (for this example, let's use `my_gctoo`) that lacks metadata.

In [48]:
import pandas as pd
from broadinstitute_cmap.io.GCToo import GCToo
from broadinstitute_cmap.io.GCToo import parse

my_gctoo = parse.parse("functional_tests/both_metadata_example_n1476x978.gctx")

minimal_row_meta = pd.DataFrame(index = my_gctoo.row_metadata_df.index)
minimal_col_meta = pd.DataFrame(index = my_gctoo.col_metadata_df.index)

data_only_gctoo = GCToo.GCToo(data_df = my_gctoo.data_df, 
 	row_metadata_df = minimal_row_meta, col_metadata_df = minimal_col_meta)

data_only_gctoo.row_metadata_df.shape

(978, 0)

In [49]:
data_only_gctoo.col_metadata_df.shape

(1476, 0)

In [50]:
data_only_gctoo.data_df.shape == my_gctoo.data_df.shape

True

# Merging gct/x files


## Several files (with consistent name prefix), from the command line

Say you have a bunch of files that start with 'LINCS_GCP' in your Downloads folder that you want to concatenate. Type the following in your command line:

```
python /Users/some_name/code/l1ktools/python/broadinstitute_cmap/io/GCToo/concat_gctoo.py --file_wildcard '/Users/some_name/Downloads/LINCS_GCP*'
```

This will save a file called `concated.gctx` in your current directory. Make sure that the wildcard is in quotes!

## Two files to concatenatenate, from a Python session 

You have 2 GCToo objects in memory that you want to concatenate. hstack is the method in concat_gctoo.py that actually does the concatenation. From within the Python console or script where you have your 2 GCToos (gct1 & gct2), type the following:

In [3]:
from broadinstitute_cmap.io.GCToo import concat_gctoo
from broadinstitute_cmap.io.GCToo import parse

gct1 = parse.parse("functional_tests/test_merge_left.gct")
gct2 = parse.parse("functional_tests/test_merge_right.gct")
merged_gcts = concat_gctoo.hstack([gct1, gct2])

gct1.data_df.shape

INFO 2016-11-10 00:37:52,966 parse_gctoo parse Reading GCT: functional_tests/test_merge_left.gct
INFO 2016-11-10 00:37:52,981 parse_gctoo parse Reading GCT: functional_tests/test_merge_right.gct
INFO 2016-11-10 00:37:53,002 concat_gctoo hstack build GCToo of all...


(4, 3)

In [4]:
gct2.data_df.shape

(3, 3)

In [5]:
merged_gcts.data_df.shape

(4, 6)

Note that there are also 3 optional arguments to hstack to work around certain obstacles common to concatenation. The full list of arguments to hstack are:

```
hstack(gctoos_list, fields_to_remove=None, reset_sample_ids=False,
    sort_headers=False)
```

1)  fields_to_remove: If the row metadata contains headers with values that are not the same in all files, then you can remove these headers using the fields_to_remove argument.

2) reset_sample_ids: If the sample ids are not unique between different files, you can use the reset_sample_ids argument. This will move the cids to a new metadata field and assign a unique integer index for each sample.

3)  sort_headers: If the row metadata headers are the same between different files but not in the same order, you can sort them using the sort_headers argument.

## Two files to concatenate, from the command line 

Say you have 2 files that you want to concatenate: /Users/some_name/file_to_concatenate1.gct and /Users/some_name/file_to_concatenate2.gct. Type the following in your command line:

```
python /Users/some_name/code/l1ktools/python/broadinstitute_cmap/io/GCToo/concat_gctoo.py --list_of_gct_paths /Users/some_name/file_to_concatenate1.gct /Users/some_name/file_to_concatenate2.gct
```

Optional arguments are as above (for hstack). 

# Slicing gct/x files 

Can be done from the command line or an active python session (assume you've already parsed in a file and have an instance of GCToo called mini_gctoo). For this use case, you'll need to have a list of desired and/or undesired rids and/or cids already (can be obtained from doing Use Case 2 or by other means). For example:

In [9]:
from broadinstitute_cmap.io.GCToo import parse
from broadinstitute_cmap.io.GCToo import slice_gct

mini_gctoo = parse.parse("functional_tests/mini_gctx_with_metadata_n2x3.gctx")

interesting_rids = ["218597_s_at", "200814_at"]
interesting_cids = ["LJP005_A375_24H:BRD-K76908866:10"]
boring_rids = ["217140_s_at"]

sliced_gctoo = slice_gct.slice_gctoo(mini_gctoo, rid = interesting_rids, cid = interesting_cids, exclude_rid = boring_rids)

sliced_gctoo.data_df.shape

(2, 1)

# Add/delete annotations

First, let's take the simplest case. You have a GCToo instance with metadata and want to add a list **that you know is ordered already like the rids of the row metadata**. 

In [10]:
from broadinstitute_cmap.io.GCToo import parse 

mini = parse.parse("functional_tests/mini_gctx_with_metadata_n2x3.gctx")
mini.row_metadata_df

rhd,pr_analyte_id,pr_analyte_num,pr_gene_id,pr_model_id
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
200814_at,Analyte 11,11,5720,
218597_s_at,Analyte 12,12,55847,
217140_s_at,Analyte 12,12,7416,


In [11]:
# now delete the column "pr_model_id" (it's all NaN anyway, so say we don't care about it)
del mini.row_metadata_df["pr_model_id"]
mini.row_metadata_df 

rhd,pr_analyte_id,pr_analyte_num,pr_gene_id
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
200814_at,Analyte 11,11,5720
218597_s_at,Analyte 12,12,55847
217140_s_at,Analyte 12,12,7416


In [12]:
# now add a column from a list 
mini.row_metadata_df["is_this_fun"] = ["yes","yes", "maybe"]

mini.row_metadata_df

rhd,pr_analyte_id,pr_analyte_num,pr_gene_id,is_this_fun
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
200814_at,Analyte 11,11,5720,yes
218597_s_at,Analyte 12,12,55847,yes
217140_s_at,Analyte 12,12,7416,maybe


In [27]:
# BUT: This assumes the new column is ordered the same as the rids. 
# If this isn't the case:

import pandas as pd

mini = parse.parse("functional_tests/mini_gctx_with_metadata_n2x3.gctx")
mini.row_metadata_df

new_annotation = pd.DataFrame([89, 9137430, 3], index = ["217140_s_at","200814_at", "218597_s_at"], columns = ["important_metric"])
new_annotation

Unnamed: 0,important_metric
217140_s_at,89
200814_at,9137430
218597_s_at,3


In [28]:
mini.row_metadata_df = pd.merge(mini.row_metadata_df, new_annotation, left_index=True, right_index=True)
mini.row_metadata_df

rhd,pr_analyte_id,pr_analyte_num,pr_gene_id,pr_model_id,important_metric
217140_s_at,Analyte 12,12,7416,,89
200814_at,Analyte 11,11,5720,,9137430
218597_s_at,Analyte 12,12,55847,,3


# Transpose a GCToo

In [33]:
from broadinstitute_cmap.io.GCToo import GCToo
from broadinstitute_cmap.io.GCToo import parse 

mini = parse.parse("functional_tests/mini_gctx_with_metadata_n2x3.gctx")
mini.multi_index_df 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,bead_batch,b19,b19
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,bead_revision,r2,r2
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,cid,LJP005_A375_24H:DMSO:-666,LJP005_A375_24H:BRD-K76908866:10
pr_analyte_id,pr_analyte_num,pr_gene_id,pr_model_id,rid,Unnamed: 5_level_3,Unnamed: 6_level_3
Analyte 11,11,5720,,200814_at,-0.283359,0.01127
Analyte 12,12,55847,,218597_s_at,0.304119,1.921061
Analyte 12,12,7416,,217140_s_at,0.398655,-0.144652


In [32]:
transposed_mini = GCToo.GCToo(data_df = mini.data_df.transpose(), 
	row_metadata_df = mini.col_metadata_df, col_metadata_df = mini.row_metadata_df)

transposed_mini.multi_index_df

Unnamed: 0_level_0,Unnamed: 1_level_0,pr_analyte_id,Analyte 11,Analyte 12,Analyte 12
Unnamed: 0_level_1,Unnamed: 1_level_1,pr_analyte_num,11,12,12
Unnamed: 0_level_2,Unnamed: 1_level_2,pr_gene_id,5720,55847,7416
Unnamed: 0_level_3,Unnamed: 1_level_3,pr_model_id,NaN,NaN,NaN
Unnamed: 0_level_4,Unnamed: 1_level_4,cid,200814_at,218597_s_at,217140_s_at
bead_batch,bead_revision,rid,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5
b19,r2,LJP005_A375_24H:DMSO:-666,-0.283359,0.304119,0.398655
b19,r2,LJP005_A375_24H:BRD-K76908866:10,0.01127,1.921061,-0.144652


# Column or row-wise basic math

## Column/row-wise means

In [40]:
from broadinstitute_cmap.io.GCToo import parse 
import pandas as pd

mini = parse.parse("functional_tests/mini_gctx_with_metadata_n2x3.gctx")
# by row
rowwise_mean = mini.data_df.mean(axis = 0)
# by col
colwise_mean = mini.data_df.mean(axis = 1)

# want to write to a GCToo? Easy!
colwise_mean_mini_gctoo = GCToo.GCToo(data_df = pd.DataFrame(colwise_mean, columns=["cid"]), 
	row_metadata_df = mini.row_metadata_df, col_metadata_df = pd.DataFrame(["colwise_mean"], index=["cid"]))
colwise_mean_mini_gctoo.data_df

Unnamed: 0_level_0,cid
rid,Unnamed: 1_level_1
200814_at,-0.136045
218597_s_at,1.11259
217140_s_at,0.127002


## Pairwise correlation of columns

Notes: 
- To do correlation by rows using pandas method, transpose data_df first.
- pandas' default correlation metric is pearson.

In [41]:
mini.data_df.corr(method = "spearman")

cid,LJP005_A375_24H:DMSO:-666,LJP005_A375_24H:BRD-K76908866:10
cid,Unnamed: 1_level_1,Unnamed: 2_level_1
LJP005_A375_24H:DMSO:-666,1.0,-0.5
LJP005_A375_24H:BRD-K76908866:10,-0.5,1.0


## Aggregation by specified grouping 

This tasks is particularly well-suited to the multi-index DataFrame attribute of a GCToo instance.

In [42]:
mini.multi_index_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,bead_batch,b19,b19
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,bead_revision,r2,r2
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,cid,LJP005_A375_24H:DMSO:-666,LJP005_A375_24H:BRD-K76908866:10
pr_analyte_id,pr_analyte_num,pr_gene_id,pr_model_id,rid,Unnamed: 5_level_3,Unnamed: 6_level_3
Analyte 11,11,5720,,200814_at,-0.283359,0.01127
Analyte 12,12,55847,,218597_s_at,0.304119,1.921061
Analyte 12,12,7416,,217140_s_at,0.398655,-0.144652


In [43]:
# Say you want to group by pr_analyte_id and pr_gene_id,
# And sum the values of these group members
mini.multi_index_df.groupby(level=["pr_analyte_id", "pr_gene_id"]).sum()

Unnamed: 0_level_0,bead_batch,b19,b19
Unnamed: 0_level_1,bead_revision,r2,r2
Unnamed: 0_level_2,cid,LJP005_A375_24H:DMSO:-666,LJP005_A375_24H:BRD-K76908866:10
pr_analyte_id,pr_gene_id,Unnamed: 2_level_3,Unnamed: 3_level_3
Analyte 11,5720,-0.283359,0.01127
Analyte 12,7416,0.398655,-0.144652
Analyte 12,55847,0.304119,1.921061


# Write a GCToo instance to file 

Assume you're in an active python session in your local GCToo directory and have a GCToo instance (let's call it my_gctoo that you'd like to write to .gct or .gctx.

To write to a GCT file:

```
import write_gct 

write_gct.write(my_gctoo, "some/path/to/my_gctoo.gct") 
```

To write to a GCTX file:

```
import write_gctx

write_gctx.write(my_gctoo, "some/path/to/my_gctoo.gctx")  
```

# Convert a gct -> gctx (or vice versa) from the command line

Converting from a gct to a gctx might be useful if you have a large gct and want faster IO in the future. 

To write some_thing.gct -> some_thing.gctx in working directory:

```
python gct2gctx.py -filename some_thing.gct 
```

To write some_thing.gct to a .gctx named something_else.gctx in a different out directory (both -outname and -outpath are optional):
```
python gct2gctx.py -filename some_thing.gct -outname something_else -outpath my/special/folder
```

Converting a gctx to a gct might be useful if you want to look at your .gctx file in a text editor or something similar. 

To write some_thing.gctx -> some_thing.gct in working directory:
```
python gctx2gct.py -filename some_thing.gctx
```

To write some_thing.gctx to a .gct named something_else.gct in a different out directory (both -outname and -outpath are optional):
```
python gctx2gct.py -filename some_thing.gctx -outname something_else -outpath my/special/folder
```