This notebook is a tutorial of common analysis tasks related to gct/x files.

As a refresher, gct files are simply fancy text files that combine metadata with the data they describe. gctx files are the HDF5, binary versions, which are smaller and have faster IO time.

## Reading in gct/x files

** To read in a .gct or .gctx file:**

In [30]:
% gct and gctx files that we'll be using throughout the tutorial
gctx_file_location = '../resources/example.gctx';
gct_file_location = '../resources/example.gct';

### gct

In [31]:
ds = parse_gctx(gct_file_location)

Reading ../resources/example.gct [978x377]
class:single
read:978/978
Done.


ds = 

        mat: [978x377 single]
        rid: {978x1 cell}
        rhd: {11x1 cell}
      rdesc: {978x11 cell}
      rdict: [11x1 containers.Map]
        cid: {377x1 cell}
        chd: {35x1 cell}
      cdesc: {377x35 cell}
      cdict: [35x1 containers.Map]
    version: '#1.3'
        src: '../resources/example.gct'


### gctx

In [32]:
ds = parse_gctx(gctx_file_location)

Reading ../resources/example.gctx [978x1476]
Done [0.79 s].

ds = 

        mat: [978x1476 double]
        rid: {978x1 cell}
        rhd: {11x1 cell}
      rdesc: {978x11 cell}
        cid: {1476x1 cell}
        chd: {35x1 cell}
      cdesc: {1476x35 cell}
      rdict: [11x1 containers.Map]
      cdict: [35x1 containers.Map]
        src: '../resources/example.gctx'
    version: 'GCTX1.0'
     h5path: '/0/DATA/0/matrix'
     h5name: 'unnamed'


We call `ds` a gct object, but it's represented as a struct cell array. Note that `parse_gctx` can be used to import both gct and gctx files.

** Say you only want to read in the metadata from a big gctx file without touching the data.**

In [33]:
 ds_with_only_meta = parse_gctx(gctx_file_location, 'annot_only', true)

Reading ../resources/example.gctx Done [0.75 s].

ds_with_only_meta = 

        mat: []
        rid: {978x1 cell}
        rhd: {11x1 cell}
      rdesc: {978x11 cell}
        cid: {1476x1 cell}
        chd: {35x1 cell}
      cdesc: {1476x35 cell}
      rdict: [11x1 containers.Map]
      cdict: [35x1 containers.Map]
        src: '../resources/example.gctx'
    version: 'GCTX1.0'
     h5path: '/0/DATA/0/matrix'


Note that `mat` is empty, but the metadata is the same as above.

** You can also read in only a certain subset of rids and/or cids.**

Practically speaking, this is more useful for gctx files than gct files, since (as a text file) you'll need to read in the entire gct file anyway. **You'll need to have a list of desired rids and/or cids already (can be obtained from reading only metadata in first, then subsetting).**

In [34]:
% Get rids and cids from gctoo_with_only_meta
my_rids = ds_with_only_meta.rid(3:5)
my_cids = ds_with_only_meta.cid(5)

my_rids = 

    '217140_s_at'
    '209253_at'
    '214404_x_at'


my_cids = 

    'LJP005_MCF7_24H_X1_B17:A07'


In [35]:
% Use my_rids and my_cids for getting a subset of the gctx
ds_subset = parse_gctx(gctx_file_location, 'rid', my_rids, 'cid', my_cids)

Reading ../resources/example.gctx [3x1]
Performing 1 hyperslab selections using single mode
Done [0.74 s].

ds_subset = 

        mat: [3x1 double]
        rid: {3x1 cell}
        rhd: {11x1 cell}
      rdesc: {3x11 cell}
        cid: {'LJP005_MCF7_24H_X1_B17:A07'}
        chd: {35x1 cell}
      cdesc: {1x35 cell}
      rdict: [11x1 containers.Map]
      cdict: [35x1 containers.Map]
        src: '../resources/example.gctx'
    version: 'GCTX1.0'
     h5path: '/0/DATA/0/matrix'
     h5name: 'unnamed'


## Modifying a gct object

You can modify the attributes of the gct object. However, beware messing up the integrity of the gct object; for example, changing `rhd` so that it doesn't align with `rdesc` and `rdict`.

In [36]:
ds.rhd = {}

ds = 

        mat: [978x1476 double]
        rid: {978x1 cell}
        rhd: {}
      rdesc: {978x11 cell}
        cid: {1476x1 cell}
        chd: {35x1 cell}
      cdesc: {1476x35 cell}
      rdict: [11x1 containers.Map]
      cdict: [35x1 containers.Map]
        src: '../resources/example.gctx'
    version: 'GCTX1.0'
     h5path: '/0/DATA/0/matrix'
     h5name: 'unnamed'


## Merging gct/x files

Say you have 2 gct files that you want to merge together. They have the same row metadata but different columns.

In [37]:
ds1 = parse_gctx(gctx_file_location)
ds2 = parse_gctx(gct_file_location)

Reading ../resources/example.gctx [978x1476]
Done [0.72 s].

ds1 = 

        mat: [978x1476 double]
        rid: {978x1 cell}
        rhd: {11x1 cell}
      rdesc: {978x11 cell}
        cid: {1476x1 cell}
        chd: {35x1 cell}
      cdesc: {1476x35 cell}
      rdict: [11x1 containers.Map]
      cdict: [35x1 containers.Map]
        src: '../resources/example.gctx'
    version: 'GCTX1.0'
     h5path: '/0/DATA/0/matrix'
     h5name: 'unnamed'

Reading ../resources/example.gct [978x377]
class:single
read:978/978
Done.


ds2 = 

        mat: [978x377 single]
        rid: {978x1 cell}
        rhd: {11x1 cell}
      rdesc: {978x11 cell}
      rdict: [11x1 containers.Map]
        cid: {377x1 cell}
        chd: {35x1 cell}
      cdesc: {377x35 cell}
      cdict: [35x1 containers.Map]
    version: '#1.3'
        src: '../resources/example.gct'


In [38]:
merged = merge_two(ds1, ds2)

Appending cols

merged = 

        mat: [978x1853 single]
        rid: {978x1 cell}
        rhd: {11x1 cell}
      rdesc: {978x11 cell}
        cid: {1853x1 cell}
        chd: {35x1 cell}
      cdesc: {1853x35 cell}
      rdict: [11x1 containers.Map]
      cdict: [35x1 containers.Map]
        src: 'unnamed'
    version: '#1.3'


## Slicing gct/x files

Let's say you want to slice a gct/x file to keep only `dp52` probes and only `DMSO` samples.

In [39]:
ds = parse_gctx(gctx_file_location)

Reading ../resources/example.gctx [978x1476]
Done [0.73 s].

ds = 

        mat: [978x1476 double]
        rid: {978x1 cell}
        rhd: {11x1 cell}
      rdesc: {978x11 cell}
        cid: {1476x1 cell}
        chd: {35x1 cell}
      cdesc: {1476x35 cell}
      rdict: [11x1 containers.Map]
      cdict: [35x1 containers.Map]
        src: '../resources/example.gctx'
    version: 'GCTX1.0'
     h5path: '/0/DATA/0/matrix'
     h5name: 'unnamed'


Get rids corresponding to dp52 probes.

In [40]:
beadset_ids = ds_get_meta(ds, 'row', 'pr_bset_id');
dp52_bool_array = strcmp('dp52', beadset_ids);
dp52_rids = ds.rid(dp52_bool_array);
length(dp52_rids)

ans =

   489


Get cids corresponding to DMSO samples.

In [41]:
pert_inames = ds_get_meta(ds, 'column', 'pert_iname');
dmso_bool_array = strcmp('DMSO', pert_inames);
dmso_cids = ds.cid(dmso_bool_array);
length(dmso_cids)

ans =

   100


In [42]:
sliced = ds_slice(ds, 'rid', dp52_rids, 'cid', dmso_cids)

sliced = 

        mat: [489x100 double]
        rid: {489x1 cell}
        rhd: {11x1 cell}
      rdesc: {489x11 cell}
        cid: {100x1 cell}
        chd: {35x1 cell}
      cdesc: {100x35 cell}
      rdict: [11x1 containers.Map]
      cdict: [35x1 containers.Map]
        src: '../resources/example.gctx'
    version: 'GCTX1.0'
     h5path: '/0/DATA/0/matrix'
     h5name: 'unnamed'


Confirm that size of sliced is correct: 489 probes x 100 samples.

## Compute correlations between columns

In [43]:
ds = parse_gctx(gct_file_location);
cc = ds_corr(ds);

Reading ../resources/example.gct [978x377]
class:single
read:978/978
Done.


In [44]:
% cc is a gct object
cc.mat(1:5, 1:5)

ans =

    1.0000    0.9445    0.9232    0.9317    0.8501
    0.9445    1.0000    0.9202    0.9264    0.8463
    0.9232    0.9202    1.0000    0.9160    0.8696
    0.9317    0.9264    0.9160    1.0000    0.8634
    0.8501    0.8463    0.8696    0.8634    1.0000


## Transpose a gct/x

In [45]:
ds = parse_gctx(gct_file_location)

Reading ../resources/example.gct [978x377]
class:single
read:978/978
Done.


ds = 

        mat: [978x377 single]
        rid: {978x1 cell}
        rhd: {11x1 cell}
      rdesc: {978x11 cell}
      rdict: [11x1 containers.Map]
        cid: {377x1 cell}
        chd: {35x1 cell}
      cdesc: {377x35 cell}
      cdict: [35x1 containers.Map]
    version: '#1.3'
        src: '../resources/example.gct'


In [46]:
transposed = transpose_gct(ds)

transposed = 

        mat: [377x978 single]
        rid: {377x1 cell}
        rhd: {35x1 cell}
      rdesc: {377x35 cell}
      rdict: [35x1 containers.Map]
        cid: {978x1 cell}
        chd: {11x1 cell}
      cdesc: {978x11 cell}
      cdict: [11x1 containers.Map]
    version: '#1.3'
        src: '../resources/example.gct'


## Writing gct/x files

In [47]:
mkgct('example_out.gct', ds)
mkgctx('example_out.gctx', ds)

Saving file to example_out_n377x978.gct
Dimensions of matrix: [978x377]
Setting precision to 4
Saved.

ans =

example_out_n377x978.gct

Saving HDF5 dataset to: example_out_n377x978.gctx...
Disabling compression.
Setting chunk size to: 978x268
done [0.15s].

ans =

example_out_n377x978.gctx


Note that the number of rows and columns is automatically appended to the filename. Note also that the **columns** go first; this is so that it is easy to see by glancing at the end of the filename if the file is in landmark space (978 rows).

In [48]:
% Clean-up
delete 'example_out_n377x978.gct'
delete 'example_out_n377x978.gctx'