__Author:__ Bram Van de Sande

__Date:__ 29 JAN 2018
 
__Outline__ This notebook assesses the read performance of two different formats for storing whole genome rankings.

1. The legacy format based on SQLite3 using schema defined as:
```
CREATE TABLE rankings (geneID VARCHAR(255), ranking BLOB);
CREATE TABLE motifs (motifName VARCHAR(255), idx INTEGER);
```
The ranking of a gene for all regulatory features for which it was scored and ranked is stored as a BLOB.
2. The new feather format which need to be installed by:
```
pip install feather-format
```
and provides a fast and minimal API to store and read pandas' ``DataFrame``s. This format also allows to read only a subset of the columns of the dataframe stored.

Cave: The HDF5 fileformat interfaced through the pandas framework via the PyTables packages is not used because of the current 2000 columns limit (https://stackoverflow.com/questions/16639503/unable-to-save-dataframe-to-hdf5-object-header-message-is-too-large)

In [1]:
import glob, os
import numpy as np
import pandas as pd
from feather.api import write_dataframe, read_dataframe
from pyscenic.rnkdb import RankingDatabase
from pyscenic.genesig import GeneSignature

For the performance assessment existing human whole genome rankings stored in the legacy format are going to be used.

In [4]:
DB_FOLDER = "/Users/bramvandesande/Projects/lcb/databases/"
DB_GLOB = os.path.join(DB_FOLDER,"hg19*.db")
db_fnames = glob.glob(DB_GLOB)
def name(fname):
    return os.path.basename(fname).split(".")[0]
dbs = [RankingDatabase(fname=fname, name=name(fname), nomenclature="HGNC") for fname in db_fnames]
dbs

[RankingDatabase(name="hg19-tss-centered-5kb-10species",n_features=24453),
 RankingDatabase(name="hg19-500bp-upstream-10species",n_features=24453),
 RankingDatabase(name="hg19-tss-centered-10kb-7species",n_features=24453),
 RankingDatabase(name="hg19-500bp-upstream-7species",n_features=24453),
 RankingDatabase(name="hg19-tss-centered-5kb-7species",n_features=24453),
 RankingDatabase(name="hg19-tss-centered-10kb-10species",n_features=24453)]

These rankings need to be converted to the new feather format.

In [5]:
def convert2feather(db, fname):
    features, genes, rankings = db.load_full()
    # Genes must be columns because feather is a column-oriented format.
    # Specifying dtype penalizes read performance.
    df = pd.DataFrame(index=features, columns=genes, data=rankings)
    write_dataframe(df, fname)
    return df 

In [6]:
def convert(db):
    fname = os.path.join(DB_FOLDER, "{}.feather".format(db.name))
    convert2feather(db, fname)
    return fname
feather_fnames = [convert(db) for db in dbs]

The size on disk is similar for both fileformats

In [7]:
fsizes = [float(os.path.getsize(fname))/1e9 for fname in db_fnames]
fsizes

[1.099437056, 1.099437056, 1.099437056, 1.099437056, 1.099437056, 1.099437056]

In [8]:
fsizes = [float(os.path.getsize(fname))/1e9 for fname in feather_fnames]
fsizes

[1.091646048, 1.091646048, 1.091646048, 1.091646048, 1.091646048, 1.091646048]

The gene signatures used in this performance assessement are downloaded from MSigDB (http://software.broadinstitute.org/gsea/msigdb). The module C6 is used in this notebook.

In [11]:
GMT_FNAME = "/Users/bramvandesande/Projects/lcb/resources/c6.all.v6.1.symbols.gmt.txt"

In [12]:
msigdb_c6 = GeneSignature.from_gmt(
                        fname=GMT_FNAME,
                        nomenclature="HGNC",
                        gene_separator="\t",
                        field_separator="\t")
len(msigdb_c6)

189

The feather format needs a list of gene symbols.

In [13]:
signatures = [sorted(gs.genes) for gs in msigdb_c6]

The compare the read performance every signature in the MSigDB module is loaded from all human ranking databases.

In [14]:
def print_report(res):
    seconds = res.all_runs[0]
    n_dbs = len(dbs)
    n_signatures = len(msigdb_c6)
    print("{}ms per load".format((seconds/(n_dbs*n_signatures))*1000.0))

In [15]:
%%timeit -n1 -r1 -o -q
for signature in signatures:
    for fname in feather_fnames:
        read_dataframe(fname, columns=signature)

<TimeitResult : 1min 17s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

In [16]:
print_report(_)

68.2256395943558ms per load


In [17]:
%%timeit -n1 -r1 -o -q
for gs in msigdb_c6:
    for db in dbs:
        db.load(gs)

<TimeitResult : 1min 58s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

In [18]:
print_report(_)

104.71501166578506ms per load


__Conclusion:__ The new feather format has several advantages over the legacy format:
- Faster read performance on average.
- A far easier implementation relying entirely on an external python package.
- Interoperability between R and python.
- Reading entire dataframes into memory is significantly faster with the feather format.