__Author:__ Bram Van de Sande

__Date:__ 25 APR 2018

__Remaining implementation challenges:__
    
1. Disk-volume of region-based databases: the inverted database design can significantly reduce the size on disk (from 120Gb to 4,7Gb for the 1M regions-24K features human database) but with a huge impact on read performance (from 1 second to several minutes for a typical signature). Potential mitigation challenges are:
    - Recoding the decompression of the inverted design using C++ and its standard template library. Impact only be a constant factor while personal investment will be substantial.
    - Break the clean interface between database storage and AUC and recovery curve calculation. The exact order of the genes is not really required for AUC calculation so decompression times could be significantly reduced.
2. Large memory footptint when using many cores combined with large rank_threshold increase the probability of generating MemoryError's. This can be mitigated by reducing the number of cores used (current strategy). The burden on memory can be reliefed by reimplementing the calculation of the average recovery curve and its standard deviation be avoiding the memory-hungry vectorized approach and go for a iterative approach (two phased iterative approach will be necessary, i.e. one loop for the average curve and a second subsequent one for calculating the standard deviation). The real problem has something to do with the dask framework which shows memory leakage. Use memory profilinh tools to investigate this before changing RCC implementation (https://pythonhosted.org/Pympler/muppy.html)
3. Bus error can be avoided by copying all auxilliary datasets (ranking database and motif annotation table) the node-local scratch storage ($VSC_SCRATCH_NODE).
4. Incredibly slow even when using multiple cores. Do profiling to get more information on where the bottlenecks reside. 1 percent takes 30min so theoretically 60,000 modules would take 50 hours or 2 days.

_Caveat:_ Still need to code conversion from regions back to genes to support AUcell in the last step of the pipeline.

In [1]:
from pyscenic.regions import RegionRankingDatabase, convert, Delineation, load
from pyscenic.genesig import GeneSignature
import os

In [2]:
DB_FOLDER = '/Users/bramvandesande/Projects/lcb/databases/'
HG19_DB = 'hg19-regions-1M-9species.inverted.feather'
RESOURCES_FOLDER = '/Users/bramvandesande/Projects/lcb/pyscenic/src/resources/tests'
GMT_FNAME = 'c6.all.v6.1.symbols.gmt'

In [3]:
!ls {DB_FOLDER}

dm6-5kb-upstream-full-tx-11species.mc8nr.feather
dm6_UPSTREAM5KB_FULL_TX_motifRanking.RData
hg19-500bp-upstream-10species.mc9nr.feather
hg19-500bp-upstream-7species.mc9nr.feather
hg19-regions-1M-9species.bed.gz
hg19-regions-1M-9species.inverted.feather
hg19-regions-1M-9species.inverted.identifiers.txt
hg19-regions-220330-9species.extracted.feather
hg19-regions-220330-9species.inverted.feather
hg19-regions-220330-9species.inverted.identifiers.txt
hg19-tss-centered-10kb-10species.mc9nr.feather
hg19-tss-centered-10kb-7species.mc9nr.feather
hg19-tss-centered-5kb-10species.mc9nr.feather
hg19-tss-centered-5kb-7species.mc9nr.feather
mm9-500bp-upstream-10species.mc9nr.feather
mm9-500bp-upstream-7species.mc9nr.feather
mm9-tss-centered-10kb-10species.mc9nr.feather
mm9-tss-centered-10kb-7species.mc9nr.feather
mm9-tss-centered-5kb-10species.mc9nr.feather
mm9-tss-centered-5kb-7species.mc9nr.feather


In [4]:
db = RegionRankingDatabase(os.path.join(DB_FOLDER, HG19_DB), name='hg19-1M-regions')

In [5]:
len(db.regions)

1223024

In [6]:
signatures = GeneSignature.from_gmt(os.path.join(RESOURCES_FOLDER, GMT_FNAME), 'HGNC', field_separator='\t', gene_separator='\t')
len(signatures)

189

In [8]:
import numpy as np
np.random.uniform(low=0.5, high=13.3, size=(50,))

array([ 9.28067221,  8.68231655,  9.37239211,  9.22263887,  9.07035501,
       13.21788222,  5.22955063, 11.14638916,  6.43318771,  3.91015555,
       12.4239484 , 12.52129836, 13.09815328, 12.29158812,  3.6175327 ,
       10.2431006 ,  3.25987139,  7.79902101,  6.3108416 ,  1.82386732,
       13.28174089,  8.55206661,  2.6222048 ,  8.36171365, 11.23567103,
        4.04145687, 10.40181721, 13.23567878,  1.96096473,  2.00766474,
       10.2775063 ,  3.82094041, 11.08229546,  2.54086962,  8.04782619,
        3.41068053,  8.87874084,  8.13214622,  2.79587443,  6.50719285,
        5.19540032,  3.70436932,  9.47801332,  4.63939797,  8.91565459,
        6.95054319,  7.07546034,  7.16014038, 11.29293814,  7.10672253])

In [11]:
test = signatures[89].copy(gene2weight=list(zip(signatures[89].genes, np.random.uniform(low=0.1, high=10.0, size=(len(signatures[89],))))))

In [12]:
test

GeneSignature(name='PTEN_DN.V2_UP', nomenclature='HGNC', gene2weight=<frozendict {'IL32': 6.074815374172376, 'EHF': 9.001424216811124, 'BBC3': 4.2357212149387475, 'DENND2A': 6.054371626698191, 'CPA4': 6.3641123070290515, 'EMP1': 9.401430329587095, 'OSGIN1': 5.463787390838832, 'AKR1C3': 8.720942986125003, 'MYL10': 3.4105283810736533, 'IL17A': 4.812175424328964, 'GMIP': 3.6450138415728506, 'TRIB2': 6.96068973924421, 'BUD31': 3.582587321759326, 'LOC647835': 6.613040219776352, 'GPR87': 7.955599777815841, 'CCL16': 9.689740699757767, 'KRT15': 3.9560173119664617, 'NAV3': 7.925310025899901, 'LIPG': 9.768395717078745, 'ARVCF': 6.255898172533963, 'ZNF668': 0.3563703995239139, 'CEP41': 8.34206017733226, 'CLPB': 3.3049721274341075, 'HSPB8': 8.444547376459543, 'SERPINB5': 9.029959831725854, 'TSHB': 1.6675709830719216, 'ARHGDIB': 6.735016002274071, 'SATB1': 3.2187211360302883, 'ATP2B2': 2.7190791616140206, 'TMEM127': 8.436954307504214, 'SPOCK1': 7.244063009611832, 'LCN2': 3.779366793217826, 'ZNF549'

In [13]:
convert(test, db, Delineation.HG19_500BP_UP)

GeneSignature(name='PTEN_DN.V2_UP', nomenclature='regions', gene2weight=<frozendict {'chr16-reg3999': 6.074815374172376, 'chr11-reg25202': 9.001424216811124, 'chr19-reg41736': 4.2357212149387475, 'chr19-reg41738': 4.2357212149387475, 'chr7-reg94938': 6.054371626698191, 'chr7-reg86082': 6.3641123070290515, 'chr12-reg12238': 9.401430329587095, 'chr16-reg59456': 5.463787390838832, 'chr16-reg59461': 5.463787390838832, 'chr10-reg4846': 8.720942986125003, 'chr7-reg68086': 3.4105283810736533, 'chr6-reg45817': 4.812175424328964, 'chr19-reg21924': 3.6450138415728506, 'chr2-reg11453': 6.96068973924421, 'chr7-reg65635': 3.582587321759326, 'chr3-reg103935': 7.955599777815841, 'chr17-reg29452': 9.689740699757767, 'chr17-reg34680': 3.9560173119664617, 'chr12-reg53504': 7.925310025899901, 'chr18-reg30389': 9.768395717078745, 'chr22-reg3047': 6.255898172533963, 'chr16-reg27291': 0.3563703995239139, 'chr16-reg27298': 0.3563703995239139, 'chr11-reg51268': 3.3049721274341075, 'chr12-reg84041': 8.44454737

In [12]:
%%timeit -r1 -n1
db.load(convert(signatures[89], db, Delineation.HG19_500BP_UP))

5min 35s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
