__Author:__ Bram Van de Sande

__Date:__ 6 FEB 2018

__Outline:__ This notebook clarifies the process by which the co-expression modules derived from GENIE3 can be refined into true regulomes (i.e. excluding indirect targets of transcription factors). Aka "RcisTarget".

In [1]:
import os
import glob
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase, SQLiteRankingDatabase, MemoryDecorator
from pyscenic.genesig import GeneSignature, Regulome
from pyscenic.regulome import module2regulome_bincount_impl, derive_regulomes, module2regulome_numba_impl
from pyscenic.utils import load_motif_annotations

from dask import delayed
from dask.dot import dot_graph
from dask.multiprocessing import get
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler
from dask.diagnostics import ProgressBar
from distributed import LocalCluster, Client
from bokeh.io import output_notebook, push_notebook, show
output_notebook()
from dask.diagnostics import visualize

In [2]:
%load_ext snakeviz
%load_ext line_profiler

In [3]:
DATA_FOLDER="/Users/bramvandesande/Projects/lcb/tmp"
RESOURCES_FOLDER="/Users/bramvandesande/Projects/lcb/resources"
DATABASE_FOLDER = "/Users/bramvandesande/Projects/lcb/databases/"

SQLITE_GLOB = os.path.join(DATABASE_FOLDER, "mm9-*.db")
FEATHER_GLOB = os.path.join(DATABASE_FOLDER, "mm9-*.feather")

MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.mgi-m0.001-o0.0.tbl")

NOMENCLATURE = "MGI"

Make databases in feather format are available.

In [4]:
if False:
    def derive_db_name(fname):
        return os.path.basename(fname).split(".")[0]

    from pyscenic.rnkdb import convert2feather
    
    for fname in glob.glob(SQLITE_GLOB):
        convert2feather(fname, DATABASE_FOLDER, derive_db_name(fname), NOMENCLATURE)

### Load resources

Co-expression modules were derived from GENIE3 output.

In [5]:
with open(os.path.join(DATA_FOLDER,'modules.pickle'), 'rb') as f:
    modules = pickle.load(f)

In [6]:
len(modules)

5106

### Load whole genome ranking databases

All implementations of the database are loaded for performance testing.

In [7]:
def name(fname):
    return os.path.basename(fname).split(".")[0]

In [8]:
db_fnames = glob.glob(FEATHER_GLOB)
dbs = [RankingDatabase(fname=fname, name=name(fname), nomenclature="MGI") for fname in db_fnames]

In [9]:
len(dbs)

6

In [10]:
sqldb_fnames = glob.glob(SQLITE_GLOB)
sqldbs = [SQLiteRankingDatabase(fname=fname, name=name(fname), nomenclature="MGI") for fname in sqldb_fnames]

In [11]:
len(sqldbs)

6

In [12]:
memdb = MemoryDecorator(dbs[0])

### Load motif annotations

In [13]:
motif_annotations = load_motif_annotations(MOTIF_ANNOTATIONS_FNAME)

In [14]:
motif_annotations.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,motif_similarity_qvalue,orthologous_identity,description
gene_name,#motif_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Hoxa9,bergman__Abd-B,0.0006,1.0,gene is annotated for similar motif cisbp__M10...
Zfp128,bergman__Aef1,0.0,0.220264,motif is annotated for orthologous gene FBgn00...
Zfp853,bergman__Cf2,0.0,0.166667,motif is annotated for orthologous gene FBgn00...
Nr1h2,bergman__EcR_usp,0.0,0.378924,gene is orthologous to FBgn0000546 in D. melan...
Nr1h3,bergman__EcR_usp,0.0,0.408989,gene is orthologous to FBgn0000546 in D. melan...


### Single-thread pipeline

Before scaling it via dask to work on the full combinatorial space of databases x modules.

In [15]:
module2regulome = module2regulome_bincount_impl

#### Feather-based storage implementation

In [20]:
%lprun -f module2regulome list((idx, module2regulome(dbs[0], module, motif_annotations)) for idx, module in enumerate(modules[0:25]))

In [17]:
%%snakeviz
regulomes = list((idx, module2regulome(dbs[0], module, motif_annotations)) for idx, module in enumerate(modules[0:25]))

 
*** Profile stats marshalled to file '/var/folders/cj/xhw0rd3s7hg5k4p78t4s3hph0000gn/T/tmp7lh2ty5d'. 


1. General performance is 78s for executing `module2regulome` 25 times.
1. 79% of time is spent at `recovery` and 18% at `db.load`.

#### SQLite-based storage implementation

In [23]:
%lprun -f module2regulome list((idx, module2regulome(sqldbs[0], module, motif_annotations)) for idx, module in enumerate(modules[0:25]))

In [24]:
%%snakeviz
regulomes = list((idx, module2regulome(sqldbs[0], module, motif_annotations)) for idx, module in enumerate(modules[0:25]))

 
*** Profile stats marshalled to file '/var/folders/cj/xhw0rd3s7hg5k4p78t4s3hph0000gn/T/tmpl6qd4z5u'. 


1. General performance is 83s for executing `module2regulome` 25 times.
1. 42% of time is spent at `recovery` and 56% at `db.load`.

#### In-memory based implementation

In [26]:
%lprun -f module2regulome list((idx, module2regulome(memdbs[0], module, motif_annotations)) for idx, module in enumerate(modules[0:25]))

In [23]:
%%snakeviz
regulomes = list((idx, module2regulome(memdbs[0], module, motif_annotations)) for idx, module in enumerate(modules[0:25]))

 
*** Profile stats marshalled to file '/var/folders/cj/xhw0rd3s7hg5k4p78t4s3hph0000gn/T/tmpfwp_6eat'. 


1. General performance is 78s for executing `module2regulome` 25 times.
1. 89% of time is spent at `recovery` and 8.5% at `db.load`.

#### In-memory based implementation, assess effect of reducing rank_threshold parameter

In [22]:
%lprun -f module2regulome list((idx, module2regulome(memdbs[0], module, motif_annotations, auc_threshold=0.01, rank_threshold=750)) for idx, module in enumerate(modules[0:25]))

1. General performance is 69s for executing `module2regulome` 25 times.
1. 93% of time is spent at `recovery` and 4.4% at `db.load`.

#### Feather based implementation using R similar approach

In [20]:
%%snakeviz
regulomes = list((idx, module2regulome_numba_impl(dbs[0], module, motif_annotations)) for idx, module in enumerate(modules[0:25]))

 
*** Profile stats marshalled to file '/var/folders/cj/xhw0rd3s7hg5k4p78t4s3hph0000gn/T/tmpvme3jr1l'. 


1. General performance is 49s for executing `module2regulome` 25 times.
1. 47% of time is spent at `recovery` and 24% at `db.load`.

#### Approach combining all potential improvements (in-memory database, auc-only calculation to assess enriched features and numba JIT implementation).

In [19]:
%%snakeviz
regulomes = list((idx, module2regulome_numba_impl(memdb, module, motif_annotations)) for idx, module in enumerate(modules[0:25]))

 
*** Profile stats marshalled to file '/var/folders/cj/xhw0rd3s7hg5k4p78t4s3hph0000gn/T/tmphxcv1yj4'. 


1. General performance is 42s for executing `module2regulome` 25 times.
1. 81% of time is spent at `recovery`

### Parallelized pipeline

#### Python multiprocessing implementation (db-dedicated workers using in memory copy + numba implementation of auc calculation).

Loading the database is also part of the overall timing. This will however dwarf when the number of modules increases.

In [17]:
%%timeit -n1 -r1 -o -q
regulomes = derive_regulomes(dbs[0:6], modules[0:50], MOTIF_ANNOTATIONS_FNAME)
print(len(regulomes))

Worker for mm9-tss-centered-5kb-7species: database loaded in memory.
Worker for mm9-500bp-upstream-7species: database loaded in memory.
Worker for mm9-500bp-upstream-10species: database loaded in memory.
Worker for mm9-tss-centered-10kb-10species: database loaded in memory.
Worker for mm9-tss-centered-10kb-7species: database loaded in memory.
Worker for mm9-tss-centered-5kb-10species: database loaded in memory.
Worker for mm9-tss-centered-5kb-7species: motif annotations loaded in memory.
Worker for mm9-tss-centered-10kb-10species: motif annotations loaded in memory.
Worker for mm9-tss-centered-10kb-7species: motif annotations loaded in memory.
Worker for mm9-tss-centered-5kb-10species: motif annotations loaded in memory.
Worker for mm9-500bp-upstream-10species: motif annotations loaded in memory.
Worker for mm9-500bp-upstream-7species: motif annotations loaded in memory.
Worker for mm9-500bp-upstream-10species: 4 regulomes created.
Worker for mm9-500bp-upstream-7species: 3 regulomes cr

<TimeitResult : 4min 35s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

#### Dask framework

In [17]:
with ProgressBar():
    with Profiler() as prof, ResourceProfiler(dt=0.25) as rprof, CacheProfiler() as cprof:
        regulomes = derive_regulomes(dbs[0:2], modules[0:50], MOTIF_ANNOTATIONS_FNAME,
                                     client_or_address="local")

[########################################] | 100% Completed |  2min 33.5s


In [18]:
len(regulomes)

18

In [19]:
visualize([prof, rprof, cprof])

#### Dask with custom client

BUG: The workers seem to be dying.

In [26]:
local_cluster = LocalCluster(n_workers=6, 
                             threads_per_worker=1)

custom_client = Client(local_cluster)

In [27]:
custom_client

0,1
Client  Scheduler: tcp://127.0.0.1:52900  Dashboard: http://127.0.0.1:52901,Cluster  Workers: 6  Cores: 6  Memory: 12.88 GB


In [28]:
regulomes = derive_regulomes(dbs[0:2],
                             modules[0:50],
                             MOTIF_ANNOTATIONS_FNAME,
                             client_or_address=custom_client)

In [29]:
regulomes

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:52913, threads: 1>>
Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 348, in catch_zombie
    yield
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/distributed/nanny.py", line 245, in memory_monitor
    memory = psutil.Process(self.process.pid).memory_info().r

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:52913, threads: 1>>
Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 348, in catch_zombie
    yield
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/distributed/nanny.py", line 245, in memory_monitor
    memory = psutil.Process(self.process.pid).memory_info().r

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:52996, threads: 1>>
Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 348, in catch_zombie
    yield
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/distributed/nanny.py", line 245, in memory_monitor
    memory = psutil.Process(self.process.pid).memory_info().r

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:52996, threads: 1>>
Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 348, in catch_zombie
    yield
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/distributed/nanny.py", line 245, in memory_monitor
    memory = psutil.Process(self.process.pid).memory_info().r

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:52909, threads: 1>>
Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 348, in catch_zombie
    yield
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/distributed/nanny.py", line 245, in memory_monitor
    memory = psutil.Process(self.process.pid).memory_info().r

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:52909, threads: 1>>
Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 348, in catch_zombie
    yield
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/Users/bramvandesande/miniconda3/envs/pyscenic_dev/lib/python3.6/site-packages/distributed/nanny.py", line 245, in memory_monitor
    memory = psutil.Process(self.process.pid).memory_info().r

In [15]:
custom_client.close()
local_cluster.close()