Cachey #502

mrocklin · 2015-07-31T16:43:07Z

Adds opportunistic caching via callbacks and cachey

Example

In [1]: import dask.dataframe as dd

In [2]: from cachey import Cache

In [3]: from dask.diagnostics.cache import cache

In [4]: df = dd.read_csv('data/accounts.*.csv')

In [5]: c = Cache(1000000000)

In [6]: result = df.amount.sum()

In [7]: with cache(c):
    print result.compute()
   ...:     
3575826832

In [8]: with cache(c):
    print result.compute()
   ...:     
3575826832

In [9]: c.data
Out[9]: 
{('reduction-aggregation-6', 0): 3575826832,
 ('reduction-chunk-5', 0): 1187712489,
 ('reduction-chunk-5', 1): 1192801040,
 ('reduction-chunk-5', 2): 1195313303}

Some problems

badly named keys like ('x', 1000) stop this from being as effective as it might be
fuse operations stop us from seeing some of the useful intermediate computations we could cache
I had to add cull into the async scheduler and depend on an impure start callback
Cachey is still quite immature

hussainsultan · 2015-07-31T20:22:40Z

ping @jefsayshi this might be of interest to you as you think about speeding up the calculation workflow with dask persistent storage and caching.

mrocklin · 2015-08-03T23:27:19Z

I've removed the WIP label. This is ready for review.

When combined with #510 for dataframe hash naming we get some nice results

In [1]: import dask.dataframe as dd
In [2]: from dask.diagnostics import Cache
In [3]: c = Cache(5e7)  # 50 MB
In [4]: c.register()    # globally active

In [5]: df = dd.read_csv('accounts.*.csv')
In [6]: %time len(df)   # normal time to read the csv files
CPU times: user 887 ms, sys: 169 ms, total: 1.06 s
Wall time: 940 ms
Out[6]: 3000000

In [7]: %time len(df)  # other times mostly free
CPU times: user 1.51 ms, sys: 145 µs, total: 1.65 ms
Wall time: 1.44 ms
Out[7]: 3000000

In [8]: c.cache.data  # we only cached the reductions, not the entire dataset
Out[8]: 
{('reduction-aggregation-cf5d8142bdd248cc7c42a746e7108ca7', 0): 3000000,
 ('reduction-chunk-cf5d8142bdd248cc7c42a746e7108ca7', 0): 1000000,
 ('reduction-chunk-cf5d8142bdd248cc7c42a746e7108ca7', 1): 1000000,
 ('reduction-chunk-cf5d8142bdd248cc7c42a746e7108ca7', 2): 1000000}

In [9]: %time df.amount.sum().compute()  # so we don't get speedups
CPU times: user 951 ms, sys: 136 ms, total: 1.09 s
Wall time: 952 ms
Out[9]: 3575826832

In [10]: %time df.amount.mean().compute()  # but df.amount is small enough to keep
CPU times: user 50.2 ms, sys: 9.05 ms, total: 59.2 ms
Wall time: 26.4 ms
Out[10]: 1191.9422773333333

We see that we get to hold on to small frequently used bits from the dataset which makes somewhat new computations df.amount.sum() -> df.amount.mean() very fast.

mrocklin · 2015-08-10T21:31:25Z

This is now backed off of #569

mrocklin · 2015-08-11T15:19:04Z

I seem to be leaking memory somewhere. Memory usage goes well above the stated limits when using caching.

mrocklin · 2015-08-11T16:39:28Z

Hrm, I've ruled out most things. I slightly suspect that this is just my OS not flushing things. Considering merging this and seeing what happens.

jcrist · 2015-08-11T17:00:19Z

You could try adding a call to gc() somewhere and see if that clears things up. Not as a permanent fix, just for diagnosing if this is a bug or just not cleaning up memory immediately.

mrocklin · 2015-08-11T17:13:37Z

I've tried this. gc.collect() doesn't have much effect.

mrocklin · 2015-08-11T17:22:28Z

Clearing out the cache explicitly doesn't have much of an effect.
Calling gc.collect doesn't do much.
This only occurs when things are actually put into the cache (e.g. cache with small size doesn't produce this problem).
It doesn't occur when using cachey manually (e.g. cachey.Cache(...).put(key, value, cost))
It doesn't occur if we use dask but comment out the part that puts stuff into cachey

the last two in particular were strange. They seem to mutually exclude any isolated part of the code.

mrocklin · 2015-08-11T23:28:12Z

I'd like to merge this and come back to the leak issue. The troublesome parts are sufficiently on the edges of dask (nothing else depends on them) and yet this also includes some fixes within more core parts that keep getting changed. I also intend to continue working on this for the forseeable future.

Cachey

This will allow xray users to take advantage of dask's nascent support for caching intermediate results (dask/dask#502). For example: In [1]: import xray In [2]: from dask.diagnostics.cache import Cache In [3]: c = Cache(5e7) In [4]: c.register() In [5]: ds = xray.open_mfdataset('/Users/shoyer/data/era-interim/2t/2014-*.nc', engine='scipy') In [6]: %time ds.sum().load() CPU times: user 2.72 s, sys: 2.7 s, total: 5.41 s Wall time: 3.85 s Out[6]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 5.338e+10 In [7]: %time ds.mean().load() CPU times: user 5.31 s, sys: 1.86 s, total: 7.17 s Wall time: 1.81 s Out[7]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 279.0 In [8]: %time ds.mean().load() CPU times: user 7.73 ms, sys: 2.73 ms, total: 10.5 ms Wall time: 8.45 ms Out[8]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 279.0

mrocklin mentioned this pull request Aug 2, 2015

Use deterministic names for dataframe #510

Merged

mrocklin force-pushed the cachey branch from 571422c to dcfebeb Compare August 3, 2015 22:21

mrocklin changed the title ~~[WIP] Cachey~~ Cachey Aug 3, 2015

mrocklin closed this Aug 3, 2015

mrocklin reopened this Aug 3, 2015

mrocklin force-pushed the cachey branch 3 times, most recently from 2ff2f63 to a816ddb Compare August 10, 2015 21:31

mrocklin added 10 commits August 11, 2015 08:13

start callbacks happen before start_state generation

bf4ff8c

add cache profiler

be49a12

rename cache->Cache, support direct inputs

16bc9c9

skip doctest

d30001b

async guards input dask graph with copy

6a51950

unregister cache in doctest

19b0dfc

skip tests if cachey not available

39fdd33

add cache to diagnostics/__init__.py

7acd913

cache prefers cheap dependents

eaf7c4f

Replace Diagnostic with Callback in Cache

af6edb4

mrocklin force-pushed the cachey branch from e6b3283 to 602e650 Compare August 11, 2015 15:17

mrocklin added 3 commits August 11, 2015 08:35

strip dd.optimize tests of fuse

4880407

turn off numexpr

aa8e5fd

use overhead in cache computations

10a1785

mrocklin force-pushed the cachey branch from 602e650 to 10a1785 Compare August 11, 2015 15:35

clear timing data on finish

741a5ed

mrocklin added a commit that referenced this pull request Aug 12, 2015

Merge pull request #502 from mrocklin/cachey

3a330c6

Cachey

mrocklin merged commit 3a330c6 into dask:master Aug 12, 2015

mrocklin deleted the cachey branch August 12, 2015 15:41

shoyer mentioned this pull request Aug 31, 2015

Use deterministic names for dask arrays from open_dataset pydata/xarray#555

Merged

mrocklin mentioned this pull request Sep 4, 2015

how to incorporate hierarchical caching into dask? #118

Closed

TomAugspurger mentioned this pull request Oct 3, 2017

COMPAT: pandas 0.21.0 #2736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cachey #502

Cachey #502

mrocklin commented Jul 31, 2015

hussainsultan commented Jul 31, 2015

mrocklin commented Aug 3, 2015

mrocklin commented Aug 10, 2015

mrocklin commented Aug 11, 2015

mrocklin commented Aug 11, 2015

jcrist commented Aug 11, 2015

mrocklin commented Aug 11, 2015

mrocklin commented Aug 11, 2015

mrocklin commented Aug 11, 2015

Cachey #502

Cachey #502

Conversation

mrocklin commented Jul 31, 2015

Example

hussainsultan commented Jul 31, 2015

mrocklin commented Aug 3, 2015

mrocklin commented Aug 10, 2015

mrocklin commented Aug 11, 2015

mrocklin commented Aug 11, 2015

jcrist commented Aug 11, 2015

mrocklin commented Aug 11, 2015

mrocklin commented Aug 11, 2015

mrocklin commented Aug 11, 2015