New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cachey #502
Cachey #502
Conversation
ping @jefsayshi this might be of interest to you as you think about speeding up the calculation workflow with dask persistent storage and caching. |
I've removed the WIP label. This is ready for review. When combined with #510 for dataframe hash naming we get some nice results In [1]: import dask.dataframe as dd
In [2]: from dask.diagnostics import Cache
In [3]: c = Cache(5e7) # 50 MB
In [4]: c.register() # globally active
In [5]: df = dd.read_csv('accounts.*.csv')
In [6]: %time len(df) # normal time to read the csv files
CPU times: user 887 ms, sys: 169 ms, total: 1.06 s
Wall time: 940 ms
Out[6]: 3000000
In [7]: %time len(df) # other times mostly free
CPU times: user 1.51 ms, sys: 145 µs, total: 1.65 ms
Wall time: 1.44 ms
Out[7]: 3000000
In [8]: c.cache.data # we only cached the reductions, not the entire dataset
Out[8]:
{('reduction-aggregation-cf5d8142bdd248cc7c42a746e7108ca7', 0): 3000000,
('reduction-chunk-cf5d8142bdd248cc7c42a746e7108ca7', 0): 1000000,
('reduction-chunk-cf5d8142bdd248cc7c42a746e7108ca7', 1): 1000000,
('reduction-chunk-cf5d8142bdd248cc7c42a746e7108ca7', 2): 1000000}
In [9]: %time df.amount.sum().compute() # so we don't get speedups
CPU times: user 951 ms, sys: 136 ms, total: 1.09 s
Wall time: 952 ms
Out[9]: 3575826832
In [10]: %time df.amount.mean().compute() # but df.amount is small enough to keep
CPU times: user 50.2 ms, sys: 9.05 ms, total: 59.2 ms
Wall time: 26.4 ms
Out[10]: 1191.9422773333333 We see that we get to hold on to small frequently used bits from the dataset which makes somewhat new computations |
2ff2f63
to
a816ddb
Compare
This is now backed off of #569 |
I seem to be leaking memory somewhere. Memory usage goes well above the stated limits when using caching. |
Hrm, I've ruled out most things. I slightly suspect that this is just my OS not flushing things. Considering merging this and seeing what happens. |
You could try adding a call to |
I've tried this. |
the last two in particular were strange. They seem to mutually exclude any isolated part of the code. |
I'd like to merge this and come back to the leak issue. The troublesome parts are sufficiently on the edges of dask (nothing else depends on them) and yet this also includes some fixes within more core parts that keep getting changed. I also intend to continue working on this for the forseeable future. |
This will allow xray users to take advantage of dask's nascent support for caching intermediate results (dask/dask#502). For example: In [1]: import xray In [2]: from dask.diagnostics.cache import Cache In [3]: c = Cache(5e7) In [4]: c.register() In [5]: ds = xray.open_mfdataset('/Users/shoyer/data/era-interim/2t/2014-*.nc', engine='scipy') In [6]: %time ds.sum().load() CPU times: user 2.72 s, sys: 2.7 s, total: 5.41 s Wall time: 3.85 s Out[6]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 5.338e+10 In [7]: %time ds.mean().load() CPU times: user 5.31 s, sys: 1.86 s, total: 7.17 s Wall time: 1.81 s Out[7]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 279.0 In [8]: %time ds.mean().load() CPU times: user 7.73 ms, sys: 2.73 ms, total: 10.5 ms Wall time: 8.45 ms Out[8]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 279.0
This will allow xray users to take advantage of dask's nascent support for caching intermediate results (dask/dask#502). For example: In [1]: import xray In [2]: from dask.diagnostics.cache import Cache In [3]: c = Cache(5e7) In [4]: c.register() In [5]: ds = xray.open_mfdataset('/Users/shoyer/data/era-interim/2t/2014-*.nc', engine='scipy') In [6]: %time ds.sum().load() CPU times: user 2.72 s, sys: 2.7 s, total: 5.41 s Wall time: 3.85 s Out[6]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 5.338e+10 In [7]: %time ds.mean().load() CPU times: user 5.31 s, sys: 1.86 s, total: 7.17 s Wall time: 1.81 s Out[7]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 279.0 In [8]: %time ds.mean().load() CPU times: user 7.73 ms, sys: 2.73 ms, total: 10.5 ms Wall time: 8.45 ms Out[8]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: t2m float64 279.0
Adds opportunistic caching via callbacks and cachey
Example
Some problems
('x', 1000)
stop this from being as effective as it might befuse
operations stop us from seeing some of the useful intermediate computations we could cachecull
into the async scheduler and depend on an impure start callback