Skip to content

Minor optimizations to Dask Array's store#3153

Merged
jakirkham merged 16 commits intodask:masterfrom
jakirkham:minor_store_opts
Feb 20, 2018
Merged

Minor optimizations to Dask Array's store#3153
jakirkham merged 16 commits intodask:masterfrom
jakirkham:minor_store_opts

Conversation

@jakirkham
Copy link
Copy Markdown
Member

@jakirkham jakirkham commented Feb 9, 2018

Performs some optimizations for Delayed targets. Namely skips collecting empty dictionaries if no Delayed target is present. Also optimized the Delayed target portion of the Dask as this was not being done before (already done as part of the Dask Array optimizations). In cases not using Delayed targets, this should be a no-op.

If return_stored and compute are both True, builds the load_chunk Dask after calling persist on the store_chunk Dask. This comes with the nice benefit of inlining results from persist directly into the load_chunk Dask. Thus simplifying the graph used for the Dask Arrays constructed in this case. Also simplifies the for-loop over sources and targets.

In the case of return_stored=True and compute=False, fuses the store_chunk and load_chunk calls into one load_store_chunk function per @jcrist's suggestion. This allows the lock (if any) to be held for the duration of both the store and load steps. Thus this should avoid contention between store and load steps on chunks.

Merges Dasks corresponding to chunks of particular types (e.g. Delayed targets, store_chunk calls, load_chunk calls) in a few cases before merging with other Dask types.

Also simplifies store_chunk and load_chunk by running fuse_slice for the provided region before constructing the Dask. This simplifies the logic in these functions and avoids the overhead (however small) of offloading this work to these functions.

Admittedly of the optimizations done here, similar optimizations are performed when submitting Dask Arrays for computation. However as store represents a final step in many ways, it seems reasonable to perform some of these optimizations here before returning to the user.

@jakirkham
Copy link
Copy Markdown
Member Author

@mrocklin, would like to get this into the upcoming release if at all possible.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Feb 9, 2018

My point in #3082 still stands that I believe that these optimizations would better be done by generating the graphs directly rather than relying on the behavior of the optimizations. Reiterating my argument from there:

  • Optimizations introduce overhead that can often be avoided by directly generating the graphs as intended. As people have started generating larger graphs this overhead has become more of a problem, leading to the addition of settings to turn off these optimizations.
  • Because of the above, no user-facing-function currently calls the optimization functions (e.g. cull) individually. We sometimes call __dask_optimize__ when converting between collections, but never use the optimization methods directly, instead generating the graphs as they should be the first time.
  • Optimization passes don't reliably show intent. I can see that you call cull and inline in the code, but it's not immediately apparent why those are needed (I understand why you're doing them here, but wouldn't have without the additional context of this PR). They don't provide the best self-documenting-code.

This isn't a firm -1, I would just urge you to think about if there's a cleaner way to get the graph structure you're looking for that doesn't rely on calling the optimizations.

@jakirkham
Copy link
Copy Markdown
Member Author

Thanks for taking a look, @jcrist.

So, as noted above, the optimizations we have added here would happen if the graph was submitted to the scheduler anyways. Also store (unlike other operations like add) represents a final point with the Dask graph where computation will occur and optimizations would be expected. Given this, I don't follow what the problem is with doing them ourselves. We already do this with the Dask Arrays anyways.

Would add that for the case of return_stored=True and compute=True, I don't see a clear way to generate this as one graph as we have to submit part of the graph for computation before constructing the rest. As such running inline seems reasonable to get rid of some now unneeded intermediate piece of the graph. If you can think of a way, would be open to suggestions.

If there are points where the intent is unclear, I'm happy to add comments.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 9, 2018

To @jcrist's point historically having optimizations in algorithmic code has led to difficult to identify errors.

Admittedly of the optimizations done here, similar optimizations are performed when submitting Dask Arrays for computation. However as store represents a final step in many ways, it seems reasonable to perform some of these optimizations here before returning to the user.

I'm not very familiar with how store works anymore, so please forgive some questions here. So lets say the user calls store on a dask array with a complex graph behind it, but the user also asks us to give them back an array with the stored resuls. How deep is the graph that they get? I would expect it to be pretty shallow if they're loading from computed results. I would expect no change to the graph at all if they didn't call compute.

@jakirkham presumably these optimizations are relevant for your use cases. It might help to motivate this work if you were to include a bit about what is behaving poorly for you.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 9, 2018

In regards to the release I'd like to go ahead. I've been stalling for a few days as things have gotten cleaned up. I'd strongly prefer not to stall any longer. I would like to start this process and have most of it wrapped up (along with some other work that depends on it) before end of day today.

@jakirkham
Copy link
Copy Markdown
Member Author

To @jcrist's point historically having optimizations in algorithmic code has led to difficult to identify errors.

Of course. Avoiding running optimizations is an admirable goal.

That said, IMHO this is not algorithmic code (e.g. tensordot) and the optimizations are inline with what the scheduler already does IIUC (e.g. running cull on the graph to drop unreachable parts and running inline on Futures). Further store is a reasonable point for these optimizations to occur and there is not (to me) an obvious way to write this to have the same effect without optimizations.

There were some other concerns that @jcrist reasonably raised in the previous PR regarding locking, which have been dropped from this one in the hopes it would be more appealing/less controversial.

How deep is the graph that they get? I would expect it to be pretty shallow if they're loading from computed results. I would expect no change to the graph at all if they didn't call compute.

As of PR ( #2980 ), the user will always have Dask Array optimizations applied to the sources in store. So even if they didn't call compute, they get these optimizations. In the end, we agreed this was reasonable even if compute=False (e.g. a Delayed object was returned) as we would be loosing precious optimizations on Dask Arrays otherwise.

If they did call with compute=True and return_stored=True, then the store_chunks part of the graph is submitted to the scheduler via persist returning keys with results or Futures as appropriate. So this is pretty shallow, my hope here was to make it even more shallow. Namely inline the Futures and cull the unneeded keys related to the store_chunks part.

@jakirkham
Copy link
Copy Markdown
Member Author

In regards to the release I'd like to go ahead.

The reason I asked to try and get this into this release is this was something I had raised as blocking for me at the last meeting. Unfortunately urgent personal matters came up inbetween then and now and I haven't had time to look at fixing the old PR ( #3082 ) until today. So don't want to hold up the release if it is pressing to get the release out (especially given I fumbled the ball here). If it is not pressing though, I would like a chance to address this issue. Please let me know either way.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 9, 2018

If they did call with compute=True and return_stored=True, then the store_chunks part of the graph is submitted to the scheduler via persist returning keys with results or Futures as appropriate. So this is pretty shallow, my hope here was to make it even more shallow. Namely inline the Futures and cull the unneeded keys related to the store_chunks part.

If we're using the result of persist then I would expect cull to already be called on the graph.

Regarding making the graph even more shallow, are the extra tasks causing significant performance issues? Are there advantages to fusing now rather than letting the downstream optimizations handle it?
Alternatively, is it easy to generate the graphs you want by hand?

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 9, 2018

If we can knock things out in twenty minutes or so I'm happy to wait. I doubt that this will happen though (and changes get done and tests pass, and more changes, etc.) so I'd rather move on with the release if it's alright.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 9, 2018

If you're able to provide a couple of small examples that might help me understand the value of these changes more clearly.

@jakirkham
Copy link
Copy Markdown
Member Author

If we can knock things out in twenty minutes or so I'm happy to wait.

If it is that urgent, I guess go ahead.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 9, 2018

If it is that urgent, I guess go ahead.

It is somewhat urgent. Current bokeh builds are broken against dask. Other projects have been asking for a release.

More generally though, we've been trying to get a release out for a couple weeks and have constantly had people ask for small PRs to be included. At the meeting last week we decided on today as being a strong cutoff. I think that we should stick to it.

@jakirkham
Copy link
Copy Markdown
Member Author

More generally though, we've been trying to get a release out for a couple weeks and have constantly had people ask for small PRs to be included. At the meeting last week we decided on today as being a strong cutoff. I think that we should stick to it.

It looks like Dask used milestones to track tasks for releases in the past. Not sure why that stopped, but maybe it is time to bring them back. This would help make it clearer what things are release blocking and when the deadlines are. Doesn't mean there won't be disagreements about both what should go in and when things should be done in the future. Though it would help make the communication about each release clearer and easier to track. Thoughts?

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 9, 2018

I'm not sure that milestones were ever seriously used. I think that one or two devs tried to make them a consistent practice but it didn't seriously take. I'm open to the idea, but there are also procedural costs to consider (more paperwork to handle on every change). This would take someone to champion the idea and put in a fair amount of work to hold people to it to start up the process.

@jakirkham jakirkham force-pushed the minor_store_opts branch 11 times, most recently from 0fac252 to e143886 Compare February 11, 2018 00:42
@jakirkham
Copy link
Copy Markdown
Member Author

Took a bit of work, but managed to eliminate cull and inline while still getting the intended outcome. Also simplifies the code in store a bit compared to the original attempt at return_stored in PR ( #2980 ), which should make it a bit easier to follow as well. Obviously it's the weekend, so would be good to hear some thoughts on this next week.

@jakirkham jakirkham force-pushed the minor_store_opts branch 2 times, most recently from 36af18a to ba91a70 Compare February 11, 2018 02:16
@jakirkham
Copy link
Copy Markdown
Member Author

Thanks for the advice @mrocklin.

Have added a test using get_deps for case 1 and that works nicely.

Took a different tack for 2. Instead counted the number of times the lock was acquired at the end of the computation. As we know the lock should be acquired only once per chunk when compute=False and twice per chunk when compute=True, this is an easy thing to assert. Thinking that this test should fail reliably if the behavior is changed and gets at the same underlying issue.

Please let me know what you think.

@jakirkham
Copy link
Copy Markdown
Member Author

Planning on merging tomorrow if no comments.

Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jakirkham, overall this looks good to me. I left a few nit-picky comments on style that you can feel free to ignore, and a few small things I'd like changed. Otherwise this looks good to merge. Thanks for being patient here.

store_keys = []
store_dsk = []
for tgt, src, reg in zip(targets, sources, regions):
tgt = getattr(tgt, "key", tgt)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this does? It's not clear to me what case works with a target with a key attribute. If this is for supporting Delayed targets, I'd prefer an explicit instance check rather than an attribute check as:

  • It's more apparent what the intent is
  • Won't fail for targets that just happen to have a key attribute.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as you guessed this is dealing with Delayed targets. The original code actually used try...except AttributeError:.... Was originally introduced in PR ( #2181 ). So thought this was some small improvement. Though agree we should be verifying things actually are Delayed. That said, we might want this check a bit earlier as there is a call to __dask_optimize__ for these objects.

Copy link
Copy Markdown
Member

@jcrist jcrist Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Perhaps something like this earlier:

for t in targets:
    if isinstance(t, Delayed):
        targets2.append(t.key)
    elif is_dask_collection(t):
        raise TypeError("Targets must be either Delayed objects or array-likes")
    else:
        targets2.append(t)

Then use targets2 unconditionally in this loop.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also probably update the docstring for store to add that targets can be instances of Delayed.

targets_dsk = sharedict.merge(*targets_dsk)
targets_dsk = Delayed.__dask_optimize__(targets_dsk, targets_keys)

load_stored = (return_stored and not compute)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these parenthesis are unnecessary

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know they aren't necessary, but it feels more readable to me. If there are strong objections, can change though.

slices = [fuse_slice(region, slc) for slc in slices]

name = 'store-%s' % arr.name
sto_chunk = store_chunk
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases like this in the past we've just used a chunk or func variable name. sto_chunk looks like a typo to me. No strong feelings though.

for t, slc in zip(core.flatten(arr.__dask_keys__()), slices):
store_key = (name,) + t[1:]
dsk[store_key] = (store_chunk, t, out, slc, lock, region, return_stored)
dsk[store_key] = (sto_chunk, t, out, slc, lock, return_stored) + args
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd write this as a dictionary comprehension, but that's just my opinion. No strong feelings.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Felt like I had made a lot of changes in the code and was trying to keep it easy to review. So didn't rewrite this, but sure we can give it a try.

load_key = ('load-%s' % k[0],) + k[1:]
# Reuse the result and arguments from `store_chunk` in `load_chunk`.
load_dsk[load_key] = (load_chunk, each_key,) + dsk[each_key][3:-1]
load_dsk[load_key] = (load_chunk, dsk_post[k]) + dsk_pre[k][3:-1]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd write this as a dictionary comprehension, but that's just me. No strong opinions.

load_dsk = {('load-%s' % k[0],) + k[1:]: (load_chunk, dsk_post[k]) + dsk_pre[k][3:-1]
            for k in keys}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story as above. Was trying to keep this easy to review, but sure it can be rewritten.

if i == 9:
assert False

# Verify number of lock calls
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand this comment slightly to say why it should be 2 * nchunks in one case and nchunks in the other? I know why right now, but future maintainers may not.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Though I'd prefer to put that comment near the branch below if that is ok.

Copy link
Copy Markdown
Member

@jcrist jcrist Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides this comment, all remaining things I pointed out were nits that can be ignored (up to you).

Copy link
Copy Markdown
Member Author

@jakirkham jakirkham Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Was working my way through them. Just now got this one.

@jakirkham jakirkham force-pushed the minor_store_opts branch 2 times, most recently from ce32dd6 to 897ff7b Compare February 20, 2018 15:57
There was an extra comma at the end of a `tuple` in `retrieve_from_ooc`.
This drops that extraneous comma to improve readability.
Make sure that the Dask keys that are passed to `Array`'s
`__dask_optimize__` method are flattened. While the `__dask_optimize__`
method of `Array` does flatten the keys, it is best not to rely on that
in `store` and just ensure they are properly formatted.
Instead of leaving the user-provided `region` and chunk slice (i.e.
`index`) to be fused in the `store_chunk` and `load_chunk` functions, go
ahead and fuse them when constructing the Dask graph and leave `region`
as `None`. This should cutdown the work done in `store_chunk` and
`load_chunk`. Also should allow for the `region` argument in both of
these functions to be subsequently dropped.
Since `fuse_slice` is now run as part of Dask construction and this
removes the need to pass the `region` on for handling in the store/load
chunk functions, simply drop the `region` arguments and related logic
from `store_chunk` and `load_chunk`.  Makes them a bit easier to follow
and maintain going forward.
Use the standard way of optimizing Dask Delayed objects to optimize the
targets in advance if needed. Thus if `cull`ing or any other
optimization strategy becomes common place for Delayed objects, it will
be handled through the standard mechanisms instead of us doing it in
some custom way.

As an added bonus, this avoids collect empty dictionaries in the event a
target does not have a Dask (e.g. not Delayed).  Previously `store` had
collected empty dictionaries in this case. So this should be an
improvement when constructing and working with the `SharedDict` object
for `targets` in the common case (i.e. no Delayed targets).

Also provides a check to guarantee we are actually getting Dask Delayed
objects and are not supporting any other Dask collections. If we don't
get a Dask Delayed object, we assume this is some Array-like object.
For some time, `store` has supported `Delayed` `targets`. The intent
being to support creating a storage location once storage occurs and
possibly handle other things.  However this wasn't noted in the
docstring. So we make that a little clearer by noting `targets` can be
`Delayed`.
Before merging store Dasks with source and target Dasks, go ahead and
merge the store Dasks together alone. Simplifies the merging process a
bit and avoids directly relying on a dependency. Also makes sure that
there is a `SharedDict` for each group of related dictionaries that get
added to final store Dask. Namely the `sources`, `targets`, and now the
`store_chunk`s steps.
As we are only really interested in the one Dask graph that represents
all `store_chunk` operations, renamed `store_dsks_mrg` and `store_dsks`
to `store_dsk` to make that clearer.
Provides an option to give `retrieve_from_ooc` a computed Dask as well
to fill in values in the `load_chunk`s part of the Dask. This can be
useful when inlining results from the `persist`ed `store_chunk`s step.
This argument is totally optional and will be used to inline computed
results of `Future`s as the case may be. If this argument is not
provided, then the keys are simply inserted into the new Dask as was the
case before.
Renames `store_chunk` to `load_store_chunk`. The new `load_store_chunk`
function differs slightly from `store_chunk`'s previous behavior in that
it will skip storing results if `x` is `None`. Also `load_store_chunk`
takes the new `load_stored` option, which allows for reading the results
back in from `out` after storing them. IOW `load_store_chunk` provides
the behavior of `store_chunk`, `load_chunk`, and a fused version of
`store_chunk` and `load_chunk`. By keeping this functionality all within
one function using different options, it should avoid divergence that
could occur with having different functions.

Reimplements `store_chunk` and `load_chunk` to be light wrappers around
`load_store_chunk` with only a few needed arguments. The rest are
handled in some default manner. By doing this, the original behavior of
`store_chunk` and `load_chunk` are restored while simply using
`load_store_chunk` underneath the hood. The `store_chunk` and
`load_chunk` functions are kept to preserve readability in terms of the
code, Dasks generated, and scheduler progress.

Makes some adjustments to `insert_to_ooc` to handle either using
`load_store_chunk` when `return_stored` and `load_stored` are both
`True` or simply using `store_chunk` for all other cases. This results
in some tweaks to the key name used and arguments provided. The purpose
of these adjustments are to ensure `load_store_chunk` shows up when
loading and storing will happen in the same step. Otherwise
`insert_to_ooc` only adds `store_chunk`, which may or may not be
followed by a separate `load_chunk` step depending on whether
`return_stored` is `True` or `False`.
Refreshes `store` so as to effectively inline all uses of `store_chunk`s
when `return_stored=True`. Thus users only see `load_store_*` nodes
whether or not `compute` is `True` or `False`. This avoids some lock
contention between storing and loading steps when `compute` is `False`.
When `compute` is `True`, this effectively inlines `Future`s or results
that the `store_chunk` step would have returned. Thus making the most
compact Dask possible when `return_stored=True`.
Appears that `sto_chunk` was more confusing than helpful in the review.
So rename it to `func` to make it clear we intend this to be a different
function and not a typo of `store_chunk`.
In the `insert_to_ooc` and `retrieve_from_ooc` functions, make use of
dictionary comprehensions to simplify the code a bit. These also tend to
be a bit faster than Python `for`-loops.
When `return_stored` and `compute` are both set to `True`, ensure that
`store` returns Dask Arrays that contain depth 1 Dasks only. This helps
ensure that the Dask Arrays generated in this case don't have any
lingering connections to the Dask used for storing in the Dask Arrays
returned.
Verifies how many times a `Lock` instance is acquired in `store` when
using `return_stored`. Namely if `compute=True`, the lock will be
acquired two times for each chunk as storing and loading are separate
steps to allow for storing of chunks to occur asynchronously. However if
`compute=False`, the lock will be acquired only one time for each chunk
as store and load are fused into a single step that acquires the lock
for both parts. This acts as a check to ensure that the `compute=False`
case of `return_stored` does in fact fuse storing and loading into a
single step that holds the lock for both parts.
@jakirkham
Copy link
Copy Markdown
Member Author

Thanks for the comments @jcrist. Think I have addressed the bulk of your concerns. Could use another look if you have time.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Feb 20, 2018

LGTM. Feel free to merge when tests pass.

@jakirkham jakirkham merged commit fadc962 into dask:master Feb 20, 2018
@jakirkham jakirkham deleted the minor_store_opts branch February 20, 2018 16:59
@jakirkham
Copy link
Copy Markdown
Member Author

Thanks everyone. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants