Minor optimizations to Dask Array's store by jakirkham · Pull Request #3153 · dask/dask

jakirkham · 2018-02-09T17:42:49Z

Performs some optimizations for Delayed targets. Namely skips collecting empty dictionaries if no Delayed target is present. Also optimized the Delayed target portion of the Dask as this was not being done before (already done as part of the Dask Array optimizations). In cases not using Delayed targets, this should be a no-op.

If return_stored and compute are both True, builds the load_chunk Dask after calling persist on the store_chunk Dask. This comes with the nice benefit of inlining results from persist directly into the load_chunk Dask. Thus simplifying the graph used for the Dask Arrays constructed in this case. Also simplifies the for-loop over sources and targets.

In the case of return_stored=True and compute=False, fuses the store_chunk and load_chunk calls into one load_store_chunk function per @jcrist's suggestion. This allows the lock (if any) to be held for the duration of both the store and load steps. Thus this should avoid contention between store and load steps on chunks.

Merges Dasks corresponding to chunks of particular types (e.g. Delayed targets, store_chunk calls, load_chunk calls) in a few cases before merging with other Dask types.

Also simplifies store_chunk and load_chunk by running fuse_slice for the provided region before constructing the Dask. This simplifies the logic in these functions and avoids the overhead (however small) of offloading this work to these functions.

Admittedly of the optimizations done here, similar optimizations are performed when submitting Dask Arrays for computation. However as store represents a final step in many ways, it seems reasonable to perform some of these optimizations here before returning to the user.

jakirkham · 2018-02-09T17:46:39Z

@mrocklin, would like to get this into the upcoming release if at all possible.

jcrist · 2018-02-09T18:36:26Z

My point in #3082 still stands that I believe that these optimizations would better be done by generating the graphs directly rather than relying on the behavior of the optimizations. Reiterating my argument from there:

Optimizations introduce overhead that can often be avoided by directly generating the graphs as intended. As people have started generating larger graphs this overhead has become more of a problem, leading to the addition of settings to turn off these optimizations.
Because of the above, no user-facing-function currently calls the optimization functions (e.g. cull) individually. We sometimes call __dask_optimize__ when converting between collections, but never use the optimization methods directly, instead generating the graphs as they should be the first time.
Optimization passes don't reliably show intent. I can see that you call cull and inline in the code, but it's not immediately apparent why those are needed (I understand why you're doing them here, but wouldn't have without the additional context of this PR). They don't provide the best self-documenting-code.

This isn't a firm -1, I would just urge you to think about if there's a cleaner way to get the graph structure you're looking for that doesn't rely on calling the optimizations.

jakirkham · 2018-02-09T18:54:22Z

Thanks for taking a look, @jcrist.

So, as noted above, the optimizations we have added here would happen if the graph was submitted to the scheduler anyways. Also store (unlike other operations like add) represents a final point with the Dask graph where computation will occur and optimizations would be expected. Given this, I don't follow what the problem is with doing them ourselves. We already do this with the Dask Arrays anyways.

Would add that for the case of return_stored=True and compute=True, I don't see a clear way to generate this as one graph as we have to submit part of the graph for computation before constructing the rest. As such running inline seems reasonable to get rid of some now unneeded intermediate piece of the graph. If you can think of a way, would be open to suggestions.

If there are points where the intent is unclear, I'm happy to add comments.

mrocklin · 2018-02-09T18:55:14Z

To @jcrist's point historically having optimizations in algorithmic code has led to difficult to identify errors.

Admittedly of the optimizations done here, similar optimizations are performed when submitting Dask Arrays for computation. However as store represents a final step in many ways, it seems reasonable to perform some of these optimizations here before returning to the user.

I'm not very familiar with how store works anymore, so please forgive some questions here. So lets say the user calls store on a dask array with a complex graph behind it, but the user also asks us to give them back an array with the stored resuls. How deep is the graph that they get? I would expect it to be pretty shallow if they're loading from computed results. I would expect no change to the graph at all if they didn't call compute.

@jakirkham presumably these optimizations are relevant for your use cases. It might help to motivate this work if you were to include a bit about what is behaving poorly for you.

mrocklin · 2018-02-09T18:56:47Z

In regards to the release I'd like to go ahead. I've been stalling for a few days as things have gotten cleaned up. I'd strongly prefer not to stall any longer. I would like to start this process and have most of it wrapped up (along with some other work that depends on it) before end of day today.

jakirkham · 2018-02-09T19:34:32Z

To @jcrist's point historically having optimizations in algorithmic code has led to difficult to identify errors.

Of course. Avoiding running optimizations is an admirable goal.

That said, IMHO this is not algorithmic code (e.g. tensordot) and the optimizations are inline with what the scheduler already does IIUC (e.g. running cull on the graph to drop unreachable parts and running inline on Futures). Further store is a reasonable point for these optimizations to occur and there is not (to me) an obvious way to write this to have the same effect without optimizations.

There were some other concerns that @jcrist reasonably raised in the previous PR regarding locking, which have been dropped from this one in the hopes it would be more appealing/less controversial.

How deep is the graph that they get? I would expect it to be pretty shallow if they're loading from computed results. I would expect no change to the graph at all if they didn't call compute.

As of PR ( #2980 ), the user will always have Dask Array optimizations applied to the sources in store. So even if they didn't call compute, they get these optimizations. In the end, we agreed this was reasonable even if compute=False (e.g. a Delayed object was returned) as we would be loosing precious optimizations on Dask Arrays otherwise.

If they did call with compute=True and return_stored=True, then the store_chunks part of the graph is submitted to the scheduler via persist returning keys with results or Futures as appropriate. So this is pretty shallow, my hope here was to make it even more shallow. Namely inline the Futures and cull the unneeded keys related to the store_chunks part.

jakirkham · 2018-02-09T19:40:45Z

In regards to the release I'd like to go ahead.

The reason I asked to try and get this into this release is this was something I had raised as blocking for me at the last meeting. Unfortunately urgent personal matters came up inbetween then and now and I haven't had time to look at fixing the old PR ( #3082 ) until today. So don't want to hold up the release if it is pressing to get the release out (especially given I fumbled the ball here). If it is not pressing though, I would like a chance to address this issue. Please let me know either way.

mrocklin · 2018-02-09T19:45:51Z

If they did call with compute=True and return_stored=True, then the store_chunks part of the graph is submitted to the scheduler via persist returning keys with results or Futures as appropriate. So this is pretty shallow, my hope here was to make it even more shallow. Namely inline the Futures and cull the unneeded keys related to the store_chunks part.

If we're using the result of persist then I would expect cull to already be called on the graph.

Regarding making the graph even more shallow, are the extra tasks causing significant performance issues? Are there advantages to fusing now rather than letting the downstream optimizations handle it?
Alternatively, is it easy to generate the graphs you want by hand?

mrocklin · 2018-02-09T19:46:55Z

If we can knock things out in twenty minutes or so I'm happy to wait. I doubt that this will happen though (and changes get done and tests pass, and more changes, etc.) so I'd rather move on with the release if it's alright.

mrocklin · 2018-02-09T19:47:48Z

If you're able to provide a couple of small examples that might help me understand the value of these changes more clearly.

jakirkham · 2018-02-09T19:55:51Z

If we can knock things out in twenty minutes or so I'm happy to wait.

If it is that urgent, I guess go ahead.

mrocklin · 2018-02-09T19:58:19Z

If it is that urgent, I guess go ahead.

It is somewhat urgent. Current bokeh builds are broken against dask. Other projects have been asking for a release.

More generally though, we've been trying to get a release out for a couple weeks and have constantly had people ask for small PRs to be included. At the meeting last week we decided on today as being a strong cutoff. I think that we should stick to it.

jakirkham · 2018-02-09T21:19:52Z

More generally though, we've been trying to get a release out for a couple weeks and have constantly had people ask for small PRs to be included. At the meeting last week we decided on today as being a strong cutoff. I think that we should stick to it.

It looks like Dask used milestones to track tasks for releases in the past. Not sure why that stopped, but maybe it is time to bring them back. This would help make it clearer what things are release blocking and when the deadlines are. Doesn't mean there won't be disagreements about both what should go in and when things should be done in the future. Though it would help make the communication about each release clearer and easier to track. Thoughts?

mrocklin · 2018-02-09T21:26:36Z

I'm not sure that milestones were ever seriously used. I think that one or two devs tried to make them a consistent practice but it didn't seriously take. I'm open to the idea, but there are also procedural costs to consider (more paperwork to handle on every change). This would take someone to champion the idea and put in a fair amount of work to hold people to it to start up the process.

jakirkham · 2018-02-11T00:54:30Z

Took a bit of work, but managed to eliminate cull and inline while still getting the intended outcome. Also simplifies the code in store a bit compared to the original attempt at return_stored in PR ( #2980 ), which should make it a bit easier to follow as well. Obviously it's the weekend, so would be good to hear some thoughts on this next week.

jakirkham · 2018-02-16T16:31:06Z

Thanks for the advice @mrocklin.

Have added a test using get_deps for case 1 and that works nicely.

Took a different tack for 2. Instead counted the number of times the lock was acquired at the end of the computation. As we know the lock should be acquired only once per chunk when compute=False and twice per chunk when compute=True, this is an easy thing to assert. Thinking that this test should fail reliably if the behavior is changed and gets at the same underlying issue.

Please let me know what you think.

jakirkham · 2018-02-20T14:24:21Z

Planning on merging tomorrow if no comments.

jcrist

Thanks @jakirkham, overall this looks good to me. I left a few nit-picky comments on style that you can feel free to ignore, and a few small things I'd like changed. Otherwise this looks good to merge. Thanks for being patient here.

jcrist · 2018-02-20T14:52:23Z

dask/array/core.py

+    store_keys = []
+    store_dsk = []
+    for tgt, src, reg in zip(targets, sources, regions):
+        tgt = getattr(tgt, "key", tgt)


Can you explain what this does? It's not clear to me what case works with a target with a key attribute. If this is for supporting Delayed targets, I'd prefer an explicit instance check rather than an attribute check as:

It's more apparent what the intent is

Won't fail for targets that just happen to have a key attribute.

Right, as you guessed this is dealing with Delayed targets. The original code actually used try...except AttributeError:.... Was originally introduced in PR ( #2181 ). So thought this was some small improvement. Though agree we should be verifying things actually are Delayed. That said, we might want this check a bit earlier as there is a call to __dask_optimize__ for these objects.

Makes sense to me. Perhaps something like this earlier:

for t in targets: if isinstance(t, Delayed): targets2.append(t.key) elif is_dask_collection(t): raise TypeError("Targets must be either Delayed objects or array-likes") else: targets2.append(t)

Then use targets2 unconditionally in this loop.

Should also probably update the docstring for store to add that targets can be instances of Delayed.

jcrist · 2018-02-20T14:52:59Z

dask/array/core.py

+    targets_dsk = sharedict.merge(*targets_dsk)
+    targets_dsk = Delayed.__dask_optimize__(targets_dsk, targets_keys)

+    load_stored = (return_stored and not compute)


nit: these parenthesis are unnecessary

I know they aren't necessary, but it feels more readable to me. If there are strong objections, can change though.

jcrist · 2018-02-20T14:55:31Z

dask/array/core.py

+        slices = [fuse_slice(region, slc) for slc in slices]

    name = 'store-%s' % arr.name
+    sto_chunk = store_chunk


In cases like this in the past we've just used a chunk or func variable name. sto_chunk looks like a typo to me. No strong feelings though.

jcrist · 2018-02-20T14:56:12Z

dask/array/core.py

    for t, slc in zip(core.flatten(arr.__dask_keys__()), slices):
        store_key = (name,) + t[1:]
-        dsk[store_key] = (store_chunk, t, out, slc, lock, region, return_stored)
+        dsk[store_key] = (sto_chunk, t, out, slc, lock, return_stored) + args


nit: I'd write this as a dictionary comprehension, but that's just my opinion. No strong feelings.

Felt like I had made a lot of changes in the code and was trying to keep it easy to review. So didn't rewrite this, but sure we can give it a try.

jcrist · 2018-02-20T14:58:59Z

dask/array/core.py

+        load_key = ('load-%s' % k[0],) + k[1:]
        # Reuse the result and arguments from `store_chunk` in `load_chunk`.
-        load_dsk[load_key] = (load_chunk, each_key,) + dsk[each_key][3:-1]
+        load_dsk[load_key] = (load_chunk, dsk_post[k]) + dsk_pre[k][3:-1]


nit: I'd write this as a dictionary comprehension, but that's just me. No strong opinions.

load_dsk = {('load-%s' % k[0],) + k[1:]: (load_chunk, dsk_post[k]) + dsk_pre[k][3:-1] for k in keys}

Same story as above. Was trying to keep this easy to review, but sure it can be rewritten.

jcrist · 2018-02-20T15:00:20Z

dask/array/tests/test_array_core.py

        if i == 9:
            assert False

+    # Verify number of lock calls


Can you expand this comment slightly to say why it should be 2 * nchunks in one case and nchunks in the other? I know why right now, but future maintainers may not.

Sure. Though I'd prefer to put that comment near the branch below if that is ok.

Besides this comment, all remaining things I pointed out were nits that can be ignored (up to you).

Sure. Was working my way through them. Just now got this one.

There was an extra comma at the end of a `tuple` in `retrieve_from_ooc`. This drops that extraneous comma to improve readability.

Make sure that the Dask keys that are passed to `Array`'s `__dask_optimize__` method are flattened. While the `__dask_optimize__` method of `Array` does flatten the keys, it is best not to rely on that in `store` and just ensure they are properly formatted.

Instead of leaving the user-provided `region` and chunk slice (i.e. `index`) to be fused in the `store_chunk` and `load_chunk` functions, go ahead and fuse them when constructing the Dask graph and leave `region` as `None`. This should cutdown the work done in `store_chunk` and `load_chunk`. Also should allow for the `region` argument in both of these functions to be subsequently dropped.

Since `fuse_slice` is now run as part of Dask construction and this removes the need to pass the `region` on for handling in the store/load chunk functions, simply drop the `region` arguments and related logic from `store_chunk` and `load_chunk`. Makes them a bit easier to follow and maintain going forward.

Use the standard way of optimizing Dask Delayed objects to optimize the targets in advance if needed. Thus if `cull`ing or any other optimization strategy becomes common place for Delayed objects, it will be handled through the standard mechanisms instead of us doing it in some custom way. As an added bonus, this avoids collect empty dictionaries in the event a target does not have a Dask (e.g. not Delayed). Previously `store` had collected empty dictionaries in this case. So this should be an improvement when constructing and working with the `SharedDict` object for `targets` in the common case (i.e. no Delayed targets). Also provides a check to guarantee we are actually getting Dask Delayed objects and are not supporting any other Dask collections. If we don't get a Dask Delayed object, we assume this is some Array-like object.

For some time, `store` has supported `Delayed` `targets`. The intent being to support creating a storage location once storage occurs and possibly handle other things. However this wasn't noted in the docstring. So we make that a little clearer by noting `targets` can be `Delayed`.

Before merging store Dasks with source and target Dasks, go ahead and merge the store Dasks together alone. Simplifies the merging process a bit and avoids directly relying on a dependency. Also makes sure that there is a `SharedDict` for each group of related dictionaries that get added to final store Dask. Namely the `sources`, `targets`, and now the `store_chunk`s steps.

As we are only really interested in the one Dask graph that represents all `store_chunk` operations, renamed `store_dsks_mrg` and `store_dsks` to `store_dsk` to make that clearer.

Provides an option to give `retrieve_from_ooc` a computed Dask as well to fill in values in the `load_chunk`s part of the Dask. This can be useful when inlining results from the `persist`ed `store_chunk`s step. This argument is totally optional and will be used to inline computed results of `Future`s as the case may be. If this argument is not provided, then the keys are simply inserted into the new Dask as was the case before.

Renames `store_chunk` to `load_store_chunk`. The new `load_store_chunk` function differs slightly from `store_chunk`'s previous behavior in that it will skip storing results if `x` is `None`. Also `load_store_chunk` takes the new `load_stored` option, which allows for reading the results back in from `out` after storing them. IOW `load_store_chunk` provides the behavior of `store_chunk`, `load_chunk`, and a fused version of `store_chunk` and `load_chunk`. By keeping this functionality all within one function using different options, it should avoid divergence that could occur with having different functions. Reimplements `store_chunk` and `load_chunk` to be light wrappers around `load_store_chunk` with only a few needed arguments. The rest are handled in some default manner. By doing this, the original behavior of `store_chunk` and `load_chunk` are restored while simply using `load_store_chunk` underneath the hood. The `store_chunk` and `load_chunk` functions are kept to preserve readability in terms of the code, Dasks generated, and scheduler progress. Makes some adjustments to `insert_to_ooc` to handle either using `load_store_chunk` when `return_stored` and `load_stored` are both `True` or simply using `store_chunk` for all other cases. This results in some tweaks to the key name used and arguments provided. The purpose of these adjustments are to ensure `load_store_chunk` shows up when loading and storing will happen in the same step. Otherwise `insert_to_ooc` only adds `store_chunk`, which may or may not be followed by a separate `load_chunk` step depending on whether `return_stored` is `True` or `False`.

Refreshes `store` so as to effectively inline all uses of `store_chunk`s when `return_stored=True`. Thus users only see `load_store_*` nodes whether or not `compute` is `True` or `False`. This avoids some lock contention between storing and loading steps when `compute` is `False`. When `compute` is `True`, this effectively inlines `Future`s or results that the `store_chunk` step would have returned. Thus making the most compact Dask possible when `return_stored=True`.

Appears that `sto_chunk` was more confusing than helpful in the review. So rename it to `func` to make it clear we intend this to be a different function and not a typo of `store_chunk`.

In the `insert_to_ooc` and `retrieve_from_ooc` functions, make use of dictionary comprehensions to simplify the code a bit. These also tend to be a bit faster than Python `for`-loops.

When `return_stored` and `compute` are both set to `True`, ensure that `store` returns Dask Arrays that contain depth 1 Dasks only. This helps ensure that the Dask Arrays generated in this case don't have any lingering connections to the Dask used for storing in the Dask Arrays returned.

Verifies how many times a `Lock` instance is acquired in `store` when using `return_stored`. Namely if `compute=True`, the lock will be acquired two times for each chunk as storing and loading are separate steps to allow for storing of chunks to occur asynchronously. However if `compute=False`, the lock will be acquired only one time for each chunk as store and load are fused into a single step that acquires the lock for both parts. This acts as a check to ensure that the `compute=False` case of `return_stored` does in fact fuse storing and loading into a single step that holds the lock for both parts.

jakirkham · 2018-02-20T16:25:47Z

Thanks for the comments @jcrist. Think I have addressed the bulk of your concerns. Could use another look if you have time.

jcrist · 2018-02-20T16:28:13Z

LGTM. Feel free to merge when tests pass.

jakirkham · 2018-02-20T16:59:37Z

Thanks everyone. :)

jakirkham mentioned this pull request Feb 9, 2018

Inline store_chunk calls for store's return_stored option #3082

Closed

jakirkham force-pushed the minor_store_opts branch from 540786d to a55b925 Compare February 9, 2018 21:09

jakirkham force-pushed the minor_store_opts branch 11 times, most recently from 0fac252 to e143886 Compare February 11, 2018 00:42

jakirkham force-pushed the minor_store_opts branch 2 times, most recently from 36af18a to ba91a70 Compare February 11, 2018 02:16

jakirkham force-pushed the minor_store_opts branch from 4ce912a to 566ec92 Compare February 16, 2018 16:20

jcrist reviewed Feb 20, 2018

View reviewed changes

jakirkham force-pushed the minor_store_opts branch 2 times, most recently from ce32dd6 to 897ff7b Compare February 20, 2018 15:57

jakirkham added 16 commits February 20, 2018 11:22

Drop end comma from tuple in retrieve_from_ooc

abc41ca

There was an extra comma at the end of a `tuple` in `retrieve_from_ooc`. This drops that extraneous comma to improve readability.

Rename store_dsks_mrg and store_dsks to store_dsk

6e2468e

As we are only really interested in the one Dask graph that represents all `store_chunk` operations, renamed `store_dsks_mrg` and `store_dsks` to `store_dsk` to make that clearer.

Change insert_to_ooc's sto_chunk to func

50f14dc

Appears that `sto_chunk` was more confusing than helpful in the review. So rename it to `func` to make it clear we intend this to be a different function and not a typo of `store_chunk`.

Use dictionary comprehensions in insert/retrieve

86e96c5

In the `insert_to_ooc` and `retrieve_from_ooc` functions, make use of dictionary comprehensions to simplify the code a bit. These also tend to be a bit faster than Python `for`-loops.

Note store_chunk inlining in changelog

848f9da

jakirkham force-pushed the minor_store_opts branch from 897ff7b to 848f9da Compare February 20, 2018 16:23

jakirkham merged commit fadc962 into dask:master Feb 20, 2018

jakirkham deleted the minor_store_opts branch February 20, 2018 16:59

madsbk mentioned this pull request Sep 4, 2020

[REVIEW] Dask Array's store to return a single HLG layer #6601

Merged

2 tasks

jakirkham mentioned this pull request Nov 4, 2024

Add load_stored kwarg to dask.array.store. #11465

Merged

Uh oh!

Conversation

jakirkham commented Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Feb 9, 2018

Uh oh!

jcrist commented Feb 9, 2018

Uh oh!

jakirkham commented Feb 9, 2018

Uh oh!

mrocklin commented Feb 9, 2018

Uh oh!

mrocklin commented Feb 9, 2018

Uh oh!

jakirkham commented Feb 9, 2018

Uh oh!

jakirkham commented Feb 9, 2018

Uh oh!

mrocklin commented Feb 9, 2018

Uh oh!

mrocklin commented Feb 9, 2018

Uh oh!

mrocklin commented Feb 9, 2018

Uh oh!

jakirkham commented Feb 9, 2018

Uh oh!

mrocklin commented Feb 9, 2018

Uh oh!

jakirkham commented Feb 9, 2018

Uh oh!

mrocklin commented Feb 9, 2018

Uh oh!

jakirkham commented Feb 11, 2018

Uh oh!

jakirkham commented Feb 16, 2018

Uh oh!

jakirkham commented Feb 20, 2018

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Feb 20, 2018

Uh oh!

jakirkham commented Feb 9, 2018 •

edited

Loading

jcrist Feb 20, 2018 •

edited

Loading

jcrist Feb 20, 2018 •

edited

Loading

jakirkham Feb 20, 2018 •

edited

Loading