New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use BlockwiseDep for map_blocks with block_id or block_info #7686
base: main
Are you sure you want to change the base?
Conversation
cc @rjzamora @dask/array |
@madsbk any tips on the serialization? If I understand correctly (which I'm not at all sure I do), |
Yes this is correct. The worker should deserialize the data automatically but the serialization logic in Distributed is very messy so it is hard to know exactly why it doesn't happen :/ FYI: I am working on redesigning the serialization logic in Distributed, which will handle issues like this. E.g. the |
Sounds good. What's the timeframe, and is there a ticket I can watch so that I'll know when it's in place? If it's happening soon I might just leave this until then.
To check that I understand you correctly: if X is a to_serialize object, and the layer materialization produces a task like this:
it should work, whereas a task
might not? (the latter is approximately what I'm currently doing, although there are two layers of dict involved). |
This is an attempt to improve the serialization situation. Instead of trying to put the data into a custom subclass of BlockwiseDep, insert it as (constant) arguments to the wrapper function. This then relies on SubgraphCallable to handle the serialization. This may theoretically improve serialization costs when the graph is materialized on the client (which I think is still the default for arrays), because the raw data is inside the SubgraphCallable and hence only serialised once, with production of the individual block_infos left to the workers. The benchmark code in dask#7686 is a bit slower than the previous version (about 7s for compute), but still faster than main.
I've updated the PR with a different approach - see commit message of bc2b235 for details. @rjzamora I feel like this new approach might not be in the "spirit" of BlockwiseDep, in that I'm just using an empty BlockwiseDepDict to access the block index, and all the logic for processing that block index into an input to the task is handled by a wrapper function and some constant (indices is None) arguments to the Blockwise. Let me know if you think I should revert bc2b235. |
100% untested.
This still assembles the block_info parameter when the Blockwise layer is materialised, rather than as part of the task, so it is probably not going to be any more scalable, but it avoids creating an additional layer. It still needs to be updated with serialization support to make it work with distributed.
There seem to be some issues with it (causing the _BlockInfo itself to be passed to the function, rather than its expansion). To be fixed in a later commit.
This adds some tests and makes them pass, but the deserialization is probably happening in the wrong place or with the wrong method.
This is an attempt to improve the serialization situation. Instead of trying to put the data into a custom subclass of BlockwiseDep, insert it as (constant) arguments to the wrapper function. This then relies on SubgraphCallable to handle the serialization. This may theoretically improve serialization costs when the graph is materialized on the client (which I think is still the default for arrays), because the raw data is inside the SubgraphCallable and hence only serialised once, with production of the individual block_infos left to the workers. The benchmark code in dask#7686 is a bit slower than the previous version (about 7s for compute), but still faster than main.
It treated a numblocks of `()` as if it was unspecified.
I've added some extra tests, so I think this is ready for review. I'm still not sure if I'm on the right path with serialization - it seems a bit different to the related code. |
The failing tests seem to be flaky and not related to this PR. |
When the
map_blocks
function hasblock_id
orblock_info
arguments, it is populated with information about the current block. Previously this was done by synthesizing all the information at the timemap_blocks
was called and stored in a da.Array of objects. Now that Blockwise supports I/O-deps, it's feasible to defer this work until the Blockwise is materialised.I've made this a draft PR because I have no idea what I'm doing with the serialization and need some advice. I've made a distributed test pass, but after seeing some comments in other PRs I'm aware that there are some subtleties about what may be deserialized on the scheduler (in this case I think an np.ndarray is getting deserialized there).
Maybe a solution would be to have
_BlockInfo.__getitem__
generate a task to generate the block_info, rather than generating it itself? That would also have the advantage that constructing the block info would be left to the workers.I've extended
da.blockwise
to accept BlockwiseDep arguments (and documented it), so thatmap_blocks
can pass them through (sincemap_blocks
callsda.blockwise
rather thandask.blockwise
). I probably still need to add some direct tests, rather than relying on the coverage frommap_blocks
.The performance looks good. Here's some benchmark code:
Before:
After:
black dask
/flake8 dask
/isort dask