New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A HighLevelGraph abstract layer for map_overlap #7595
Conversation
Also adding @rjzamora |
Thanks @GenevieveBuckley! I should be able to take a look today |
dask/array/overlap.py
Outdated
def __iter__(self): | ||
return iter(self._dict) | ||
|
||
def __len__(self): | ||
return len(self._dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we plan on these not needing to materialize self._dict
? Maybe we'd want a get_output_keys()
method similar to Blockwise. In particular, __len__
gets called by _repr_html_
, so materializing the graph in there can make display in a notebook slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good things to flag, thanks @gjoseph92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting started on this @GenevieveBuckley, I think if we go down the path of doing a full implementation of map_overlap
(which involves both rechunking and slicing) it would be too much to bite off. But if we can identify a useful subset with, e.g., no rechunking and constant boundary conditions, we'd have a good start.
Do you know if those conditions are met by some of the imaging applications you are interested in?
@emilmelnikov - while this PR isn't much to look at, Ian and I have had one meeting today (and another planned later this week) to work out the best way to tackle things. |
FYI, I had a chance to try moving the guts of the My current problem is working out how to add dependencies to that task graph (essentially, the input array 'x' was specified as a dependency originally when the code was using |
Good news, bad news Good news: the dependencies are fixed now (thanks Ian!), and things also work using LocalCluster. Bad news: We've only made a tiny, tiny dent in the time it takes. We need things to be orders of magnitude less than they are to be comparable with |
We could try and delay more of the graph-building steps, but I suspect we might need to look at how to improve slicing too (sorry @ian-r-rose I know you were hoping to avoid that) |
Interesting, I get a huge speedup on your branch using Emil's benchmark! When I use |
I would note: I'll bet we can make this even faster by moving the computation of the keys into the |
Huh, that's very different. ...Turns out I have trouble telling the
That's roughly ~24x faster (it's only faster for this first bit, we've just delayed the slowness caused by materializing the graph until later on). Note, currently we're just delaying materializing the graph. So it's not faster overall, you still have to wait for that when things are computed. We'll try to optimize that a bit later on. @emilmelnikov it would be useful to hear how much benefit or not you'd get just from shifting the slow parts to later on. Do you know if that's useful by itself, or if it's critical to have the overall time including computation improved, too? |
I've just tried moving a Interestingly, it doesn't seem to matter as much as I expected exactly which boundary condition you choose. Maybe a lot of the other slowness is due to all the sanity checks we see around the place? I'll need to do some line profiling to work out where the slowest bits are. |
Perhaps this will help us figure out what to target next: Details - click to expand!
|
There are actually two related benefits we could have here:
|
dask/array/overlap.py
Outdated
self.chunks = chunks | ||
self.numblocks = numblocks | ||
self.token = token | ||
self._cached_keys = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is an ugly way to do things - we're passing in a whole bunch of stuff in the initialization that is really only being used in the _construct_graph method. The aim is to get things working and then tidy it up, but I want to point this out now regardless.
There's also some potential confusion with all the different sorts of name
(there is a name equivalent to the original x.chunks
, a name that is "overlap-"+tokenize(x, axes)
, and then a getitem
name as well). Ideally this would be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is an ugly way to do things - we're passing in a whole bunch of stuff in the initialization that is really only being used in the _construct_graph method. The aim is to get things working and then tidy it up, but I want to point this out now regardless.
Yeah, that's a fair concern -- I would say that it comes down to the fact that we are sort-of reimplementing some of the logic belonging to a dask collection, but in a more restricted setting (no deserialization, no numpy/pandas), so it's both (a) kind of duplicative feeling, and (b) a little awkward. Perhaps @rjzamora has some ideas about patterns for making things like this less verbose.
There's also some potential confusion with all the different sorts of
name
(there is a name equivalent to the originalx.chunks
, a name that is"overlap-"+tokenize(x, axes)
, and then agetitem
name as well). Ideally this would be simplified.
Yep, I agree! I'm certainly open to ideas for how to name things in a clearer way (so to speak).
At the very least, I think some of these things could be computed in __init__
rather than passed in. This would help it be more self-consistent as well. For example, numblocks
is computable from chunks
, so they aren't really independent parameters to be passed in. You may also be able to tokenize
(x.name, axes) so you wouldn't have to pass in a token
(I'm not actually sure if there would be other consequences to changing how getitem
-name is computed)
That's great, happy to see this!
Unfortunately, while it's awesome to be able to work with dask collections without immediate graph construction overhead, we still need to be able to slice input and compute output in a way that takes I've updated the benchmark script to print time needed to compute a single chunk from
Although I'm not sure if this is just too much to ask for the current state of |
There's a related discussion here from @bmerry about making slicing scale better #5918 which I believe is relevant to this discussion too. @gjoseph92 has also been thinking about how to improve slicing, so he may have some thoughts for our next steps, too. |
@GenevieveBuckley I don't much like this pattern, but for now I'd follow the lead of other layers and take a Line 321 in b3a8646
Conditionally switch between Lines 797 to 811 in b3a8646
And only set Line 308 in b3a8646
|
Thanks @gjoseph92, I probably should have thought to realize that other people would have run into the same problem and solved it somewhere. The tip about |
And yes, this still works well with LocalCluster :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ready to be merged in then?
As a small note, it seems a little strange to me to move all layer definitions up out of the array directory. Is there an established logic behind where the Layer
classes should be defined? I see that the Blockwise
class is in blockwise.py for instance.
That was my impression
There is no established logic. This might be a good point in time to establish what the organization should be. (We could do the re-organising in a separate PR, if need be) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ian-r-rose @gjoseph92 if either of you get a chance, it would be great to give this a final review
As a small note, it seems a little strange to me to move all layer definitions up out of the array directory
IIRC we decided to keep most layer classes in a separate layer.py
module to more easily check that extra libraries aren't imported within the module (there's more context around this over in #7381). This is because we want to avoid importing libraries like NumPy, pandas, etc. on the scheduler which may not be installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of very minor comments, but I think this is a good checkpoint. Follow-up work could include:
- computing layer length without materialization
- implementing
cull
(right now culling forces materialization as well)
Now that I've fixed one of the tests so that it actually runs (thanks @jrbourbeau and @ian-r-rose), I now have a test failure to fix too 😆 |
Turns out, I made a typo and forgot a single
|
Ok, now we should be done @jsignell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GenevieveBuckley!
Woo! |
This PR introduces a high level graph layer for array overlaps.
Note: there is no actual optimization included in this PR, we've just delayed materializing the whole task graph until a later point. We expect that delaying graph materialization is still a semi-useful thing to do, and that this PR sets us up for further work with optimizations in the future.
black dask
/flake8 dask
/isort dask
Related discussion: #7404