New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use High Level Graph optimizations for delayed #8316
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ian-r-rose! I just merged #8325 and and then merged main
here, which should result in CI passing for this PR
I had already cherry-picked #8325 here, so I think the merge was a no-op :) |
Ah, I see. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ian-r-rose! Left a few small comments, but overall this looks good
if not isinstance(dsk, HighLevelGraph): | ||
dsk = HighLevelGraph.from_collections(id(dsk), dsk, dependencies=()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised to see we still need to support low-level task graphs here. Can you remind me why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it is need anymore, to be honest. This stanza is from 2a51476, which looks like it was copied from array
and dataframe
, both of which have similar defensiveness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out there are still some places where delayed uses low-level graphs -- quite a number of tests fail without this check.
I'll bet it could be fixed by moving the HighLevelGraph cast to those places. Do you have any idea of how much bespoke user code there might be out there manually constructing Delayed graphs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. My guess is there's probably not much manual constructing of Delayed
graphs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it might be interesting to put a WIP PR and see how many tests fail. At a minimum it would provide a good list of things to add HLGs to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look earlier today, and a good many of the failures seemed to come from this manual graph in persist
.
I don't think it would be a huge lift (for delayed, at least).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand, we want to move that to HLG? If so, agreed that looks doable and pretty valuable actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the consequences of moving that to HLG. What do you think the win would be there @jakirkham (not that I disagree)? Maybe I'll open up a test PR today with this to have something a bit more concrete to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ian-r-rose!
In dask#8452 I realized that an incorrect pattern had emerged from dask#6510 of including ```python if not isinstance(dsk, HighLevelGraph): dsk = HighLevelGraph.from_collections(id(dsk), dsk, dependencies=()) ``` in optimization functions. Specifically, `id(dsk)` is incorrect as the layer name here. The layer name must match the `.name` of the resulting collection that gets created by `__dask_postpersist__()`, otherwise `__dask_layers__()` on the optimized collection will be wrong. Since `optimize` doesn't know about collections and isn't passed a layer name, the only reasonable thing to do here is to error when given a low-level graph. This is safe to do for Arrays and DataFrames, since their constructors convert any low-level graphs to HLGs. This PR doesn't really fix anything—the code path removed should be unused—but it eliminates a confusing pattern that has already wandered its way into other places dask#8316 (comment).
pre-commit run --all-files
This dusts of @jrbourbeau 's work in #7298 and fixes the remaining issue there.