Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use High Level Graph optimizations for delayed #8316

Merged
merged 8 commits into from Nov 2, 2021

Conversation

ian-r-rose
Copy link
Collaborator

@ian-r-rose ian-r-rose commented Oct 29, 2021

This dusts of @jrbourbeau 's work in #7298 and fixes the remaining issue there.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose! I just merged #8325 and and then merged main here, which should result in CI passing for this PR

@ian-r-rose
Copy link
Collaborator Author

I had already cherry-picked #8325 here, so I think the merge was a no-op :)

@jrbourbeau
Copy link
Member

Ah, sorry I was referring to the update from #8325 (which I now see wasn't needed due to the cherry pick) as well as #8314 which should resolve the flaky test_interrupt which was popping up here

@ian-r-rose
Copy link
Collaborator Author

Ah, I see. Thanks!

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose! Left a few small comments, but overall this looks good

dask/tests/test_delayed.py Outdated Show resolved Hide resolved
dask/tests/test_delayed.py Outdated Show resolved Hide resolved
dask/tests/test_highgraph.py Show resolved Hide resolved
Comment on lines +476 to +477
if not isinstance(dsk, HighLevelGraph):
dsk = HighLevelGraph.from_collections(id(dsk), dsk, dependencies=())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see we still need to support low-level task graphs here. Can you remind me why this is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it is need anymore, to be honest. This stanza is from 2a51476, which looks like it was copied from array and dataframe, both of which have similar defensiveness.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out there are still some places where delayed uses low-level graphs -- quite a number of tests fail without this check.

I'll bet it could be fixed by moving the HighLevelGraph cast to those places. Do you have any idea of how much bespoke user code there might be out there manually constructing Delayed graphs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. My guess is there's probably not much manual constructing of Delayed graphs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it might be interesting to put a WIP PR and see how many tests fail. At a minimum it would provide a good list of things to add HLGs to

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look earlier today, and a good many of the failures seemed to come from this manual graph in persist.

I don't think it would be a huge lift (for delayed, at least).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, we want to move that to HLG? If so, agreed that looks doable and pretty valuable actually

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the consequences of moving that to HLG. What do you think the win would be there @jakirkham (not that I disagree)? Maybe I'll open up a test PR today with this to have something a bit more concrete to discuss.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose!

@jrbourbeau jrbourbeau merged commit 6353103 into dask:main Nov 2, 2021
gjoseph92 added a commit to gjoseph92/dask that referenced this pull request Dec 13, 2021
In dask#8452 I realized that an incorrect pattern had emerged from dask#6510 of including
```python
    if not isinstance(dsk, HighLevelGraph):
        dsk = HighLevelGraph.from_collections(id(dsk), dsk, dependencies=())
```
in optimization functions. Specifically, `id(dsk)` is incorrect as the layer name here. The layer name must match the `.name` of the resulting collection that gets created by `__dask_postpersist__()`, otherwise `__dask_layers__()` on the optimized collection will be wrong. Since `optimize` doesn't know about collections and isn't passed a layer name, the only reasonable thing to do here is to error when given a low-level graph.
This is safe to do for Arrays and DataFrames, since their constructors convert any low-level graphs to HLGs.

This PR doesn't really fix anything—the code path removed should be unused—but it eliminates a confusing pattern that has already wandered its way into other places dask#8316 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify worker ressource for dask.Delayed object with dask.annotate
4 participants