Skip to content

[Dask.order] Ignore data tasks when ordering#10706

Merged
hendrikmakait merged 3 commits intodask:mainfrom
fjetter:ignore_data_root_tasks
Dec 19, 2023
Merged

[Dask.order] Ignore data tasks when ordering#10706
hendrikmakait merged 3 commits intodask:mainfrom
fjetter:ignore_data_root_tasks

Conversation

@fjetter
Copy link
Copy Markdown
Member

@fjetter fjetter commented Dec 14, 2023

This supersedes #10619

This may be a little controversial... However, there are frequently topologies (particularly in the array space) that have a dummy task at the bottom of the graph that includes some metadata (e.g. for zarr). In the xarray world, those are frequently embedded numpy arrays.

I believe we should special case such tasks since they can throw off otherwise fine heuristics.

  1. They never have dependencies so we can schedule them whenever we want
  2. There is no point in running their dependents more quickly than others trying to release them. We cannot release their data since the data is embedded in the graph/run_spec (even a released task currently holds on to their run_spec)
  3. The data itself is typically very small, otherwise it would not be feasible to embed it into a graph

So, why is this controversial

With this, ordering would be different for say da.from_numpy(np.zeros(100), chunks=20) and da.zeros(100, chunk=20) since the first would literally embed the numpy array into the dask graph while the latter generates the data whenever needed. I'm not sure if this is such a bad thing. It may just be a little surprising but I don't think this will have negative effects. This can be seen in one of our tests where I changed to a dask.array.random

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Dec 14, 2023

I'm not aware of any cases that perform better or worse with this but I am reasonably certain that this graph transformation is a good thing, see the argument above. The only place I am aware of where this is used is really the array world where zarr et al have this dummy task.

This should put an end to the conversation about whether or not they should inline those tasks, at least for the dask/dask part. There is still the distributed scheduler that could mess this up, of course.

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Dec 14, 2023

Generally speaking, I could see us add various equivalency transformations to a graph that we know won't impact performance but will simplify the topology. This is now the second one I added, after #10697 which is basically an equivalent to this PR but for leaf nodes.
I am not aware of any such transformations at the moment but if more problems around ordering pop up, this might be an angle to investigate

@dcherian
Copy link
Copy Markdown
Collaborator

This should put an end to the conversation about whether or not they should inline those tasks,

Woot! Woot!

Copy link
Copy Markdown
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @fjetter!

Co-authored-by: Hendrik Makait <hendrik@makait.com>
@hendrikmakait hendrikmakait merged commit 131f89d into dask:main Dec 19, 2023
milesgranger pushed a commit to milesgranger/dask that referenced this pull request Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants