[Dask.order] Ignore data tasks when ordering#10706
Conversation
|
I'm not aware of any cases that perform better or worse with this but I am reasonably certain that this graph transformation is a good thing, see the argument above. The only place I am aware of where this is used is really the array world where zarr et al have this dummy task. This should put an end to the conversation about whether or not they should inline those tasks, at least for the dask/dask part. There is still the distributed scheduler that could mess this up, of course. |
|
Generally speaking, I could see us add various equivalency transformations to a graph that we know won't impact performance but will simplify the topology. This is now the second one I added, after #10697 which is basically an equivalent to this PR but for leaf nodes. |
Woot! Woot! |
Co-authored-by: Hendrik Makait <hendrik@makait.com>
This supersedes #10619
This may be a little controversial... However, there are frequently topologies (particularly in the array space) that have a dummy task at the bottom of the graph that includes some metadata (e.g. for zarr). In the xarray world, those are frequently embedded numpy arrays.
I believe we should special case such tasks since they can throw off otherwise fine heuristics.
So, why is this controversial
With this, ordering would be different for say da.from_numpy(np.zeros(100), chunks=20) and da.zeros(100, chunk=20) since the first would literally embed the numpy array into the dask graph while the latter generates the data whenever needed. I'm not sure if this is such a bad thing. It may just be a little surprising but I don't think this will have negative effects. This can be seen in one of our tests where I changed to a
dask.array.random