I believe the progress bars on the dashboard are currently sorted by group size (largest first):
|
names = sorted(msg["all"], key=msg["all"].get, reverse=True) |
This is a cheap metric that probably sometimes approximates topological order. Of course, it's wrong for any fan-out operations (repartition, shuffle, etc.).
But progress might be easier to watch and decipher if it was in actual topological order. Bars would then be most full at the top, and least full at the bottom.
It took me a long time of using dask to actually understand what the progress bars were showing, I think because it felt so random which ones were completing first.
It seems doable to maintain topological ordering, but might require some state in between updates to do efficiently.
Current

In the above example, make-timeseries comes first in topological order, then the repartitions, then sub, then dataframe-count and dataframe-sum
Proposed

I believe the progress bars on the dashboard are currently sorted by group size (largest first):
distributed/distributed/diagnostics/progress_stream.py
Line 94 in bfc5cfe
This is a cheap metric that probably sometimes approximates topological order. Of course, it's wrong for any fan-out operations (repartition, shuffle, etc.).
But progress might be easier to watch and decipher if it was in actual topological order. Bars would then be most full at the top, and least full at the bottom.
It took me a long time of using dask to actually understand what the progress bars were showing, I think because it felt so random which ones were completing first.
It seems doable to maintain topological ordering, but might require some state in between updates to do efficiently.
Current
In the above example,
make-timeseriescomes first in topological order, then therepartitions, thensub, thendataframe-countanddataframe-sumProposed