-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fixes for dask.order - Remove change of tactical goal in single dep path #10505
Conversation
if not singles: | ||
continue | ||
# if not singles: | ||
continue | ||
process_singles = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling this GOTO now actually generates the optimal order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, this also breaks some tests now. Have to investigate further
I found two smart switches that actually cause this specific graph to be ordered worse. However, the linear growth described in #10384 (comment) is still a problem. |
Note: I'm just documenting observations here. There are no deep insights here and I only recommend reading this if you are really interested and want to following along Considering that #10384 (comment) shows that dataframes scale much better, I started looking into what the difference between the two graphs is.
After the initial pick of the key, the two orderings diverge already starting with the pick of the second key. Effectively, the first key is a random number gen that breaks into a Lines 221 to 233 in 3520077
This partition_keys function is also used to "change the strategic goal" that I disabled earlier. Inspecting the two partition keys for both the dataframe and array workloads reveal an interesting property
So, not only does the dataframe choose a different path, the partition_keys are vastly different in both cases even though the graphs are sufficiently similar (I'll get to the topological differences in a moment) Inspecting why that is, we look at the calculated graph metrics
So, all numbers are very different. The partition_key is a function of most of those metrics In a very, very, very simplified version of this graph, we can see that the Now, in terms of the overall differences between the graph, the two versions (very small) below DataFrameArrayThe most obvious difference between arrays and DataFrames is that the DataFrame has a single final task as output key while the array graph has three (the three sub-arrays of the xarray datasets). This may be relevant since dask.order is actually implementing a special case for this scenario Lines 128 to 135 in 3520077
min_dependencies
to be continued... |
I did a little time traveling and measured pressure for dataframe / array
The good news is that we apparently are not at the worst case (which seems to be 1011 for this example) and this is no clear regression. However, it looks like the version of f4a7531 was doing pretty well for both cases (and it was symmetrical) FWIW these commits go back to version 0.17.something and represent all notable changes to dask.order since |
I'd be curious what this would look like applied to the A/B tests. Presumably recent fixes were for some reason. It seems like we weren't measuring well though, except with unit tests. Applying these to a broader set of workflows would be interesting. |
Running. This will take a moment |
Just looking over the version of 0.17.2 though, I doubt it is sufficient. There are many relatively obvious failure cases it does not handle. However, it is actually more robust for the cases I'm currently looking at (even fixes #9995) |
This is waaaay too early for a PR but considering how complex this subject is I wanted to share intermediate progress asap.
This is specifically looking at the graph taken from #10384. The simplified reproducer I provided over there is part of this test and I started investigating why this is ordered poorly
Main ordering
Pressure 8
Optimal ordering
I ordered the graph manually to what I consider optimal ordering for this case. I don't necessarily strive to reproduce this but this should serve as an orientation. This is likely not unique. Note how source nodes have a homogeneous age and how terminal/end nodes are computed more quickly.
Pressure 6
This PR (WIP)
As a first step, I looked into why
("b", 1, 0)
is scheduled way too late with priority 22. It should be scheduled immediately after("a", 1, 0)
to free up("a", 0, 0)
.It turns out that this task is scheduled poorly due to a "change of the tactical goal" that is based on the partition_keys. Removing this entirely "fixes" this specific section of the graph and allows
("b", 1, 0)
to be scheduled much more quickly, reducing pressure from 8 to 6.That's already nice but the resulting ordering is still far away from optimal and a little to heterogenous for my taste considering how symmetrical this problem is.
I haven't found out, yet, whether this "change of tactical goal" is a bad idea in general or whether the partition_key cost function is just bad. FWIW this specific code branch is untested. While it is covered by
test_map_overlap
andtest_array_store_final_order
, removing this condition entirely does not have an impact on any asserted property. In fact, at least fortest_map_overlap
, removing this condition also reduces pressure and generates a better graph (the other graph is more difficult to analyze)cc @eriknw