-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in test_join.py::test_join_big
#711
Comments
I'm not entirely sure if this is 'bad', if anything seems like an improvement based on the graph. 🤔
|
The comparisons to coiled-latest and coiled 0.2.1 are not relevant. Both versions are way too old. We basically only compare to upstream so the relevant charts are the timeseries charts here https://benchmarks.coiled.io/coiled-upstream-py3.9.html Looks like something happened on March 10th (last Friday) that caused P2P performance to change. One change we merged that may affect this is dask/distributed#7621 @hendrikmakait for visibility. I don't think this is concerning since overall memory is still constant even though it's 10-15% higher. Wall time stays the same. |
It's actually quite interesting to inspect Grafana for before and after We can see a much longer tail in the computation And a relatively unhealthy CPU distribution for the workers where the second unpack stage appears to only utilize the CPU partially after Which shows a more even CPU distribution but a much more heavy iowait contribution. I suspect this is actually the impact of dask/distributed#7587 💡 |
Yes, this is definitely dask/distributed#7587 since the event loop health is also significantly better on the new run. @ntabris do we have hardware disk read/write rates in grafana at the moment or do we just have the dask instrumented pieces about spilling? |
Here's hardware network and disk rates... I was curious whether individual workers would getting better io rates, or if it was different distribution across workers, so I made some charts to look at that. Before: After: So max R+W rate is the same, 127MiB/s, but the "after" cluster was getting high read on all the workers, where there was much more variance across workers on the "before" cluster. |
very nice plots, thank you @ntabris ! I believe these will be useful (at least for developers, possibly for users as well) |
|
Workflow Run URL
The text was updated successfully, but these errors were encountered: