Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in test_join.py::test_join_big #711

Open
github-actions bot opened this issue Mar 11, 2023 · 7 comments
Open

Regression in test_join.py::test_join_big #711

github-actions bot opened this issue Mar 11, 2023 · 7 comments

Comments

@github-actions
Copy link

Workflow Run URL

@milesgranger
Copy link
Contributor

I'm not entirely sure if this is 'bad', if anything seems like an improvement based on the graph. 🤔
@fjetter

Traceback (most recent call last):
File "/home/runner/work/coiled-runtime/coiled-runtime/detect_regressions.py", line 139, in
regressions_report(regressions_df)
File "/home/runner/work/coiled-runtime/coiled-runtime/detect_regressions.py", line 126, in regressions_report
raise Exception(
Exception: Regressions detected 1:
runtime = 'coiled-upstream-py3.9', name = 'test_join_big[1-p2p]', category = 'benchmarks', last_three_peak_memory [GiB] = (8.803901672363281, 9.057891845703125, 12.268653869628906), peak_memory_threshold [GiB] = 8.744792175292968

image

@fjetter
Copy link
Member

fjetter commented Mar 14, 2023

The comparisons to coiled-latest and coiled 0.2.1 are not relevant. Both versions are way too old. We basically only compare to upstream so the relevant charts are the timeseries charts here https://benchmarks.coiled.io/coiled-upstream-py3.9.html

Peak mem
image

Looks like something happened on March 10th (last Friday) that caused P2P performance to change. One change we merged that may affect this is dask/distributed#7621

@hendrikmakait for visibility. I don't think this is concerning since overall memory is still constant even though it's 10-15% higher. Wall time stays the same.

Wall time:
image

@fjetter
Copy link
Member

fjetter commented Mar 14, 2023

It's actually quite interesting to inspect Grafana for before and after

before
https://grafana.dev-sandbox.coiledhq.com/d/eU1bT-nVz/cluster-metrics-prometheus?var-datasource=AWS%20Prometheus%20-%20Sandbox%20%28us%20east%202%29&from=1678356404431&to=1678356894794&var-cluster=test_join-e30ef57d&orgId=1

We can see a much longer tail in the computation

image

And a relatively unhealthy CPU distribution for the workers where the second unpack stage appears to only utilize the CPU partially

image

after

https://grafana.dev-sandbox.coiledhq.com/d/eU1bT-nVz/cluster-metrics-prometheus?var-datasource=AWS%20Prometheus%20-%20Sandbox%20%28us%20east%202%29&from=1678460091873&to=1678460511532&var-cluster=test_join-da3cf31e&orgId=1

image

Which shows a more even CPU distribution but a much more heavy iowait contribution.

image

I suspect this is actually the impact of dask/distributed#7587 💡

@fjetter
Copy link
Member

fjetter commented Mar 14, 2023

Yes, this is definitely dask/distributed#7587 since the event loop health is also significantly better on the new run.
Given this data, I'm curious how this all looks like if we offloaded the disk IO part to another thread or if we even optimistically preloaded N partitions to the worker ahead of time. At least theoretically we could speed up the unpack stage by about 30%

@ntabris do we have hardware disk read/write rates in grafana at the moment or do we just have the dask instrumented pieces about spilling?

@ntabris
Copy link
Member

ntabris commented Mar 14, 2023

Here's hardware network and disk rates...

Before:

image

After:

image

I was curious whether individual workers would getting better io rates, or if it was different distribution across workers, so I made some charts to look at that.

Before:

image

After:

image

So max R+W rate is the same, 127MiB/s, but the "after" cluster was getting high read on all the workers, where there was much more variance across workers on the "before" cluster.

@fjetter
Copy link
Member

fjetter commented Mar 14, 2023

very nice plots, thank you @ntabris ! I believe these will be useful (at least for developers, possibly for users as well)

@j-bennet j-bennet changed the title ⚠️ CI failed ⚠️ Regression in test_join.py::test_join_big May 8, 2023
@j-bennet
Copy link
Contributor

j-bennet commented May 8, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants