Regression in `test_join.py::test_join_big` #711

github-actions · 2023-03-11T01:16:40Z

milesgranger · 2023-03-14T08:44:04Z

I'm not entirely sure if this is 'bad', if anything seems like an improvement based on the graph. 🤔
@fjetter

Traceback (most recent call last):
File "/home/runner/work/coiled-runtime/coiled-runtime/detect_regressions.py", line 139, in
regressions_report(regressions_df)
File "/home/runner/work/coiled-runtime/coiled-runtime/detect_regressions.py", line 126, in regressions_report
raise Exception(
Exception: Regressions detected 1:
runtime = 'coiled-upstream-py3.9', name = 'test_join_big[1-p2p]', category = 'benchmarks', last_three_peak_memory [GiB] = (8.803901672363281, 9.057891845703125, 12.268653869628906), peak_memory_threshold [GiB] = 8.744792175292968

fjetter · 2023-03-14T09:33:16Z

The comparisons to coiled-latest and coiled 0.2.1 are not relevant. Both versions are way too old. We basically only compare to upstream so the relevant charts are the timeseries charts here https://benchmarks.coiled.io/coiled-upstream-py3.9.html

Peak mem

Looks like something happened on March 10th (last Friday) that caused P2P performance to change. One change we merged that may affect this is dask/distributed#7621

@hendrikmakait for visibility. I don't think this is concerning since overall memory is still constant even though it's 10-15% higher. Wall time stays the same.

Wall time:

fjetter · 2023-03-14T11:02:35Z

It's actually quite interesting to inspect Grafana for before and after

before
https://grafana.dev-sandbox.coiledhq.com/d/eU1bT-nVz/cluster-metrics-prometheus?var-datasource=AWS%20Prometheus%20-%20Sandbox%20%28us%20east%202%29&from=1678356404431&to=1678356894794&var-cluster=test_join-e30ef57d&orgId=1

We can see a much longer tail in the computation

And a relatively unhealthy CPU distribution for the workers where the second unpack stage appears to only utilize the CPU partially

after

https://grafana.dev-sandbox.coiledhq.com/d/eU1bT-nVz/cluster-metrics-prometheus?var-datasource=AWS%20Prometheus%20-%20Sandbox%20%28us%20east%202%29&from=1678460091873&to=1678460511532&var-cluster=test_join-da3cf31e&orgId=1

Which shows a more even CPU distribution but a much more heavy iowait contribution.

I suspect this is actually the impact of dask/distributed#7587 💡

fjetter · 2023-03-14T11:14:11Z

Yes, this is definitely dask/distributed#7587 since the event loop health is also significantly better on the new run.
Given this data, I'm curious how this all looks like if we offloaded the disk IO part to another thread or if we even optimistically preloaded N partitions to the worker ahead of time. At least theoretically we could speed up the unpack stage by about 30%

@ntabris do we have hardware disk read/write rates in grafana at the moment or do we just have the dask instrumented pieces about spilling?

ntabris · 2023-03-14T13:31:25Z

Here's hardware network and disk rates...

Before:

After:

I was curious whether individual workers would getting better io rates, or if it was different distribution across workers, so I made some charts to look at that.

Before:

After:

So max R+W rate is the same, 127MiB/s, but the "after" cluster was getting high read on all the workers, where there was much more variance across workers on the "before" cluster.

fjetter · 2023-03-14T13:40:20Z

very nice plots, thank you @ntabris ! I believe these will be useful (at least for developers, possibly for users as well)

j-bennet · 2023-05-08T16:57:14Z

Xref ⚠️ CI failed ⚠️ #828.

github-actions bot added the ci-failure label Mar 11, 2023

j-bennet changed the title ~~⚠️ CI failed ⚠️~~ Regression in test_join.py::test_join_big May 8, 2023

This was referenced May 8, 2023

Potential regression in peak memory for test_join_big[0.1-tasks] #833

Closed

⚠️ CI failed ⚠️ #828

Closed

j-bennet mentioned this issue May 8, 2023

⚠️ CI failed ⚠️ #830

Closed

ncclementi mentioned this issue May 10, 2023

Release 2023.5.0 dask/community#322

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in `test_join.py::test_join_big` #711

Regression in `test_join.py::test_join_big` #711

github-actions bot commented Mar 11, 2023

milesgranger commented Mar 14, 2023

fjetter commented Mar 14, 2023

fjetter commented Mar 14, 2023

fjetter commented Mar 14, 2023

ntabris commented Mar 14, 2023

fjetter commented Mar 14, 2023

j-bennet commented May 8, 2023

Regression in test_join.py::test_join_big #711

Regression in test_join.py::test_join_big #711

Comments

github-actions bot commented Mar 11, 2023

milesgranger commented Mar 14, 2023

fjetter commented Mar 14, 2023

fjetter commented Mar 14, 2023

fjetter commented Mar 14, 2023

ntabris commented Mar 14, 2023

fjetter commented Mar 14, 2023

j-bennet commented May 8, 2023

Regression in `test_join.py::test_join_big` #711

Regression in `test_join.py::test_join_big` #711