⚠️ CI failed ⚠️ #271

github-actions · 2022-08-20T01:18:14Z

ncclementi · 2022-08-22T16:17:22Z

The two regressions detected are:
test_download_throughput[s3fs] in0.0.4-py3.9 shows this bump, but it seems that it went down afterward.

test_q3 5GB (parquet) (red line) in upstream-py3.9 this one seems like a legit regression. look at the increase in duration

@ian-r-rose Do you think this latest one is related to dask/dask#9397 (comment) ?

ian-r-rose · 2022-08-22T16:27:25Z

Interesting, it doesn't look to me like the timing lines up (that fix was merged on Aug 19). I'm not sure, but it's worth investigating

fjetter · 2022-08-23T06:43:55Z

Timing wise, all dask/dask changes that might be relevant

git log --since='2022-08-15 14:15' --until='2022-08-18 14:15' --pretty=oneline

097decd5f4914d7cd85b3babd6ecdd20a633bf8d Shuffle-based groupby aggregation for high-cardinality groups (#9302)
c1bc90aa5149d8f515aadd8bcf11712e9ce7cdee Use `entry_points` utility in `sizeof` (#9390)
a9ee6c2fdf0a3093747e675997143e0dbe584bad Add `entry_points` compatibility utility (#9388)
feb40fdc4a20ddcdde3ea07273dd44d89ba7c59b Clarify that ``bind()`` etc. regenerate the keys (#9385)
0a7b58b134ba38e00c97f715030fe15eb84d0bb2 Implement ma.*_like functions (#9378)
b894f72d740425c252e7896089ce328bc56a71fc Upload environment file artifact for each CI build (#9372)
1fcfb7c93e83908eee4449b473f6e325b79e1f1a Fix `SeriesGroupBy` cumulative functions with `axis=1` (#9377)

all dask/distributed changes that may be relevant

61fca1cff93acbeaebbfd5b5935556e6424b8d6d Clean up `cluster` process reaping (#6840)
7768f6c4ea12a5f0af908bb41ed197eba1f87b17 Don't use `bokeh` `Figure` in tests (#6721)
1d0701b51c1dc354fb22af9b27b8f722f02daf63 Only close scheduler in SpecCluster if it exists (#6888)
3647cfefd004281dac0ffa92349e23bc2a94d68a Work around incompatibility of crick with setuptools 65 (#6887)

Nothing suspicious jumps out.

Of course, this might also be a coiled related change.

cc @ntabris @shughes-uk is there a way for us to see when there were coiled deployments?

FabioRosado · 2022-08-23T12:12:50Z

Timing wise, all dask/dask changes that might be relevant

git log --since='2022-08-15 14:15' --until='2022-08-18 14:15' --pretty=oneline

097decd5f4914d7cd85b3babd6ecdd20a633bf8d Shuffle-based groupby aggregation for high-cardinality groups (#9302)
c1bc90aa5149d8f515aadd8bcf11712e9ce7cdee Use `entry_points` utility in `sizeof` (#9390)
a9ee6c2fdf0a3093747e675997143e0dbe584bad Add `entry_points` compatibility utility (#9388)
feb40fdc4a20ddcdde3ea07273dd44d89ba7c59b Clarify that ``bind()`` etc. regenerate the keys (#9385)
0a7b58b134ba38e00c97f715030fe15eb84d0bb2 Implement ma.*_like functions (#9378)
b894f72d740425c252e7896089ce328bc56a71fc Upload environment file artifact for each CI build (#9372)
1fcfb7c93e83908eee4449b473f6e325b79e1f1a Fix `SeriesGroupBy` cumulative functions with `axis=1` (#9377)

all dask/distributed changes that may be relevant

61fca1cff93acbeaebbfd5b5935556e6424b8d6d Clean up `cluster` process reaping (#6840)
7768f6c4ea12a5f0af908bb41ed197eba1f87b17 Don't use `bokeh` `Figure` in tests (#6721)
1d0701b51c1dc354fb22af9b27b8f722f02daf63 Only close scheduler in SpecCluster if it exists (#6888)
3647cfefd004281dac0ffa92349e23bc2a94d68a Work around incompatibility of crick with setuptools 65 (#6887)

Nothing suspicious jumps out.

Of course, this might also be a coiled related change.

cc @ntabris @shughes-uk is there a way for us to see when there were coiled deployments?

Do you have access to datadog? If so you can look at the APM services and check the deployments, here's a link that shows the month's deployments

This will show the last deployment and the version contains the timestamp:

ntabris · 2022-08-23T19:45:40Z

The only recent platform change that I'd expect to possibly impact perf was setting the cgroup for memory on the container. This means the machine won't freeze from dask using too much memory, but it also means that dask potentially won't use as much memory (e.g., it will restart worker before hitting the ceiling that it was hitting before).

This change went to prod on evening of August 16.

fjetter · 2022-08-24T08:56:10Z

The only recent platform change that I'd expect to possibly impact perf was setting the cgroup for memory on the container. This means the machine won't freeze from dask using too much memory, but it also means that dask potentially won't use as much memory (e.g., it will restart worker before hitting the ceiling that it was hitting before).

This change went to prod on evening of August 16.

I guess this is it then. This change not only protects from OOM but also changes a few internal memory monitoring mechanisms. For instance, if these workloads are operating at high memory pressure, chances are that we're now spilling more data making the entire thing a bit slower

Edit: I ran the workload myself and strongly doubt that this is related to the changes to cgroups/memory limits. This workload is far away from any limiting/spilling. This is something else.

If I drive the workload myself it also finishes about 25% faster than if it's running in the benchmark suite. We've seen some of these systematic problems in other tests as well where we suspect that the scheduler is accumulating some state that slows everything down in turn, e.g. #253

Now I'm wondering what kind of additional tests we started to run during that time period. Indeed, we started to run much more workloads during that time

82d6b21 (main) Integration tests for spilling (#229)
eeb2b65 Benchmarks for some common workloads (#243)

cc @gjoseph92

fjetter · 2022-08-24T09:44:09Z

Didn't have a chance to test this but just skimming our code base, this change could help dask/distributed#6944

ncclementi · 2022-08-24T15:19:36Z

Notice that the regression is in the durations but not as much in the memory. @ian-r-rose and I think that since this query is a groupby there could be a change that this PR
097decd5f4914d7cd85b3babd6ecdd20a633bf8d Shuffle-based groupby aggregation for high-cardinality groups (#9302) , even though is turned off by default could have caused something. I haven't taken a deeper look at this yet but I'm planning on opening an issue on dask/dask and ping Rick (author of the PR) to see if he has any input.

ian-r-rose · 2022-08-24T15:22:01Z

@ncclementi you might be interested in reviewing/testing #269 in this context :)

ncclementi · 2022-08-25T00:00:45Z

After doing a git bisect it looks like the regression comes from the Shuffle PR see dask/dask#9428 for reference. I pinged Rick since he wrote the PR to take a look at it and get some feedback.

ncclementi · 2022-11-02T21:08:44Z

Closing this as the regression was exposed after fixing a bug.

github-actions bot added the ci-failure label Aug 20, 2022

ncclementi mentioned this issue Aug 22, 2022

⚠️ CI failed ⚠️ #272

Closed

fjetter mentioned this issue Aug 23, 2022

Regression detection should provide possible culprits based on git history #275

Open

ncclementi mentioned this issue Aug 24, 2022

Possible regression introduced by PR 9302 dask/dask#9428

Closed

ncclementi added h2o-benchmark work related to h2o benchmark regression labels Aug 25, 2022

ncclementi closed this as completed Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠️ CI failed ⚠️ #271

⚠️ CI failed ⚠️ #271

github-actions bot commented Aug 20, 2022

ncclementi commented Aug 22, 2022

ian-r-rose commented Aug 22, 2022

fjetter commented Aug 23, 2022

FabioRosado commented Aug 23, 2022

ntabris commented Aug 23, 2022

fjetter commented Aug 24, 2022 •

edited

Loading

fjetter commented Aug 24, 2022

ncclementi commented Aug 24, 2022 •

edited

Loading

ian-r-rose commented Aug 24, 2022

ncclementi commented Aug 25, 2022

ncclementi commented Nov 2, 2022

⚠️ CI failed ⚠️ #271

⚠️ CI failed ⚠️ #271

Comments

github-actions bot commented Aug 20, 2022

ncclementi commented Aug 22, 2022

ian-r-rose commented Aug 22, 2022

fjetter commented Aug 23, 2022

FabioRosado commented Aug 23, 2022

ntabris commented Aug 23, 2022

fjetter commented Aug 24, 2022 • edited Loading

fjetter commented Aug 24, 2022

ncclementi commented Aug 24, 2022 • edited Loading

ian-r-rose commented Aug 24, 2022

ncclementi commented Aug 25, 2022

ncclementi commented Nov 2, 2022

fjetter commented Aug 24, 2022 •

edited

Loading

ncclementi commented Aug 24, 2022 •

edited

Loading