Benchmarks for some common workloads #243

gjoseph92 · 2022-08-12T21:47:00Z

Closes #191

This adds all benchmarks from dask/distributed#6560 (comment).

They have been slightly rewritten, so the tests pick their data size automatically as a factor of total cluster memory. This is intended to make #241 significantly easier: by just parametrizing over different clusters, we can test both strong and weak scaling of the same tests.

This also adds a couple of utility functions, scaled_array_shape and timeseries_of_size, to help writing dynamically-scalable tests like this. I found that utilities like this make translating real-world examples into benchmark tests significantly more pleasant: you just look at real_world_data_size / real_world_cluster_size, multiply that factor by cluster_memory(client), and pass the target data size into one of the helper functions.

this is more representative of what tom actually wanted to do (write to zarr). we don't need to keep the whole thing in memory.

ian-r-rose

You previously expressed interest in restructuring some of the test layout -- do we still want to rip that bandaid off?

tests/workloads/test_dataframe.py

tests/utils_test.py

tests/workloads/test_array.py

tests/utils_test.py

gjoseph92 · 2022-08-16T01:39:34Z

You previously expressed interest in restructuring some of the test layout -- do we still want to rip that bandaid off?

We probably do, but let's do it in a separate PR. I've managed to organize these in a way I'm satisfied with (they're separate from all the other benchmarks right now), but it might be good to eventually mix some of the other ones into these?

@ian-r-rose do you think this is ready to merge? The failed tests look like running out of AWS vCPU quota and things like that, not actual failures (so it's hard to know).

ian-r-rose · 2022-08-16T15:28:05Z

@ian-r-rose do you think this is ready to merge? The failed tests look like running out of AWS vCPU quota and things like that, not actual failures (so it's hard to know).

Yes, I think this is ready, except they are not being run right now (unless I'm missing something?). The test workflow currently runs the different directories separately in their own jobs, so nobody is running the workflows directory. I'd like to discuss changing that (probably in the same refactor as we were discussing above), but for now we should probably just give workflows their own job.

Seems easier than setting up yet another github action to run `workloads`

Another thing you didn't have to worry about with package_sync!

this feels too long.

gjoseph92 · 2022-08-17T02:49:38Z

Okay, I've gotten everything working except for tests/benchmarks/test_array.py::test_climatcic_mean.

When I run this test on its own against coiled-runtime 0.0.4, it does fine. The dashboard looks as it should, tasks go quickly. There's a little bit of spillage but not much.

However, it's always failing in the full CI job. With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it. The dashboard looked a bit worse, more data spilled to disk. Workers kept running out of memory and restarting. So progress was extremely slow, and kept rolling back every time a worker died.

I think this demonstrates a real failure case. This workload should work in the allotted time, but it doesn't in a real-world case (when other things have run before it).

What do we want to do here? Merge it skipped? How do we want to handle integration tests for things that are currently broken? It's important not to forget about them, and also valuable to see when exactly they get fixed. If we skip it, I can try to remember to manually test it and un-skip it when something gets merged upstream that resolves it, but that's easy to mess up, and seeing the clear history gives us both traceability and a good story.

Theories for why it's failing:

On distributed==2022.6.0, MALLOC_TRIM_THRESHOLD_ hasn't been set yet by default. That might make the difference. Note though that the test passes even without it being set, if it's run on a fresh cluster. So that's clearly not the only problem. Plus, we're client.restart()-ing the workers before every test, so the workers should be in the same brand-new state regardless of whether the test is run on its own, or after others. However, client.restart() doesn't do that much to the scheduler, so maybe that's where the problem is.
We've know that every subsequent time you submit a workload to the scheduler, it runs slower and slower, and scheduler memory grows and grows: Are reference cycles a performance problem? dask/distributed#4987 (comment). (There's no reason to think things have changed since that research last year.)

As the scheduler gets sluggish, it will be slower to both tell workers about data-consuimg downstream tasks to run (instead of the data-producing root tasks they've already been told to run), and it will be slower to allow them to delete keys that are completed and aren't needed anymore. Note that just because a worker runs a downstream task (like writing a chunk to zarr) doesn't mean the worker gets to immediately release the upstream data—it must be explicitly told by the scheduler to do so. If the scheduler is slow, the worker will go load even more data into memory while keeping around the chunks that have already been written to zarr and should have been released.

Thus we see the double-whammy of root task overproduction: as soon as the delicate balance of scheduler latency is thrown off, workers will simultaneously produce memory faster than they should, and release memory slower than they should:
- Workers run twice as many root tasks as they should, causing memory pressure dask/distributed#5223
- [Idea] Could workers sometimes know when to release keys on their own? dask/distributed#5114
Basically, I think this will only be fixed by Withhold root tasks [no co assignment] dask/distributed#6614, or by understanding and fixing whatever's causing the scheduler to slow down (which is further out) Are reference cycles a performance problem? dask/distributed#4987.

ntabris · 2022-08-17T02:53:04Z

With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it

@gjoseph92 any chance you have the cluster id or a link to the details page?

gjoseph92 · 2022-08-17T02:56:08Z

@ntabris sadly no, I closed that tab too long ago. And AFAICT there are no logging statements in the tests that allow you to see which cluster a test was running on, and the cluster names in tests are random.

ntabris · 2022-08-17T03:01:17Z

AFAICT there are no logging statements in the tests that allow you to see which cluster a test was running on, and the cluster names in tests are random

Hm, that might be good to change.

Regardless I found it based on the public IP for the scheduler (shown in your screenshot!).

https://cloud.coiled.io/dask-engineering/clusters/51386/details

I might poke around a little at the logs tomorrow (and anyone else is of course welcome to as well).

ntabris · 2022-08-17T03:04:43Z

Oh, the first worker I checked shows over 2000 lines of solid:

distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.09 GiB -- Worker memory limit: 7.11 GiB

It's around 10/s for about 5 minutes.

ian-r-rose · 2022-08-17T18:25:03Z

Thanks for writing this up @gjoseph92. I'm trying this out right now, and was able to reproduce some of this locally (well, not locally, but you know what I mean). If I run test_climatcic_mean by itself, it does okay, but if I run the whole module things start to go badly in the same way you describe.

I'm still looking into this, but I wanted to note that I looked at the scheduler system tab while things were going poorly, and didn't see anything overly concerning:

ian-r-rose · 2022-08-17T18:29:18Z

I think this demonstrates a real failure case. This workload should work in the allotted time, but it doesn't in a real-world case (when other things have run before it).

What do we want to do here? Merge it skipped? How do we want to handle integration tests for things that are currently broken? It's important not to forget about them, and also valuable to see when exactly they get fixed. If we skip it, I can try to remember to manually test it and un-skip it when something gets merged upstream that resolves it, but that's easy to mess up, and seeing the clear history gives us both traceability and a good story.

I like the idea of having a good history of memory usage, duration, etc, even for failing tests. I would note, however, that the measurements for tests that don't finish aren't really indicative of anything real. The duration would just show the timeout until things were fixed. The memory metrics would be a bit more meaningful, but I'd still hesitate to interpret them. So I might still advocate for skipping the problematic tests until we can figure out what's going on.

tests/benchmarks/test_array.py

Co-authored-by: Ian Rose <ian.r.rose@gmail.com>

gjoseph92 · 2022-08-17T18:36:58Z

That is interesting that the scheduler looks okay. For now, I've just skipped it. We can talk about what to do in general on #252, and for the specific test on #253.

ian-r-rose · 2022-08-17T21:01:04Z

This is all green -- anything else you'd like to do here before we merge @gjoseph92?

gjoseph92 · 2022-08-17T21:16:16Z

Nope, I think this is good! Thanks for the review.

gjoseph92 requested review from ian-r-rose and jrbourbeau August 12, 2022 21:47

gjoseph92 added 13 commits August 12, 2022 15:58

test_name_uuid

03700ba

some initial benchmarks

fabe0a2

ignore DS_Store

1353fbf

note size of small_cluster

16e987f

scaled_array_shape and other helpers

bb1e35d

scale tests to cluster size

b37a1bb

timeseries_of_size util

ad63fc0

test_dataframe_align

343048d

test_shuffle

c3d6d5e

test_double_diff

c68fe88

send vorticity to devnull

be6276f

this is more representative of what tom actually wanted to do (write to zarr). we don't need to keep the whole thing in memory.

test_jobqueue

c8dddd3

lint

63b4916

gjoseph92 force-pushed the workload-benchmarks branch from 270fac6 to 63b4916 Compare August 12, 2022 21:58

ian-r-rose reviewed Aug 12, 2022

View reviewed changes

tests/workloads/test_dataframe.py Outdated Show resolved Hide resolved

tests/workloads/test_dataframe.py Outdated Show resolved Hide resolved

tests/utils_test.py Show resolved Hide resolved

tests/workloads/test_array.py Outdated Show resolved Hide resolved

ian-r-rose reviewed Aug 15, 2022

View reviewed changes

tests/utils_test.py Show resolved Hide resolved

annotations from the __future__

9d78ab6

gjoseph92 added 5 commits August 16, 2022 11:56

Move everything to benchmarks

86661d3

Seems easier than setting up yet another github action to run `workloads`

fix test_jobqueue

79515e6

Fix arr_to_devnull importability

33cc847

Another thing you didn't have to worry about with package_sync!

longer timeout for test_climatcic_mean?

4ae9338

this feels too long.

fix test_jobqueue again

f44ecb0

gjoseph92 mentioned this pull request Aug 17, 2022

How to handle tests that "should work" but don't? #252

Open

ian-r-rose reviewed Aug 17, 2022

View reviewed changes

tests/benchmarks/test_array.py Outdated Show resolved Hide resolved

gjoseph92 mentioned this pull request Aug 17, 2022

Failing test_climatic_mean #253

Open

gjoseph92 and others added 2 commits August 17, 2022 12:33

skip test_climatcic_mean

44a9d5b

fix spelling

7a5a356

Co-authored-by: Ian Rose <ian.r.rose@gmail.com>

skip typo

78a8614

gjoseph92 merged commit eeb2b65 into main Aug 17, 2022

gjoseph92 deleted the workload-benchmarks branch August 17, 2022 21:16

This was referenced Aug 19, 2022

Integration test: common physical science workload #174

Closed

Package Sync #235

Merged

Improve traceability between currently-running tests and Coiled resources they create #267

Open

fjetter mentioned this pull request Aug 24, 2022

⚠️ CI failed ⚠️ #271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks for some common workloads #243

Benchmarks for some common workloads #243

gjoseph92 commented Aug 12, 2022

ian-r-rose left a comment

gjoseph92 commented Aug 16, 2022

ian-r-rose commented Aug 16, 2022

gjoseph92 commented Aug 17, 2022

ntabris commented Aug 17, 2022

gjoseph92 commented Aug 17, 2022

ntabris commented Aug 17, 2022

ntabris commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

gjoseph92 commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

gjoseph92 commented Aug 17, 2022

Benchmarks for some common workloads #243

Benchmarks for some common workloads #243

Conversation

gjoseph92 commented Aug 12, 2022

ian-r-rose left a comment

Choose a reason for hiding this comment

gjoseph92 commented Aug 16, 2022

ian-r-rose commented Aug 16, 2022

gjoseph92 commented Aug 17, 2022

ntabris commented Aug 17, 2022

gjoseph92 commented Aug 17, 2022

ntabris commented Aug 17, 2022

ntabris commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

gjoseph92 commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

gjoseph92 commented Aug 17, 2022