-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks for some common workloads #243
Conversation
this is more representative of what tom actually wanted to do (write to zarr). we don't need to keep the whole thing in memory.
270fac6
to
63b4916
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You previously expressed interest in restructuring some of the test layout -- do we still want to rip that bandaid off?
We probably do, but let's do it in a separate PR. I've managed to organize these in a way I'm satisfied with (they're separate from all the other benchmarks right now), but it might be good to eventually mix some of the other ones into these? @ian-r-rose do you think this is ready to merge? The failed tests look like running out of AWS vCPU quota and things like that, not actual failures (so it's hard to know). |
Yes, I think this is ready, except they are not being run right now (unless I'm missing something?). The test workflow currently runs the different directories separately in their own jobs, so nobody is running the |
Seems easier than setting up yet another github action to run `workloads`
Another thing you didn't have to worry about with package_sync!
this feels too long.
Okay, I've gotten everything working except for When I run this test on its own against coiled-runtime 0.0.4, it does fine. The dashboard looks as it should, tasks go quickly. There's a little bit of spillage but not much. However, it's always failing in the full CI job. With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it. The dashboard looked a bit worse, more data spilled to disk. Workers kept running out of memory and restarting. So progress was extremely slow, and kept rolling back every time a worker died. I think this demonstrates a real failure case. This workload should work in the allotted time, but it doesn't in a real-world case (when other things have run before it). What do we want to do here? Merge it skipped? How do we want to handle integration tests for things that are currently broken? It's important not to forget about them, and also valuable to see when exactly they get fixed. If we skip it, I can try to remember to manually test it and un-skip it when something gets merged upstream that resolves it, but that's easy to mess up, and seeing the clear history gives us both traceability and a good story. Theories for why it's failing:
|
@gjoseph92 any chance you have the cluster id or a link to the details page? |
@ntabris sadly no, I closed that tab too long ago. And AFAICT there are no logging statements in the tests that allow you to see which cluster a test was running on, and the cluster names in tests are random. |
Hm, that might be good to change. Regardless I found it based on the public IP for the scheduler (shown in your screenshot!). https://cloud.coiled.io/dask-engineering/clusters/51386/details I might poke around a little at the logs tomorrow (and anyone else is of course welcome to as well). |
Oh, the first worker I checked shows over 2000 lines of solid:
It's around 10/s for about 5 minutes. |
Thanks for writing this up @gjoseph92. I'm trying this out right now, and was able to reproduce some of this locally (well, not locally, but you know what I mean). If I run I'm still looking into this, but I wanted to note that I looked at the scheduler system tab while things were going poorly, and didn't see anything overly concerning: |
I like the idea of having a good history of memory usage, duration, etc, even for failing tests. I would note, however, that the measurements for tests that don't finish aren't really indicative of anything real. The duration would just show the timeout until things were fixed. The memory metrics would be a bit more meaningful, but I'd still hesitate to interpret them. So I might still advocate for skipping the problematic tests until we can figure out what's going on. |
Co-authored-by: Ian Rose <ian.r.rose@gmail.com>
This is all green -- anything else you'd like to do here before we merge @gjoseph92? |
Nope, I think this is good! Thanks for the review. |
Closes #191
This adds all benchmarks from dask/distributed#6560 (comment).
They have been slightly rewritten, so the tests pick their data size automatically as a factor of total cluster memory. This is intended to make #241 significantly easier: by just parametrizing over different clusters, we can test both strong and weak scaling of the same tests.
This also adds a couple of utility functions,
scaled_array_shape
andtimeseries_of_size
, to help writing dynamically-scalable tests like this. I found that utilities like this make translating real-world examples into benchmark tests significantly more pleasant: you just look atreal_world_data_size / real_world_cluster_size
, multiply that factor bycluster_memory(client)
, and pass the target data size into one of the helper functions.