-
Notifications
You must be signed in to change notification settings - Fork 18
Centralized kwargs to coiled.Cluster #549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| backend_options=backend_options, | ||
| package_sync=True, | ||
| environ=dask_env_variables, | ||
| tags=gitlab_cluster_tags, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor concern with this is that our benchmarks would not be able to pick up a difference here. It'd be nice if these kwargs would be hashed and/or stored in the DB somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The db stores a link to the stdout/stderr of the run, and in that you have the dump of the kwargs
| backend_options = merge( | ||
| m.kwargs for m in request.node.iter_markers(name="backend_options") | ||
| ) | ||
| backend_options["send_prometheus_metrics"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could only work with module-level annotations. I could not find it used anywhere and I wonder if there's any actual benefit in retaining the extra complexity? @jrbourbeau
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may very well be dead code. If you don't see it being used anywhere, feel free to remove
| m.kwargs.get("spot", False) | ||
| for m in module.iter_markers(name="backend_options") | ||
| ): | ||
| module.add_marker(marker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrbourbeau why did you enable spot=True, spot_on_demand_fallback=True, multizone=True for stability tests, but not for benchmark or runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some specific tests where I chose not to use spot because it seemed like the tests were more sensitive to cluster start time and there's the potential for some small impact from using spot + fallback (since we might get some spot, then need to make second request to get some on-demand, and I don't know if this is true but it's possible time to provision would be different on spot vs on-demand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster start time is not included in the test runtime measure though... unless spot instances are frequently forcibly shut down in the middle of a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I had in mind the test that's about work stealing.
|
Everything works as intended. |
|
This LGTM, I'm not sure if @jrbourbeau wants to give it a last look since he was involved in the initial review. |
| wait_for_workers: true | ||
| scheduler_vm_types: [m6i.xlarge] | ||
| backend_options: | ||
| send_prometheus_metrics: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this, it's now set (by me) at the account level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also doesn't hurt anything if you want to leave it..
Move the kwargs of all calls to
coiled.Clusterto a centralized config file, which can then be overridden during A/B tests.This reduces repetition and enables running A/B tests where what's being tested is not the software, but the infrastructure.
For example:
Out of scope: review all tests to verify that the workload scales dynamically with cluster size.
Practical changes
benchmarksandruntimetests using thesmall_clientfixture, setspot=true, spot_on_demand_fallback=true, multizone=truetest_parquet.py, upgradescheduler_vm_typesfrom m5.xlarge to m6i.xlargetest_parquet.pyandtest_spill.py, setsend_prometheus_metrics=truetest_work_stealing_on_scaling_upandtest_repeated_merge_spill, setspot=true, spot_on_demand_fallback=true, multizone=truetest_work_stealing_on_straggling_worker, setspot=true, spot_on_demand_fallback=true, multizone=true, send_prometheus_metrics=True