Centralized kwargs to coiled.Cluster #549

crusaderky · 2022-11-23T14:36:35Z

Move the kwargs of all calls to coiled.Cluster to a centralized config file, which can then be overridden during A/B tests.
This reduces repetition and enables running A/B tests where what's being tested is not the software, but the infrastructure.

For example:

10 workers with 2 threads each vs. 5 workers with 4 threads each
m6i vs. m6a vs. m5 instances

Partially closes Configure number of workers and instance types in A/B tests #483.
Out of scope: review all tests to verify that the workload scales dynamically with cluster size.

Practical changes

In all benchmarks and runtime tests using the small_client fixture, set spot=true, spot_on_demand_fallback=true, multizone=true
In test_parquet.py, upgrade scheduler_vm_types from m5.xlarge to m6i.xlarge
In test_parquet.py and test_spill.py, set send_prometheus_metrics=true
In test_work_stealing_on_scaling_up and test_repeated_merge_spill, set spot=true, spot_on_demand_fallback=true, multizone=true
In test_work_stealing_on_straggling_worker, set spot=true, spot_on_demand_fallback=true, multizone=true, send_prometheus_metrics=True

crusaderky · 2022-11-23T14:36:53Z

CC @fjetter @jrbourbeau @ncclementi @ntabris

fjetter · 2022-11-23T14:39:03Z

conftest.py

-        backend_options=backend_options,
-        package_sync=True,
        environ=dask_env_variables,
        tags=gitlab_cluster_tags,


Minor concern with this is that our benchmarks would not be able to pick up a difference here. It'd be nice if these kwargs would be hashed and/or stored in the DB somewhere

The db stores a link to the stdout/stderr of the run, and in that you have the dump of the kwargs

crusaderky · 2022-11-23T14:42:36Z

conftest.py

-    backend_options = merge(
-        m.kwargs for m in request.node.iter_markers(name="backend_options")
-    )
-    backend_options["send_prometheus_metrics"] = True


This could only work with module-level annotations. I could not find it used anywhere and I wonder if there's any actual benefit in retaining the extra complexity? @jrbourbeau

This may very well be dead code. If you don't see it being used anywhere, feel free to remove

crusaderky · 2022-11-23T14:44:30Z

tests/stability/conftest.py

-            m.kwargs.get("spot", False)
-            for m in module.iter_markers(name="backend_options")
-        ):
-            module.add_marker(marker)


@jrbourbeau why did you enable spot=True, spot_on_demand_fallback=True, multizone=True for stability tests, but not for benchmark or runtime?

there are some specific tests where I chose not to use spot because it seemed like the tests were more sensitive to cluster start time and there's the potential for some small impact from using spot + fallback (since we might get some spot, then need to make second request to get some on-demand, and I don't know if this is true but it's possible time to provision would be different on spot vs on-demand).

Cluster start time is not included in the test runtime measure though... unless spot instances are frequently forcibly shut down in the middle of a test?

yeah, I had in mind the test that's about work stealing.

crusaderky · 2022-11-24T15:59:29Z

Everything works as intended.
A/B test evidence is available at #551.
Ready for review and merge.

ncclementi · 2022-11-30T15:01:55Z

This LGTM, I'm not sure if @jrbourbeau wants to give it a last look since he was involved in the initial review.

ntabris · 2022-11-30T15:06:34Z

cluster_kwargs.yaml

+  wait_for_workers: true
+  scheduler_vm_types: [m6i.xlarge]
+  backend_options:
+    send_prometheus_metrics: true


I'd remove this, it's now set (by me) at the account level.

But also doesn't hurt anything if you want to leave it..

crusaderky added 2 commits November 23, 2022 14:15

Centralized kwargs to coiled.Cluster

38ed761

Dump merged kwargs

37ebab6

crusaderky self-assigned this Nov 23, 2022

crusaderky marked this pull request as draft November 23, 2022 14:37

fjetter reviewed Nov 23, 2022

View reviewed changes

crusaderky commented Nov 23, 2022

View reviewed changes

crusaderky marked this pull request as ready for review November 24, 2022 15:58

Merge branch 'main' into guido/cluster_kwargs

5e7e921

ncclementi approved these changes Nov 30, 2022

View reviewed changes

ntabris reviewed Nov 30, 2022

View reviewed changes

crusaderky mentioned this pull request Nov 30, 2022

Tweak parallelism from A/B config #562

Merged

crusaderky merged commit a8bf7f8 into main Nov 30, 2022

crusaderky deleted the guido/cluster_kwargs branch November 30, 2022 16:00

Centralized kwargs to coiled.Cluster #549

Centralized kwargs to coiled.Cluster #549

Uh oh!

Conversation

crusaderky commented Nov 23, 2022

Practical changes

Uh oh!

crusaderky commented Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Nov 24, 2022

Uh oh!

ncclementi commented Nov 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

crusaderky commented Nov 23, 2022 •

edited

Loading

crusaderky Nov 23, 2022 •

edited

Loading