Skip to content

Conversation

@crusaderky
Copy link
Contributor

Move the kwargs of all calls to coiled.Cluster to a centralized config file, which can then be overridden during A/B tests.
This reduces repetition and enables running A/B tests where what's being tested is not the software, but the infrastructure.

For example:

  • 10 workers with 2 threads each vs. 5 workers with 4 threads each
  • m6i vs. m6a vs. m5 instances


Practical changes

  • In all benchmarks and runtime tests using the small_client fixture, set spot=true, spot_on_demand_fallback=true, multizone=true
  • In test_parquet.py, upgrade scheduler_vm_types from m5.xlarge to m6i.xlarge
  • In test_parquet.py and test_spill.py, set send_prometheus_metrics=true
  • In test_work_stealing_on_scaling_up and test_repeated_merge_spill, set spot=true, spot_on_demand_fallback=true, multizone=true
  • In test_work_stealing_on_straggling_worker, set spot=true, spot_on_demand_fallback=true, multizone=true, send_prometheus_metrics=True

@crusaderky crusaderky self-assigned this Nov 23, 2022
@crusaderky
Copy link
Contributor Author

crusaderky commented Nov 23, 2022

@crusaderky crusaderky marked this pull request as draft November 23, 2022 14:37
backend_options=backend_options,
package_sync=True,
environ=dask_env_variables,
tags=gitlab_cluster_tags,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor concern with this is that our benchmarks would not be able to pick up a difference here. It'd be nice if these kwargs would be hashed and/or stored in the DB somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The db stores a link to the stdout/stderr of the run, and in that you have the dump of the kwargs

backend_options = merge(
m.kwargs for m in request.node.iter_markers(name="backend_options")
)
backend_options["send_prometheus_metrics"] = True
Copy link
Contributor Author

@crusaderky crusaderky Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could only work with module-level annotations. I could not find it used anywhere and I wonder if there's any actual benefit in retaining the extra complexity? @jrbourbeau

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may very well be dead code. If you don't see it being used anywhere, feel free to remove

m.kwargs.get("spot", False)
for m in module.iter_markers(name="backend_options")
):
module.add_marker(marker)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau why did you enable spot=True, spot_on_demand_fallback=True, multizone=True for stability tests, but not for benchmark or runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some specific tests where I chose not to use spot because it seemed like the tests were more sensitive to cluster start time and there's the potential for some small impact from using spot + fallback (since we might get some spot, then need to make second request to get some on-demand, and I don't know if this is true but it's possible time to provision would be different on spot vs on-demand).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster start time is not included in the test runtime measure though... unless spot instances are frequently forcibly shut down in the middle of a test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I had in mind the test that's about work stealing.

@crusaderky crusaderky marked this pull request as ready for review November 24, 2022 15:58
@crusaderky
Copy link
Contributor Author

Everything works as intended.
A/B test evidence is available at #551.
Ready for review and merge.

@ncclementi
Copy link
Contributor

This LGTM, I'm not sure if @jrbourbeau wants to give it a last look since he was involved in the initial review.

wait_for_workers: true
scheduler_vm_types: [m6i.xlarge]
backend_options:
send_prometheus_metrics: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this, it's now set (by me) at the account level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also doesn't hurt anything if you want to leave it..

@crusaderky crusaderky merged commit a8bf7f8 into main Nov 30, 2022
@crusaderky crusaderky deleted the guido/cluster_kwargs branch November 30, 2022 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Configure number of workers and instance types in A/B tests

6 participants