Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Root-task withholding without co-assignment #6631

Closed
fjetter opened this issue Jun 24, 2022 · 2 comments · Fixed by #6614 or #6989
Closed

Root-task withholding without co-assignment #6631

fjetter opened this issue Jun 24, 2022 · 2 comments · Fixed by #6614 or #6989
Assignees
Labels
enhancement Improve existing functionality or make things work better performance scheduling

Comments

@fjetter
Copy link
Member

fjetter commented Jun 24, 2022

We had an early attempt to experiment with root-task withholding to address the problem of root-task-overproduction. Below a couple of links with additional information (non-exhaustive)

We started an experimentation trying to withhold worker assignment for root tasks, i.e. delay worker assignment scheduler side, see #6560

Early prototypes show very promising results that should improve our cluster memory footprint. A prototype is available at #6614 (and should be ready to try for curious users)

Given that the current co-assignment logic has some significant shortcomings (e.g. #6597) and the withholding of root-tasks appears to be sufficient to control our memory footprint (some experimentation on configuration is still required) we should get the root-task withhold logic in a production ready, i.e. merge-able state and get rid of the current co-assignment logic.

This should be verified by thorough performance benchmark results, for this, see coiled/benchmarks#191 for work on automated benchmarks.

Once this is solid, we may consider adding a more robust co-assignment logic in a follow up step, if necessary.

AC

  • The prototype PR is merged and the new assignment logic is hidden behind a feature toggle
  • The feature toggle is disabled by default
  • There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job.
  • A follow up ticket with an overview of all skipped tests is created
@gjoseph92
Copy link
Collaborator

#6614 currently implements this behind a feature flag. When the feature flag is turned off (current default), scheduling logic stays as-is, not only keeping co-assignment, but even fixing #6597.

For this ticket, is root task withholding by default the goal, or do we just want to get it in behind a feature flag?

I imagine performance benchmarks will be an important part of answering this question, as well community input. But there's a also the question getting the entire test suite to pass under a new scheduling approach, and whether that's in scope or should be a follow-up task.

@gjoseph92
Copy link
Collaborator

gjoseph92 commented Aug 31, 2022

Reopening, since these still need to happen:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improve existing functionality or make things work better performance scheduling
Projects
None yet
2 participants