Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Adaptive] Do not allow workers to downscale if they are running long-running tasks (e.g. worker_client) #7481

Merged
merged 9 commits into from Dec 18, 2023

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jan 17, 2023

Having tasks flagged as long running protects the worker from being classified as idle. However, when scaling adaptively and Adapt recommends fewer workers than there are idle, this can cause workers with seceded tasks to be downscaled even though there are better options.

This suggests to only close workers with long running tasks if there are no other possible options.

Considering that these workers may be driving a computation we may even need to go as far as to prohibit downscaling of those tasks entirely

@github-actions
Copy link
Contributor

github-actions bot commented Jan 17, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       27 files  ±  0         27 suites  ±0   11h 46m 14s ⏱️ + 13m 13s
  3 942 tests +  2    3 832 ✔️ +  2     108 💤 ±0  2 ±0 
49 584 runs  +26  47 312 ✔️ +25  2 270 💤 +1  2 ±0 

For more details on these failures, see this check.

Results for commit 315b337. ± Comparison against base commit 415d4fa.

♻️ This comment has been updated with latest results.

@mrocklin
Copy link
Member

Considering that these workers may be driving a computation we may even need to go as far as to prohibit downscaling of those tasks entirely

I think that if there is a running task, long-running or otherwise, that we should not choose to remove that worker when using adaptive scaling, even if it is mostly-idle.

@fjetter
Copy link
Member Author

fjetter commented Jan 23, 2023

I think that if there is a running task, long-running or otherwise, that we should not choose to remove that worker when using adaptive scaling, even if it is mostly-idle.

I think I agree given the current implementation. There is a possibility to allow for workers to naturally conclude their work by not allowing further assignments of tasks. This would be a similar mechanism as suggested in #3761 (comment) (basically setting a worker to pause and steal the remaining tasks)
For now, I think I don't want to go down that path but merely want to show options if a stricter behavior causes problems.

@fjetter fjetter force-pushed the deprioritize_long_running_from_adaptive branch from 1b101df to d381e34 Compare December 14, 2023 11:20
Comment on lines +6927 to +6928
valid_workers = [ws for ws in self.workers.values() if not ws.long_running]
groups = groupby(key, valid_workers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this now prohibits downscaling of workers with long-running tasks, e.g. tasks-within-tasks

@fjetter fjetter self-assigned this Dec 14, 2023
@fjetter fjetter force-pushed the deprioritize_long_running_from_adaptive branch from 0f5b0da to a1ae1a7 Compare December 14, 2023 17:06
@fjetter fjetter changed the title Deprioritize long running tasks when adapting Do not allow workers to downscale if they are running long-running tasks (worker-client) Dec 14, 2023
@fjetter fjetter changed the title Do not allow workers to downscale if they are running long-running tasks (worker-client) Do not allow workers to downscale if they are running long-running tasks (e.g. worker_client) Dec 14, 2023
@fjetter fjetter changed the title Do not allow workers to downscale if they are running long-running tasks (e.g. worker_client) [Adaptive] Do not allow workers to downscale if they are running long-running tasks (e.g. worker_client) Dec 14, 2023
@fjetter fjetter added the adaptive All things relating to adaptive scaling label Dec 14, 2023
Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits

distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
@crusaderky crusaderky merged commit 53e95ec into dask:main Dec 18, 2023
30 of 34 checks passed
@fjetter fjetter deleted the deprioritize_long_running_from_adaptive branch January 12, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adaptive All things relating to adaptive scaling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants