Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler kill workers that have just picked up a new task. #1202

Closed
cthorey opened this issue Feb 13, 2024 · 4 comments
Closed

Autoscaler kill workers that have just picked up a new task. #1202

cthorey opened this issue Feb 13, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@cthorey
Copy link
Contributor

cthorey commented Feb 13, 2024

Describe the bug

I am not sure it's a bug per say but given the implementation (auto_scaler.py) which update idle_workers once at the beginning of the event loop and use the time reported there to decide upon which instance to spin down, I end up in cases (albeit not often) where a worker get spinned down even if he just picked up a new Task.

Would it not be better to check right before spinning down the worker if he is still IDLE ? I am referinig to line 325 in auto_scaler.py ?

@cthorey cthorey added the bug Something isn't working label Feb 13, 2024
@ainoam
Copy link
Collaborator

ainoam commented Feb 14, 2024

Makes total sense @cthorey - Would you care to issue a PR?

@cthorey
Copy link
Contributor Author

cthorey commented Feb 20, 2024

I though about it but then I realize, we have still no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider. What would be better would be to be able to detect when the agent have been taken down and reschedule Task that have been interrupted this way.

I raise an issue allegroai/clearml-agent#188 here which prevents this for now.

Specifically, when an instance is taken down, SIGTERM are sent to running processed and the running task are marked as completed. What would be better would be to mark them as fail so that we have the option to reschedule them via the retry_on_failure parameter which we can pass to the PipelineController.

@ainoam
Copy link
Collaborator

ainoam commented Feb 22, 2024

Sounds like we're mixing up a number of points @cthorey.

  1. Your original post - A race condition where an instances activity status is obsolete by the time the autoscaler takes action for taking it down.
  2. The status of a task once its executing agent was explicitly terminated (which you address in clearml-agent#188) and its effect on pipeline logic.

These should probably be handled independently. WDYT?

@cthorey
Copy link
Contributor Author

cthorey commented Feb 27, 2024

Yep - I agree they should be handled independently.
Regarding 1. and hence this issue, I think we can reasonably close it given that, as I said above, we have no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants