-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaler kill workers that have just picked up a new task. #1202
Comments
Makes total sense @cthorey - Would you care to issue a PR? |
I though about it but then I realize, we have still no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider. What would be better would be to be able to detect when the agent have been taken down and reschedule Task that have been interrupted this way. I raise an issue allegroai/clearml-agent#188 here which prevents this for now. Specifically, when an instance is taken down, SIGTERM are sent to running processed and the running task are marked as completed. What would be better would be to mark them as fail so that we have the option to reschedule them via the retry_on_failure parameter which we can pass to the PipelineController. |
Sounds like we're mixing up a number of points @cthorey.
These should probably be handled independently. WDYT? |
Yep - I agree they should be handled independently. |
Describe the bug
I am not sure it's a bug per say but given the implementation (auto_scaler.py) which update idle_workers once at the beginning of the event loop and use the time reported there to decide upon which instance to spin down, I end up in cases (albeit not often) where a worker get spinned down even if he just picked up a new Task.
Would it not be better to check right before spinning down the worker if he is still IDLE ? I am referinig to line 325 in auto_scaler.py ?
The text was updated successfully, but these errors were encountered: