Autoscaler kill workers that have just picked up a new task. #1202

cthorey · 2024-02-13T17:01:01Z

Describe the bug

I am not sure it's a bug per say but given the implementation (auto_scaler.py) which update idle_workers once at the beginning of the event loop and use the time reported there to decide upon which instance to spin down, I end up in cases (albeit not often) where a worker get spinned down even if he just picked up a new Task.

Would it not be better to check right before spinning down the worker if he is still IDLE ? I am referinig to line 325 in auto_scaler.py ?

ainoam · 2024-02-14T17:16:39Z

Makes total sense @cthorey - Would you care to issue a PR?

cthorey · 2024-02-20T12:15:48Z

I though about it but then I realize, we have still no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider. What would be better would be to be able to detect when the agent have been taken down and reschedule Task that have been interrupted this way.

I raise an issue allegroai/clearml-agent#188 here which prevents this for now.

Specifically, when an instance is taken down, SIGTERM are sent to running processed and the running task are marked as completed. What would be better would be to mark them as fail so that we have the option to reschedule them via the retry_on_failure parameter which we can pass to the PipelineController.

ainoam · 2024-02-22T10:46:46Z

Sounds like we're mixing up a number of points @cthorey.

Your original post - A race condition where an instances activity status is obsolete by the time the autoscaler takes action for taking it down.
The status of a task once its executing agent was explicitly terminated (which you address in clearml-agent#188) and its effect on pipeline logic.

These should probably be handled independently. WDYT?

cthorey · 2024-02-27T11:33:20Z

Yep - I agree they should be handled independently.
Regarding 1. and hence this issue, I think we can reasonably close it given that, as I said above, we have no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider.

cthorey added the bug Something isn't working label Feb 13, 2024

cthorey closed this as completed Feb 27, 2024

cthorey mentioned this issue Apr 4, 2024

🐛 Recheck that the worker is still IDLE before taking it down #1240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler kill workers that have just picked up a new task. #1202

Autoscaler kill workers that have just picked up a new task. #1202

cthorey commented Feb 13, 2024

ainoam commented Feb 14, 2024

cthorey commented Feb 20, 2024

ainoam commented Feb 22, 2024

cthorey commented Feb 27, 2024

Autoscaler kill workers that have just picked up a new task. #1202

Autoscaler kill workers that have just picked up a new task. #1202

Comments

cthorey commented Feb 13, 2024

Describe the bug

ainoam commented Feb 14, 2024

cthorey commented Feb 20, 2024

ainoam commented Feb 22, 2024

cthorey commented Feb 27, 2024