[Feature Request] Fallback queues #493

H4dr1en · 2021-11-18T15:44:10Z

Hi,

Often I queue an experiment in a queue that uses on-demand GPU instances in aws and the clearml aws autoscaler keeps failing with the following error:

Error: Failed to start new instance, An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 4): We currently do not have sufficient g4dn.2xlarge capacity in the Availability Zone you requested (eu-west-1a). Our system will be working on provisioning additional capacity. You can currently get g4dn.2xlarge capacity by not specifying an Availability Zone in your request or choosing eu-west-1b, eu-west-1c

I wonder if there is an easy way of extending the aws autoscaler to detect such errors of InsufficientInstanceCapacity and use a different availability zone. Given that this would mean that some other aws properties (eg. subnet, security groups, etc.) should be different, we could think of having a "fallback to queue" mechanism in the aws autoscaler. This mechism would work as follows:

In the aws autoscaler configuration, I specify for a specific queue the fallback queues
If the autoscaler fails to spin up an instance in that specific queue, it will try to start another instance in one of the fallback queue

In practice, this would allow to have one queue configuration per availability zone. The autoscaler could then spin up an instance faster.

The text was updated successfully, but these errors were encountered:

bmartinn · 2021-11-20T23:17:40Z

Thanks @H4dr1en , this sounds like a great idea.
We are working on expanding the autoscaler to support both GCP and Azure (I think that these will be released next month)
Once we have them out, it will make sense to support, "fallback" mode, where if one instance launch fails it will try the next on the list. Is this what you had in mind?

H4dr1en · 2021-11-21T19:00:29Z

yes, this is exactly what I have in mind 👍

tienduccao · 2022-01-05T19:24:52Z

Hi @bmartinn , do you have any update on the GCP autosccaler?
Thanks.

bmartinn · 2022-01-08T01:31:24Z

Thanks for the ping @tienduccao !
Things were delayed a bit, but I can update that the GCP is ready and will be released to the community (SaaS) version and then sync back to repository. I'm hoping it will not take more than a couple of weeks :)

tienduccao · 2022-01-08T06:40:12Z

Great news, thanks Martin

…

On Sat, 8 Jan 2022, 02:31 Martin.B, ***@***.***> wrote: Thanks for the ping @tienduccao <https://github.com/tienduccao> ! Things were delayed a bit, but I can update that the GCP is ready and will be released to the community (SaaS) version and then sync back to repository. I'm hoping it will not take more than a couple of weeks :) — Reply to this email directly, view it on GitHub <#493 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBNCU4WTEQ3J2UCQZBIABTUU6HXRANCNFSM5IJ57ZTA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bmartinn added the Feature Request Feature Request - Support w/ :+1: reaction label Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Fallback queues #493

[Feature Request] Fallback queues #493

H4dr1en commented Nov 18, 2021 •

edited

bmartinn commented Nov 20, 2021

H4dr1en commented Nov 21, 2021

tienduccao commented Jan 5, 2022

bmartinn commented Jan 8, 2022

tienduccao commented Jan 8, 2022 via email

[Feature Request] Fallback queues #493

[Feature Request] Fallback queues #493

Comments

H4dr1en commented Nov 18, 2021 • edited

bmartinn commented Nov 20, 2021

H4dr1en commented Nov 21, 2021

tienduccao commented Jan 5, 2022

bmartinn commented Jan 8, 2022

tienduccao commented Jan 8, 2022 via email

H4dr1en commented Nov 18, 2021 •

edited