Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Fallback queues #493

Open
H4dr1en opened this issue Nov 18, 2021 · 5 comments
Open

[Feature Request] Fallback queues #493

H4dr1en opened this issue Nov 18, 2021 · 5 comments
Labels
Feature Request Feature Request - Support w/ :+1: reaction

Comments

@H4dr1en
Copy link
Contributor

H4dr1en commented Nov 18, 2021

Hi,

Often I queue an experiment in a queue that uses on-demand GPU instances in aws and the clearml aws autoscaler keeps failing with the following error:

Error: Failed to start new instance, An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 4): We currently do not have sufficient g4dn.2xlarge capacity in the Availability Zone you requested (eu-west-1a). Our system will be working on provisioning additional capacity. You can currently get g4dn.2xlarge capacity by not specifying an Availability Zone in your request or choosing eu-west-1b, eu-west-1c

I wonder if there is an easy way of extending the aws autoscaler to detect such errors of InsufficientInstanceCapacity and use a different availability zone. Given that this would mean that some other aws properties (eg. subnet, security groups, etc.) should be different, we could think of having a "fallback to queue" mechanism in the aws autoscaler. This mechism would work as follows:

  1. In the aws autoscaler configuration, I specify for a specific queue the fallback queues
  2. If the autoscaler fails to spin up an instance in that specific queue, it will try to start another instance in one of the fallback queue

In practice, this would allow to have one queue configuration per availability zone. The autoscaler could then spin up an instance faster.

@bmartinn bmartinn added the Feature Request Feature Request - Support w/ :+1: reaction label Nov 20, 2021
@bmartinn
Copy link
Member

Thanks @H4dr1en , this sounds like a great idea.
We are working on expanding the autoscaler to support both GCP and Azure (I think that these will be released next month)
Once we have them out, it will make sense to support, "fallback" mode, where if one instance launch fails it will try the next on the list. Is this what you had in mind?

@H4dr1en
Copy link
Contributor Author

H4dr1en commented Nov 21, 2021

yes, this is exactly what I have in mind 👍

@tienduccao
Copy link

Hi @bmartinn , do you have any update on the GCP autosccaler?
Thanks.

@bmartinn
Copy link
Member

bmartinn commented Jan 8, 2022

Thanks for the ping @tienduccao !
Things were delayed a bit, but I can update that the GCP is ready and will be released to the community (SaaS) version and then sync back to repository. I'm hoping it will not take more than a couple of weeks :)

@tienduccao
Copy link

tienduccao commented Jan 8, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Feature Request - Support w/ :+1: reaction
Projects
None yet
Development

No branches or pull requests

3 participants