-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available #773
Comments
We also observe a similar problem that I will describe below. If it sounds like a separate issue please let me know. We run our ECS cluster with the following default providers: Now let's say we run a service that uses the default providers and uses autoscaling. If the service has a I consider this almost a bug, as it is very counter intuitive that ECS will allocate by providers first and consistently ignore |
I fully agree with this. – There should be an option to prioritize the desired count over the capacity provider. It would open a door for more flexible usage of spot capacity, also on the long-running services. |
Couldn't agree more. i asked about this when SPOT was launched. Had a chat to our TAM and also the service team. Dont think it was on the agenda any time soon back then. Personally, I doubt this will be a priority for AWS as it makes SPOT just too easy and everyone will choose to use SPOT instead of FARGATE and where is the fun in that... |
@dactp I'm confused, does this setting:
Allow OD to be implemented only if SPOT is not available? |
Just means it'll run 50% of tasks in Fargate and 50% in Spot, there's no failover if one is not available |
+1 |
1 similar comment
+1 |
How would one handle this with lambda? Trigger a lambda on spot allocation failure event which does a |
@seanturner026 To do this, I was thinking of deploying ECS services in duplicate - one with spot, the other with the plain capacity provider. When the service events |
Ah brilliant! I was thinking this problem is solved by a lambda concurrently (<3 go)...
Then you would also have an event bridge rule that runs each morning and replaces the SPOT provider (and perhaps kills some ON_DEMAND tasks? I suppose perhaps not necessary if you're tasks already autoscale)
Have you tested this approach? I'm wondering how long it takes to return to |
That's one way to do it for sure. The construct I linked aims to be a little dumber so that ECS can handle more of the heavy lifting. When we opt to create a discrete ECS service for each type of capacity, ECS is able to wrangle the individual tasks for us. We can just increase the desired capacity on the OD service when the Spot service is degraded. (i.e., task placement error.) We can also decrease the OD service desired capacity when Spot has self-healed to max capacity. (i.e., steady state)
I've tested this with synthetic events. It seems to work for my use case (~2-10 desired count.) If a single spot task can't be placed, the OD-capacity service is spun up. The spot capacity service returns to steady-state as soon as ECS can place all the spot tasks and the OD service spins back down to zero tasks. I haven't tested it with auto-scaling, but I suspect that this will be trickier as I don't know how scale-in is handled when spot capacity is unavailable. |
If the original request was about Fargate Spot, I've built this tool fargate-spot-capacity-fail-handler to switch a service to be 100% Fargate if Fargate Spot not available. |
I'm also confused, does this setting: FARGATE_SPOT base=0 weight=4 and when we run desired count as 1. Will OD will be launched if SPOT is not available? |
Community Note
Tell us about your request
Ability to create a capacity strategy that allows you to use spot instances as long as the spot capacity is available, and fall back to on-demand instances only when there is no capacity available for spot.
Which service(s) is this request for?
ECS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I was hoping that "Base" in a capacity strategy will be more of a "strategy" but it seems to be a "constraint". In my use case, I was hoping to use 5 (which is also the total number of tasks in my service) as base for my capprovider1 which entirely consists of spot instances and use a 1:1 weight. So, the base will be met as long as there are spot instance available, otherwise I was hoping it to ignore the base and fall back to the capprovider2 which has OD instances. But even when capprovider2 has instances, service fails to place tasks because it's trying to satisfy base.
Are you currently working around this issue?
Using lambda
Please let me know if more information is required or in case there is a better alternative.
The text was updated successfully, but these errors were encountered: