[ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available #773

aliabas7 · 2020-02-26T16:35:05Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Ability to create a capacity strategy that allows you to use spot instances as long as the spot capacity is available, and fall back to on-demand instances only when there is no capacity available for spot.

Which service(s) is this request for?
ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I was hoping that "Base" in a capacity strategy will be more of a "strategy" but it seems to be a "constraint". In my use case, I was hoping to use 5 (which is also the total number of tasks in my service) as base for my capprovider1 which entirely consists of spot instances and use a 1:1 weight. So, the base will be met as long as there are spot instance available, otherwise I was hoping it to ignore the base and fall back to the capprovider2 which has OD instances. But even when capprovider2 has instances, service fails to place tasks because it's trying to satisfy base.

Are you currently working around this issue?
Using lambda
Please let me know if more information is required or in case there is a better alternative.

dactp · 2020-03-04T15:26:23Z

We also observe a similar problem that I will describe below. If it sounds like a separate issue please let me know.

We run our ECS cluster with the following default providers:
FARGATE_SPOT base=0 weight=50
FARGATE base=0 weight=50

Now let's say we run a service that uses the default providers and uses autoscaling.

If the service has a desired_count=10 and the fargate_spot capacity is not available, ECS will not use the available fargate capacity to honour desired_count. The service will run with only 5 tasks instead.

I consider this almost a bug, as it is very counter intuitive that ECS will allocate by providers first and consistently ignore desired_count.
We would prefer an integrated spot/non-spot scaling approach like EC2 Fleet does.

nikovirtala · 2020-04-14T13:16:28Z

I consider this almost a bug, as it is very counter intuitive that ECS will allocate by providers first and consistently ignore desired_count.

I fully agree with this. – There should be an option to prioritize the desired count over the capacity provider. It would open a door for more flexible usage of spot capacity, also on the long-running services.

jitesh88 · 2020-07-08T01:49:09Z

Couldn't agree more. i asked about this when SPOT was launched. Had a chat to our TAM and also the service team. Dont think it was on the agenda any time soon back then. Personally, I doubt this will be a priority for AWS as it makes SPOT just too easy and everyone will choose to use SPOT instead of FARGATE and where is the fun in that...

twigs67 · 2020-08-26T16:16:37Z

@dactp I'm confused, does this setting:

FARGATE_SPOT base=0 weight=50
FARGATE base=0 weight=50

Allow OD to be implemented only if SPOT is not available?

nathanielram · 2020-08-26T16:20:02Z

Just means it'll run 50% of tasks in Fargate and 50% in Spot, there's no failover if one is not available

StevePavlin · 2020-09-05T19:34:08Z

+1

Sanyambansal76 · 2020-11-04T14:10:34Z

+1

seanturner026 · 2020-12-04T11:05:24Z

How would one handle this with lambda?

Trigger a lambda on spot allocation failure event which does a run-task api call on the FARGATE capacity-provider?

misterjoshua · 2021-03-16T02:17:07Z

How would one handle this with lambda?

Trigger a lambda on spot allocation failure event which does a run-task api call on the FARGATE capacity-provider?

@seanturner026 To do this, I was thinking of deploying ECS services in duplicate - one with spot, the other with the plain capacity provider. When the service events SERVICE_TASK_PLACEMENT_FAILURE, I can dial up the desired count on the non-spot service to equal the discrepancy plus some buffer room. When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

seanturner026 · 2021-03-20T20:25:06Z

How would one handle this with lambda?
Trigger a lambda on spot allocation failure event which does a run-task api call on the FARGATE capacity-provider?

@seanturner026 To do this, I was thinking of deploying ECS services in duplicate - one with spot, the other with the plain capacity provider. When the service events SERVICE_TASK_PLACEMENT_FAILURE, I can dial up the desired count on the non-spot service to equal the discrepancy plus some buffer room. When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

Ah brilliant! I was thinking this problem is solved by a lambda concurrently (<3 go)...

removing spot provider
desribing tasks
sending SIGTERM to SPOT tasks
dergeistering SPOT tasks from targetgroup with IP
starting the same number of ON_DEMAND tasks

Then you would also have an event bridge rule that runs each morning and replaces the SPOT provider (and perhaps kills some ON_DEMAND tasks? I suppose perhaps not necessary if you're tasks already autoscale)

When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

Have you tested this approach? I'm wondering how long it takes to return to SERVICE_STEADY_STATE, and if it's worthwhile to maintain the spot compute as well in the background.

misterjoshua · 2021-03-22T23:24:30Z

How would one handle this with lambda?
Trigger a lambda on spot allocation failure event which does a run-task api call on the FARGATE capacity-provider?

@seanturner026 To do this, I was thinking of deploying ECS services in duplicate - one with spot, the other with the plain capacity provider. When the service events SERVICE_TASK_PLACEMENT_FAILURE, I can dial up the desired count on the non-spot service to equal the discrepancy plus some buffer room. When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

Ah brilliant! I was thinking this problem is solved by a lambda concurrently (<3 go)...

removing spot provider

desribing tasks

sending SIGTERM to SPOT tasks

dergeistering SPOT tasks from targetgroup with IP

starting the same number of ON_DEMAND tasks

Then you would also have an event bridge rule that runs each morning and replaces the SPOT provider (and perhaps kills some ON_DEMAND tasks? I suppose perhaps not necessary if you're tasks already autoscale)

That's one way to do it for sure. The construct I linked aims to be a little dumber so that ECS can handle more of the heavy lifting. When we opt to create a discrete ECS service for each type of capacity, ECS is able to wrangle the individual tasks for us. We can just increase the desired capacity on the OD service when the Spot service is degraded. (i.e., task placement error.) We can also decrease the OD service desired capacity when Spot has self-healed to max capacity. (i.e., steady state)

When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

Have you tested this approach? I'm wondering how long it takes to return to SERVICE_STEADY_STATE, and if it's worthwhile to maintain the spot compute as well in the background.

I've tested this with synthetic events. It seems to work for my use case (~2-10 desired count.) If a single spot task can't be placed, the OD-capacity service is spun up. The spot capacity service returns to steady-state as soon as ECS can place all the spot tasks and the OD service spins back down to zero tasks. I haven't tested it with auto-scaling, but I suspect that this will be trickier as I don't know how scale-in is handled when spot capacity is unavailable.

nadaahm · 2021-03-22T23:39:01Z

If the original request was about Fargate Spot, I've built this tool fargate-spot-capacity-fail-handler to switch a service to be 100% Fargate if Fargate Spot not available.
But if the original request was about EC2, just follow Spot best practices and you shouldn't have issues with capacity. Interruptions will still occur but interrupted instances will be replaced by another ones from different instance pool. check out this terraform template to provision ECS cluster with Spot best practices applied

harishsambasivam · 2022-09-22T08:21:56Z

I'm also confused, does this setting:

FARGATE_SPOT base=0 weight=4
FARGATE base=0 weight=1

and when we run desired count as 1.

Will OD will be launched if SPOT is not available?

aliabas7 added the Proposed Community submitted issue label Feb 26, 2020

aliabas7 changed the title ~~[ECS] : Capacity Strategy to Fall back to OD only When No More Spot Instances Are Available~~ [ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available Feb 26, 2020

je-al mentioned this issue May 29, 2020

[service] [request]: Fargate Spot failover to Fargate #852

Open

SoManyHs mentioned this issue Apr 9, 2021

Fargate Spot Support aws/copilot-cli#2162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available #773

[ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available #773

aliabas7 commented Feb 26, 2020

dactp commented Mar 4, 2020

nikovirtala commented Apr 14, 2020

jitesh88 commented Jul 8, 2020

twigs67 commented Aug 26, 2020

nathanielram commented Aug 26, 2020

StevePavlin commented Sep 5, 2020

Sanyambansal76 commented Nov 4, 2020

seanturner026 commented Dec 4, 2020

misterjoshua commented Mar 16, 2021 •

edited

Loading

seanturner026 commented Mar 20, 2021 •

edited

Loading

misterjoshua commented Mar 22, 2021 •

edited

Loading

nadaahm commented Mar 22, 2021

harishsambasivam commented Sep 22, 2022

[ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available #773

[ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available #773

Comments

aliabas7 commented Feb 26, 2020

Community Note

dactp commented Mar 4, 2020

nikovirtala commented Apr 14, 2020

jitesh88 commented Jul 8, 2020

twigs67 commented Aug 26, 2020

nathanielram commented Aug 26, 2020

StevePavlin commented Sep 5, 2020

Sanyambansal76 commented Nov 4, 2020

seanturner026 commented Dec 4, 2020

misterjoshua commented Mar 16, 2021 • edited Loading

seanturner026 commented Mar 20, 2021 • edited Loading

misterjoshua commented Mar 22, 2021 • edited Loading

nadaahm commented Mar 22, 2021

harishsambasivam commented Sep 22, 2022

misterjoshua commented Mar 16, 2021 •

edited

Loading

seanturner026 commented Mar 20, 2021 •

edited

Loading

misterjoshua commented Mar 22, 2021 •

edited

Loading