Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available #773

Open
aliabas7 opened this issue Feb 26, 2020 · 13 comments
Labels
Proposed Community submitted issue

Comments

@aliabas7
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Ability to create a capacity strategy that allows you to use spot instances as long as the spot capacity is available, and fall back to on-demand instances only when there is no capacity available for spot.

Which service(s) is this request for?
ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I was hoping that "Base" in a capacity strategy will be more of a "strategy" but it seems to be a "constraint". In my use case, I was hoping to use 5 (which is also the total number of tasks in my service) as base for my capprovider1 which entirely consists of spot instances and use a 1:1 weight. So, the base will be met as long as there are spot instance available, otherwise I was hoping it to ignore the base and fall back to the capprovider2 which has OD instances. But even when capprovider2 has instances, service fails to place tasks because it's trying to satisfy base.

Are you currently working around this issue?
Using lambda
Please let me know if more information is required or in case there is a better alternative.

@aliabas7 aliabas7 added the Proposed Community submitted issue label Feb 26, 2020
@aliabas7 aliabas7 changed the title [ECS] : Capacity Strategy to Fall back to OD only When No More Spot Instances Are Available [ECS] : Capacity Strategy to Fall back to OD only When No More Spot Capacity Available Feb 26, 2020
@dactp
Copy link

dactp commented Mar 4, 2020

We also observe a similar problem that I will describe below. If it sounds like a separate issue please let me know.

We run our ECS cluster with the following default providers:
FARGATE_SPOT base=0 weight=50
FARGATE base=0 weight=50

Now let's say we run a service that uses the default providers and uses autoscaling.

If the service has a desired_count=10 and the fargate_spot capacity is not available, ECS will not use the available fargate capacity to honour desired_count. The service will run with only 5 tasks instead.

I consider this almost a bug, as it is very counter intuitive that ECS will allocate by providers first and consistently ignore desired_count.
We would prefer an integrated spot/non-spot scaling approach like EC2 Fleet does.

@nikovirtala
Copy link

I consider this almost a bug, as it is very counter intuitive that ECS will allocate by providers first and consistently ignore desired_count.

I fully agree with this. – There should be an option to prioritize the desired count over the capacity provider. It would open a door for more flexible usage of spot capacity, also on the long-running services.

@jitesh88
Copy link

jitesh88 commented Jul 8, 2020

Couldn't agree more. i asked about this when SPOT was launched. Had a chat to our TAM and also the service team. Dont think it was on the agenda any time soon back then. Personally, I doubt this will be a priority for AWS as it makes SPOT just too easy and everyone will choose to use SPOT instead of FARGATE and where is the fun in that...

@twigs67
Copy link

twigs67 commented Aug 26, 2020

@dactp I'm confused, does this setting:

FARGATE_SPOT base=0 weight=50
FARGATE base=0 weight=50

Allow OD to be implemented only if SPOT is not available?

@nathanielram
Copy link

Just means it'll run 50% of tasks in Fargate and 50% in Spot, there's no failover if one is not available

@StevePavlin
Copy link

+1

1 similar comment
@Sanyambansal76
Copy link

+1

@seanturner026
Copy link

How would one handle this with lambda?

Trigger a lambda on spot allocation failure event which does a run-task api call on the FARGATE capacity-provider?

@misterjoshua
Copy link

misterjoshua commented Mar 16, 2021

How would one handle this with lambda?

Trigger a lambda on spot allocation failure event which does a run-task api call on the FARGATE capacity-provider?

@seanturner026 To do this, I was thinking of deploying ECS services in duplicate - one with spot, the other with the plain capacity provider. When the service events SERVICE_TASK_PLACEMENT_FAILURE, I can dial up the desired count on the non-spot service to equal the discrepancy plus some buffer room. When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

@seanturner026
Copy link

seanturner026 commented Mar 20, 2021

How would one handle this with lambda?
Trigger a lambda on spot allocation failure event which does a run-task api call on the FARGATE capacity-provider?

@seanturner026 To do this, I was thinking of deploying ECS services in duplicate - one with spot, the other with the plain capacity provider. When the service events SERVICE_TASK_PLACEMENT_FAILURE, I can dial up the desired count on the non-spot service to equal the discrepancy plus some buffer room. When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

Ah brilliant! I was thinking this problem is solved by a lambda concurrently (<3 go)...

  • removing spot provider
  • desribing tasks
  • sending SIGTERM to SPOT tasks
  • dergeistering SPOT tasks from targetgroup with IP
  • starting the same number of ON_DEMAND tasks

Then you would also have an event bridge rule that runs each morning and replaces the SPOT provider (and perhaps kills some ON_DEMAND tasks? I suppose perhaps not necessary if you're tasks already autoscale)

When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

Have you tested this approach? I'm wondering how long it takes to return to SERVICE_STEADY_STATE, and if it's worthwhile to maintain the spot compute as well in the background.

@misterjoshua
Copy link

misterjoshua commented Mar 22, 2021

How would one handle this with lambda?
Trigger a lambda on spot allocation failure event which does a run-task api call on the FARGATE capacity-provider?

@seanturner026 To do this, I was thinking of deploying ECS services in duplicate - one with spot, the other with the plain capacity provider. When the service events SERVICE_TASK_PLACEMENT_FAILURE, I can dial up the desired count on the non-spot service to equal the discrepancy plus some buffer room. When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

Ah brilliant! I was thinking this problem is solved by a lambda concurrently (<3 go)...

  • removing spot provider
  • desribing tasks
  • sending SIGTERM to SPOT tasks
  • dergeistering SPOT tasks from targetgroup with IP
  • starting the same number of ON_DEMAND tasks

Then you would also have an event bridge rule that runs each morning and replaces the SPOT provider (and perhaps kills some ON_DEMAND tasks? I suppose perhaps not necessary if you're tasks already autoscale)

That's one way to do it for sure. The construct I linked aims to be a little dumber so that ECS can handle more of the heavy lifting. When we opt to create a discrete ECS service for each type of capacity, ECS is able to wrangle the individual tasks for us. We can just increase the desired capacity on the OD service when the Spot service is degraded. (i.e., task placement error.) We can also decrease the OD service desired capacity when Spot has self-healed to max capacity. (i.e., steady state)

When the spot service events SERVICE_STEADY_STATE, I can dial down non-spot service. Edit: Here's a CDK construct that takes this approach.

Have you tested this approach? I'm wondering how long it takes to return to SERVICE_STEADY_STATE, and if it's worthwhile to maintain the spot compute as well in the background.

I've tested this with synthetic events. It seems to work for my use case (~2-10 desired count.) If a single spot task can't be placed, the OD-capacity service is spun up. The spot capacity service returns to steady-state as soon as ECS can place all the spot tasks and the OD service spins back down to zero tasks. I haven't tested it with auto-scaling, but I suspect that this will be trickier as I don't know how scale-in is handled when spot capacity is unavailable.

@nadaahm
Copy link

nadaahm commented Mar 22, 2021

If the original request was about Fargate Spot, I've built this tool fargate-spot-capacity-fail-handler to switch a service to be 100% Fargate if Fargate Spot not available.
But if the original request was about EC2, just follow Spot best practices and you shouldn't have issues with capacity. Interruptions will still occur but interrupted instances will be replaced by another ones from different instance pool. check out this terraform template to provision ECS cluster with Spot best practices applied

@harishsambasivam
Copy link

I'm also confused, does this setting:

FARGATE_SPOT base=0 weight=4
FARGATE base=0 weight=1

and when we run desired count as 1.

Will OD will be launched if SPOT is not available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests