[ECS] [request]: Allow configuration on when ECS kills task after made unhealthy by ALB #1271

zbintliff · 2021-02-12T15:07:52Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Our applications can have unpredictable spike in traffic. During this time the backends get overwhelmed with request and start rejecting new connections. This results in failing health checks and ECS killing the task, countering any scaling ECS is performing. Instead we are looking for it to remain active, finish its requests, and give it a chance to become healthy again while the service is scaling out to handle the load. (essentially a circuit breaker in ECS).

ALB doesn't automatically kill the targets as they leave it to the service to decide (ECS or ASG for example)

Which service(s) is this request for?
Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Essentially have a circuit breaker in ECS. Allow it to fail health checks but also give the tasks a chance to recover. During large scale events the failing health checks make scaling even harder to perform as many old tasks could help with the load are recycled.

Are you currently working around this issue?
Teams have tested this in EC2 and are thinking of moving there. The way they establish it is by:

Setting ASG health check to instance health check
Allows ALB to remove target when overwhelmed and failing ALB Health Check then have a local check checking some other form of healthiness (maybe another port that has metrics etc) and when it fails mark the instance as unhealthy.

Additional context

Attachments

zbintliff · 2021-02-12T15:09:07Z

Since ECS supports container health checks and ALB health checks maybe a toggle for ECS to only kill a task when it fails container health check. Allowing ALB to set node unhealthy but leaving it running.

pavneeta · 2021-05-24T21:36:24Z

Hey @zbintliff thansk for your feedback. I wanted to get your feedback on a potential solution to this problem where you can use the existing ECS service parameter - healthCheckGracePeriodSeconds as a work around to solve this problem. This parameter controls the period of time, in seconds, that the Amazon ECS service scheduler should ignore unhealthy Elastic Load Balancing target health checks, container health checks, and Route 53 health checks after a task enters a RUNNING state. You can set this parameter upto 2,147,483,647 seconds which is 24,305 days . Effectively ECS will ignore the ELB health checks for the service to make task lifecycle decisions such as "Kill overloaded tasks" and you can use the ECS container health checks as liveness checks to ensure that unhealthy tasks are terminated( and replaced) as needed. The ELB will automatically stop routing to Unhealthy tasks ( overloaded) and will resume routing to them once they "finish its requests, and become healthy again ".

Would this be an acceptable workaround to solve the problem you mentioned?

zbintliff · 2021-05-24T21:49:24Z

@pavneeta , doesn't healthCheckGracePeriod get ignored once the ALB marks a node healthy?

That is if you have healthCheckGracePeriod of 100 but app gets healthy after 30 it is considered healthy and then will be marked unhealthy after the unhealthyThreshold is hit? I could be conflating EC2 vs ECS health checks.

Regardless, this workaround means actually unhealthy apps will not be terminated which will result in availability issues.

sumitverma · 2021-06-26T17:22:01Z

Another issue here is, when those spike happens container health check will also start failing (in addition to ALB health check) because of load. So, Ideally when ALB health fails, requests should stop routing to that target, new target should be started but the old one should be given the chance to recover for sometime, instead of being killed immediately. So, something like "Terminate after X seconds of health check failure", and if the container becomes healthy during that "X" seconds add it to the rotation.

Somewhat also related: Getting health check from each ALB (we use 4 AZ) at the same time seems like an overkill (we get 4x health check request). Ideally it should be used in round robin, not all at once, and there should be only 1 health check at the defined interval.

stewartcampbell · 2023-06-27T11:58:22Z

While setting healthCheckGracePeriodSeconds does stop the tasks being removed from the target group, it also stops the tasks being killed if they fail the container health check, which isnt ideal. That means we need to monitor for failed tasks and automate the termination of them.

genbit · 2023-11-03T17:38:00Z

We have shipped an improvement to ECS scheduler, that would prioritize starting a new healthy tasks, before killing tasks that were marked unhealthy.
You can read more in this WNP: https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-ecs-applications-resiliency-unpredictable-load-spikes/
Blog post with deep dive: https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/

zbintliff added the Proposed Community submitted issue label Feb 12, 2021

pavneeta self-assigned this Feb 18, 2021

pavneeta added this to Researching in containers-roadmap via automation Feb 18, 2021

pavneeta added the ECS Amazon Elastic Container Service label Feb 18, 2021

raags mentioned this issue May 14, 2021

[ECS] [request]: Ability to disable task restart behaviour on Healthcheck Failure #1373

Open

toricls unassigned pavneeta Aug 24, 2021

vibhav-ag self-assigned this Jul 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ECS] [request]: Allow configuration on when ECS kills task after made unhealthy by ALB #1271

[ECS] [request]: Allow configuration on when ECS kills task after made unhealthy by ALB #1271

zbintliff commented Feb 12, 2021

zbintliff commented Feb 12, 2021

pavneeta commented May 24, 2021

zbintliff commented May 24, 2021

sumitverma commented Jun 26, 2021

stewartcampbell commented Jun 27, 2023

genbit commented Nov 3, 2023

[ECS] [request]: Allow configuration on when ECS kills task after made unhealthy by ALB #1271

[ECS] [request]: Allow configuration on when ECS kills task after made unhealthy by ALB #1271

Comments

zbintliff commented Feb 12, 2021

Community Note

zbintliff commented Feb 12, 2021

pavneeta commented May 24, 2021

zbintliff commented May 24, 2021

sumitverma commented Jun 26, 2021

stewartcampbell commented Jun 27, 2023

genbit commented Nov 3, 2023