Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] [request]: Allow configuration on when ECS kills task after made unhealthy by ALB #1271

Open
zbintliff opened this issue Feb 12, 2021 · 6 comments
Assignees
Labels
ECS Amazon Elastic Container Service Proposed Community submitted issue

Comments

@zbintliff
Copy link

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Our applications can have unpredictable spike in traffic. During this time the backends get overwhelmed with request and start rejecting new connections. This results in failing health checks and ECS killing the task, countering any scaling ECS is performing. Instead we are looking for it to remain active, finish its requests, and give it a chance to become healthy again while the service is scaling out to handle the load. (essentially a circuit breaker in ECS).

ALB doesn't automatically kill the targets as they leave it to the service to decide (ECS or ASG for example)

Which service(s) is this request for?
Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Essentially have a circuit breaker in ECS. Allow it to fail health checks but also give the tasks a chance to recover. During large scale events the failing health checks make scaling even harder to perform as many old tasks could help with the load are recycled.

Are you currently working around this issue?
Teams have tested this in EC2 and are thinking of moving there. The way they establish it is by:

  1. Setting ASG health check to instance health check
  2. Allows ALB to remove target when overwhelmed and failing ALB Health Check then have a local check checking some other form of healthiness (maybe another port that has metrics etc) and when it fails mark the instance as unhealthy.

Additional context

Attachments

@zbintliff zbintliff added the Proposed Community submitted issue label Feb 12, 2021
@zbintliff
Copy link
Author

Since ECS supports container health checks and ALB health checks maybe a toggle for ECS to only kill a task when it fails container health check. Allowing ALB to set node unhealthy but leaving it running.

@pavneeta pavneeta self-assigned this Feb 18, 2021
@pavneeta pavneeta added this to Researching in containers-roadmap via automation Feb 18, 2021
@pavneeta pavneeta added the ECS Amazon Elastic Container Service label Feb 18, 2021
@pavneeta
Copy link

Hey @zbintliff thansk for your feedback. I wanted to get your feedback on a potential solution to this problem where you can use the existing ECS service parameter - healthCheckGracePeriodSeconds as a work around to solve this problem. This parameter controls the period of time, in seconds, that the Amazon ECS service scheduler should ignore unhealthy Elastic Load Balancing target health checks, container health checks, and Route 53 health checks after a task enters a RUNNING state. You can set this parameter upto 2,147,483,647 seconds which is 24,305 days . Effectively ECS will ignore the ELB health checks for the service to make task lifecycle decisions such as "Kill overloaded tasks" and you can use the ECS container health checks as liveness checks to ensure that unhealthy tasks are terminated( and replaced) as needed. The ELB will automatically stop routing to Unhealthy tasks ( overloaded) and will resume routing to them once they "finish its requests, and become healthy again ".

Would this be an acceptable workaround to solve the problem you mentioned?

@zbintliff
Copy link
Author

@pavneeta , doesn't healthCheckGracePeriod get ignored once the ALB marks a node healthy?

That is if you have healthCheckGracePeriod of 100 but app gets healthy after 30 it is considered healthy and then will be marked unhealthy after the unhealthyThreshold is hit? I could be conflating EC2 vs ECS health checks.

Regardless, this workaround means actually unhealthy apps will not be terminated which will result in availability issues.

@sumitverma
Copy link

Another issue here is, when those spike happens container health check will also start failing (in addition to ALB health check) because of load. So, Ideally when ALB health fails, requests should stop routing to that target, new target should be started but the old one should be given the chance to recover for sometime, instead of being killed immediately. So, something like "Terminate after X seconds of health check failure", and if the container becomes healthy during that "X" seconds add it to the rotation.

Somewhat also related: Getting health check from each ALB (we use 4 AZ) at the same time seems like an overkill (we get 4x health check request). Ideally it should be used in round robin, not all at once, and there should be only 1 health check at the defined interval.

@vibhav-ag vibhav-ag self-assigned this Jul 3, 2022
@stewartcampbell
Copy link

While setting healthCheckGracePeriodSeconds does stop the tasks being removed from the target group, it also stops the tasks being killed if they fail the container health check, which isnt ideal. That means we need to monitor for failed tasks and automate the termination of them.

@genbit
Copy link

genbit commented Nov 3, 2023

We have shipped an improvement to ECS scheduler, that would prioritize starting a new healthy tasks, before killing tasks that were marked unhealthy.
You can read more in this WNP: https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-ecs-applications-resiliency-unpredictable-load-spikes/
Blog post with deep dive: https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service Proposed Community submitted issue
Projects
containers-roadmap
  
Researching
Development

No branches or pull requests

6 participants