New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ECS] [request]: Allow configuration on when ECS kills task after made unhealthy by ALB #1271
Comments
Since ECS supports container health checks and ALB health checks maybe a toggle for ECS to only kill a task when it fails container health check. Allowing ALB to set node unhealthy but leaving it running. |
Hey @zbintliff thansk for your feedback. I wanted to get your feedback on a potential solution to this problem where you can use the existing ECS service parameter - Would this be an acceptable workaround to solve the problem you mentioned? |
@pavneeta , doesn't healthCheckGracePeriod get ignored once the ALB marks a node healthy? That is if you have healthCheckGracePeriod of 100 but app gets healthy after 30 it is considered healthy and then will be marked unhealthy after the unhealthyThreshold is hit? I could be conflating EC2 vs ECS health checks. Regardless, this workaround means actually unhealthy apps will not be terminated which will result in availability issues. |
Another issue here is, when those spike happens container health check will also start failing (in addition to ALB health check) because of load. So, Ideally when ALB health fails, requests should stop routing to that target, new target should be started but the old one should be given the chance to recover for sometime, instead of being killed immediately. So, something like "Terminate after X seconds of health check failure", and if the container becomes healthy during that "X" seconds add it to the rotation. Somewhat also related: Getting health check from each ALB (we use 4 AZ) at the same time seems like an overkill (we get 4x health check request). Ideally it should be used in round robin, not all at once, and there should be only 1 health check at the defined interval. |
While setting |
We have shipped an improvement to ECS scheduler, that would prioritize starting a new healthy tasks, before killing tasks that were marked unhealthy. |
Community Note
Tell us about your request
Our applications can have unpredictable spike in traffic. During this time the backends get overwhelmed with request and start rejecting new connections. This results in failing health checks and ECS killing the task, countering any scaling ECS is performing. Instead we are looking for it to remain active, finish its requests, and give it a chance to become healthy again while the service is scaling out to handle the load. (essentially a circuit breaker in ECS).
ALB doesn't automatically kill the targets as they leave it to the service to decide (ECS or ASG for example)
Which service(s) is this request for?
Fargate, ECS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Essentially have a circuit breaker in ECS. Allow it to fail health checks but also give the tasks a chance to recover. During large scale events the failing health checks make scaling even harder to perform as many old tasks could help with the load are recycled.
Are you currently working around this issue?
Teams have tested this in EC2 and are thinking of moving there. The way they establish it is by:
Additional context
Attachments
The text was updated successfully, but these errors were encountered: