[ECS] [Deployment]: Allow Custom CircuitBreaker min failures #1247

Lux-CC · 2021-01-29T17:24:41Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Allow a custom minimum amount of failures for circuit breaker.

Which service(s) is this request for?
ECS (fargate)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
For dev environments I'd like to set up quick deployments & failures for new task definition versions. Currently the minimum amount of failures is 10 before circuit breaker marks it as failed. However, for dev environments often only 1 container is running anyway and when it fails it usually means a bug in the code. One or two failures are sufficient.

Are you currently working around this issue?
Current workaround for me is manually initiating a 'cancel update stack' for my cloudformation stack.

Thanks!

Luuk

elruwen · 2021-03-25T23:17:58Z

I am using ALB health checks. If the health checks are setup to try a couple of times, it might take a while for a task to fail. Multiply that times 10 and it takes way to long. I would suggest something like:

retrymode: auto|manual
retryfactor: 1-10, decimals are allowed

If retrymode is manual, the desired task count is multiplied by retryfactor. The result is always rounded up.

Examples:
Desired Count 1:
retrymode: manual
retryfactor: 1.5
-> two retries

Desired Count 10:
retrymode: manual
retryfactor: 1.5
-> 15 retries for 10 tasks

vibhav-ag · 2021-08-30T14:59:59Z

Thanks for the input @LRuttenCN @elruwen . Additionally, are there other configurations that you'd like to see for rollbacks for rolling updates in ECS? An example of this might be rollbacks based on Cloudwatch Alarms.

elruwen · 2021-09-01T01:38:39Z

Hi @vibhav-ag
If you are in that area, can you also fix #1206?

There is also #1273 which is about rollbacks.

With regards to rollbacks, I am not sure how the rollbacks currently work in detail, since due to bug #1206, testing the rollbacks is quite painful.
But let me explain how I would like it to be but using an example:

I run ECS containers on EC2 instances. They are both in the same cloudformation stacks. Imagine I got 2 EC2 instances A and B, A runs with two tasks and is "full" and B runs with one task and has space for more more task. Let's imagine I am behind an ALB.

Case 1 - I update only the Task Definitions
Expectation:

one task gets launched on the existing EC2 instance
one new EC2 instance gets launched with two tasks
none of the new tasks turn healthy
traffic stays the whole time on the old tasks
rollback: terminate new task on B and terminate the new EC2 instance

Case 2 - I update the Task Definitions + EC2 AMI
Expectation:

two new EC2 instances are being launched
3 new tasks are spawned on the new EC2 instances (don't spawn on the old one, since the system should know they are being replaced)
none of the new tasks turn healthy
traffic stays the whole time on the old tasks
rollback: terminate the two new EC2 instances

bogdankatishev · 2022-06-30T09:10:23Z

Hello,

We also have this problem for quite a time. Manually initiating a 'cancel update stack' for the cloudformation stack is not an option for us because it still takes 30+ minutes for cloudformation to reach UPDATE_ROLLBACK_COMPLETE state.

Our current workaround is this: https://aws.amazon.com/blogs/compute/automating-rollback-of-failed-amazon-ecs-deployments/
This workaround is not optimal because it does not cover all the deploy use cases.

Does anyone here have another workaround that they are using to achieve faster feedback/rollback loops?

rmarops · 2023-02-08T14:43:09Z

python boto3 ecs client
update_service(
cluster = xx,
service = xx,
desiredCount = 0
)

This is a bigger issue when a build process is wrapped up in any commercial offering that tracks build credits.

genesor · 2023-06-19T12:32:22Z

Hello,

The circuit breaker is something we tried recently and we were a bit disappointed with it.

If you run a single instance service with the current settings it takes around 100 minutes to detect a failed deployment because of the threshold + throttling. It's faster to detect it by hand by simply watching the AWS Console.

Any news on this subject @vibhav-ag @tabern ?

SantiagoSchez · 2023-06-23T22:27:41Z

Same here. The worst is that even detecting a failing deployment in advance, I cannot stop an ongoing deployment and the tasks keeps being created again and again. If there is a task already working, I don't need to re-create a task for a minimum of 10 times until the deployment is marked as FAILED!

vibhav-ag · 2024-01-12T01:04:56Z

We've enhanced circuit breaker to be more responsive by default. Is there still a need to configure minimum failures for circuit breaker?

https://aws.amazon.com/about-aws/whats-new/2024/01/amazon-ecs-deployment-monitoring-responsiveness-services/

SantiagoSchez · 2024-01-12T01:32:06Z

Thank you, 3 is better than 10. But yes, it would be more convenient if we could just adjust the minimum number of failures.

Lux-CC added the Proposed Community submitted issue label Jan 29, 2021

pavneeta added this to Researching in containers-roadmap via automation Apr 21, 2021

efekarakus mentioned this issue Jul 13, 2021

Circuit breakers configurable timeout? aws/copilot-cli#2608

Closed

huanjani mentioned this issue Jul 28, 2021

"update in progress" stuck aws/copilot-cli#2672

Closed

vibhav-ag self-assigned this Aug 30, 2021

vibhav-ag added the ECS Amazon Elastic Container Service label Aug 30, 2021

tomelliff mentioned this issue Sep 9, 2021

ECS circuit breaker rollback doesn't work with wait for steady state hashicorp/terraform-provider-aws#19519

Open

efekarakus mentioned this issue Nov 19, 2021

Override circuitbreaker retry count aws/copilot-cli#3061

Closed

mcfadden mentioned this issue Jan 24, 2022

[ECS] [request]: Circuit Breaker #1573

Open

Lou1415926 mentioned this issue Jun 14, 2022

Add command to stop tasks aws/copilot-cli#1397

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ECS] [Deployment]: Allow Custom CircuitBreaker min failures #1247

[ECS] [Deployment]: Allow Custom CircuitBreaker min failures #1247

Lux-CC commented Jan 29, 2021 •

edited

elruwen commented Mar 25, 2021

vibhav-ag commented Aug 30, 2021

elruwen commented Sep 1, 2021

bogdankatishev commented Jun 30, 2022

rmarops commented Feb 8, 2023

genesor commented Jun 19, 2023 •

edited

SantiagoSchez commented Jun 23, 2023 •

edited

vibhav-ag commented Jan 12, 2024

SantiagoSchez commented Jan 12, 2024

[ECS] [Deployment]: Allow Custom CircuitBreaker min failures #1247

[ECS] [Deployment]: Allow Custom CircuitBreaker min failures #1247

Comments

Lux-CC commented Jan 29, 2021 • edited

Community Note

elruwen commented Mar 25, 2021

vibhav-ag commented Aug 30, 2021

elruwen commented Sep 1, 2021

bogdankatishev commented Jun 30, 2022

rmarops commented Feb 8, 2023

genesor commented Jun 19, 2023 • edited

SantiagoSchez commented Jun 23, 2023 • edited

vibhav-ag commented Jan 12, 2024

SantiagoSchez commented Jan 12, 2024

Lux-CC commented Jan 29, 2021 •

edited

genesor commented Jun 19, 2023 •

edited

SantiagoSchez commented Jun 23, 2023 •

edited