Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] [Deployment]: Allow Custom CircuitBreaker min failures #1247

Open
Lux-CC opened this issue Jan 29, 2021 · 9 comments
Open

[ECS] [Deployment]: Allow Custom CircuitBreaker min failures #1247

Lux-CC opened this issue Jan 29, 2021 · 9 comments
Assignees
Labels
ECS Amazon Elastic Container Service Proposed Community submitted issue

Comments

@Lux-CC
Copy link

Lux-CC commented Jan 29, 2021

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Allow a custom minimum amount of failures for circuit breaker.

Which service(s) is this request for?
ECS (fargate)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
For dev environments I'd like to set up quick deployments & failures for new task definition versions. Currently the minimum amount of failures is 10 before circuit breaker marks it as failed. However, for dev environments often only 1 container is running anyway and when it fails it usually means a bug in the code. One or two failures are sufficient.

Are you currently working around this issue?
Current workaround for me is manually initiating a 'cancel update stack' for my cloudformation stack.

Thanks!

Luuk

@Lux-CC Lux-CC added the Proposed Community submitted issue label Jan 29, 2021
@elruwen
Copy link

elruwen commented Mar 25, 2021

I am using ALB health checks. If the health checks are setup to try a couple of times, it might take a while for a task to fail. Multiply that times 10 and it takes way to long. I would suggest something like:

retrymode: auto|manual
retryfactor: 1-10, decimals are allowed

If retrymode is manual, the desired task count is multiplied by retryfactor. The result is always rounded up.

Examples:
Desired Count 1:
retrymode: manual
retryfactor: 1.5
-> two retries

Desired Count 10:
retrymode: manual
retryfactor: 1.5
-> 15 retries for 10 tasks

@pavneeta pavneeta added this to Researching in containers-roadmap via automation Apr 21, 2021
@vibhav-ag vibhav-ag self-assigned this Aug 30, 2021
@vibhav-ag vibhav-ag added the ECS Amazon Elastic Container Service label Aug 30, 2021
@vibhav-ag
Copy link

Thanks for the input @LRuttenCN @elruwen . Additionally, are there other configurations that you'd like to see for rollbacks for rolling updates in ECS? An example of this might be rollbacks based on Cloudwatch Alarms.

@elruwen
Copy link

elruwen commented Sep 1, 2021

Hi @vibhav-ag
If you are in that area, can you also fix #1206?

There is also #1273 which is about rollbacks.

With regards to rollbacks, I am not sure how the rollbacks currently work in detail, since due to bug #1206, testing the rollbacks is quite painful.
But let me explain how I would like it to be but using an example:

I run ECS containers on EC2 instances. They are both in the same cloudformation stacks. Imagine I got 2 EC2 instances A and B, A runs with two tasks and is "full" and B runs with one task and has space for more more task. Let's imagine I am behind an ALB.

Case 1 - I update only the Task Definitions
Expectation:

  • one task gets launched on the existing EC2 instance
  • one new EC2 instance gets launched with two tasks
  • none of the new tasks turn healthy
  • traffic stays the whole time on the old tasks
  • rollback: terminate new task on B and terminate the new EC2 instance

Case 2 - I update the Task Definitions + EC2 AMI
Expectation:

  • two new EC2 instances are being launched
  • 3 new tasks are spawned on the new EC2 instances (don't spawn on the old one, since the system should know they are being replaced)
  • none of the new tasks turn healthy
  • traffic stays the whole time on the old tasks
  • rollback: terminate the two new EC2 instances

@bogdankatishev
Copy link

Hello,

We also have this problem for quite a time. Manually initiating a 'cancel update stack' for the cloudformation stack is not an option for us because it still takes 30+ minutes for cloudformation to reach UPDATE_ROLLBACK_COMPLETE state.

Our current workaround is this: https://aws.amazon.com/blogs/compute/automating-rollback-of-failed-amazon-ecs-deployments/
This workaround is not optimal because it does not cover all the deploy use cases.

Does anyone here have another workaround that they are using to achieve faster feedback/rollback loops?

@rmarops
Copy link

rmarops commented Feb 8, 2023

python boto3 ecs client
update_service(
cluster = xx,
service = xx,
desiredCount = 0
)

This is a bigger issue when a build process is wrapped up in any commercial offering that tracks build credits.

@genesor
Copy link

genesor commented Jun 19, 2023

Hello,

The circuit breaker is something we tried recently and we were a bit disappointed with it.

If you run a single instance service with the current settings it takes around 100 minutes to detect a failed deployment because of the threshold + throttling. It's faster to detect it by hand by simply watching the AWS Console.

Any news on this subject @vibhav-ag @tabern ?

@SantiagoSchez
Copy link

SantiagoSchez commented Jun 23, 2023

Same here. The worst is that even detecting a failing deployment in advance, I cannot stop an ongoing deployment and the tasks keeps being created again and again. If there is a task already working, I don't need to re-create a task for a minimum of 10 times until the deployment is marked as FAILED!

@vibhav-ag
Copy link

We've enhanced circuit breaker to be more responsive by default. Is there still a need to configure minimum failures for circuit breaker?

https://aws.amazon.com/about-aws/whats-new/2024/01/amazon-ecs-deployment-monitoring-responsiveness-services/

@SantiagoSchez
Copy link

Thank you, 3 is better than 10. But yes, it would be more convenient if we could just adjust the minimum number of failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service Proposed Community submitted issue
Projects
containers-roadmap
  
Researching
Development

No branches or pull requests

7 participants