Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] [RFC]: Automatic management of instance draining in an ASG #256

Open
coultn opened this issue Apr 19, 2019 · 15 comments
Labels
ECS

Comments

@coultn
Copy link

@coultn coultn commented Apr 19, 2019

Request for comment: please share your thoughts on this proposed improvement to ECS!

With this improvement, ECS will automate instance and task draining.

  • Customers can opt-in to automated instance draining for each of their clusters, using the AWS CLI, API, SDK, Cloud Formation, or Console.

  • For ECS cluster instances that are members of an EC2 Auto Scaling Group (ASG), ECS will set the instance to the DRAINING state whenever the ASG initiates termination of the instance. (The existing behavior of DRAINING is that ECS prevents new tasks from being scheduled for placement on the container instance. Service tasks on the draining container instance that are in the PENDING state are stopped immediately. If there are container instances in the cluster that are available, replacement service tasks are started on them. Service tasks on the container instance that are in the RUNNING state are stopped and replaced according to the service's deployment configuration parameters, minimumHealthyPercent and maximumPercent.)

  • ECS will prevent the container instance from terminating until all service tasks on the instance have stopped (up to a maximum of 48 hours, based on how ASG instance lifecycle hooks work).

The functionality is similar to what was published in this blog. The main difference is that ECS will automate it for you.

@Jsmiiii

This comment has been minimized.

Copy link

@Jsmiiii Jsmiiii commented Apr 20, 2019

Yes, this sounds great. Although we can manage removing instances from an ASG, this is painful process to do, especially if the ASG tries to re-balance itself across AZs (when terminating instances in bulk - typically during scale down actions) if the placement policies are set up to do that.

@tomelliff

This comment has been minimized.

Copy link

@tomelliff tomelliff commented Apr 20, 2019

This would be great. I have a slightly gnarly Lambda function triggered by autoscaling termination hooks that loops to SNS over and over until the instance no longer has any tasks running on it and, while it works fine, always makes me a little uncomfortable every time I look at it.

Probably worth raising as a separate feature request but ,in a slightly related thing, I'd also love to see daemonset services be the last to be evicted from a container instance. Right now our log shipper service runs as a daemonset and is one of the first things to be stopped when a container instance is drained, meaning logs for the tasks that are being connection drained from the ALB are now dropped.

@ngamradt-turner

This comment has been minimized.

Copy link

@ngamradt-turner ngamradt-turner commented Apr 20, 2019

We actually use a slightly-modified version of the Lambda solution to do this, would love to see this become standard functionality.

I think we used the blog post to set it up. There were some bugs with the solution that we had to fix. It has been stable and works well, but would be great to have ECS manage this instead of the needed Lambda bolt-on solution.

@QuinnyPig

This comment has been minimized.

Copy link

@QuinnyPig QuinnyPig commented Apr 21, 2019

Take it one step further. Imagine being able to drain instances from Fargate into an ECS instance. This unlocks a number of different opportunities for both cost and density stories.

@rgarcia

This comment has been minimized.

Copy link

@rgarcia rgarcia commented Apr 22, 2019

We have had to implement our own draining logic since we run a handful of "system" containers that collect metrics, logs, etc. from the host. Without some kind of support to kill these containers last, the native draining mode isn't usable for us.

@sirocode

This comment has been minimized.

Copy link

@sirocode sirocode commented Apr 22, 2019

Here are my observations:

  1. ECS service (in homemade green/blue deployment scheme) runs on EC2 Spot instance and has 50-200 policy. A dynamic port is being registered in the target group. Once I get spot termination notice (120 seconds before instance termination) - I run lambda (with is heavily modified lambda covered in https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/) that starts draining for the particular instance (with dynamic port). Because the deregistration delay is set to 30 seconds, draining happens and then (in 30 seconds) I get the same ECS server process running on the same EC2 sport machine again and it's registered in the same target group. Obviously, setting the whole EC2 node to draining state will help to solve the issue for this particular ECS service.

  2. ECS service with 100-200 policy. 3 nodes on EC2 spots. When spot node stops - I cannot simply set the node into draining state because service's instance will not be deregistered from the target group - I need to force deregister it using Lambda. This "force deregister" is not good because it does not drain connections. Ideally, I'd like to have something called "force draining" - draining should happen when this scenario happens.

Also, monitoring containers (being run using DAEMON service type) should leave the instance last.

@rothgar

This comment has been minimized.

Copy link

@rothgar rothgar commented Apr 22, 2019

I also would like to see logging/agents able to run in ECS with higher priority so they are evicted last or ignored from draining. Similar to priorityClassName in Kubernetes and critical-pods
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

@coultn

This comment has been minimized.

Copy link
Author

@coultn coultn commented Apr 23, 2019

Thanks everyone for your comments! Regarding the scheduling of DAEMON services, that is being tracked under a separate issue: #128

@rgarcia

This comment has been minimized.

Copy link

@rgarcia rgarcia commented Apr 23, 2019

I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.

@Trane9991

This comment has been minimized.

Copy link

@Trane9991 Trane9991 commented Apr 25, 2019

We wrote the lambda which does the exact thing: https://github.com/getsocial-rnd/ecs-drain-lambda
Supports the Spot Instance Interruption Notice and Auto Scaling Lifecycle Terminate Hooks.

And yes, this is a complete rewrite of

The functionality is similar to what was published in this blog

@coultn coultn moved this from Researching to We're Working On It in containers-roadmap May 9, 2019
@casper-gh

This comment has been minimized.

Copy link

@casper-gh casper-gh commented May 9, 2019

I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.

+1

We recently migrated to mixed ASG per this blog, and also using the lifecycle hook to set ECS host to DRAINING when ASG scales down an instance. My big question right now is will ASG able to detect autoscaling:EC2_INSTANCE_TERMINATING event for spot instances within this mixed group.

@coultn

This comment has been minimized.

Copy link
Author

@coultn coultn commented May 9, 2019

I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.

+1

We recently migrated to mixed ASG per this blog, and also using the lifecycle hook to set ECS host to DRAINING when ASG scales down an instance. My big question right now is will ASG able to detect autoscaling:EC2_INSTANCE_TERMINATING event for spot instances within this mixed group.

The answer is yes - it will handle both ASG lifecycle hooks and Spot termination notices.

@guilhermesmi

This comment has been minimized.

Copy link

@guilhermesmi guilhermesmi commented May 29, 2019

This is very useful. One additional comment - the ability to set a Termination Policy to the ASG such that it prioritizes termination of EC2 instances that have no tasks running. Even tough setting the instance to DRAINING ensures ECS to schedule a new task to meet the service Min Healthy Percent / Desired Count, stopping and starting a new task causes retries / timeouts. Moreover, freshly started tasks may not perform as well as old ones due warm local caches, JIT compilation (for JVM), etc.

@innokentiyt

This comment has been minimized.

Copy link

@innokentiyt innokentiyt commented Jul 24, 2019

Yes! This is my most desired feature for ASG and ECS.

@mfortin

This comment has been minimized.

Copy link

@mfortin mfortin commented Oct 25, 2019

Would that DRAINING instance prevents other scaling activities on the ASG ?
Say, the ASG is scaling down as the desiredCount has decreased, a lifecycle termination hook is triggered and part of that process is to DRAIN the ECS container instance. The instance state, in the ASG is TERMINATING:WAIT, and it waits until the instance has drained. However, if there is a new change and there is a new scale out rule, the ASG would not scale out as the lifecycle hook is not yet terminated. Am I right ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
containers-roadmap
We're Working On It
You can’t perform that action at this time.