How to preserve currently running ECS Instances during update to the latest AMI #672

vmogilev · 2017-01-17T19:49:54Z

I am in process of testing the update to the latest Amazon ECS-optimized AMI (amzn-ami-2016.09.d-amazon-ecs-optimized).

Our current ECS Instances are running amzn-ami-2015.09.g-amazon-ecs-optimized which at the time of the launch pulled the following stack:

Docker: 1.9.1
ECS Agent: 1.8.2

I don't think it's a good idea to simply update launch configuration with the new AMI and hope for the best. What if things fail under load, what if we discover a bug with the new AMI/Docker/Agent combo running our containers? These are all possibilities and we need to mitigate the risks by preserving our old instances while the new instances are burning-in under production load. Once we feel solid - we can terminate the old instances.

I can't figure out how to do this. Here's what I tried:

I updated the launch configuration for the Auto Scaling Group and doubled the number of instances in it. End result I have 4 instances with OLD AMI and 4 instances with NEW AMI. Good!
I then updated the ECS Service and increased it's number of Tasks from 4 to 8. End result 4 new tasks were started on the NEW AMI Instances and 4 original tasks are still running on the OLD AMI Instances. Good!

All good at this point. Next I need to stop the tasks on the 4 OLD AMI Instances and somehow keep these OLD AMI Instances in reserve while we burn in the 4 NEW AMI Instances. Here's what I tried:

I set the Status for the 4 OLD AMI Instance "Standby" (in ASG). I was expecting ECS AGENT on these OLD AMI Instances to terminate all running ECS Tasks. No dice!
I then reduced ECS Service task number from 8 to 4 hoping that ECS Agent will terminate the Tasks on the OLD AMI Instances. No dice! It terminated TASKS on random instances mixing NEW/OLD in the process.
I then decided to help ECS Agent and manually (one at a time) stopped running TASKS on the OLD AMI Instances hoping that ECS Agent will NOT re-launch the TASKS on the OLD AMI Instances. No dice -- it still managed to launch some tasks on the OLD AMI Instances.

At this point I am lost. Is this even possible?

One option I am considering is using task-placement-constraints, but I am hoping someone here has gone through this basic need and can share their ideas with me.

I feel we should have a way to mark ECS Instances as StandBy and have the ECS Agent not schedule any tasks on them for as long as that status is active. I don't think "Deregister" functionality is sufficient here because there is no way that I know of to bring deregistered instances back into service.

I also don't like that a specific version of Docker/ECS Agent is not pinned to a specific version of Amazon ECS-optimized AMI. If it were - this would not be an issue, I could always bring back a known, good working set of versions into service. But as it is now - even if I used an older AMI - it will pull in the most recent version of ECS Agent and Docker on launch.

Thank you!

vmogilev · 2017-01-18T06:37:19Z

I solved this problem using task-placement-constraints

hridyeshpant · 2017-01-19T23:46:07Z

@vmogilev can you please tell us how did to manage to do this using task-placement-constraints. The main problem i am facing to while updating the cluster are :
1.how do we make sure new tasks are getting launched in new instances (new ami), when we increase task from 4 to 8
2..how do we make sure old tasks are getting scheduled in new instances (new ami), when we reduced the number of task from 8 to 4
3. how to automate this, as we are running all most more the 50 instances and more the 300 services.

vmogilev · 2017-01-20T00:48:58Z

@hridyeshpant - It took me a day to come up with the solution so I decided to document it before I forget. Hopefully it'll help someone else in the future. Here's the process I came up with:

Blue/Green ECS-optimized AMI Update For ECS Instances

Hope this helps! It you have any questions - don't hesitate to ask.

aaithal · 2017-01-24T17:15:13Z

@vmogilev Thank you so much for writing this up! Please let us know if you need any more assistance here. I'm closing this issue for now.

samuelkarp · 2017-01-24T22:27:09Z

@vmogilev @hridyeshpant Today we launched Container Instance Draining to address this use case. Once you launch your new instances with the new AMI, you can set the old instances to DRAINING and the service will move tasks to the new instances for you automatically.

vmogilev · 2017-01-24T22:29:03Z

Wow @samuelkarp - you guys rock! Thank you

hridyeshpant · 2017-01-26T20:31:45Z

great @samuelkarp :)

hridyeshpant · 2017-01-26T20:45:02Z

@samuelkarp is there way we can place existing task to always new instances (created by new ami) from DRAINING instance. we can't use task-placement-constraints as this will required to create new version of task def and service. we also can't put all old AMI instances (60 at this time) to DRAINING state as this will cause throttling in docker registry.

vmogilev · 2017-01-26T21:00:24Z

@hridyeshpant what are the constraints that are preventing you to create new task definition?

hridyeshpant · 2017-01-26T23:21:44Z

@vmogilev we are running 20-30 service per instances. so if i need to use constraints. i need to first update all 30 task def and then need to update 30 service to use new task def during cluster update.
i am looking somethings where i can tell ecs scheduler deploy task to new instances (based on AMI-id) without worry about updating bunch of task def and services.

samuelkarp · 2017-01-27T00:08:38Z

@hridyeshpant I'm not sure I understand your question. When you place an instance into the DRAINING status, services with tasks on those instances will stop and replace them according to the service deployment configuration parameters; new tasks are not launched on instances in the DRAINING status.

hridyeshpant · 2017-01-27T00:31:57Z

@samuelkarp my use case is : i always want services (running in DRAINING instance) move to new instances (created by new ami) not any instances attached to cluster.
so i am using combination of DRAINING feature with task-placement-constraints. but only issue i am seeing is i need to update all task def and services with new placementConstraints with new ami_id of DRAINING isntance.

let say i am running 20-30 task per instances with placementConstraints
"placementConstraints": [
{
"expression": "attribute:ecs.ami-id == old_ami_id",
"type": "memberOf"
}
],
now i put one instance in DRAINING state and launch new instance with new AMI. now i need to update all 20-30 task def with new placementConstraints with new AMI_ID and also need to update the all service with new task def version, so that all services in DRAINING instances moved to only new instances (with new ami).

samuelkarp · 2017-01-27T00:56:42Z

Is there a particular reason you have the AMI ID in the task definition? If you were just using that as a stop-gap before DRAINING came out, do you still need it?

hridyeshpant · 2017-01-27T01:04:53Z

@samuelkarp
if i dont use AMI ID in placementConstraints, how do i move old task to new instances.
If you can point me to solve my use case that will help.
use case:
During my cluster update, i always want services (running in DRAINING instance) move to new instances (created by new ecs-ami) not any old running instances attached to cluster.
i don't want to update 20-30 task def and services via placementConstraints and update ami_id during my cluster update process.
we are running more then 500 services per cluster, so placementConstraints option doesn't look good option, which required updating 500 task def and then services.

samuelkarp · 2017-01-27T01:08:00Z

if i dont use AMI ID in placementConstraints, how do i move old task to new instances.

Mark your old instances as DRAINING and the ECS service scheduler will move them according to the service deployment configuration parameters.

hridyeshpant · 2017-01-27T01:11:57Z

@samuelkarp but just marking old instances as DRAINING , ecs can put those services in any running instances, which can be old instances with old ami.
AM i missing something here ?

samuelkarp · 2017-01-27T01:14:44Z

AM i missing something here ?

Maybe we're both speaking past each other?

Launch some new instances
Mark old instances DRAINING
Tasks move off the DRAINING instances
After all the tasks are off the drained instances, terminate your old instances

We have a blog post about how to do this.

hridyeshpant · 2017-01-27T01:20:27Z

@samuelkarp so when we say Tasks move off the DRAINING instances,
let say

i have 3 instances with ami-id 1234
i launch new 1 instances with ami-id -5678
mark one old instances (ami-id 1234) DRAINING.At this stage, will ecs scheduler place task always to newly launch instances (ami-id -5678) or any instances (2 old ami and 1 new ami) ?

samuelkarp · 2017-01-27T01:21:30Z

At this stage, will ecs scheduler place task alaeys to newly launch instances (ami-id -5678) or any instances (2 old ami and 1 new ami) ?

Any of the non-DRAINING instances.

hridyeshpant · 2017-01-27T01:22:09Z

that is no my use case, i aways want to deploy in newly launch instances (ami-id -5678) . that what i mentioned earlier about my use case and using combination of combination of DRAINING feature with task-placement-constraints

samuelkarp · 2017-01-27T01:22:59Z

I think I'm misunderstanding your use-case then. It's probably worth opening a new issue if you want to go into this in more depth.

hridyeshpant · 2017-01-27T01:24:17Z

yeah may be i was not clear.
my use case, i aways want to deploy in newly launch instances (ami-id -5678) . that what i mentioned earlier about my use case and using combination of combination of DRAINING feature with task-placement-constraints but the issue is unnecessary updating task def and services

samuelkarp · 2017-01-27T19:18:03Z

@hridyeshpant If I understand your use-case: you have a large number of instances in a cluster. You'd like to replace all the instances in the cluster over a period of time rather than all at once, and you'd like to ensure that tasks only get moved once rather than potentially moving more than once due to starting on instances that you're going to get rid of anyway.

I think you can accomplish this by using DRAINING plus an anti-affinity task placement constraint in your task definition. For example:

Step 1: Add a custom constraint to all task definitions to not place tasks on instances with a custom attribute state=pre-drain. You only need to do this once.

"placementConstraints": [
   {
       "expression": "attribute:state!=pre-drain",
       "type": "memberOf"
   }
]

Step 2: Launch the instances with the new AMI in your cluster.

Step 3: Set the custom attribute state=pre-drain on the old instances in the cluster. New task launches will now only go to the new instances.

Step 4: Set X% of the old instances to the DRAINING state. When tasks are rescheduled they will only end up on new instances.

Step 5: Continue to increase the percentage of instances in the DRAINING state. When you have 100% of the old instances in the DRAINING state and tasks successfully moved, terminate the instances.

hridyeshpant · 2017-01-27T23:08:22Z

@samuelkarp thanks a lot, this will definitely work.

mattcallanan · 2017-11-21T06:33:21Z

Updating this thread to say this approach does work well.
One minor tweak was to add state !exists or to the expression as instances by default won't have the state attribute:

      placement_constraints = [ {
              type: 'memberOf',
              expression: 'attribute:state !exists or attribute:state != pre-drain'
          }]

aaithal closed this as completed Jan 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preserve currently running ECS Instances during update to the latest AMI #672

How to preserve currently running ECS Instances during update to the latest AMI #672

vmogilev commented Jan 17, 2017 •

edited

vmogilev commented Jan 18, 2017

hridyeshpant commented Jan 19, 2017

vmogilev commented Jan 20, 2017

aaithal commented Jan 24, 2017

samuelkarp commented Jan 24, 2017

vmogilev commented Jan 24, 2017

hridyeshpant commented Jan 26, 2017

hridyeshpant commented Jan 26, 2017

vmogilev commented Jan 26, 2017

hridyeshpant commented Jan 26, 2017

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 •

edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 •

edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 •

edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 •

edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 •

edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 •

edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017

mattcallanan commented Nov 21, 2017

How to preserve currently running ECS Instances during update to the latest AMI #672

How to preserve currently running ECS Instances during update to the latest AMI #672

Comments

vmogilev commented Jan 17, 2017 • edited

vmogilev commented Jan 18, 2017

hridyeshpant commented Jan 19, 2017

vmogilev commented Jan 20, 2017

aaithal commented Jan 24, 2017

samuelkarp commented Jan 24, 2017

vmogilev commented Jan 24, 2017

hridyeshpant commented Jan 26, 2017

hridyeshpant commented Jan 26, 2017

vmogilev commented Jan 26, 2017

hridyeshpant commented Jan 26, 2017

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 • edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 • edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 • edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 • edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 • edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017 • edited

samuelkarp commented Jan 27, 2017

hridyeshpant commented Jan 27, 2017

mattcallanan commented Nov 21, 2017

vmogilev commented Jan 17, 2017 •

edited

hridyeshpant commented Jan 27, 2017 •

edited

hridyeshpant commented Jan 27, 2017 •

edited

hridyeshpant commented Jan 27, 2017 •

edited

hridyeshpant commented Jan 27, 2017 •

edited

hridyeshpant commented Jan 27, 2017 •

edited

hridyeshpant commented Jan 27, 2017 •

edited