New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to preserve currently running ECS Instances during update to the latest AMI #672

Closed
vmogilev opened this Issue Jan 17, 2017 · 25 comments

Comments

Projects
None yet
5 participants
@vmogilev

vmogilev commented Jan 17, 2017

I am in process of testing the update to the latest Amazon ECS-optimized AMI (amzn-ami-2016.09.d-amazon-ecs-optimized).

Our current ECS Instances are running amzn-ami-2015.09.g-amazon-ecs-optimized which at the time of the launch pulled the following stack:

Docker: 1.9.1
ECS Agent: 1.8.2

I don't think it's a good idea to simply update launch configuration with the new AMI and hope for the best. What if things fail under load, what if we discover a bug with the new AMI/Docker/Agent combo running our containers? These are all possibilities and we need to mitigate the risks by preserving our old instances while the new instances are burning-in under production load. Once we feel solid - we can terminate the old instances.

I can't figure out how to do this. Here's what I tried:

  1. I updated the launch configuration for the Auto Scaling Group and doubled the number of instances in it. End result I have 4 instances with OLD AMI and 4 instances with NEW AMI. Good!

  2. I then updated the ECS Service and increased it's number of Tasks from 4 to 8. End result 4 new tasks were started on the NEW AMI Instances and 4 original tasks are still running on the OLD AMI Instances. Good!

All good at this point. Next I need to stop the tasks on the 4 OLD AMI Instances and somehow keep these OLD AMI Instances in reserve while we burn in the 4 NEW AMI Instances. Here's what I tried:

  1. I set the Status for the 4 OLD AMI Instance "Standby" (in ASG). I was expecting ECS AGENT on these OLD AMI Instances to terminate all running ECS Tasks. No dice!

  2. I then reduced ECS Service task number from 8 to 4 hoping that ECS Agent will terminate the Tasks on the OLD AMI Instances. No dice! It terminated TASKS on random instances mixing NEW/OLD in the process.

  3. I then decided to help ECS Agent and manually (one at a time) stopped running TASKS on the OLD AMI Instances hoping that ECS Agent will NOT re-launch the TASKS on the OLD AMI Instances. No dice -- it still managed to launch some tasks on the OLD AMI Instances.

At this point I am lost. Is this even possible?

One option I am considering is using task-placement-constraints, but I am hoping someone here has gone through this basic need and can share their ideas with me.

I feel we should have a way to mark ECS Instances as StandBy and have the ECS Agent not schedule any tasks on them for as long as that status is active. I don't think "Deregister" functionality is sufficient here because there is no way that I know of to bring deregistered instances back into service.

I also don't like that a specific version of Docker/ECS Agent is not pinned to a specific version of Amazon ECS-optimized AMI. If it were - this would not be an issue, I could always bring back a known, good working set of versions into service. But as it is now - even if I used an older AMI - it will pull in the most recent version of ECS Agent and Docker on launch.

Thank you!

@vmogilev

This comment has been minimized.

Show comment
Hide comment
@vmogilev

vmogilev Jan 18, 2017

I solved this problem using task-placement-constraints

vmogilev commented Jan 18, 2017

I solved this problem using task-placement-constraints

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 19, 2017

@vmogilev can you please tell us how did to manage to do this using task-placement-constraints. The main problem i am facing to while updating the cluster are :
1.how do we make sure new tasks are getting launched in new instances (new ami), when we increase task from 4 to 8
2..how do we make sure old tasks are getting scheduled in new instances (new ami), when we reduced the number of task from 8 to 4
3. how to automate this, as we are running all most more the 50 instances and more the 300 services.

hridyeshpant commented Jan 19, 2017

@vmogilev can you please tell us how did to manage to do this using task-placement-constraints. The main problem i am facing to while updating the cluster are :
1.how do we make sure new tasks are getting launched in new instances (new ami), when we increase task from 4 to 8
2..how do we make sure old tasks are getting scheduled in new instances (new ami), when we reduced the number of task from 8 to 4
3. how to automate this, as we are running all most more the 50 instances and more the 300 services.

@vmogilev

This comment has been minimized.

Show comment
Hide comment
@vmogilev

vmogilev Jan 20, 2017

@hridyeshpant - It took me a day to come up with the solution so I decided to document it before I forget. Hopefully it'll help someone else in the future. Here's the process I came up with:

Blue/Green ECS-optimized AMI Update For ECS Instances

Hope this helps! It you have any questions - don't hesitate to ask.

vmogilev commented Jan 20, 2017

@hridyeshpant - It took me a day to come up with the solution so I decided to document it before I forget. Hopefully it'll help someone else in the future. Here's the process I came up with:

Blue/Green ECS-optimized AMI Update For ECS Instances

Hope this helps! It you have any questions - don't hesitate to ask.

@aaithal

This comment has been minimized.

Show comment
Hide comment
@aaithal

aaithal Jan 24, 2017

Member

@vmogilev Thank you so much for writing this up! Please let us know if you need any more assistance here. I'm closing this issue for now.

Member

aaithal commented Jan 24, 2017

@vmogilev Thank you so much for writing this up! Please let us know if you need any more assistance here. I'm closing this issue for now.

@aaithal aaithal closed this Jan 24, 2017

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jan 24, 2017

Member

@vmogilev @hridyeshpant Today we launched Container Instance Draining to address this use case. Once you launch your new instances with the new AMI, you can set the old instances to DRAINING and the service will move tasks to the new instances for you automatically.

Member

samuelkarp commented Jan 24, 2017

@vmogilev @hridyeshpant Today we launched Container Instance Draining to address this use case. Once you launch your new instances with the new AMI, you can set the old instances to DRAINING and the service will move tasks to the new instances for you automatically.

@vmogilev

This comment has been minimized.

Show comment
Hide comment
@vmogilev

vmogilev Jan 24, 2017

Wow @samuelkarp - you guys rock! Thank you

vmogilev commented Jan 24, 2017

Wow @samuelkarp - you guys rock! Thank you

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant commented Jan 26, 2017

great @samuelkarp :)

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 26, 2017

@samuelkarp is there way we can place existing task to always new instances (created by new ami) from DRAINING instance. we can't use task-placement-constraints as this will required to create new version of task def and service. we also can't put all old AMI instances (60 at this time) to DRAINING state as this will cause throttling in docker registry.

hridyeshpant commented Jan 26, 2017

@samuelkarp is there way we can place existing task to always new instances (created by new ami) from DRAINING instance. we can't use task-placement-constraints as this will required to create new version of task def and service. we also can't put all old AMI instances (60 at this time) to DRAINING state as this will cause throttling in docker registry.

@vmogilev

This comment has been minimized.

Show comment
Hide comment
@vmogilev

vmogilev Jan 26, 2017

@hridyeshpant what are the constraints that are preventing you to create new task definition?

vmogilev commented Jan 26, 2017

@hridyeshpant what are the constraints that are preventing you to create new task definition?

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 26, 2017

@vmogilev we are running 20-30 service per instances. so if i need to use constraints. i need to first update all 30 task def and then need to update 30 service to use new task def during cluster update.
i am looking somethings where i can tell ecs scheduler deploy task to new instances (based on AMI-id) without worry about updating bunch of task def and services.

hridyeshpant commented Jan 26, 2017

@vmogilev we are running 20-30 service per instances. so if i need to use constraints. i need to first update all 30 task def and then need to update 30 service to use new task def during cluster update.
i am looking somethings where i can tell ecs scheduler deploy task to new instances (based on AMI-id) without worry about updating bunch of task def and services.

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jan 27, 2017

Member

@hridyeshpant I'm not sure I understand your question. When you place an instance into the DRAINING status, services with tasks on those instances will stop and replace them according to the service deployment configuration parameters; new tasks are not launched on instances in the DRAINING status.

Member

samuelkarp commented Jan 27, 2017

@hridyeshpant I'm not sure I understand your question. When you place an instance into the DRAINING status, services with tasks on those instances will stop and replace them according to the service deployment configuration parameters; new tasks are not launched on instances in the DRAINING status.

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 27, 2017

@samuelkarp my use case is : i always want services (running in DRAINING instance) move to new instances (created by new ami) not any instances attached to cluster.
so i am using combination of DRAINING feature with task-placement-constraints. but only issue i am seeing is i need to update all task def and services with new placementConstraints with new ami_id of DRAINING isntance.

let say i am running 20-30 task per instances with placementConstraints
"placementConstraints": [
{
"expression": "attribute:ecs.ami-id == old_ami_id",
"type": "memberOf"
}
],
now i put one instance in DRAINING state and launch new instance with new AMI. now i need to update all 20-30 task def with new placementConstraints with new AMI_ID and also need to update the all service with new task def version, so that all services in DRAINING instances moved to only new instances (with new ami).

hridyeshpant commented Jan 27, 2017

@samuelkarp my use case is : i always want services (running in DRAINING instance) move to new instances (created by new ami) not any instances attached to cluster.
so i am using combination of DRAINING feature with task-placement-constraints. but only issue i am seeing is i need to update all task def and services with new placementConstraints with new ami_id of DRAINING isntance.

let say i am running 20-30 task per instances with placementConstraints
"placementConstraints": [
{
"expression": "attribute:ecs.ami-id == old_ami_id",
"type": "memberOf"
}
],
now i put one instance in DRAINING state and launch new instance with new AMI. now i need to update all 20-30 task def with new placementConstraints with new AMI_ID and also need to update the all service with new task def version, so that all services in DRAINING instances moved to only new instances (with new ami).

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jan 27, 2017

Member

Is there a particular reason you have the AMI ID in the task definition? If you were just using that as a stop-gap before DRAINING came out, do you still need it?

Member

samuelkarp commented Jan 27, 2017

Is there a particular reason you have the AMI ID in the task definition? If you were just using that as a stop-gap before DRAINING came out, do you still need it?

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 27, 2017

@samuelkarp
if i dont use AMI ID in placementConstraints, how do i move old task to new instances.
If you can point me to solve my use case that will help.
use case:
During my cluster update, i always want services (running in DRAINING instance) move to new instances (created by new ecs-ami) not any old running instances attached to cluster.
i don't want to update 20-30 task def and services via placementConstraints and update ami_id during my cluster update process.
we are running more then 500 services per cluster, so placementConstraints option doesn't look good option, which required updating 500 task def and then services.

hridyeshpant commented Jan 27, 2017

@samuelkarp
if i dont use AMI ID in placementConstraints, how do i move old task to new instances.
If you can point me to solve my use case that will help.
use case:
During my cluster update, i always want services (running in DRAINING instance) move to new instances (created by new ecs-ami) not any old running instances attached to cluster.
i don't want to update 20-30 task def and services via placementConstraints and update ami_id during my cluster update process.
we are running more then 500 services per cluster, so placementConstraints option doesn't look good option, which required updating 500 task def and then services.

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jan 27, 2017

Member

if i dont use AMI ID in placementConstraints, how do i move old task to new instances.

Mark your old instances as DRAINING and the ECS service scheduler will move them according to the service deployment configuration parameters.

Member

samuelkarp commented Jan 27, 2017

if i dont use AMI ID in placementConstraints, how do i move old task to new instances.

Mark your old instances as DRAINING and the ECS service scheduler will move them according to the service deployment configuration parameters.

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 27, 2017

@samuelkarp but just marking old instances as DRAINING , ecs can put those services in any running instances, which can be old instances with old ami.
AM i missing something here ?

hridyeshpant commented Jan 27, 2017

@samuelkarp but just marking old instances as DRAINING , ecs can put those services in any running instances, which can be old instances with old ami.
AM i missing something here ?

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jan 27, 2017

Member

AM i missing something here ?

Maybe we're both speaking past each other?

  • Launch some new instances
  • Mark old instances DRAINING
  • Tasks move off the DRAINING instances
  • After all the tasks are off the drained instances, terminate your old instances

We have a blog post about how to do this.

Member

samuelkarp commented Jan 27, 2017

AM i missing something here ?

Maybe we're both speaking past each other?

  • Launch some new instances
  • Mark old instances DRAINING
  • Tasks move off the DRAINING instances
  • After all the tasks are off the drained instances, terminate your old instances

We have a blog post about how to do this.

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 27, 2017

@samuelkarp so when we say Tasks move off the DRAINING instances,
let say

  • i have 3 instances with ami-id 1234
  • i launch new 1 instances with ami-id -5678
  • mark one old instances (ami-id 1234) DRAINING.At this stage, will ecs scheduler place task always to newly launch instances (ami-id -5678) or any instances (2 old ami and 1 new ami) ?

hridyeshpant commented Jan 27, 2017

@samuelkarp so when we say Tasks move off the DRAINING instances,
let say

  • i have 3 instances with ami-id 1234
  • i launch new 1 instances with ami-id -5678
  • mark one old instances (ami-id 1234) DRAINING.At this stage, will ecs scheduler place task always to newly launch instances (ami-id -5678) or any instances (2 old ami and 1 new ami) ?
@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jan 27, 2017

Member

At this stage, will ecs scheduler place task alaeys to newly launch instances (ami-id -5678) or any instances (2 old ami and 1 new ami) ?

Any of the non-DRAINING instances.

Member

samuelkarp commented Jan 27, 2017

At this stage, will ecs scheduler place task alaeys to newly launch instances (ami-id -5678) or any instances (2 old ami and 1 new ami) ?

Any of the non-DRAINING instances.

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 27, 2017

that is no my use case, i aways want to deploy in newly launch instances (ami-id -5678) . that what i mentioned earlier about my use case and using combination of combination of DRAINING feature with task-placement-constraints

hridyeshpant commented Jan 27, 2017

that is no my use case, i aways want to deploy in newly launch instances (ami-id -5678) . that what i mentioned earlier about my use case and using combination of combination of DRAINING feature with task-placement-constraints

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jan 27, 2017

Member

I think I'm misunderstanding your use-case then. It's probably worth opening a new issue if you want to go into this in more depth.

Member

samuelkarp commented Jan 27, 2017

I think I'm misunderstanding your use-case then. It's probably worth opening a new issue if you want to go into this in more depth.

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 27, 2017

yeah may be i was not clear.
my use case, i aways want to deploy in newly launch instances (ami-id -5678) . that what i mentioned earlier about my use case and using combination of combination of DRAINING feature with task-placement-constraints but the issue is unnecessary updating task def and services

hridyeshpant commented Jan 27, 2017

yeah may be i was not clear.
my use case, i aways want to deploy in newly launch instances (ami-id -5678) . that what i mentioned earlier about my use case and using combination of combination of DRAINING feature with task-placement-constraints but the issue is unnecessary updating task def and services

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jan 27, 2017

Member

@hridyeshpant If I understand your use-case: you have a large number of instances in a cluster. You'd like to replace all the instances in the cluster over a period of time rather than all at once, and you'd like to ensure that tasks only get moved once rather than potentially moving more than once due to starting on instances that you're going to get rid of anyway.

I think you can accomplish this by using DRAINING plus an anti-affinity task placement constraint in your task definition. For example:

Step 1: Add a custom constraint to all task definitions to not place tasks on instances with a custom attribute state=pre-drain. You only need to do this once.

"placementConstraints": [
   {
       "expression": "attribute:state!=pre-drain",
       "type": "memberOf"
   }
]

Step 2: Launch the instances with the new AMI in your cluster.

Step 3: Set the custom attribute state=pre-drain on the old instances in the cluster. New task launches will now only go to the new instances.

Step 4: Set X% of the old instances to the DRAINING state. When tasks are rescheduled they will only end up on new instances.

Step 5: Continue to increase the percentage of instances in the DRAINING state. When you have 100% of the old instances in the DRAINING state and tasks successfully moved, terminate the instances.

Member

samuelkarp commented Jan 27, 2017

@hridyeshpant If I understand your use-case: you have a large number of instances in a cluster. You'd like to replace all the instances in the cluster over a period of time rather than all at once, and you'd like to ensure that tasks only get moved once rather than potentially moving more than once due to starting on instances that you're going to get rid of anyway.

I think you can accomplish this by using DRAINING plus an anti-affinity task placement constraint in your task definition. For example:

Step 1: Add a custom constraint to all task definitions to not place tasks on instances with a custom attribute state=pre-drain. You only need to do this once.

"placementConstraints": [
   {
       "expression": "attribute:state!=pre-drain",
       "type": "memberOf"
   }
]

Step 2: Launch the instances with the new AMI in your cluster.

Step 3: Set the custom attribute state=pre-drain on the old instances in the cluster. New task launches will now only go to the new instances.

Step 4: Set X% of the old instances to the DRAINING state. When tasks are rescheduled they will only end up on new instances.

Step 5: Continue to increase the percentage of instances in the DRAINING state. When you have 100% of the old instances in the DRAINING state and tasks successfully moved, terminate the instances.

@hridyeshpant

This comment has been minimized.

Show comment
Hide comment
@hridyeshpant

hridyeshpant Jan 27, 2017

@samuelkarp thanks a lot, this will definitely work.

hridyeshpant commented Jan 27, 2017

@samuelkarp thanks a lot, this will definitely work.

@mattcallanan

This comment has been minimized.

Show comment
Hide comment
@mattcallanan

mattcallanan Nov 21, 2017

Updating this thread to say this approach does work well.
One minor tweak was to add state !exists or to the expression as instances by default won't have the state attribute:

      placement_constraints = [ {
              type: 'memberOf',
              expression: 'attribute:state !exists or attribute:state != pre-drain'
          }]

mattcallanan commented Nov 21, 2017

Updating this thread to say this approach does work well.
One minor tweak was to add state !exists or to the expression as instances by default won't have the state attribute:

      placement_constraints = [ {
              type: 'memberOf',
              expression: 'attribute:state !exists or attribute:state != pre-drain'
          }]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment