Skip to content

How to preserve currently running ECS Instances during update to the latest AMI #672

@vmogilev

Description

@vmogilev

I am in process of testing the update to the latest Amazon ECS-optimized AMI (amzn-ami-2016.09.d-amazon-ecs-optimized).

Our current ECS Instances are running amzn-ami-2015.09.g-amazon-ecs-optimized which at the time of the launch pulled the following stack:

Docker: 1.9.1
ECS Agent: 1.8.2

I don't think it's a good idea to simply update launch configuration with the new AMI and hope for the best. What if things fail under load, what if we discover a bug with the new AMI/Docker/Agent combo running our containers? These are all possibilities and we need to mitigate the risks by preserving our old instances while the new instances are burning-in under production load. Once we feel solid - we can terminate the old instances.

I can't figure out how to do this. Here's what I tried:

  1. I updated the launch configuration for the Auto Scaling Group and doubled the number of instances in it. End result I have 4 instances with OLD AMI and 4 instances with NEW AMI. Good!

  2. I then updated the ECS Service and increased it's number of Tasks from 4 to 8. End result 4 new tasks were started on the NEW AMI Instances and 4 original tasks are still running on the OLD AMI Instances. Good!

All good at this point. Next I need to stop the tasks on the 4 OLD AMI Instances and somehow keep these OLD AMI Instances in reserve while we burn in the 4 NEW AMI Instances. Here's what I tried:

  1. I set the Status for the 4 OLD AMI Instance "Standby" (in ASG). I was expecting ECS AGENT on these OLD AMI Instances to terminate all running ECS Tasks. No dice!

  2. I then reduced ECS Service task number from 8 to 4 hoping that ECS Agent will terminate the Tasks on the OLD AMI Instances. No dice! It terminated TASKS on random instances mixing NEW/OLD in the process.

  3. I then decided to help ECS Agent and manually (one at a time) stopped running TASKS on the OLD AMI Instances hoping that ECS Agent will NOT re-launch the TASKS on the OLD AMI Instances. No dice -- it still managed to launch some tasks on the OLD AMI Instances.

At this point I am lost. Is this even possible?

One option I am considering is using task-placement-constraints, but I am hoping someone here has gone through this basic need and can share their ideas with me.

I feel we should have a way to mark ECS Instances as StandBy and have the ECS Agent not schedule any tasks on them for as long as that status is active. I don't think "Deregister" functionality is sufficient here because there is no way that I know of to bring deregistered instances back into service.

I also don't like that a specific version of Docker/ECS Agent is not pinned to a specific version of Amazon ECS-optimized AMI. If it were - this would not be an issue, I could always bring back a known, good working set of versions into service. But as it is now - even if I used an older AMI - it will pull in the most recent version of ECS Agent and Docker on launch.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions