ECS CodeDeploy canary deployments #229

clareliguori · 2019-03-28T21:14:17Z

Similar to ECS blue-green deployments with AWS CodeDeploy, but shift a percentage of production traffic to the green fleet and monitor rollback alarms, before shifting 100% of traffic.

jespersoderlund · 2019-03-29T07:44:47Z

There also needs to be a mode where you programmatically choose to promote the canary, not only relying on alarms. We've build this kind of orchestration on-top of the existing ECS functionality but there's a lot of complexity there that would be good to have provided by the service.

clareliguori · 2019-03-29T13:35:47Z

@jespersoderlund what gates your programmatic promotions? Integration tests, manual testing, other metric sources, etc?

CodeDeploy Hooks allow for programmatic promotion between each step in the deployment lifecycle. It sounds like you need the ability to invoke a hook when a percentage of production traffic is shifted.
https://docs.aws.amazon.com/codedeploy/latest/userguide/reference-appspec-file-structure-hooks.html#appspec-hooks-ecs

jespersoderlund · 2019-03-29T21:42:02Z

We have 2 types of gates that we implement today in addition to a basic "promote-if-healhty".

Manual, where the deployment will wait until a manual approval is made
Metrics, where a set of standard alerts that we define can be combined with custom defined metrics (we're building an operating a platform for internal dev teams)

The problem with the "hooks" is that it's called once. We want the deployment to stop and wait for input in the manual canary promotion case, since it will be a completely async process with an unknown time between "stop" and "promote/rollback".

For the metrics one it's can also be an unpredictable time since some services might need longer time to gather enough data to make a metrics decision on whether to proceed or not.

In both cases there must be a timeout with rollback.

clareliguori · 2019-03-29T22:15:57Z

The CodeDeploy hook does stop and wait for input. It does not continue the deployment based on the function's success -- the function can actually go off and trigger some other async workflow, or notify someone that a manual approval is needed. The hook waits for something (a function, an async workflow, a person) to call the PutLifecycleEventHookExecutionStatus API. The hook timeout is configurable up to an hour, default is 30 minutes, and can trigger rollbacks.

jespersoderlund · 2019-03-30T20:22:37Z

Right, that would work then! % of traffic + hooks to allow the other types of canary-promotion triggers.

dsouzajude · 2019-04-02T13:03:18Z

It would also be great to support canary deploys for services that don't need to be behind an ALB or associated with any target group. Currently Blue-Green Deployments with ECS only support services that have an associated target group and are behind an ALB.

In some cases, like ours, we have an api service through which all traffic gets routed down to downstream (backend) services, but these reside in private subnets and are in no need of an ALB (traffic gets routed to them using HAProxy). We'd like to have canary deployment support for these as well and these deploys can be monitored by custom metrics that we have in place (such as the ones we get from logs i.e. errors or haproxy metrics).

We'd also like CloudFormation support for this.

Just some feedback from my side.

clareliguori · 2019-04-02T20:20:26Z

@dsouzajude (and others!) For non-load-balanced services, how would you expect the shifting behavior to look?

For example, a blue-green-ish deployment with initial canary:

Prior: old version is at 100% of desired count
Step 1: set new version to 10% of desired count (110% total of desired count, new version takes ~9% of traffic)
Step 2: set new version to 100% of desired count (200% total of desired count, new version takes 50% of traffic)
Step 3: set old version to 0% of desired count (100% total of desired count, new version takes 100% of traffic)

Or perhaps a more linear progression to limit overprovisioning:

Prior: old version is at 100%
Step 1: set new version to 10% of desired count (110% total of desired count, new version takes ~9% of traffic)
Step 2: set old version to 90% of desired count (100% total of desired count, new version takes ~10% of traffic)
Step 3: set new version to 20% of desired count (110% total of desired count, new version takes ~18% of traffic)
Step 4: set old version to 80% of desired count (100% total of desired count, new version takes ~20% of traffic)
And so on, adding 10% more to new version and removing 10% from old version, until new version is at 100% and old version is at 0%

mridehalgh · 2019-04-02T22:15:29Z

@clareliguori one thing that would be great would be support for a different baseline. Instead of comparing against the old version. The baseline would use the same version as the old version however it would share the same amount of traffic as the new canary. For example to compare a 5% baseline and 5% of the new version. The goal with this is to try and reduce the likelihood of anything interfering with the analysis.

For example, does a newly provisioned service function more slowly/quicker than a warm service?

Spinnaker probably explains this better than I can: https://www.spinnaker.io/guides/user/canary/best-practices/#compare-canary-against-baseline-not-against-production

clareliguori · 2019-04-02T23:01:13Z

@mridehalgh Tell me more about how you would use the baseline and canary metrics, and where your metrics are stored (CloudWatch, other?). Spinnaker's Kayenta system compares baseline vs canary metrics using a threshold for how far apart the metric values can be to promote the deployment, while CodeDeploy uses absolute thresholds specified in CloudWatch alarms.

Btw, Spinnaker does have some support for ECS, see details in issue #234

dsouzajude · 2019-04-03T07:45:27Z

@dsouzajude (and others!) For non-load-balanced services, how would you expect the shifting behavior to look?

For example, a blue-green-ish deployment with initial canary:

Prior: old version is at 100% of desired count

Step 1: set new version to 10% of desired count (110% total of desired count, new version takes ~9% of traffic)

Step 2: set new version to 100% of desired count (200% total of desired count, new version takes 50% of traffic)

Step 3: set old version to 0% of desired count (100% total of desired count, new version takes 100% of traffic)

Or perhaps a more linear progression to limit overprovisioning:

Prior: old version is at 100%

Step 1: set new version to 10% of desired count (110% total of desired count, new version takes ~9% of traffic)

Step 2: set old version to 90% of desired count (100% total of desired count, new version takes ~10% of traffic)

Step 3: set new version to 20% of desired count (110% total of desired count, new version takes ~18% of traffic)

Step 4: set old version to 80% of desired count (100% total of desired count, new version takes ~20% of traffic)

And so on, adding 10% more to new version and removing 10% from old version, until new version is at 100% and old version is at 0%

@clareliguori I would prefer a blue-greenish deploy compared to linear progression.

Another option that we currently use in our non-ecs environment is, we specify how many instances of the new canary (i.e. the desired count itself) should be allowed to run (you could also express this as a percent of the desired count). And we state that this is a "canary" deployment. But in this canary deployment, we observe how it performs w.r.t performance and functionality (i.e. errors, expected behaviour and other custom metrics). and we let it run indefinitely for X days (sometimes over the weekend or overnight) to gain more confidence in the canary deploy. Only then we manually complete the canary deployment by switching traffic over completely to the new canary.

Since we already know that service was deployed as a canary, on the next deploy we could have the option to:

Complete the canary (i.e. complete the blue-green ish way of deploying the canary as you mentioned above) or
Manually increase the desired count again and wait again for some time to test the canary even more with more instances of it (more traffic to it) or
Rollback the canary and shift traffic to the old version if we are not satisfied with results.

Hope that makes sense. I could explain more if you require more details about my use-case.

deleugpn · 2019-04-03T07:52:45Z

For me if we at least had CloudFormation support for Blue/Green, that would be fantastic. Where I work, AWS only exist to the extend of CloudFormation support.

clareliguori · 2019-04-03T15:08:32Z

@deleugpn yep, we're tracking that in issue #130

dsouzajude · 2019-05-29T10:53:46Z

Just wanted to confirm what desiredCount in this case would be? Would it be the desiredCount setup when configuring the service originally or the desiredCount at runtime (i.e. the current desiredCount) which was automatically adjusted during the events of autoscaling the service.

Just wanted to add that, on promoting the canary (or during the canary), the desiredCount at runtime should be chosen and not the desiredCount that was set originally. I wanted to ask this because we've faced this issue previously when using the boto3 API where on updating the service we needed to provide a desiredCount and this needs to be the current desiredCount that has been changed maybe due to autoscaling but ECS doesn't take this automatically into account from what i understand.

Thanks!

nathanpeck · 2019-05-29T16:14:29Z

@dsouzajude CodeDeploy deployments use task sets under the hood, which have a scale attribute that is a percentage of the service's desired count. So if the service's desiredCount is 10 and the service has two task sets at scale = 100%, each task set will have 10 tasks. If autoscaling occurs and increases the service's desiredCount to 11, each task set will launch an additional task so that each task set has 11 tasks.

ghost · 2019-11-15T21:00:54Z

@coultn do you know if this will be supported by cloudformation when it's released?

KiamarzFallahi · 2020-02-07T00:01:13Z

We are pleased to announce that your containers hosted on Amazon Elastic Container Service (Amazon ECS) can now be updated using canary or linear deployment strategies by using AWS CodeDeploy.

For more information see our announcement, visit our new blog and see the technical documentation.

clareliguori mentioned this issue Mar 28, 2019

Improved continuous delivery support for Fargate and ECS #13

Closed

abby-fuller added Fargate AWS Fargate ECS Amazon Elastic Container Service labels Mar 28, 2019

clareliguori closed this as completed Feb 7, 2020

mwarkentin mentioned this issue Feb 7, 2020

[ECS] [CodeDeploy] [CloudFormation]: CloudFormation support for BLUE/GREEN deployments on ECS #130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS CodeDeploy canary deployments #229

ECS CodeDeploy canary deployments #229

clareliguori commented Mar 28, 2019

jespersoderlund commented Mar 29, 2019

clareliguori commented Mar 29, 2019

jespersoderlund commented Mar 29, 2019

clareliguori commented Mar 29, 2019

jespersoderlund commented Mar 30, 2019

dsouzajude commented Apr 2, 2019

clareliguori commented Apr 2, 2019

mridehalgh commented Apr 2, 2019 •

edited

Loading

clareliguori commented Apr 2, 2019 •

edited

Loading

dsouzajude commented Apr 3, 2019

deleugpn commented Apr 3, 2019

clareliguori commented Apr 3, 2019

dsouzajude commented May 29, 2019

nathanpeck commented May 29, 2019 •

edited

Loading

ghost commented Nov 15, 2019

KiamarzFallahi commented Feb 7, 2020 •

edited

Loading

ECS CodeDeploy canary deployments #229

ECS CodeDeploy canary deployments #229

Comments

clareliguori commented Mar 28, 2019

jespersoderlund commented Mar 29, 2019

clareliguori commented Mar 29, 2019

jespersoderlund commented Mar 29, 2019

clareliguori commented Mar 29, 2019

jespersoderlund commented Mar 30, 2019

dsouzajude commented Apr 2, 2019

clareliguori commented Apr 2, 2019

mridehalgh commented Apr 2, 2019 • edited Loading

clareliguori commented Apr 2, 2019 • edited Loading

dsouzajude commented Apr 3, 2019

deleugpn commented Apr 3, 2019

clareliguori commented Apr 3, 2019

dsouzajude commented May 29, 2019

nathanpeck commented May 29, 2019 • edited Loading

ghost commented Nov 15, 2019

KiamarzFallahi commented Feb 7, 2020 • edited Loading

mridehalgh commented Apr 2, 2019 •

edited

Loading

clareliguori commented Apr 2, 2019 •

edited

Loading

nathanpeck commented May 29, 2019 •

edited

Loading

KiamarzFallahi commented Feb 7, 2020 •

edited

Loading