Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws-eks] Unexpected node replacement due to using SSM for AMIs #7273

Closed
stefanolczak opened this issue Apr 9, 2020 · 2 comments · Fixed by #9746
Closed

[aws-eks] Unexpected node replacement due to using SSM for AMIs #7273

stefanolczak opened this issue Apr 9, 2020 · 2 comments · Fixed by #9746
Assignees
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/small Small work item – less than a day of effort p1

Comments

@stefanolczak
Copy link

stefanolczak commented Apr 9, 2020

#4156 introduced retrieving AMI id for EKS worker nodes from SSM by using AWS::SSM::Parameter resource in CF template. It's a great feature when it comes to new deployments but it causes troubles when updating currently deployed stacks with EKS worker nodes. Value of the parameter changes when AWS releases new version of the AMI. For example yesterday there was a change from amazon-eks-node-1.14-v20200312 to amazon-eks-node-1.14-v20200406. The main problem is that this change is not shown in cdk diff. But if we change anything else in the stack, even something irrelevant then Cloudformation implicitly updates all Launch Configurations used by AutoScalingGroups of EKS worker nodes. That results in replacement of every EC2 ( EKS worked node ) of every ASG without any notification before deployment. Its really frustrating because we don't know what will be updated during the deployment and replacing every node is big thing that we should be noticed of before the deployment starts. Can we at least detect the change in cdk diff?

Reproduction Steps

  1. Deploy stack with EKS cluster. Do no use managed node groups but just add new ASGs usting addCapacity()
  2. Wait for AWS changes recommended AMI image id in SSM Parameter store.
  3. Try to deploy stack and notice no changes are detected.
  4. Change something in stack. ( i.e. max capacity of ASG )
  5. Deploy stack and notice CF is updating launch configurations with new AMI id.

Error Log

2020-04-09 09:41:00 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CLaunchConfigD2E4A6CF | UPDATE_IN_PROGRESS | Requested update requires the creation of a new physical resource; hence creating one.
2020-04-09 09:41:00 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BLaunchConfig5CCFDD66 | UPDATE_IN_PROGRESS | Requested update requires the creation of a new physical resource; hence creating one.
2020-04-09 09:41:01 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1ALaunchConfigC286DD69 | UPDATE_IN_PROGRESS | Requested update requires the creation of a new physical resource; hence creating one.
2020-04-09 09:41:01 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CLaunchConfigD2E4A6CF | UPDATE_IN_PROGRESS | Resource creation Initiated
2020-04-09 09:41:01 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BLaunchConfig5CCFDD66 | UPDATE_IN_PROGRESS | Resource creation Initiated
2020-04-09 09:41:02 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1ALaunchConfigC286DD69 | UPDATE_COMPLETE | -
2020-04-09 09:41:02 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CLaunchConfigD2E4A6CF | UPDATE_COMPLETE | -
2020-04-09 09:41:02 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1ALaunchConfigC286DD69 | UPDATE_IN_PROGRESS | Resource creation Initiated
2020-04-09 09:41:02 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BLaunchConfig5CCFDD66 | UPDATE_COMPLETE | -
2020-04-09 09:41:09 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | -
2020-04-09 09:41:10 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949 | UPDATE_IN_PROGRESS | -
2020-04-09 09:41:10 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | -
2020-04-09 09:41:14 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949 | UPDATE_IN_PROGRESS | Rolling update initiated. Terminating 2 obsolete instance(s) in batches of 1.
2020-04-09 09:41:14 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | Rolling update initiated. Terminating 3 obsolete instance(s) in batches of 1.
2020-04-09 09:41:14 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | Temporarily setting autoscaling group MinSize and DesiredCapacity to 3.
2020-04-09 09:41:14 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | Rolling update initiated. Terminating 3 obsolete instance(s) in batches of 1.
2020-04-09 09:41:15 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | Terminating instance(s) [i-xxxx]; replacing with 1 new instance(s).
2020-04-09 09:41:15 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949 | UPDATE_IN_PROGRESS | Temporarily setting autoscaling group MinSize and DesiredCapacity to 2.
2020-04-09 09:41:15 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | Temporarily setting autoscaling group MinSize and DesiredCapacity to 3.
2020-04-09 09:41:16 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949 | UPDATE_IN_PROGRESS | Terminating instance(s) [i-xxxx]; replacing with 1 new instance(s).
2020-04-09 09:41:16 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | Terminating instance(s) [i-xxxx]; replacing with 1 new instance(s).
2020-04-09 09:42:09 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949 | UPDATE_IN_PROGRESS | Terminating instance(s) [i-xxxx]; replacing with 1 new instance(s).
2020-04-09 09:42:09 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949 | UPDATE_IN_PROGRESS | Successfully terminated instance(s) [i-xxxx] (Progress 50%).
2020-04-09 09:42:33 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | Terminating instance(s) [i-xxxx]; replacing with 1 new instance(s).
2020-04-09 09:42:33 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | Successfully terminated instance(s) [i-xxxx] (Progress 33%).
2020-04-09 09:42:34 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | Successfully terminated instance(s) [i-xxxx] (Progress 33%).
2020-04-09 09:42:35 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | Terminating instance(s) [i-xxxx]; replacing with 1 new instance(s).
2020-04-09 09:43:02 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949 | UPDATE_IN_PROGRESS | Successfully terminated instance(s) [i-xxxx] (Progress 100%).
2020-04-09 09:43:05 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949 | UPDATE_COMPLETE | -
2020-04-09 09:43:26 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | Terminating instance(s) [i-xxxx]; replacing with 1 new instance(s).
2020-04-09 09:43:26 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | Successfully terminated instance(s) [i-xxxx] (Progress 67%).
2020-04-09 09:43:28 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | Terminating instance(s) [i-xxxx]; replacing with 1 new instance(s).
2020-04-09 09:43:28 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | Successfully terminated instance(s) [i-xxxx] (Progress 67%).
2020-04-09 09:44:22 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_IN_PROGRESS | Successfully terminated instance(s) [i-xxxx] (Progress 100%).
2020-04-09 09:44:24 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F | UPDATE_COMPLETE | -
2020-04-09 09:44:44 UTC+0200 | SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460 | UPDATE_IN_PROGRESS | Successfully terminated instance(s) [i-xxxx] (Progress 100%).

Environment

  • CLI Version : 1.28.0

Other


This is 🐛 Bug Report

@stefanolczak stefanolczak added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 9, 2020
@stefanolczak stefanolczak changed the title Retrieving EKS AMI with SSM shows no diff or causes hidden update Retrieving EKS AMI with SSM shows no diff and causes implicit replacement of every node Apr 9, 2020
@SomayaB SomayaB added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Apr 11, 2020
@eladb
Copy link
Contributor

eladb commented Apr 12, 2020

Thanks for reporting. I can see why this can cause a pretty major headache. We will look into this.

@eladb eladb added the p1 label Apr 12, 2020
@SomayaB SomayaB removed the needs-triage This issue or PR still needs to be triaged. label May 19, 2020
@eladb eladb added this to the EKS Developer Preview milestone Jun 24, 2020
@eladb eladb changed the title Retrieving EKS AMI with SSM shows no diff and causes implicit replacement of every node [EKS Feature] Retrieving EKS AMI with SSM shows no diff and causes implicit replacement of every node Jun 24, 2020
@eladb eladb removed this from the EKS Developer Preview milestone Jun 24, 2020
@eladb eladb added this to the EKS Dev Preview milestone Jul 22, 2020
@eladb eladb added the effort/small Small work item – less than a day of effort label Aug 4, 2020
@eladb eladb changed the title [EKS Feature] Retrieving EKS AMI with SSM shows no diff and causes implicit replacement of every node Unexpected node replacement due to using SSM for AMIs Aug 6, 2020
@iliapolo iliapolo assigned iliapolo and unassigned eladb Aug 9, 2020
@iliapolo iliapolo changed the title Unexpected node replacement due to using SSM for AMIs [aws-eks] Unexpected node replacement due to using SSM for AMIs Aug 16, 2020
@iliapolo
Copy link
Contributor

@stefanolczak The root cause is the update policy of the ASG. Currently, the addCapacity method will default to a RollingUpdate policy, which means any change to the launch configuration will cause node replacements.

We are considering to change this default, but in the meantime, you can specify an update policy of your own to avoid this:

import * as autoscaling from '@aws-cdk/aws-autoscaling';

this.cluster.addCapacity('Nodes', {
  instanceType: new ec2.InstanceType('t2.medium'),
  minCapacity: 3,
  updateType: autoscaling.UpdateType.NONE,
});

@iliapolo iliapolo added the in-progress This issue is being actively worked on. label Aug 16, 2020
@mergify mergify bot closed this as completed in #9746 Aug 17, 2020
mergify bot pushed a commit that referenced this issue Aug 17, 2020
…e of ASG (#9746)

This might be the PR with the highest explanation/code ratio i've ever made :)

When a value changes for an AMI in a managed SSM store parameter, it should not cause a replacement of the ASG nodes. The reasoning is that managed params can change over time with no control on the user's part. Because of this, the change will not be reflected in `cdk diff` and creates a situation where every deployment can potentially cause node replacement without notice.

There are two scenarios in which the cluster interacts with an `AutoScalingGroup`

### `addCapacity`

When one uses `cluster.addCapacity`, we implicitly create an `AutoScalingGroup` that uses either the `BottleRocketImage` or the `EksOptimizedImage` as the machine image, with no option to customize it. Both these images fetch their AMI's from a managed SSM parameter (`/aws/service/eks/optimized-ami` or `/aws/service/bottlerocket`). This means that we create the situation described above by **default**. 

https://github.com/aws/aws-cdk/blob/5af718bab8522f1a4e7f70e7221f4878a15aa4a4/packages/%40aws-cdk/aws-eks/lib/cluster.ts#L779-L785

Seems like a more reasonable default in this case would be to use `UpdateType.NONE` instead of `UpdateType.RollingUpdate`. 

Note that in such a case, even if the user explicitly changes the machine image configuration (by specifying a different `machineImageType`), node replacement will not occur, even though `cdk diff` will clearly show a configuration change.

In any case, the `updateType` can always be explicitly passed to mitigate any issue caused by the default behavior. 

### `addAutoScalingGroup`

When one uses `cluster.addAutoScalingGroup`, the `AutoScalingGroup` is created by the user. The default value for `updateType` in the `AutoScalingGroup` construct is `UpdateType.NONE`, so unless the user explicitly configured `UpdateType.RollingUpdate` - node replacement should not occur. 

Having said that, when a user specifies `UpdateType.RollingUpdate`, its not super intuitive that this update might happen without any explicit configuration change, and in fact this is actually documented in the images that use SSM to fetch the API:

https://github.com/aws/aws-cdk/blob/5af718bab8522f1a4e7f70e7221f4878a15aa4a4/packages/%40aws-cdk/aws-ec2/lib/machine-image.ts#L216-L226

-------------------------------------------

There is no way for us to selectively apply the update policy, we either dont use it at all, meaning intentional user changes won't replace nodes as well, or we use it for all, meaning implicit changes will cause it.

Ideally, we should consider moving away from using these managed SSM params in launch configurations, but that requires some additional investigation. 

The PR simply suggests to remove the `UpdateType.RollingUpdate` default from the `addCapacity` method, as a form of balance between all the considerations mentioned above.  

Fixes #7273

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@iliapolo iliapolo removed the in-progress This issue is being actively worked on. label Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/small Small work item – less than a day of effort p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants