Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(eks): AMI changes in managed SSM store param causes rolling update of ASG #9746

Merged
merged 9 commits into from Aug 17, 2020

Conversation

iliapolo
Copy link
Contributor

@iliapolo iliapolo commented Aug 16, 2020

This might be the PR with the highest explanation/code ratio i've ever made :)

When a value changes for an AMI in a managed SSM store parameter, it should not cause a replacement of the ASG nodes. The reasoning is that managed params can change over time with no control on the user's part. Because of this, the change will not be reflected in cdk diff and creates a situation where every deployment can potentially cause node replacement without notice.

There are two scenarios in which the cluster interacts with an AutoScalingGroup

addCapacity

When one uses cluster.addCapacity, we implicitly create an AutoScalingGroup that uses either the BottleRocketImage or the EksOptimizedImage as the machine image, with no option to customize it. Both these images fetch their AMI's from a managed SSM parameter (/aws/service/eks/optimized-ami or /aws/service/bottlerocket). This means that we create the situation described above by default.

machineImage: options.machineImageType === MachineImageType.BOTTLEROCKET ?
new BottleRocketImage() :
new EksOptimizedImage({
nodeType: nodeTypeForInstanceType(options.instanceType),
kubernetesVersion: this.version.version,
}),
updateType: options.updateType || autoscaling.UpdateType.ROLLING_UPDATE,

Seems like a more reasonable default in this case would be to use UpdateType.NONE instead of UpdateType.RollingUpdate.

Note that in such a case, even if the user explicitly changes the machine image configuration (by specifying a different machineImageType), node replacement will not occur, even though cdk diff will clearly show a configuration change.

In any case, the updateType can always be explicitly passed to mitigate any issue caused by the default behavior.

addAutoScalingGroup

When one uses cluster.addAutoScalingGroup, the AutoScalingGroup is created by the user. The default value for updateType in the AutoScalingGroup construct is UpdateType.NONE, so unless the user explicitly configured UpdateType.RollingUpdate - node replacement should not occur.

Having said that, when a user specifies UpdateType.RollingUpdate, its not super intuitive that this update might happen without any explicit configuration change, and in fact this is actually documented in the images that use SSM to fetch the API:

/**
* Selects the latest version of Amazon Linux
*
* This Machine Image automatically updates to the latest version on every
* deployment. Be aware this will cause your instances to be replaced when a
* new version of the image becomes available. Do not store stateful information
* on the instance if you are using this image.
*
* The AMI ID is selected using the values published to the SSM parameter store.
*/
export class AmazonLinuxImage implements IMachineImage {


There is no way for us to selectively apply the update policy, we either dont use it at all, meaning intentional user changes won't replace nodes as well, or we use it for all, meaning implicit changes will cause it.

Ideally, we should consider moving away from using these managed SSM params in launch configurations, but that requires some additional investigation.

The PR simply suggests to remove the UpdateType.RollingUpdate default from the addCapacity method, as a form of balance between all the considerations mentioned above.

Fixes #7273


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Aug 16, 2020
@iliapolo iliapolo marked this pull request as ready for review August 16, 2020 12:53
@iliapolo iliapolo requested a review from eladb August 16, 2020 12:53
@mergify
Copy link
Contributor

mergify bot commented Aug 17, 2020

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 526a8a0
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify
Copy link
Contributor

mergify bot commented Aug 17, 2020

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot merged commit 44f7753 into master Aug 17, 2020
@mergify mergify bot deleted the epolon/eks-asg-replacement-ssm branch August 17, 2020 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution/core This is a PR that came from AWS.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[aws-eks] Unexpected node replacement due to using SSM for AMIs
3 participants