Skip to content

Commit

Permalink
fix(eks): AMI changes in managed SSM store param causes rolling updat…
Browse files Browse the repository at this point in the history
…e of ASG (#9746)

This might be the PR with the highest explanation/code ratio i've ever made :)

When a value changes for an AMI in a managed SSM store parameter, it should not cause a replacement of the ASG nodes. The reasoning is that managed params can change over time with no control on the user's part. Because of this, the change will not be reflected in `cdk diff` and creates a situation where every deployment can potentially cause node replacement without notice.

There are two scenarios in which the cluster interacts with an `AutoScalingGroup`

### `addCapacity`

When one uses `cluster.addCapacity`, we implicitly create an `AutoScalingGroup` that uses either the `BottleRocketImage` or the `EksOptimizedImage` as the machine image, with no option to customize it. Both these images fetch their AMI's from a managed SSM parameter (`/aws/service/eks/optimized-ami` or `/aws/service/bottlerocket`). This means that we create the situation described above by **default**. 

https://github.com/aws/aws-cdk/blob/5af718bab8522f1a4e7f70e7221f4878a15aa4a4/packages/%40aws-cdk/aws-eks/lib/cluster.ts#L779-L785

Seems like a more reasonable default in this case would be to use `UpdateType.NONE` instead of `UpdateType.RollingUpdate`. 

Note that in such a case, even if the user explicitly changes the machine image configuration (by specifying a different `machineImageType`), node replacement will not occur, even though `cdk diff` will clearly show a configuration change.

In any case, the `updateType` can always be explicitly passed to mitigate any issue caused by the default behavior. 

### `addAutoScalingGroup`

When one uses `cluster.addAutoScalingGroup`, the `AutoScalingGroup` is created by the user. The default value for `updateType` in the `AutoScalingGroup` construct is `UpdateType.NONE`, so unless the user explicitly configured `UpdateType.RollingUpdate` - node replacement should not occur. 

Having said that, when a user specifies `UpdateType.RollingUpdate`, its not super intuitive that this update might happen without any explicit configuration change, and in fact this is actually documented in the images that use SSM to fetch the API:

https://github.com/aws/aws-cdk/blob/5af718bab8522f1a4e7f70e7221f4878a15aa4a4/packages/%40aws-cdk/aws-ec2/lib/machine-image.ts#L216-L226

-------------------------------------------

There is no way for us to selectively apply the update policy, we either dont use it at all, meaning intentional user changes won't replace nodes as well, or we use it for all, meaning implicit changes will cause it.

Ideally, we should consider moving away from using these managed SSM params in launch configurations, but that requires some additional investigation. 

The PR simply suggests to remove the `UpdateType.RollingUpdate` default from the `addCapacity` method, as a form of balance between all the considerations mentioned above.  

Fixes #7273

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
  • Loading branch information
iliapolo committed Aug 17, 2020
1 parent 4ee37a4 commit 44f7753
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 51 deletions.
9 changes: 6 additions & 3 deletions packages/@aws-cdk/aws-eks/lib/cluster.ts
Expand Up @@ -763,9 +763,12 @@ export class Cluster extends Resource implements ICluster {
* The nodes will automatically be configured with the right VPC and AMI
* for the instance type and Kubernetes version.
*
* Note that if you specify `updateType: RollingUpdate` or `updateType: ReplacingUpdate`, your nodes might be replaced at deploy
* time without notice in case the recommended AMI for your machine image type has been updated by AWS.
* The default behavior for `updateType` is `None`, which means only new instances will be launched using the new AMI.
*
* Spot instances will be labeled `lifecycle=Ec2Spot` and tainted with `PreferNoSchedule`.
* If kubectl is enabled, the
* [spot interrupt handler](https://github.com/awslabs/ec2-spot-labs/tree/master/ec2-spot-eks-solution/spot-termination-handler)
* In addition, the [spot interrupt handler](https://github.com/awslabs/ec2-spot-labs/tree/master/ec2-spot-eks-solution/spot-termination-handler)
* daemon will be installed on all spot instances to handle
* [EC2 Spot Instance Termination Notices](https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/).
*/
Expand All @@ -782,7 +785,7 @@ export class Cluster extends Resource implements ICluster {
nodeType: nodeTypeForInstanceType(options.instanceType),
kubernetesVersion: this.version.version,
}),
updateType: options.updateType || autoscaling.UpdateType.ROLLING_UPDATE,
updateType: options.updateType,
instanceType: options.instanceType,
});

Expand Down
4 changes: 0 additions & 4 deletions packages/@aws-cdk/aws-eks/lib/legacy-cluster.ts
Expand Up @@ -230,10 +230,6 @@ export class LegacyCluster extends Resource implements ICluster {
* for the instance type and Kubernetes version.
*
* Spot instances will be labeled `lifecycle=Ec2Spot` and tainted with `PreferNoSchedule`.
* If kubectl is enabled, the
* [spot interrupt handler](https://github.com/awslabs/ec2-spot-labs/tree/master/ec2-spot-eks-solution/spot-termination-handler)
* daemon will be installed on all spot instances to handle
* [EC2 Spot Instance Termination Notices](https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/).
*/
public addCapacity(id: string, options: CapacityOptions): autoscaling.AutoScalingGroup {
if (options.machineImageType === MachineImageType.BOTTLEROCKET && options.bootstrapOptions !== undefined ) {
Expand Down
44 changes: 0 additions & 44 deletions packages/@aws-cdk/aws-eks/test/integ.eks-cluster.expected.json
Expand Up @@ -1551,17 +1551,6 @@
]
},
"UpdatePolicy": {
"AutoScalingRollingUpdate": {
"WaitOnResourceSignals": false,
"PauseTime": "PT0S",
"SuspendProcesses": [
"HealthCheck",
"ReplaceUnhealthy",
"AZRebalance",
"AlarmNotification",
"ScheduledActions"
]
},
"AutoScalingScheduledAction": {
"IgnoreUnmodifiedGroupSizeProperties": true
}
Expand Down Expand Up @@ -1853,17 +1842,6 @@
]
},
"UpdatePolicy": {
"AutoScalingRollingUpdate": {
"WaitOnResourceSignals": false,
"PauseTime": "PT0S",
"SuspendProcesses": [
"HealthCheck",
"ReplaceUnhealthy",
"AZRebalance",
"AlarmNotification",
"ScheduledActions"
]
},
"AutoScalingScheduledAction": {
"IgnoreUnmodifiedGroupSizeProperties": true
}
Expand Down Expand Up @@ -2142,17 +2120,6 @@
]
},
"UpdatePolicy": {
"AutoScalingRollingUpdate": {
"WaitOnResourceSignals": false,
"PauseTime": "PT0S",
"SuspendProcesses": [
"HealthCheck",
"ReplaceUnhealthy",
"AZRebalance",
"AlarmNotification",
"ScheduledActions"
]
},
"AutoScalingScheduledAction": {
"IgnoreUnmodifiedGroupSizeProperties": true
}
Expand Down Expand Up @@ -2462,17 +2429,6 @@
]
},
"UpdatePolicy": {
"AutoScalingRollingUpdate": {
"WaitOnResourceSignals": false,
"PauseTime": "PT0S",
"SuspendProcesses": [
"HealthCheck",
"ReplaceUnhealthy",
"AZRebalance",
"AlarmNotification",
"ScheduledActions"
]
},
"AutoScalingScheduledAction": {
"IgnoreUnmodifiedGroupSizeProperties": true
}
Expand Down
19 changes: 19 additions & 0 deletions packages/@aws-cdk/aws-eks/test/test.cluster.ts
Expand Up @@ -15,6 +15,7 @@ import { testFixture, testFixtureNoVpc } from './util';
const CLUSTER_VERSION = eks.KubernetesVersion.V1_16;

export = {

'a default cluster spans all subnets'(test: Test) {
// GIVEN
const { stack, vpc } = testFixture();
Expand Down Expand Up @@ -201,6 +202,24 @@ export = {
test.done();
},

'adding capacity creates an ASG without a rolling update policy'(test: Test) {
// GIVEN
const { stack, vpc } = testFixture();
const cluster = new eks.Cluster(stack, 'Cluster', {
vpc,
defaultCapacity: 0,
version: CLUSTER_VERSION,
});

// WHEN
cluster.addCapacity('Default', {
instanceType: new ec2.InstanceType('t2.medium'),
});

test.deepEqual(expect(stack).value.Resources.ClusterASG0E4BA723.UpdatePolicy, { AutoScalingScheduledAction: { IgnoreUnmodifiedGroupSizeProperties: true } });
test.done();
},

'adding capacity creates an ASG with tags'(test: Test) {
// GIVEN
const { stack, vpc } = testFixture();
Expand Down

0 comments on commit 44f7753

Please sign in to comment.