fix(eks): creating a `ServiceAccount` in a different stack than the `Cluster` creates circular dependency between the two stacks #9701

iliapolo · 2020-08-14T12:44:07Z

This PR changes a few scenarios with regards to circular dependencies in cases where some resources are created in a different stack than the cluster stack.

Reviewers, Please refer to this detail as a first response to PR questions :)

ServiceAccount

Previously, the ServiceAccount construct used cluster.addManifest to deploy the necessary resource.

aws-cdk/packages/@aws-cdk/aws-eks/lib/service-account.ts

Lines 81 to 95 in 25a9cc7

    
             cluster.addManifest(`${id}ServiceAccountResource`, { 
        
               apiVersion: 'v1', 
        
               kind: 'ServiceAccount', 
        
               metadata: { 
        
                 name: this.serviceAccountName, 
        
                 namespace: this.serviceAccountNamespace, 
        
                 labels: { 
        
                   'app.kubernetes.io/name': this.serviceAccountName, 
        
                 }, 
        
                 annotations: { 
        
                   'eks.amazonaws.com/role-arn': this.role.roleArn, 
        
                 }, 
        
               }, 
        
             }); 
        
           }

This means that the resource itself is added to the cluster stack. When the ServiceAccount is created in a different stack, it creates a dependency between the cluster stack and the service account stack.

Since ServiceAccount also depends on the cluster, it creates a dependency between the service account stack and the cluster stack. And hence a circular dependency is formed.

Solution

There is no inherent reason to always add the ServiceAccount resource to the cluster stack. If it was added to the service account stack, the circular dependency could be avoided.

The solution is to use new KubernetesManifest(this, ...) instead of cluster.addResource - creating the manifest in the service account stack, which is perfectly fine since that direction of dependency is the intended one.

AutoScalingGroup Capacity

When adding capacity to a cluster using an AutoScalingGroup, we add the role of the ASG to the aws-auth role mappings of the cluster:

aws-cdk/packages/@aws-cdk/aws-eks/lib/cluster.ts

Lines 914 to 923 in 25a9cc7

    
           if (mapRole) { 
        
             // see https://docs.aws.amazon.com/en_us/eks/latest/userguide/add-user-role.html 
        
             this.awsAuth.addRoleMapping(autoScalingGroup.role, { 
        
               username: 'system:node:{{EC2PrivateDNSName}}', 
        
               groups: [ 
        
                 'system:bootstrappers', 
        
                 'system:nodes', 
        
               ], 
        
             }); 
        
           } else {

The ASG depends on the cluster because, among others, it requires to have a tag with the cluster name:

aws-cdk/packages/@aws-cdk/aws-eks/lib/cluster.ts

Lines 907 to 909 in 25a9cc7

    
           Tags.of(autoScalingGroup).add(`kubernetes.io/cluster/${this.clusterName}`, 'owned', { 
        
             applyToLaunchedInstances: true, 
        
           });

This creates a dependency between the ASG stack and the cluster stack. In case the ASG role is defined in the ASG stack, the auth mapping now creates a dependency between the cluster stack and the ASG, forming a circular dependency between the stacks.

Solution

AwsAuth is a singleton of the cluster, which means it is always defined in the cluster stack. Since we have no control over the stack of the ASG and its role, a circular dependency may always be created. The solution is to simply disallow creating userMappings or roleMappings with a User/Role that is not part of the cluster stack.

This might be a little more restrictive than necessary, but it has less exposure potential to edge cases and complex dependency cycles. We can always be less restrictive down the road if needed.

Fixes #8884
Fixes #9325

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

…bute

…t-circular-dep

packages/@aws-cdk/aws-eks/README.md

packages/@aws-cdk/aws-eks/lib/aws-auth.ts

packages/@aws-cdk/aws-eks/lib/cluster.ts

packages/@aws-cdk/aws-eks/lib/service-account.ts

eladb · 2020-09-01T11:35:41Z

packages/@aws-cdk/aws-eks/lib/aws-auth.ts

+      // a dependency on the cluster, allowing those resources to be in a different stack,
+      // will create a circular dependency. granted, it won't always be the case,
+      // but we opted for the more causious and restrictive approach for now.
+      throw new Error(`${construct.node.uniqueId} must be defined in the scope of the ${thisStack.stackName} stack`);


Add a reason (why not?) and also what should the user do to fix (use the construct directly?)

@eladb What do you mean by "use the construct directly"? I don't think users are supposed to instantiate this construct themselves since it will override the aws-auth config map. Right?

As far as I can tell the only thing to do is create the role/user in the cluster stack, which is already mentioned. But I can rephrase to something like:

throw new Error(`${construct.node.uniqueId} should be defined in the scope of the ${thisStack.stackName} stack to prevent circular dependencies.`)

will that be reasonably clear...?

Yes, that's good.

eladb · 2020-09-01T11:36:12Z

Approved with minor

mergify · 2020-09-02T08:11:49Z

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

mergify · 2020-09-02T08:36:51Z

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

aws-cdk-automation · 2020-09-02T08:36:52Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 9a85c15
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jessecollier · 2020-09-10T20:02:50Z

Sorry if this is not the right place to discuss this. I'm curious why this was not labeled a breaking change in the release notes.

This change caused a create and a delete of the old resource leaving in a state where the actual service account was missing. Re-running the template did not detect any changes, and thus would not recreate it. This caused us to have to create them manually.

Should this not have been labeled a "breaking change" in the release notes? Or is there guidance on upgrading cdk version with ID changes so that we don't have a disruption due to a race condition?

iliapolo · 2020-09-15T11:51:13Z

Hi @jessecollier

This change caused a create and a delete of the old resource leaving in a state where the actual service account was missing.

Do you mean that the original service account name used by some pod remained the same? and now does not exist?

In general these types of changes (stateless replacements) should not cause deployment failures, which is why it's not strictly considered a breaking change. Seems that the behavior you described is caused due to a bug/limitation we currently have with service accounts. (Related #9910).

Could you share your code? Specifically the part the creates and uses the service account.

Thanks

jessecollier · 2020-09-15T17:53:55Z

@iliapolo I mean the cloudformation logical name changed.

Here's a simple code to test this. Synth in 1.61.0 and then synth in 1.62.0 and compare the cloudformation logical IDs.

import * as cdk from '@aws-cdk/core';
import * as eks from '@aws-cdk/aws-eks';

export class CdkSaTestStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const cluster = new eks.Cluster(this, 'hello-eks', {
      version: eks.KubernetesVersion.V1_16,
    });

    cluster.addServiceAccount('example',{
      name: 'example-sa',
      namespace: 'non-default-ns'
    })

  }
}

When I run this example, it's renaming my service account name
1.61.0: helloeksmanifestexampleServiceAccountResourceEFC172AA
1.62.0: helloeksexamplemanifestexampleServiceAccountResourceEDBEE022

This effectively creates a delete on the first resource, and a create on the second. However, when this is executed in cloudformation they are run in parallel.

What we witnessed is helloeksexamplemanifestexampleServiceAccountResourceEDBEE022 is executed in cloudformation correctly. HOWEVER, because helloeksmanifestexampleServiceAccountResourceEFC172AA no longer exists, it sends a delete command. This causes it to delete the SA and IAM role. Because this happens 2nd due to race condition, I'm left in a state where the SA and IAM do not exist. However cloudformation thinks the "new" helloeksexamplemanifestexampleServiceAccountResourceEDBEE022 is in a completed/successful state.

jessecollier · 2020-09-15T18:11:10Z

Looking into a bit more. The reason the logical ID changes is because the manifest was a child of the cluster before. Now it's a nested child in the ServiceAccount resource.

1.61.0: CdkSaTestStack/hello-eks/manifest-exampleServiceAccountResource/Resource/Default
1.62.0: CdkSaTestStack/hello-eks/example/manifest-exampleServiceAccountResource/Resource/Default

So the question is: Is a path change (which is used for cloudformation logical IDs) considered a breaking change?

And if it is a breaking change, should there be an automatic detection in the test fixtures of ID changes/removals to flag as a potential breaking change?

iliapolo · 2020-09-16T11:38:21Z

@jessecollier

Not all logical ID changes are considered a breaking change. For example, stateless resources can usually be replaced without repercussions. Logical ID changes that are considered breaking are mentioned in the CHANGELOG.

I do agree though that we should be more explicit about what exactly is and is not considered breaking, and i'll make sure we put out some guidance for this.

In this case, the service account wasn't considered a breaking change because it should have been replaced and keep the existing configuration. it seems we've encountered a gap that caused this replacement to leave the cluster in a faulty state.

You analysis of the problem is correct, I also think the fact the DELETE happens 2nd is not because of a race condition, but because CFN will execute CREATE on new resources before DELETE on old resources to prevent down time.

Usually, when a logical id changes for a resource with a concrete physical name (i.e not generated based on the construct path), CFN will fail when trying to create the new resource because there will be a name conflict. This will cause a rollback of the stack, not great, but it won't cause a faulty state.

Now that I think about it, this is what should have happened after this change, the reason it didn't is because CFN isn't aware of the service account name, and because kubectl apply didn't fail (nor should it have), as far as CFN is concerned, there is no conflict.

This was an oversight on our part, apologies and thank you for reporting and diagnosing this.

I've created this issue so that people are aware and am working on providing a workaround to preserve the logical id.

Thanks again.

iliapolo added 24 commits August 2, 2020 00:12

working version of integ test

0d0d0eb

mid work

95163e3

added docstring

cc74d36

Merge branch 'master' into epolon/kubectl-describe

a7dd5ee

refactor integ test

0ca92b6

integ

398aee3

added README

b23c7da

docstrings fixes

61f8890

remove debug logs

3481106

fix test assertion

81754fd

integ

45f788f

integ expectations

d5fdf22

bring back inference in integ test

4a57715

update expectation file

f92f6b2

added tests

4588f4e

delete service-description and add namespace option to kube get

e4cd6ec

change k8s-resource to k8s-manifest and k8s-get to k8s-resource-attri…

ea139a4

…bute

whoops

50bf0aa

integ tests

f61fc46

README changes according to new method and construct names

ea07951

Merge branch 'master' into epolon/kubectl-describe

61f2ae7

rename 'namespace' to 'resourceNamespace' for consistency

95f6788

Merge branch 'epolon/kubectl-describe' into epolon/eks-service-accoun…

0a5b45f

…t-circular-dep

Merge branch 'master' into epolon/eks-service-account-circular-dep

32d87e3

mergify bot added the contribution/core This is a PR that came from AWS. label Aug 14, 2020

iliapolo added 2 commits August 14, 2020 15:46

cleanup

9745419

added more tests

a5fb2cb

SomayaB assigned iliapolo Aug 14, 2020

iliapolo added 2 commits August 15, 2020 12:40

mid work

3ee003e

mid work

35aec5c

iliapolo mentioned this pull request Aug 23, 2020

[aws-eks] ServiceAccount property references do not result in dependencies #9910

Closed

simplify

1164e6f

iliapolo marked this pull request as draft August 23, 2020 17:38

added tests to aws-auth

79b3c4b

iliapolo marked this pull request as ready for review August 24, 2020 08:44

eladb suggested changes Aug 26, 2020

View reviewed changes

iliapolo mentioned this pull request Aug 27, 2020

[aws-eks] Stacks with EKS clusters fail tear down due to dangling ELB/ALB holding networking resources #9970

Open

iliapolo added 4 commits August 31, 2020 19:35

Merge branch 'master' into epolon/eks-service-account-circular-dep

fbf9eca

revert changes to cluster security group

947ec59

fix tests

269196e

code review comments

b41ca5e

eladb approved these changes Sep 1, 2020

View reviewed changes

eladb added the pr/do-not-merge This PR should not be merged at this time. label Sep 1, 2020

fix tests

d13050b

iliapolo removed the pr/do-not-merge This PR should not be merged at this time. label Sep 2, 2020

Merge branch 'master' into epolon/eks-service-account-circular-dep

9a85c15

mergify bot merged commit 1e96ebc into master Sep 2, 2020

mergify bot deleted the epolon/eks-service-account-circular-dep branch September 2, 2020 08:36

iliapolo mentioned this pull request Sep 16, 2020

[aws-eks] ‼️ NOTICE: Service Accounts deleted upon upgrade from version 1.61.1 or below #10388

Closed

iliapolo mentioned this pull request Dec 2, 2020

(eks): Why is creating managed node groups for imported clusters allowed, but not for clusters crossing the stackborder #11810

Closed

iliapolo mentioned this pull request Mar 11, 2021

[aws-eks]: Allow creating masters role in stacks other than the cluster stack #13529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eks): creating a `ServiceAccount` in a different stack than the `Cluster` creates circular dependency between the two stacks #9701

fix(eks): creating a `ServiceAccount` in a different stack than the `Cluster` creates circular dependency between the two stacks #9701

iliapolo commented Aug 14, 2020 •

edited

eladb Sep 1, 2020

iliapolo Sep 1, 2020 •

edited

eladb Sep 2, 2020

eladb commented Sep 1, 2020

mergify bot commented Sep 2, 2020

mergify bot commented Sep 2, 2020

aws-cdk-automation commented Sep 2, 2020

jessecollier commented Sep 10, 2020

iliapolo commented Sep 15, 2020

jessecollier commented Sep 15, 2020 •

edited

jessecollier commented Sep 15, 2020

iliapolo commented Sep 16, 2020

	cluster.addManifest(`${id}ServiceAccountResource`, {
	apiVersion: 'v1',
	kind: 'ServiceAccount',
	metadata: {
	name: this.serviceAccountName,
	namespace: this.serviceAccountNamespace,
	labels: {
	'app.kubernetes.io/name': this.serviceAccountName,
	},
	annotations: {
	'eks.amazonaws.com/role-arn': this.role.roleArn,
	},
	},
	});
	}

	if (mapRole) {
	// see https://docs.aws.amazon.com/en_us/eks/latest/userguide/add-user-role.html
	this.awsAuth.addRoleMapping(autoScalingGroup.role, {
	username: 'system:node:{{EC2PrivateDNSName}}',
	groups: [
	'system:bootstrappers',
	'system:nodes',
	],
	});
	} else {

	Tags.of(autoScalingGroup).add(`kubernetes.io/cluster/${this.clusterName}`, 'owned', {
	applyToLaunchedInstances: true,
	});

fix(eks): creating a ServiceAccount in a different stack than the Cluster creates circular dependency between the two stacks #9701

fix(eks): creating a ServiceAccount in a different stack than the Cluster creates circular dependency between the two stacks #9701

Conversation

iliapolo commented Aug 14, 2020 • edited

ServiceAccount

Solution

AutoScalingGroup Capacity

Solution

eladb Sep 1, 2020

Choose a reason for hiding this comment

iliapolo Sep 1, 2020 • edited

Choose a reason for hiding this comment

eladb Sep 2, 2020

Choose a reason for hiding this comment

eladb commented Sep 1, 2020

mergify bot commented Sep 2, 2020

mergify bot commented Sep 2, 2020

aws-cdk-automation commented Sep 2, 2020

AWS CodeBuild CI Report

jessecollier commented Sep 10, 2020

iliapolo commented Sep 15, 2020

jessecollier commented Sep 15, 2020 • edited

jessecollier commented Sep 15, 2020

iliapolo commented Sep 16, 2020

fix(eks): creating a `ServiceAccount` in a different stack than the `Cluster` creates circular dependency between the two stacks #9701

fix(eks): creating a `ServiceAccount` in a different stack than the `Cluster` creates circular dependency between the two stacks #9701

iliapolo commented Aug 14, 2020 •

edited

iliapolo Sep 1, 2020 •

edited

jessecollier commented Sep 15, 2020 •

edited