-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(eks): creating a ServiceAccount
in a different stack than the Cluster
creates circular dependency between the two stacks
#9701
Conversation
// a dependency on the cluster, allowing those resources to be in a different stack, | ||
// will create a circular dependency. granted, it won't always be the case, | ||
// but we opted for the more causious and restrictive approach for now. | ||
throw new Error(`${construct.node.uniqueId} must be defined in the scope of the ${thisStack.stackName} stack`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a reason (why not?) and also what should the user do to fix (use the construct directly?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eladb What do you mean by "use the construct directly"? I don't think users are supposed to instantiate this construct themselves since it will override the aws-auth
config map. Right?
As far as I can tell the only thing to do is create the role/user in the cluster stack, which is already mentioned. But I can rephrase to something like:
throw new Error(`${construct.node.uniqueId} should be defined in the scope of the ${thisStack.stackName} stack to prevent circular dependencies.`)
will that be reasonably clear...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's good.
Approved with minor |
Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Sorry if this is not the right place to discuss this. I'm curious why this was not labeled a breaking change in the release notes. This change caused a create and a delete of the old resource leaving in a state where the actual service account was missing. Re-running the template did not detect any changes, and thus would not recreate it. This caused us to have to create them manually. Should this not have been labeled a "breaking change" in the release notes? Or is there guidance on upgrading cdk version with ID changes so that we don't have a disruption due to a race condition? |
Do you mean that the original service account name used by some pod remained the same? and now does not exist? In general these types of changes (stateless replacements) should not cause deployment failures, which is why it's not strictly considered a breaking change. Seems that the behavior you described is caused due to a bug/limitation we currently have with service accounts. (Related #9910). Could you share your code? Specifically the part the creates and uses the service account. Thanks |
@iliapolo I mean the cloudformation logical name changed. Here's a simple code to test this. Synth in 1.61.0 and then synth in 1.62.0 and compare the cloudformation logical IDs.
When I run this example, it's renaming my service account name This effectively creates a delete on the first resource, and a create on the second. However, when this is executed in cloudformation they are run in parallel. What we witnessed is |
Looking into a bit more. The reason the logical ID changes is because the manifest was a child of the cluster before. Now it's a nested child in the ServiceAccount resource. 1.61.0: CdkSaTestStack/hello-eks/manifest-exampleServiceAccountResource/Resource/Default So the question is: Is a path change (which is used for cloudformation logical IDs) considered a breaking change? And if it is a breaking change, should there be an automatic detection in the test fixtures of ID changes/removals to flag as a potential breaking change? |
Not all logical ID changes are considered a breaking change. For example, stateless resources can usually be replaced without repercussions. Logical ID changes that are considered breaking are mentioned in the CHANGELOG. I do agree though that we should be more explicit about what exactly is and is not considered breaking, and i'll make sure we put out some guidance for this. In this case, the service account wasn't considered a breaking change because it should have been replaced and keep the existing configuration. it seems we've encountered a gap that caused this replacement to leave the cluster in a faulty state. You analysis of the problem is correct, I also think the fact the Usually, when a logical id changes for a resource with a concrete physical name (i.e not generated based on the construct path), CFN will fail when trying to create the new resource because there will be a name conflict. This will cause a rollback of the stack, not great, but it won't cause a faulty state. Now that I think about it, this is what should have happened after this change, the reason it didn't is because CFN isn't aware of the service account name, and because This was an oversight on our part, apologies and thank you for reporting and diagnosing this. I've created this issue so that people are aware and am working on providing a workaround to preserve the logical id. Thanks again. |
This PR changes a few scenarios with regards to circular dependencies in cases where some resources are created in a different stack than the cluster stack.
ServiceAccount
Previously, the
ServiceAccount
construct usedcluster.addManifest
to deploy the necessary resource.aws-cdk/packages/@aws-cdk/aws-eks/lib/service-account.ts
Lines 81 to 95 in 25a9cc7
This means that the resource itself is added to the cluster stack. When the
ServiceAccount
is created in a different stack, it creates a dependency between the cluster stack and the service account stack.Since
ServiceAccount
also depends on the cluster, it creates a dependency between the service account stack and the cluster stack. And hence a circular dependency is formed.Solution
There is no inherent reason to always add the
ServiceAccount
resource to the cluster stack. If it was added to the service account stack, the circular dependency could be avoided.The solution is to use
new KubernetesManifest(this, ...)
instead ofcluster.addResource
- creating the manifest in the service account stack, which is perfectly fine since that direction of dependency is the intended one.AutoScalingGroup Capacity
When adding capacity to a cluster using an
AutoScalingGroup
, we add the role of the ASG to theaws-auth
role mappings of the cluster:aws-cdk/packages/@aws-cdk/aws-eks/lib/cluster.ts
Lines 914 to 923 in 25a9cc7
The ASG depends on the cluster because, among others, it requires to have a tag with the cluster name:
aws-cdk/packages/@aws-cdk/aws-eks/lib/cluster.ts
Lines 907 to 909 in 25a9cc7
This creates a dependency between the ASG stack and the cluster stack. In case the ASG role is defined in the ASG stack, the auth mapping now creates a dependency between the cluster stack and the ASG, forming a circular dependency between the stacks.
Solution
AwsAuth
is a singleton of the cluster, which means it is always defined in the cluster stack. Since we have no control over the stack of the ASG and its role, a circular dependency may always be created. The solution is to simply disallow creatinguserMappings
orroleMappings
with aUser/Role
that is not part of the cluster stack.This might be a little more restrictive than necessary, but it has less exposure potential to edge cases and complex dependency cycles. We can always be less restrictive down the road if needed.
Fixes #8884
Fixes #9325
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license