Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Managed)NodeGroups: Handle existing instanceRoleARN with nested resource path #2689

Merged
merged 2 commits into from Oct 6, 2020

Conversation

dmcneil
Copy link

@dmcneil dmcneil commented Sep 29, 2020

Description

When creating a cluster with either unmanaged or managed node groups that use an existing instanceRoleARN containing a nested/deep resource path, the nodes will fail to join the cluster and eventually the create process will timeout somewhere around:

[ℹ]  building managed nodegroup stack "eksctl-<cluster>-nodegroup-<nodegroup>"
[ℹ]  deploying stack "eksctl-<cluster>-nodegroup-<nodegroup>"

Example configuration to reproduce the issue:

# ... truncated for brevity

nodeGroups: # also applies to managedNodeGroups.
  - name: mng-1
    instanceType: m5.large
    iam:
      instanceRoleARN: arn:aws:iam::1234567890:role/foo/bar/baz/custom-eks-role

Currently, when using a role ARN like this, the aws-auth ConfigMap in kube-system will have the ARN copied directly but it should be copied as arn:aws:iam::1234567890:role/custom-eks-role:

$ kubectl -n kube-system get configmaps aws-auth -o yaml

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::1234567890:role/foo/bar/baz/custom-eks-role # BAD
      rolearn: arn:aws:iam::1234567890:role/custom-eks-role # GOOD
      username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system

It's a bit confusing. To the user, the ARN will possibly look invalid but that's how it is supposed to be on the EKS side...🤷‍♂️

The AWS EKS Console will show the node group(s) as failed with the vague Nodes fail to join cluster message. The AWS documentation for this falls under Container runtime network not ready troubleshooting so you can only determine this is the issue if you are able to SSH into one of the nodes and view the system/process logs which will show a bunch of Unauthorized messages when trying to register/interact with the Kubernetes API.

This PR allows for the user to still provide the fully-qualified ARN in their configuration while handling the correct ARN form with the expected behavior. I initially tried adding the fix at just the AuthConfig level but the logic there isn't applied during the initial cluster creation. I have verified that the CloudFormation process works correctly even when passing the normalized ARN in the CF template. Additional test cases to explicitly test this fix are added and the existing tests all still pass in regards to regression.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
    The godoc comment for NormalizeARN gives a summary of its purpose.
  • Manually tested
    Tested manually with both managed and unmanaged node groups along with no custom role at all for regression.
  • Added labels for change area (e.g. area/nodegroup), target version (e.g. version/0.12.0) and kind (e.g. kind/improvement)
  • Make sure the title of the PR is a good description that can go into the release notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants