Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix assigning IPs to Windows pods #3067

Merged
merged 9 commits into from
Jan 20, 2021
Merged

Conversation

cPu1
Copy link
Collaborator

@cPu1 cPu1 commented Jan 13, 2021

Description

The VPC controller requires a subset of the IAM policies of the VPC CNI plugin to work. In a cluster with no OIDC provider, the policies required by the CNI plugin are attached to the node role, so the VPC resource controller ends up using it. When a cluster has an OIDC provider (iam.withOIDC), however, eksctl uses IRSA for the VPC CNI plugin, causing the VPC controller to be unable to access AWS APIs.

This PR makes the VPC controller use IRSA if an OIDC provider is associated with the cluster, attaching only the policies required by the VPC controller instead of using the more permissive AmazonEKS_CNI_Policy managed policy.

Fixes #2721

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup), target version (e.g. version/0.12.0) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 馃く

  • Backfilled missing tests for code in same general area 馃帀
  • Refactored something and made the world a better place 馃専

pkg/cfn/manager/iam.go Outdated Show resolved Hide resolved
pkg/cfn/manager/iam.go Show resolved Hide resolved
@cPu1 cPu1 force-pushed the windows-ip-2721 branch 3 times, most recently from 0e6a131 to a291d39 Compare January 14, 2021 12:15
Comment on lines +14 to +16
type serviceAccountCreator interface {
NewTasksToCreateIAMServiceAccounts(serviceAccounts []*api.ClusterIAMServiceAccount, oidc *iamoidc.OpenIDConnectManager, clientSetGetter kubernetes.ClientSetGetter, replaceExistingRole bool) *tasks.TaskTree
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask why we have this little interface here?

It took me fractionally longer to parse the composition than if stackcollection (or stackmanager, whatever it is) were simply called from the helper struct on line 51, and I have a preference for code which lets me be lazy 馃槈

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because irsaHelper only needs this method and not the entire StackCollection. This makes testing the VPCController and IRSAHelper easier. It's somewhat similar to saying why, e.g, ioutil.ReadAll takes an io.Reader and not an os.File or a net.TCPConn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I couldn't find tests in this PR (I did assume that to start), are they still coming?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that as a general comment. I haven't added tests as the bulk of the logic is making Kubernetes and AWS calls.

@cPu1 cPu1 force-pushed the windows-ip-2721 branch 2 times, most recently from aa6ab08 to dee47f0 Compare January 15, 2021 12:51
Comment on lines +43 to +48
logger.Debug("CFN stack for IRSA already exists, replacing it with a new stack")
if err := c.DeleteStackByNameSync(name); err != nil {
close(errs)
return errors.Wrap(err, "error deleting stack")
}
return c.createIAMServiceAccountTask(errs, spec, oidc, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in thinking that VPC controller installation only occurs during cluster creation? If so then when will the scenario occur where the SA already exists occur?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so then when will the scenario occur where the SA already exists occur?

The VPC controller can be installed on an existing cluster using eksctl utils install-vpc-controllers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the scenario here when a user runs eksctl utils install-vpc-controllers after a cluster installation that already installed the VPC controller? If so why do we need to delete & recreate, can't just check for its existence and exit early if it does?

Copy link
Collaborator Author

@cPu1 cPu1 Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to allow applying any changes to the policy document. If we just check for existence, any changes made to the policy document in code will not be applied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is it needed for this PR? I'm just trying to understand when this codepath is used in relation to vpc-controller 馃槃

It might be that this isn't needed as we will add the ability to update IAM policies without deleting here #3064

Copy link
Collaborator Author

@cPu1 cPu1 Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just trying to understand when this codepath is used in relation to vpc-controller

This codepath is invoked when utils install-vpc-controllers is run on a cluster that already has the VPC controller installed (either as part of create cluster or with utils install-vpc-controllers itself). The ServiceAccount created for the role needs the role ARN, so if we skip creation of this stack if it already exists, we'll have to fetch it and get the role ARN, and any changes to the policy document will be ignored, causing the VPC controller to misbehave.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, so is this just for the scenario in which: I have an already created eksctl cluster with oidc enabled and the VPC controller installed, but its missing the policies because I created it with an older version of eksctl? If so that makes sense 馃槃

Apart from that scenario you'd never need to run utils install-vpc-controllers if its already installed correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, so is this just for the scenario in which: I have an already created eksctl cluster with oidc enabled and the VPC controller installed, but its missing the policies because I created it with an older version of eksctl?

That's right.

Apart from that scenario you'd never need to run utils install-vpc-controllers if its already installed correct?

Users may also run it to upgrade to a newer version of the VPC controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Makes sense. Does this add more downtime for users when they update the VPC controller as we will always delete and recreate the policies even if they are unchanged? Very minor, but it could be removed by using the update functionality #3064

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this add more downtime for users when they update the VPC controller as we will always delete and recreate the policies even if they are unchanged?

Yes.

Very minor, but it could be removed by using the update functionality #3064

Yeah, the ideal way would be to update the policies in the existing stack, but since we didn't already have code that updates the policies for IRSA, I went with recreating it. I was aware of your PR but it wasn't in master when I was working on this PR. After your PR is merged, I can change this code to use the update feature (or it can be done as an improvement later).

@aclevername
Copy link
Contributor

The VPC controller requires a subset of the IAM policies of the VPC CNI plugin to work. In a cluster with no OIDC provider, the policies required by the CNI plugin are attached to the node role, so the VPC resource controller ends up using it

is there a way we could make it more explicit in the code that the VPC Controller also uses/needs this policy when OIDC is disabled? It feels almost accidental that the VPC Controller worked

@cPu1
Copy link
Collaborator Author

cPu1 commented Jan 18, 2021

The VPC controller requires a subset of the IAM policies of the VPC CNI plugin to work. In a cluster with no OIDC provider, the policies required by the CNI plugin are attached to the node role, so the VPC resource controller ends up using it

is there a way we could make it more explicit in the code that the VPC Controller also uses/needs this policy when OIDC is disabled? It feels almost accidental that the VPC Controller worked

There's little value in attaching the VPC controller-specific policies to the node role because they're a subset of the policies required by the VPC CNI plugin. Although I agree that a comment documenting this behaviour would be useful.

@cPu1 cPu1 force-pushed the windows-ip-2721 branch 4 times, most recently from 4af15e4 to 1b0e72e Compare January 20, 2021 11:11
@cPu1 cPu1 changed the base branch from master to release-0.36 January 20, 2021 12:33
@cPu1 cPu1 force-pushed the windows-ip-2721 branch 2 times, most recently from a43e624 to fab451f Compare January 20, 2021 12:34
Copy link
Contributor

@michaelbeaumont michaelbeaumont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes addressed, LGTM

Copy link
Contributor

@aclevername aclevername left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cPu1 cPu1 merged commit ac55ba7 into eksctl-io:release-0.36 Jan 20, 2021
@cPu1 cPu1 deleted the windows-ip-2721 branch January 20, 2021 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Windows nodes unable to assign IP to pods
4 participants