Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RolePolicyAttachment.iam is detaching policies every reconciliation #929

Closed
blakebarnett opened this issue Oct 20, 2023 · 4 comments · Fixed by #933
Closed

RolePolicyAttachment.iam is detaching policies every reconciliation #929

blakebarnett opened this issue Oct 20, 2023 · 4 comments · Fixed by #933
Labels
bug Something isn't working needs:triage

Comments

@blakebarnett
Copy link

What happened?

Every time a RolePolicyAttachment.iam.aws.upbound.io resource that has a policyArnSelector is reconciled, the policy is detached and re-attached to the role. This results in IAM permissions being temporarily incorrect, causing unexpected errors and downtime for some applications.

How can we reproduce it?

This happens for all of our resources using a selector similar to this:

spec:
  forProvider:
    policyArnSelector:
      matchControllerRef: true
      matchLabels:
        service: foo

If we do a kubectl get rolepolicyattachments -w we see that any resources doing this become unready every reconciliation loop, and the policies disappear from the resource in the AWS console or CLI. This last up to 30 seconds before being restored. The provider-aws-iam pod with debug logs enabled show the creation event, but not whatever is causing it to become detached.

What environment did it happen in?

  • Crossplane Version: 1.12.2
  • Provider Version: 0.41.0
  • Kubernetes Version: v1.24.16-eks-2d98532
  • Kubernetes Distribution: EKS
@blakebarnett blakebarnett added bug Something isn't working needs:triage labels Oct 20, 2023
@haarchri
Copy link
Member

I notice the same problem on my end. For instance, if you're using Karpenter, it can cause pods to be unable to schedule due to missing CNI policies.

I can also see that in each reconciliation cycle, the RolePolicyAttachments are fluctuating between being SYNCED (true) and not SYNCED (false):

kubectl get rolepolicyattachments -w
NAME                                    READY   SYNCED   EXTERNAL-NAME                                                      AGE
configuration-aws-eks-84hbt             True    True     configuration-aws-eks-ghzpm-20231022104543979400000005             21h
configuration-aws-eks-99fwx             True    True     configuration-aws-eks-ghzpm-20231022105232293300000007             21h
configuration-aws-eks-fpbjh             True    True     configuration-aws-eks-ghzpm-20231022105025477300000006             21h
configuration-aws-eks-karpenter-4mplv   True    True     configuration-aws-eks-karpenter-752pv-20231022104348446800000003   13h
configuration-aws-eks-karpenter-htgld   True    True     configuration-aws-eks-karpenter-20231021212555995000000004         13h
configuration-aws-eks-karpenter-hw29t   True    True     configuration-aws-eks-karpenter-752pv-20231021210413778700000006   13h
configuration-aws-eks-karpenter-lqbn4   True    True     configuration-aws-eks-karpenter-752pv-20231022104428267400000004   13h
configuration-aws-eks-karpenter-wxswn   True    True     configuration-aws-eks-karpenter-752pv-20231022103826161600000008   13h
configuration-aws-eks-qrprp             True    True     configuration-aws-eks-ghzpm-20231021130217609800000001             21h
configuration-aws-eks-sd6bx             True    True     configuration-aws-eks-nq697-20231021130315164200000004             21h
configuration-aws-eks-karpenter-4mplv   True    True     configuration-aws-eks-karpenter-752pv-20231022104348446800000003   13h
configuration-aws-eks-karpenter-4mplv   True    True     configuration-aws-eks-karpenter-752pv-20231022104348446800000003   13h
configuration-aws-eks-karpenter-4mplv   False   True     configuration-aws-eks-karpenter-752pv-20231022104348446800000003   13h
configuration-aws-eks-karpenter-4mplv   False   True     configuration-aws-eks-karpenter-752pv-20231022105353024900000008   13h
configuration-aws-eks-karpenter-4mplv   True    True     configuration-aws-eks-karpenter-752pv-20231022105353024900000008   13h
configuration-aws-eks-karpenter-lqbn4   True    True     configuration-aws-eks-karpenter-752pv-20231022104428267400000004   13h
configuration-aws-eks-karpenter-lqbn4   True    True     configuration-aws-eks-karpenter-752pv-20231022104428267400000004   13h
configuration-aws-eks-karpenter-lqbn4   False   True     configuration-aws-eks-karpenter-752pv-20231022104428267400000004   13h
configuration-aws-eks-karpenter-lqbn4   False   True     configuration-aws-eks-karpenter-752pv-20231022105445235400000009   13h
configuration-aws-eks-karpenter-lqbn4   True    True     configuration-aws-eks-karpenter-752pv-20231022105445235400000009   13h
configuration-aws-eks-84hbt             True    True     configuration-aws-eks-ghzpm-20231022104543979400000005             21h
configuration-aws-eks-84hbt             True    True     configuration-aws-eks-ghzpm-20231022104543979400000005             21h
configuration-aws-eks-84hbt             False   True     configuration-aws-eks-ghzpm-20231022104543979400000005             21h
configuration-aws-eks-84hbt             False   True     configuration-aws-eks-ghzpm-2023102210553433180000000a             21h
configuration-aws-eks-84hbt             True    True     configuration-aws-eks-ghzpm-2023102210553433180000000a             21h

image
vs.
image
vs.
image

The main reason for this issue lies in the 'Role' resource. We initialize the spec.forProvider.managedPolicyArns :

kubectl get roles.iam configuration-aws-eks-karpenter-752pv -o yaml
apiVersion: iam.aws.upbound.io/v1beta1
kind: Role
metadata:
  annotations:
    crossplane.io/composition-resource-name: InstanceNodeRole
    crossplane.io/external-create-pending: "2023-10-21T21:04:10Z"
    crossplane.io/external-create-succeeded: "2023-10-21T21:04:10Z"
    crossplane.io/external-name: configuration-aws-eks-karpenter-752pv
    upjet.crossplane.io/provider-meta: "null"
  creationTimestamp: "2023-10-21T21:04:09Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generateName: configuration-aws-eks-karpenter-
  generation: 5
  labels:
    crossplane.io/claim-name: ""
    crossplane.io/claim-namespace: ""
    crossplane.io/composite: configuration-aws-eks-karpenter
    role: karpenter
  name: configuration-aws-eks-karpenter-752pv
  ownerReferences:
  - apiVersion: aws.platform.upbound.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: XKarpenter
    name: configuration-aws-eks-karpenter
    uid: 4c72a9f4-a3ec-4e42-84a0-a683a49a08be
  resourceVersion: "74374"
  uid: 889002df-e478-4e29-bfa9-9dd69294fcd9
spec:
  deletionPolicy: Delete
  forProvider:
    assumeRolePolicy: |
      {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "ec2.amazonaws.com"
                    ]
                },
                "Action": [
                    "sts:AssumeRole"
                ]
            }
        ]
      }
    managedPolicyArns:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore 
    maxSessionDuration: 3600
    path: /
    tags:
      crossplane-kind: role.iam.aws.upbound.io
      crossplane-name: configuration-aws-eks-karpenter-752pv
      crossplane-providerconfig: default
  initProvider: {}
  managementPolicies:
  - '*'
  providerConfigRef:
    name: default
status:
  atProvider:
    arn: arn:aws:iam::609897127049:role/configuration-aws-eks-karpenter-752pv
    assumeRolePolicy: '{"Statement":[{"Action":["sts:AssumeRole"],"Effect":"Allow","Principal":{"Service":["ec2.amazonaws.com"]}}],"Version":"2012-10-17"}'
    createDate: "2023-10-21T21:04:11Z"
    description: ""
    forceDetachPolicies: false
    id: configuration-aws-eks-karpenter-752pv
    managedPolicyArns:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
    - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
    - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    maxSessionDuration: 3600
    path: /
    roleLastUsed:
    - lastUsedDate: "2023-10-22T10:40:22Z"
      region: us-west-2
    tags:
      crossplane-kind: role.iam.aws.upbound.io
      crossplane-name: configuration-aws-eks-karpenter-752pv
      crossplane-providerconfig: default
    tagsAll:
      crossplane-kind: role.iam.aws.upbound.io
      crossplane-name: configuration-aws-eks-karpenter-752pv
      crossplane-providerconfig: default
    uniqueId: AROAY4AFTTSE6KNVQORHH
  conditions:
  - lastTransitionTime: "2023-10-21T21:04:21Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-10-22T03:36:51Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced
  - lastTransitionTime: "2023-10-21T21:04:12Z"
    reason: Success
    status: "True"
    type: LastAsyncOperation
  - lastTransitionTime: "2023-10-21T21:04:12Z"
    reason: Finished
    status: "True"
    type: AsyncOperation

After I've configured all the policies in the Role resource, the problem no longer occurs. - so we need to skip late init for spec.forProvider.managedPolicyArns

    managedPolicyArns:
    - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
    - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
    - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
image

@mbbush
Copy link
Collaborator

mbbush commented Oct 23, 2023

I think I've figured out why we're only seeing this issue with some resources.

This regression was introduced by #745 which was released in provider version 0.40.0. That caused all of our Role managed resources to late initialize whatever set of managed policies were attached at the time we upgraded to that provider version into their spec.forProvider.managedPolicyArns field. All of those roles are now trying to enforce that set of policies as the exclusive set of attached policies, but (at least where @blakebarnett and I work) that's mostly the correct set of policies, so we haven't noticed huge problems.

Roles created after the provider 0.40.0 upgrade are late initializing an array of managed policies that is the result of a race between the Role managed resource and any RolePolicyAttachment managed resources that were created at the same time. Sometimes that has all the policies we want, and sometimes it's only a subset. It does appear to never be an empty array, probably thanks to some logic in late initalization.

Thinking about how to mitigate this, I think that users will need to upgrade to a version of the provider that doesn't late initialize spec.forProvider.managedPolicyArns (or simply disable late initialization via granular management policies on all their Role.iam resources) and then null out all the spec.forProvider.managedPolicyArns on their existing Role.iam resources.

It also seems like we should at least consider whether this warrants releasing the fix as versions 0.40.1, 0.41.1 and/or 0.42.1

@haarchri
Copy link
Member

Open an PR for this: #933

@blakebarnett blakebarnett changed the title RolePolicyAttachment.iam w/policyArnSelector is detaching policies every reconciliation RolePolicyAttachment.iam is detaching policies every reconciliation Oct 23, 2023
@mbbush
Copy link
Collaborator

mbbush commented Oct 30, 2023

In case anyone is running into issues and looking here for a fix, once you upgrade to provider version 0.43, any iam Role managed resources which reconciled while you had provider versions 0.40, 0.41 or 0.42 installed will have a spec.forProvider.managedPolicyArns field set, and then the role will detach any policies not on that list, even if they're being attached by a different RolePolicyAttachment crossplane resource.

To get the Role to stop detaching policies, here's a one-liner which will unset the spec.forProvider.managedPolicyArns on all your Role managed resources. This should be safe in general, but it's worth testing. You'll have to remove the --dry-run parameter for it to do anything.

kubectl get role.iam -o name | xargs kubectl patch --dry-run=server --patch '[{"op":"remove","path":"/spec/forProvider/managedPolicyArns"}]' --type=json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants