Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kustomize-controller eventually errors out and stops reconciling #3689

Open
1 task done
ldinh1984 opened this issue Mar 14, 2023 · 1 comment
Open
1 task done

kustomize-controller eventually errors out and stops reconciling #3689

ldinh1984 opened this issue Mar 14, 2023 · 1 comment

Comments

@ldinh1984
Copy link

Describe the bug

We have a pretty sizable deploy of the of the kustomize-controller (v0.32.0) with increased quota, deployed to a node with 64 core CPU and 256GB of memory so capacity is not an issue.

We also have concurrency set to 96. Everything else is pretty much out of the box. Things run well for a good 3-4 hrs on our EKS cluster but eventually the kustomize-controller starts piling up on failed reconciling, all with the following error:
failed to update status, error: timed out waiting for the condition

After which point, all subsequent Kustomization reconciliation produce the same error, with cpu and memory usage hanging near 0 even though at any point, we have up to 20K Kustomization CRs. These errors don't go away until we restart the kustomize-controller.

Inspecting our k8s api server requests shows that there are a bunch of HTTP 409s:
Screen Shot 2023-03-14 at 4 05 40 PM

Further digging into the api server logs show the typical requests that it is erroring on, which appears to be a patch to update the Kustomization CR status:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"ff85d972-d379-4e2c-b765-f732c728b7b4","stage":"ResponseComplete","requestURI":"/apis/kustomize.toolkit.fluxcd.io/v1beta2/namespaces/nmspc/kustomizations/wrkld/status?fieldManager=gotk-kustomize-controller","verb":"patch","user":{"username":"system:serviceaccount:flux-system:kustomize-controller","uid":"bd13c9b0-b084-4c2a-a27e-1f5c3c35de81","groups":["system:serviceaccounts","system:serviceaccounts:flux-system","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["kustomize-controller-55cf9d7bfc-txhjs"],"authentication.kubernetes.io/pod-uid":["aadbd80f-ded0-4e50-825e-de607ad3235c"]}},"sourceIPs":["10.8.30.16"],"userAgent":"kustomize-controller/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"kustomizations","namespace":"nmspc","name":"wrkld","apiGroup":"kustomize.toolkit.fluxcd.io","apiVersion":"v1beta2","subresource":"status"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Conflict","code":409},"requestReceivedTimestamp":"2023-03-14T18:01:31.905898Z","stageTimestamp":"2023-03-14T18:01:31.916156Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-reconciler-flux-system\" of ClusterRole \"cluster-admin\" to ServiceAccount \"kustomize-controller/flux-system\""}}

Looking at the managed fields for one of Kustomization CR shows 2 owners for the status subresource, which looks suspicious:

  - apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:inventory:
          .: {}
          f:entries: {}
        f:observedGeneration: {}
    manager: kustomize-controller
    operation: Update
    subresource: status
    time: "2022-11-15T23:34:38Z"
  - apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
        f:lastAppliedRevision: {}
        f:lastAttemptedRevision: {}
    manager: gotk-kustomize-controller
    operation: Update
    subresource: status
    time: "2023-02-03T15:53:04Z"
  - apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"finalizers.fluxcd.io": {}
    manager: gotk-kustomize-controller
    operation: Update
    time: "2023-02-03T16:13:51Z"

As you can see above, the status subresource is owned by both gotk-kustomize-controller and kustomize-controller. Is this intended?

I also noticed this commit that added more patches to update the status more regularly. Is there a possibility for it to conflict with patches/updates that happen prior?

Fwiw, even if we turn down concurrency to 8, we eventually run into this cascading error, though it takes longer. Anyhow, if you have any idea what's going on or suggestions on things to try out, I'm all ears.

Steps to reproduce

  1. install flux v0.38.3 that comes with kustomize-controller v0.32.0
  2. create 15-20K Kustomization CRs with pruning enabled
  3. have KCRs track a flux repo that has commits added ~ every 10-15 mins. 500+ KCRs are adds and/or removes during these commits, that trigger pruning.
  4. leave controller running for 3-4 hours.
  5. eventually all reconciliations error out with failed to update status, error: timed out waiting for the condition and doesn't go away until the kustomize-controller is restarted.

Expected behavior

Able to recover on failure and continue reconciling again.

Screenshots and recordings

No response

OS / Distro

Amazon linux (linux/amd64)

Flux version

v0.38.3

Flux check

► checking prerequisites
✗ flux 0.38.3 <0.41.1 (new version is available, please upgrade)
✔ Kubernetes 1.22.16-eks-ffeb93d >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.28.1
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.32.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.30.2
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.33.0
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta2
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1beta2
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta2
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta2
✔ receivers.notification.toolkit.fluxcd.io/v1beta2
✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@darkowlzz
Copy link
Contributor

Hi, thanks for reporting this issue.
It'd be very helpful to get some more information about the observed state of the Kustomization objects when this happens, specifically the status of the object. Can you share with us what the status of Kustomization objects affected by this looks like? It'll help us in reasoning about the issue and maybe reproduce it more easily without waiting for it to happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants