Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary application refresh on update event cause high CPU usage on controller #13534

Closed
3 tasks done
agaudreault opened this issue May 10, 2023 · 3 comments · Fixed by #13912
Closed
3 tasks done
Labels
bug Something isn't working

Comments

@agaudreault
Copy link
Member

agaudreault commented May 10, 2023

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

TLDR; When a kubernetes resource watched by argo-cd is updated (whenever resourceVersion changes), argo-cd will receive an update event and trigger a refresh of the owner applications, if any. Resources that are updated frequently by controllers, such as ExternalSecret, will always have a new resourceVersion when information in the status field, such as lastRefreshTime, lastEvaluation, reconciledAt are updated. Although the resources are different, these fields should not cause a refresh in argo-cd. A refresh will cause a reconcile operation and even if the diffing evaluates to no changes, it consumes a lot of CPU.

More details

The cache from the argoproj/gitops-engine only store the resourceVersion and does not store the actual resource (except for crds). When a watch event is received, the cache will call OnResourceUpdated handler with the old and new object whenever the resourceVersion is updated, but the handler will not be able to do any relevant comparison because the old Resource object does not contain the previous version of the object.

When the argocd OnResourceUpdated handler receives an event, it will simply trigger/queue a refresh for the application, eventually causing a reconcile.

This causes a lot of reconciliation for the same application and high CPU usage on both the controller and the repo-server.

To Reproduce

  1. Update a resource deployed with argoCD by changing a value every 10 seconds (do not change a value that will cause an OutOfSync).
  2. Observe the reconcile operation being performed every 10 seconds.

Expected behavior

ArgoCD cache should be configurable to ignore certain fields to be evaluated when there is a watch event received. Even if the resourceVersion has changed, the cache should compare the actual objects and if no relevant changes are detected, it should update the cache with the new version without triggering the OnResourceUpdated handler.

To save memory, a hash of the manifest with excluded fields could be stored in the Resource object.
To configure the cache,

  • argocd could have an additional list of ResourceIgnoreDifferences to ignore fileds before computing the hash.
  • Have Lua scripts similar to the health customization that would receive the object manifest and return the manifest to hash.

Other considered solution/mitigation

  • Use the new argocd.argoproj.io/skip-reconcile: "true" annotations
    • This will block all the reconcile for an application and does not fix the issue.
  • Add a new annotation argocd.argoproj.io/ignore-resource-update: "true" on each resource that we do not want to trigger a reconcile for the application they belong to
    • This will require the users to add that annotations on multiple objects, probably with MutationWebhook, and adds a lot of burden on the ArgoCD users. It also ignores a resource completely instead of ignoring only certain fields
  • Expose configuration for the list of ignoredRefreshResources
    • This will ignore a resource completely instead of ignoring only certain fields.

Screenshots

Log analysis of which resource update are causing frequent applications refresh (test instance with no load)
image

Total number of App reconciliations (60m) when timeout.reconciliation: 3m and expected value should be 20. (test instance with no load)
image

Total number of reconciliations (15m) when timeout.reconciliation: 5m and expected value should be around 3. (instance with real load)
image

ArgoCD many application refresh and reconciliation, but no sync.
image

ArgoCD application refresh per controller
image

ArgoCD constant high cpu usage on controller with the most reconcile load
image

Version

argocd-server: v2.5.2+148d8da
  BuildDate: 2022-11-07T16:42:47Z
  GitCommit: 148d8da7a996f6c9f4d102fdd8e688c2ff3fd8c7
  GitTreeState: clean
  GoVersion: go1.18.8
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.1+g9f88ccb
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

Logs

Paste any relevant application logs here.

Related to #9014

@agaudreault agaudreault added the bug Something isn't working label May 10, 2023
@agaudreault
Copy link
Member Author

@crenshaw-dev @alexmt @jessesuen sorry to ping you but what is your opinion on the best possible implementation? I created the issue here, but apart from configurations, I think most of the code needs to be done in https://github.com/argoproj/gitops-engine/tree/master/pkg/cache and you all contributed to this code previously.

I would like to get started on the implementation as soon as possible. I am curious how you guys are able to run argocd at scale with this behavior without using an enormous amount of CPU 😅

@todaywasawesome
Copy link
Contributor

@agaudreault-jive lets discuss in sig-scalability.

@agaudreault
Copy link
Member Author

agaudreault commented May 25, 2023

@csantanapr thanks for the suggestion to use pprof to find out which method uses CPU. I did the following step to produce the flame graph below:

  • kubectl port-forward argocd-application-controller-0 8082
  • curl "http://localhost:8082/debug/pprof/profile?seconds=900" > ./argocd-application-controller-0.Profile
  • Install https://graphviz.org/download/
  • Run go tool pprof -http=: ./argocd-application-controller-0.Profile

We can see that the root method using the most CPU is processAppRefreshQueueItem, which is the processing method executed after an item has been enqueued caused by a watched resource. So I believe that adding a mechanism to avoid calling this method on unnecessary watched resource updates will improve argo-cd performance.

The method IterateHierarchy could be improved as well if possible, to improve the reconcile operation overall speed. It seems like most time is consumed by [kube.ResourceKey]

image image

And for which resources are actually causing the refresh events, the most problematic ones are the ArgoCD Application themselves (most likely the status.reconciledAt update) combined with our apps-of-app pattern to have the Applications defined as code. Then ExternalSecret with the status.refreshTime. The other ones would need more investigation, but this shows that argo-cd is affected by the operators used on the clusters and justify the need to have a configurable setting to ignore update events in my opinion.
image

crenshaw-dev added a commit that referenced this issue Jun 25, 2023
) (#13912)

* feat: ignore watched resource update

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add doc and CLI

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc index

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add command

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* revert formatting

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* do not skip on health change

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update logging to use context

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* fix typos. local build broken...

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* change after review

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* manifestHash to string

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* more doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* example for argoproj Application

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add unit test for ignored logs

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* move hash and set log to debug

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* feature flag

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* less aggressive managedFields ignore rule

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* use local settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* latest settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* safety first

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* since it's behind a feature flag, go aggressive on overrides

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
yyzxw pushed a commit to yyzxw/argo-cd that referenced this issue Aug 9, 2023
…oproj#13534) (argoproj#13912)

* feat: ignore watched resource update

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add doc and CLI

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc index

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add command

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* revert formatting

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* do not skip on health change

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update logging to use context

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* fix typos. local build broken...

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* change after review

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* manifestHash to string

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* more doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* example for argoproj Application

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add unit test for ignored logs

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* move hash and set log to debug

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* feature flag

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* less aggressive managedFields ignore rule

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* use local settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* latest settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* safety first

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* since it's behind a feature flag, go aggressive on overrides

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
tesla59 pushed a commit to tesla59/argo-cd that referenced this issue Dec 16, 2023
…oproj#13534) (argoproj#13912)

* feat: ignore watched resource update

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add doc and CLI

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc index

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add command

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* revert formatting

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* do not skip on health change

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update logging to use context

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* fix typos. local build broken...

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* change after review

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* manifestHash to string

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* more doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* example for argoproj Application

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add unit test for ignored logs

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* move hash and set log to debug

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* feature flag

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* less aggressive managedFields ignore rule

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* use local settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* latest settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* safety first

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* since it's behind a feature flag, go aggressive on overrides

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants