-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary application refresh on update event cause high CPU usage on controller #13534
Comments
@crenshaw-dev @alexmt @jessesuen sorry to ping you but what is your opinion on the best possible implementation? I created the issue here, but apart from configurations, I think most of the code needs to be done in https://github.com/argoproj/gitops-engine/tree/master/pkg/cache and you all contributed to this code previously. I would like to get started on the implementation as soon as possible. I am curious how you guys are able to run argocd at scale with this behavior without using an enormous amount of CPU 😅 |
@agaudreault-jive lets discuss in sig-scalability. |
@csantanapr thanks for the suggestion to use pprof to find out which method uses CPU. I did the following step to produce the flame graph below:
We can see that the root method using the most CPU is The method And for which resources are actually causing the refresh events, the most problematic ones are the ArgoCD Application themselves (most likely the |
) (#13912) * feat: ignore watched resource update Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add doc and CLI Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update doc index Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add command Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * codegen Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * revert formatting Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * do not skip on health change Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update doc Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update logging to use context Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * fix typos. local build broken... Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * change after review Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * manifestHash to string Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * more doc Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * example for argoproj Application Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add unit test for ignored logs Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * codegen Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update docs/operator-manual/reconcile.md Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * move hash and set log to debug Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update util/settings/settings.go Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update util/settings/settings.go Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * feature flag Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * fix Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * less aggressive managedFields ignore rule Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * Update docs/operator-manual/reconcile.md Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * use local settings Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * latest settings Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * safety first Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * since it's behind a feature flag, go aggressive on overrides Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> --------- Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
…oproj#13534) (argoproj#13912) * feat: ignore watched resource update Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add doc and CLI Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update doc index Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add command Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * codegen Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * revert formatting Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * do not skip on health change Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update doc Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update logging to use context Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * fix typos. local build broken... Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * change after review Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * manifestHash to string Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * more doc Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * example for argoproj Application Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add unit test for ignored logs Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * codegen Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update docs/operator-manual/reconcile.md Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * move hash and set log to debug Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update util/settings/settings.go Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update util/settings/settings.go Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * feature flag Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * fix Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * less aggressive managedFields ignore rule Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * Update docs/operator-manual/reconcile.md Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * use local settings Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * latest settings Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * safety first Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * since it's behind a feature flag, go aggressive on overrides Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> --------- Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
…oproj#13534) (argoproj#13912) * feat: ignore watched resource update Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add doc and CLI Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update doc index Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add command Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * codegen Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * revert formatting Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * do not skip on health change Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update doc Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * update logging to use context Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * fix typos. local build broken... Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * change after review Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * manifestHash to string Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * more doc Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * example for argoproj Application Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * add unit test for ignored logs Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * codegen Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update docs/operator-manual/reconcile.md Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * move hash and set log to debug Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update util/settings/settings.go Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * Update util/settings/settings.go Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * feature flag Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * fix Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * less aggressive managedFields ignore rule Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * Update docs/operator-manual/reconcile.md Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> * use local settings Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * latest settings Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * safety first Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * since it's behind a feature flag, go aggressive on overrides Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> --------- Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com> Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Checklist:
argocd version
.Describe the bug
TLDR; When a kubernetes resource watched by argo-cd is updated (whenever
resourceVersion
changes), argo-cd will receive an update event and trigger a refresh of the owner applications, if any. Resources that are updated frequently by controllers, such as ExternalSecret, will always have a newresourceVersion
when information in thestatus
field, such aslastRefreshTime
,lastEvaluation
,reconciledAt
are updated. Although the resources are different, these fields should not cause a refresh in argo-cd. A refresh will cause a reconcile operation and even if the diffing evaluates to no changes, it consumes a lot of CPU.More details
The cache from the argoproj/gitops-engine only store the
resourceVersion
and does not store the actual resource (except for crds). When a watch event is received, the cache will call OnResourceUpdated handler with the old and new object whenever theresourceVersion
is updated, but the handler will not be able to do any relevant comparison because the old Resource object does not contain the previous version of the object.When the argocd OnResourceUpdated handler receives an event, it will simply trigger/queue a refresh for the application, eventually causing a reconcile.
This causes a lot of reconciliation for the same application and high CPU usage on both the controller and the repo-server.
To Reproduce
Expected behavior
ArgoCD cache should be configurable to ignore certain fields to be evaluated when there is a watch event received. Even if the resourceVersion has changed, the cache should compare the actual objects and if no relevant changes are detected, it should update the cache with the new version without triggering the OnResourceUpdated handler.
To save memory, a hash of the manifest with excluded fields could be stored in the
Resource
object.To configure the cache,
Other considered solution/mitigation
argocd.argoproj.io/skip-reconcile: "true"
annotationsargocd.argoproj.io/ignore-resource-update: "true"
on each resource that we do not want to trigger a reconcile for the application they belong toScreenshots
Log analysis of which resource update are causing frequent applications refresh (test instance with no load)
Total number of App reconciliations (60m) when
timeout.reconciliation: 3m
and expected value should be 20. (test instance with no load)Total number of reconciliations (15m) when
timeout.reconciliation: 5m
and expected value should be around 3. (instance with real load)ArgoCD many application refresh and reconciliation, but no sync.
ArgoCD application refresh per controller
ArgoCD constant high cpu usage on controller with the most reconcile load
Version
Logs
Related to #9014
The text was updated successfully, but these errors were encountered: