Argocd out of memory after upgrade #4298

erezo9 · 2020-09-10T12:01:18Z

If you are trying to resolve an environment-specific issue or have a one-off question about the edge case that does not require a feature then please consider asking a
question in argocd slack channel.

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

After upgrading to 1.7.4 from 1.6.2 we experience alot of oom on the application controller and it does a restart alot, event after increaseing

To Reproduce

We have over than 200 applications and one controller

Expected behavior

Be the same as version 1.6.2 when we didnt experience this issue

Screenshots

If applicable, add screenshots to help explain your problem.

Version

Paste the output from `argocd version` here.

unfortunately i cannot since confidetnial

Paste any relevant application logs here.

The text was updated successfully, but these errors were encountered:

martinbeentjes · 2020-09-10T13:15:21Z

What difference in resource usage do you see?

erezo9 · 2020-09-10T14:20:56Z

@martinbeentjes memory mostley, reaches 6000Mi
and cpu to 3000m
we put resources limit to make it stop eating the whole resources

jessesuen · 2020-09-10T21:07:14Z

Do you happen to have prometheus / grafana installed for Argo CD? and would you be able to notice trends in the controller telemetry? looking for these graphs:

erezo9 · 2020-09-11T06:07:23Z

@jessesuen Ill send an image on sunday, dont have it with me right now

erezo9 · 2020-09-11T06:10:33Z

@jessesuen I also see that the controller uses 14Gib at somepoint, is it considered normal behavior?

jessesuen · 2020-09-11T09:00:13Z

Yes. Argo CD memory/cpu usage is linear to # of clusters x # of managed resources in that cluster.

The previous screenshot was taken from our largest Argo CD instance which manages 176 clusters but with only 270 applications.

For comparison, another one of our Argo CD instances is on the opposite extreme, and only manages 15 clusters but with 1700 applications. It's graphs look like:

jessesuen · 2020-09-11T09:02:30Z

After upgrading to 1.7.4 from 1.6.2 we experience alot of oom on the application controller and it does a restart alot, event after increaseing

Is the controller restarting because it's being OOM killed? If so, I would suggest trying to bump memory limits for v1.7 to see if v1.7 inherently needs more memory, or if there is really a leak in the controller.

erezo9 · 2020-09-11T12:12:48Z

@jessesuen well i dont know if there was a change in looping in non sync apps
but we have about 8 clusters and 200 applications
but 16 are not synced due to validation - whcih we need to add a flag to make it happen
any changes made on sync wave by any chance?

reggie-k · 2020-09-14T05:57:22Z

Me and @erezo9 are in the same team.
There are a couple of issues we face after upgrading to 1.7.4:

High memory usage of the controller
Sync waves stopped working (namely, the sync is stuck forever when there are resources with sync waves in the ArgoApp) so I had to remove them and rely on the sync auto-retries (can live with that but the sync with the auto-retries is rather slow)
After deleting a child ArgoApp from the git repo (there is an app-of-apps with auto-prune watching that app), the deletion-sync of the child app is stuck, and remains so until I delete it manually from the UI.
The overall performance observed in the UI seems worse that in the previous version (1.6.2)

alexmt · 2020-09-14T21:53:59Z

Reopening issue until the fix is released and tested.

erezo9 · 2020-09-15T08:06:26Z

@alexmt
Thanks for fixing the issue, will update as soon as the release comes out and will update the issue if relevant

reggie-k · 2020-09-16T12:42:42Z

@alexmt Thanks a lot for the quick fix, we installed it, it seems that the memory usage is slightly less, but there are still OOM restarts. It looks like the deletion of apps behaves better now. Regarding the overall performance and the sync waves issue, I can't say enough yet, have to test it further and will update

alexmt · 2020-09-16T17:34:17Z

Sorry for the troubles this upgrade cause @reggie-k , @erezo9 . The fix delivered in 1.7.5 should solve memory spike during controller initialization. Keep trying to OOM restarts reason and testing sync issues with app-of-apps. Will update as soon as I find something

erezo9 · 2020-09-21T18:39:24Z

@jessesuen upgrading to 1.7.6, same issue as well - wanted to update

reggie-k · 2020-09-22T10:39:17Z

We have installed 1.7.6 and still the umbrella app is stuck upon sync even after manual terminate

…

On Wed, Sep 16, 2020, 20:34 Alexander Matyushentsev < ***@***.***> wrote: Sorry for the troubles this upgrade cause @reggie-k <https://github.com/reggie-k> , @erezo9 <https://github.com/erezo9> . The fix delivered in 1.7.5 should solve memory spike during controller initialization. Keep trying to OOM restarts reason and testing sync issues with app-of-apps. Will update as soon as I find something — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4298 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVDWBC3JGK5TKGA7WQ5VDLSGDZKTANCNFSM4RE4RFHQ> .

reggie-k · 2020-09-22T11:00:24Z

Mentioned to sync the umbrella app, the controller entered crash loop back state for an unknown reason, after deletion of the pod and successful startup the app managed to sync.

…

On Mon, Sep 21, 2020, 21:39 Erez Tamam ***@***.***> wrote: @jessesuen <https://github.com/jessesuen> upgrading to 1.7.6, same issue as well - wanted to update — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4298 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVDWBHZTMDPG6FSBWBLHJ3SG6MWXANCNFSM4RE4RFHQ> .

alexmt · 2020-09-22T17:42:27Z

Hello @reggie-k , probably there are two separate issues.

Can please share more Does your umbrella app manages itself? There is a known chicken and egg problem. During syncing Argo CD waits when the umbrella became healthy what cannot happen until all resources are synced. #3781

If this is the case you should see waiting for ... message in sync status panel:

Regarding crash loop state. Can you please attach logs?

reggie-k · 2020-09-24T06:23:38Z

@alexmt
Ok, we have a direction here. For some reason, the controller is creating huge amount of files. We observed the following errors in the UI:
Failed to write manifest: write /dev/shm/862955259: no space left on device
Failed to write kubeconfig: write /dev/shm/550197066: no space left on device

On the node the controller runs on we see that:
The partition /var/lib/docker/containers/212e7d626d101e3f4392eea7cc19bf0f595302f77ea8d135ac32425f2084fc78/mounts/shm has used 100% availiable space 4.0K

This container is the argocd controller.
I am copying the log right now and will find a way to send it later (security constraints) and will try to see whether a restart resolves it for now

alexmt · 2020-09-25T16:57:56Z

@reggie-k , controller creates resource manifest file and kubeconfig to execute kubectl apply. Do you have apps with auto-sync and probably self-heal ? I suspect that after upgrading to 1.7 some application is permanently is out-of-sync state and controller continuously trying to synchronize it. Can you check if this is that happening in your case?

In this case the controller should backoff and notify about the issue but this is not implemented yet.

reggie-k · 2020-09-25T20:38:36Z

Hmmm we suspected something of that kind and resolved the vast majority of out-of-sync apps after the upgrade to 1.6. But I will double check on Tuesday whethere some remained that way. It's now holiday time here. Yes, all of our apps are auto sync.

…j#4298)

alexmt · 2020-09-26T00:10:48Z

The potential fix got merged into master: #4434

I'm going to run it internally for few days, just in case, and will release in 1.7.7

reggie-k · 2020-09-26T06:20:42Z

Thanks a lot, Alex!

…

On Sat, Sep 26, 2020, 03:11 Alexander Matyushentsev < ***@***.***> wrote: The potential fix got merged into master: #4434 <#4434> I'm going to run it internally for few days, just in case, and will release in 1.7.7 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4298 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVDWBAZE43GPCF5OSBVQBDSHUWRJANCNFSM4RE4RFHQ> .

reggie-k · 2020-09-29T11:03:00Z

I am making all the out-of-sync apps sync.
We will install 1.7.7.
Also working on sending the log.

reggie-k · 2020-09-29T13:16:22Z

All the apps are now synced (some are degraded or progressing but as I understand this is not a problem with regard to the controller restarts and slow sync).
Also we have installed 1.7.7
So far, only one controller restart and the sync seems faster now.
Can't tell whether 1.7.7 improved the sync time or the fact that all apps are now synced.
Will observe further.

jessesuen · 2020-10-27T20:52:07Z

Closing, but please file a new issue if you see further problems

eddycharly · 2021-06-14T17:22:38Z

Hello,
We observed the same issue today (/dev/shm no space left on device), running version 1.8.4.
We have more than 500 apps and potentially some are quite often out of sync (auto sync and self heal is enabled).

Do you think it should have been fixed or it could be the same issue ?
Should /dev/shm be cleaned up regularly by ArgoCD ?

xlanor · 2021-08-13T07:48:36Z

@eddycharly , we are running into the same issue, using the latest helm chart with argo v2.0.5.

So far, our fix has been to terminate the pods and spawn argo pods (which should be automatic due to the RS), which is a hacky way around it, but for anyone who urgently needs to get it up and running.

erezo9 added the bug Something isn't working label Sep 10, 2020

jessesuen added cherry-pick/1.7 Candidate for cherry picking into the 1.7 release branch regression Bug is a regression, should be handled with high priority type:scalability Issues related to scalability and performance related issues labels Sep 10, 2020

alexmt mentioned this issue Sep 14, 2020

fix: limit concurrent list requests accross all clusters #4328

Merged

alexmt closed this as completed in #4328 Sep 14, 2020

alexmt reopened this Sep 14, 2020

asvasyanin mentioned this issue Sep 16, 2020

Broken CLI "app wait" command after server upgrade to v1.7.4 #4321

Closed

3 tasks

jessesuen added this to the v1.8 milestone Sep 16, 2020

alexmt pushed a commit to alexmt/argo-cd that referenced this issue Sep 25, 2020

refactor: update gitops engine version (issues argoproj#4329, argopro…

1aa6a15

…j#4298)

alexmt mentioned this issue Sep 25, 2020

refactor: update gitops engine version (issues #4329, #4298) #4434

Merged

alexmt pushed a commit that referenced this issue Sep 25, 2020

refactor: update gitops engine version (issues #4329, #4298) (#4434)

6b10676

alexmt pushed a commit that referenced this issue Sep 28, 2020

refactor: update gitops engine version (issues #4329, #4298) (#4434)

334e949

jessesuen closed this as completed Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argocd out of memory after upgrade #4298

Argocd out of memory after upgrade #4298

erezo9 commented Sep 10, 2020

martinbeentjes commented Sep 10, 2020

erezo9 commented Sep 10, 2020

jessesuen commented Sep 10, 2020

erezo9 commented Sep 11, 2020

erezo9 commented Sep 11, 2020

jessesuen commented Sep 11, 2020 •

edited

jessesuen commented Sep 11, 2020

erezo9 commented Sep 11, 2020

reggie-k commented Sep 14, 2020

alexmt commented Sep 14, 2020

erezo9 commented Sep 15, 2020

reggie-k commented Sep 16, 2020

alexmt commented Sep 16, 2020

erezo9 commented Sep 21, 2020

reggie-k commented Sep 22, 2020 via email

reggie-k commented Sep 22, 2020 via email

alexmt commented Sep 22, 2020

reggie-k commented Sep 24, 2020

alexmt commented Sep 25, 2020

reggie-k commented Sep 25, 2020

alexmt commented Sep 26, 2020

reggie-k commented Sep 26, 2020 via email

reggie-k commented Sep 29, 2020

reggie-k commented Sep 29, 2020

jessesuen commented Oct 27, 2020

eddycharly commented Jun 14, 2021 •

edited

xlanor commented Aug 13, 2021

Argocd out of memory after upgrade #4298

Argocd out of memory after upgrade #4298

Comments

erezo9 commented Sep 10, 2020

martinbeentjes commented Sep 10, 2020

erezo9 commented Sep 10, 2020

jessesuen commented Sep 10, 2020

erezo9 commented Sep 11, 2020

erezo9 commented Sep 11, 2020

jessesuen commented Sep 11, 2020 • edited

jessesuen commented Sep 11, 2020

erezo9 commented Sep 11, 2020

reggie-k commented Sep 14, 2020

alexmt commented Sep 14, 2020

erezo9 commented Sep 15, 2020

reggie-k commented Sep 16, 2020

alexmt commented Sep 16, 2020

erezo9 commented Sep 21, 2020

reggie-k commented Sep 22, 2020 via email

reggie-k commented Sep 22, 2020 via email

alexmt commented Sep 22, 2020

reggie-k commented Sep 24, 2020

alexmt commented Sep 25, 2020

reggie-k commented Sep 25, 2020

alexmt commented Sep 26, 2020

reggie-k commented Sep 26, 2020 via email

reggie-k commented Sep 29, 2020

reggie-k commented Sep 29, 2020

jessesuen commented Oct 27, 2020

eddycharly commented Jun 14, 2021 • edited

xlanor commented Aug 13, 2021

jessesuen commented Sep 11, 2020 •

edited

eddycharly commented Jun 14, 2021 •

edited