Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argocd out of memory after upgrade #4298

Closed
3 tasks
erezo9 opened this issue Sep 10, 2020 · 27 comments · Fixed by #4328
Closed
3 tasks

Argocd out of memory after upgrade #4298

erezo9 opened this issue Sep 10, 2020 · 27 comments · Fixed by #4328
Labels
bug Something isn't working cherry-pick/1.7 Candidate for cherry picking into the 1.7 release branch regression Bug is a regression, should be handled with high priority type:scalability Issues related to scalability and performance related issues
Milestone

Comments

@erezo9
Copy link

erezo9 commented Sep 10, 2020

If you are trying to resolve an environment-specific issue or have a one-off question about the edge case that does not require a feature then please consider asking a
question in argocd slack channel.

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

After upgrading to 1.7.4 from 1.6.2 we experience alot of oom on the application controller and it does a restart alot, event after increaseing

To Reproduce

We have over than 200 applications and one controller

Expected behavior

Be the same as version 1.6.2 when we didnt experience this issue

Screenshots

If applicable, add screenshots to help explain your problem.

Version

Paste the output from `argocd version` here.

unfortunately i cannot since confidetnial

Paste any relevant application logs here.
@erezo9 erezo9 added the bug Something isn't working label Sep 10, 2020
@martinbeentjes
Copy link

What difference in resource usage do you see?

@erezo9
Copy link
Author

erezo9 commented Sep 10, 2020

@martinbeentjes memory mostley, reaches 6000Mi
and cpu to 3000m
we put resources limit to make it stop eating the whole resources

@jessesuen jessesuen added cherry-pick/1.7 Candidate for cherry picking into the 1.7 release branch regression Bug is a regression, should be handled with high priority type:scalability Issues related to scalability and performance related issues labels Sep 10, 2020
@jessesuen
Copy link
Member

Do you happen to have prometheus / grafana installed for Argo CD? and would you be able to notice trends in the controller telemetry? looking for these graphs:

image

@erezo9
Copy link
Author

erezo9 commented Sep 11, 2020

@jessesuen Ill send an image on sunday, dont have it with me right now

@erezo9
Copy link
Author

erezo9 commented Sep 11, 2020

@jessesuen I also see that the controller uses 14Gib at somepoint, is it considered normal behavior?

@jessesuen
Copy link
Member

jessesuen commented Sep 11, 2020

Yes. Argo CD memory/cpu usage is linear to # of clusters x # of managed resources in that cluster.

The previous screenshot was taken from our largest Argo CD instance which manages 176 clusters but with only 270 applications.

For comparison, another one of our Argo CD instances is on the opposite extreme, and only manages 15 clusters but with 1700 applications. It's graphs look like:
image

@jessesuen
Copy link
Member

After upgrading to 1.7.4 from 1.6.2 we experience alot of oom on the application controller and it does a restart alot, event after increaseing

Is the controller restarting because it's being OOM killed? If so, I would suggest trying to bump memory limits for v1.7 to see if v1.7 inherently needs more memory, or if there is really a leak in the controller.

@erezo9
Copy link
Author

erezo9 commented Sep 11, 2020

@jessesuen well i dont know if there was a change in looping in non sync apps
but we have about 8 clusters and 200 applications
but 16 are not synced due to validation - whcih we need to add a flag to make it happen
any changes made on sync wave by any chance?

@reggie-k
Copy link
Member

Me and @erezo9 are in the same team.
There are a couple of issues we face after upgrading to 1.7.4:

  1. High memory usage of the controller
  2. Sync waves stopped working (namely, the sync is stuck forever when there are resources with sync waves in the ArgoApp) so I had to remove them and rely on the sync auto-retries (can live with that but the sync with the auto-retries is rather slow)
  3. After deleting a child ArgoApp from the git repo (there is an app-of-apps with auto-prune watching that app), the deletion-sync of the child app is stuck, and remains so until I delete it manually from the UI.
  4. The overall performance observed in the UI seems worse that in the previous version (1.6.2)

@alexmt
Copy link
Collaborator

alexmt commented Sep 14, 2020

Reopening issue until the fix is released and tested.

@alexmt alexmt reopened this Sep 14, 2020
@erezo9
Copy link
Author

erezo9 commented Sep 15, 2020

@alexmt
Thanks for fixing the issue, will update as soon as the release comes out and will update the issue if relevant

@reggie-k
Copy link
Member

@alexmt Thanks a lot for the quick fix, we installed it, it seems that the memory usage is slightly less, but there are still OOM restarts. It looks like the deletion of apps behaves better now. Regarding the overall performance and the sync waves issue, I can't say enough yet, have to test it further and will update

@alexmt
Copy link
Collaborator

alexmt commented Sep 16, 2020

Sorry for the troubles this upgrade cause @reggie-k , @erezo9 . The fix delivered in 1.7.5 should solve memory spike during controller initialization. Keep trying to OOM restarts reason and testing sync issues with app-of-apps. Will update as soon as I find something

@jessesuen jessesuen added this to the v1.8 milestone Sep 16, 2020
@erezo9
Copy link
Author

erezo9 commented Sep 21, 2020

@jessesuen upgrading to 1.7.6, same issue as well - wanted to update

@reggie-k
Copy link
Member

reggie-k commented Sep 22, 2020 via email

@reggie-k
Copy link
Member

reggie-k commented Sep 22, 2020 via email

@alexmt
Copy link
Collaborator

alexmt commented Sep 22, 2020

Hello @reggie-k , probably there are two separate issues.

Can please share more Does your umbrella app manages itself? There is a known chicken and egg problem. During syncing Argo CD waits when the umbrella became healthy what cannot happen until all resources are synced. #3781

If this is the case you should see waiting for ... message in sync status panel:
image

Regarding crash loop state. Can you please attach logs?

@reggie-k
Copy link
Member

@alexmt
Ok, we have a direction here. For some reason, the controller is creating huge amount of files. We observed the following errors in the UI:
Failed to write manifest: write /dev/shm/862955259: no space left on device
Failed to write kubeconfig: write /dev/shm/550197066: no space left on device

On the node the controller runs on we see that:
The partition /var/lib/docker/containers/212e7d626d101e3f4392eea7cc19bf0f595302f77ea8d135ac32425f2084fc78/mounts/shm has used 100% availiable space 4.0K

This container is the argocd controller.
I am copying the log right now and will find a way to send it later (security constraints) and will try to see whether a restart resolves it for now

@alexmt
Copy link
Collaborator

alexmt commented Sep 25, 2020

@reggie-k , controller creates resource manifest file and kubeconfig to execute kubectl apply. Do you have apps with auto-sync and probably self-heal ? I suspect that after upgrading to 1.7 some application is permanently is out-of-sync state and controller continuously trying to synchronize it. Can you check if this is that happening in your case?

In this case the controller should backoff and notify about the issue but this is not implemented yet.

@reggie-k
Copy link
Member

Hmmm we suspected something of that kind and resolved the vast majority of out-of-sync apps after the upgrade to 1.6. But I will double check on Tuesday whethere some remained that way. It's now holiday time here. Yes, all of our apps are auto sync.

alexmt pushed a commit to alexmt/argo-cd that referenced this issue Sep 25, 2020
@alexmt
Copy link
Collaborator

alexmt commented Sep 26, 2020

The potential fix got merged into master: #4434

I'm going to run it internally for few days, just in case, and will release in 1.7.7

@reggie-k
Copy link
Member

reggie-k commented Sep 26, 2020 via email

@reggie-k
Copy link
Member

I am making all the out-of-sync apps sync.
We will install 1.7.7.
Also working on sending the log.

@reggie-k
Copy link
Member

All the apps are now synced (some are degraded or progressing but as I understand this is not a problem with regard to the controller restarts and slow sync).
Also we have installed 1.7.7
So far, only one controller restart and the sync seems faster now.
Can't tell whether 1.7.7 improved the sync time or the fact that all apps are now synced.
Will observe further.

@jessesuen
Copy link
Member

Closing, but please file a new issue if you see further problems

@eddycharly
Copy link

eddycharly commented Jun 14, 2021

Hello,
We observed the same issue today (/dev/shm no space left on device), running version 1.8.4.
We have more than 500 apps and potentially some are quite often out of sync (auto sync and self heal is enabled).

Do you think it should have been fixed or it could be the same issue ?
Should /dev/shm be cleaned up regularly by ArgoCD ?

@xlanor
Copy link

xlanor commented Aug 13, 2021

@eddycharly , we are running into the same issue, using the latest helm chart with argo v2.0.5.

So far, our fix has been to terminate the pods and spawn argo pods (which should be automatic due to the RS), which is a hacky way around it, but for anyone who urgently needs to get it up and running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cherry-pick/1.7 Candidate for cherry picking into the 1.7 release branch regression Bug is a regression, should be handled with high priority type:scalability Issues related to scalability and performance related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants