Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics cache causes scaling issues with ephemeral projects and apps #5287

Closed
victorboissiere opened this issue Jan 21, 2021 · 3 comments · Fixed by #5304
Closed
Labels
enhancement New feature or request workaround There's a workaround, might not be great, but exists

Comments

@victorboissiere
Copy link
Contributor

Summary

When a project and/or application is deleted, it is not deleted from Prometheus metrics and stays in cache until the ArgoCD application controller's next rolling-update.

Motivation

We use ArgoCD in our QA environment with temporary projects and associated applications.
Each time we create a custom branch on a repository, it creates the resources in ArgoCD to bootstrap a dedicated QA environment. However, when the ArgoCD resources are deleted, they are kept in Prometheus metrics.
We have about 2000 to 4000 applications, and we delete/create about 500 times every day.

Prometheus endpoint /metrics sometimes timeout after 10 seconds due to the huge amount of metrics. It also put pressure on metrics retention.

Proposal

I see three ways of addressing this:

  • from time to time within the app, resetting metrics with the Prometheus client
  • use a specific API endpoint to allow resetting the Prometheus metrics
  • add support on app/project deletion to also remove the associated metrics

There are probably other solutions as well. In the meantime, we did something very naive to mitigate by scheduling a cronjob to delete the argocd-application-controller pod every night.

@victorboissiere victorboissiere added the enhancement New feature or request label Jan 21, 2021
@jessesuen
Copy link
Member

We need to support this use case.

from time to time within the app, resetting metrics with the Prometheus client

Spoke with @alexmt and of all the proposals, Option 1 seems the best way. Option 2 seems like unnecessary integration effort and Option 3 will be unreliable because it's easy to miss delete events of applications, and you end up having to implement some form of option 1 anyways.

@victorboissiere would you like to contribute this change? We may not get around to this for v1.9.

@jessesuen
Copy link
Member

Workaround for this is: daily cronjob which restarts the application-controller.

@jessesuen jessesuen added the workaround There's a workaround, might not be great, but exists label Jan 21, 2021
@victorboissiere
Copy link
Contributor Author

@jessesuen thanks for the feedback. I saw that a cron package is already used for the sync window.
I'll try to reuse the same to reset Prometheus metrics every 24 hours. I'll submit the PR following the guide in the documentation.

alexmt pushed a commit that referenced this issue Feb 10, 2021
* feat(prom): Add prometheus metrics reset support

Signed-off-by: Victor Boissiere <victor.boissiere@gmail.com>
shubhamagarwal19 pushed a commit to shubhamagarwal19/argo-cd that referenced this issue Apr 15, 2021
…roj#5304)

* feat(prom): Add prometheus metrics reset support

Signed-off-by: Victor Boissiere <victor.boissiere@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request workaround There's a workaround, might not be great, but exists
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants