Prometheus metrics cache causes scaling issues with ephemeral projects and apps #5287

victorboissiere · 2021-01-21T14:47:29Z

Summary

When a project and/or application is deleted, it is not deleted from Prometheus metrics and stays in cache until the ArgoCD application controller's next rolling-update.

Motivation

We use ArgoCD in our QA environment with temporary projects and associated applications.
Each time we create a custom branch on a repository, it creates the resources in ArgoCD to bootstrap a dedicated QA environment. However, when the ArgoCD resources are deleted, they are kept in Prometheus metrics.
We have about 2000 to 4000 applications, and we delete/create about 500 times every day.

Prometheus endpoint /metrics sometimes timeout after 10 seconds due to the huge amount of metrics. It also put pressure on metrics retention.

Proposal

I see three ways of addressing this:

from time to time within the app, resetting metrics with the Prometheus client
use a specific API endpoint to allow resetting the Prometheus metrics
add support on app/project deletion to also remove the associated metrics

There are probably other solutions as well. In the meantime, we did something very naive to mitigate by scheduling a cronjob to delete the argocd-application-controller pod every night.

The text was updated successfully, but these errors were encountered:

jessesuen · 2021-01-21T21:51:42Z

We need to support this use case.

from time to time within the app, resetting metrics with the Prometheus client

Spoke with @alexmt and of all the proposals, Option 1 seems the best way. Option 2 seems like unnecessary integration effort and Option 3 will be unreliable because it's easy to miss delete events of applications, and you end up having to implement some form of option 1 anyways.

@victorboissiere would you like to contribute this change? We may not get around to this for v1.9.

jessesuen · 2021-01-21T21:52:17Z

Workaround for this is: daily cronjob which restarts the application-controller.

victorboissiere · 2021-01-22T15:51:07Z

@jessesuen thanks for the feedback. I saw that a cron package is already used for the sync window.
I'll try to reuse the same to reset Prometheus metrics every 24 hours. I'll submit the PR following the guide in the documentation.

* feat(prom): Add prometheus metrics reset support Signed-off-by: Victor Boissiere <victor.boissiere@gmail.com>

…roj#5304) * feat(prom): Add prometheus metrics reset support Signed-off-by: Victor Boissiere <victor.boissiere@gmail.com>

victorboissiere added the enhancement New feature or request label Jan 21, 2021

jessesuen added the workaround There's a workaround, might not be great, but exists label Jan 21, 2021

victorboissiere mentioned this issue Jan 24, 2021

feat(prom): Add prometheus metrics reset support #5287 #5304

Merged

8 tasks

alexmt closed this as completed in #5304 Feb 10, 2021

alexmt pushed a commit that referenced this issue Feb 10, 2021

feat(prom): Add prometheus metrics reset support #5287 (#5304)

e0f7731

* feat(prom): Add prometheus metrics reset support Signed-off-by: Victor Boissiere <victor.boissiere@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus metrics cache causes scaling issues with ephemeral projects and apps #5287

Prometheus metrics cache causes scaling issues with ephemeral projects and apps #5287

victorboissiere commented Jan 21, 2021

jessesuen commented Jan 21, 2021

jessesuen commented Jan 21, 2021

victorboissiere commented Jan 22, 2021

Prometheus metrics cache causes scaling issues with ephemeral projects and apps #5287

Prometheus metrics cache causes scaling issues with ephemeral projects and apps #5287

Comments

victorboissiere commented Jan 21, 2021

Summary

Motivation

Proposal

jessesuen commented Jan 21, 2021

jessesuen commented Jan 21, 2021

victorboissiere commented Jan 22, 2021