You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am deploying Prometheus and Loki both in separate ArgoCD applications (argoproj.io/v1alpha1), using Helm charts. Both use CRDs, and some of those CRDs are used by both. When I upgraded Prometheus, it upgraded the CRDs. The Loki deployment wasn't expecting these CRDs, and kept trying to return them back to the versions it wanted. Prometheus did the same, creating a loop, high CPU and memory usage. Most concerning, it appeared to be making many requests a second to the control plane API, and the control planes eventually also ran out of memory resulting in an outage.
For now I have applied skipCrds: true to one of them to resolve the issue. However, this is potentially problematic in the future as although they share some CRDs, some are potentially unique to each app and wouldn't be upgraded. This means I have to keep moving the skipCrds: true around between each every time I upgrade them.
Ideally there would be a feature that allows these CRDs shared by different apps to be reconciled without the loop. When loops do occur, there should be a limit on the number of reconciliation attempts, or a backoff to prevent the impact on the control planes.
The text was updated successfully, but these errors were encountered:
However, since you have 2 different applications managing the same CRDs, I would highly recommend you to create a third application that only deploys the CRDs. The same applies for any cluster scoped resources that would conflict between 2 applications. This way, you can manage the update lifecycle of the CRD independently from the 2 different controllers.
Summary
I am deploying Prometheus and Loki both in separate ArgoCD applications (argoproj.io/v1alpha1), using Helm charts. Both use CRDs, and some of those CRDs are used by both. When I upgraded Prometheus, it upgraded the CRDs. The Loki deployment wasn't expecting these CRDs, and kept trying to return them back to the versions it wanted. Prometheus did the same, creating a loop, high CPU and memory usage. Most concerning, it appeared to be making many requests a second to the control plane API, and the control planes eventually also ran out of memory resulting in an outage.
For now I have applied
skipCrds: true
to one of them to resolve the issue. However, this is potentially problematic in the future as although they share some CRDs, some are potentially unique to each app and wouldn't be upgraded. This means I have to keep moving theskipCrds: true
around between each every time I upgrade them.Ideally there would be a feature that allows these CRDs shared by different apps to be reconciled without the loop. When loops do occur, there should be a limit on the number of reconciliation attempts, or a backoff to prevent the impact on the control planes.
The text was updated successfully, but these errors were encountered: