Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache inconsistency of child resources #4053

Closed
jessesuen opened this issue Aug 5, 2020 · 4 comments · Fixed by #4202
Closed

Cache inconsistency of child resources #4053

jessesuen opened this issue Aug 5, 2020 · 4 comments · Fixed by #4202
Labels
bug Something isn't working
Milestone

Comments

@jessesuen
Copy link
Member

On one Argo CD instance (v1.6.2-f282a33), a pod was deleted as a result of a Argo CD Rollout Restart action. The pod was part of a Rollout's ReplicaSet. Even though the pod truly disappeared from kubernetes, it still remained visible in Argo CD.

The inconsistent state remained for roughly ~24 hours, which is our default cache invalidation period. After which the state was corrected and the pod disappeared from the UI.

@jessesuen jessesuen added the bug Something isn't working label Aug 5, 2020
@jessesuen jessesuen added this to the v1.8 milestone Aug 5, 2020
@jdfalk
Copy link
Contributor

jdfalk commented Aug 14, 2020

I think the offending line is this:
https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L29

This problem causes everything that is set to auto sync to continuously sync over and over. This is not a bug that should be put off, and it needs to be addressed before v1.7 is launched.

We manage 10 clusters with each of our argocd instances, and due to the numerous bugs in the 1.6.2 release (see: extra lines added in manifests generated in the ui if you select autosync, inability to process helm hooks properly, resources going unknown / missing and not syncing, and a host of other issues) we attempted to use the latest release on our test clusters. Due to this cache nonsense it's been attempting to reapply the deployments to all clusters repeatedly. Eventually after 24 hours it updates correctly. I would consider this a critical bug as it destroys the entire purpose of gitops and trust in the status of any object in the clusters as returned via argocd.

@alexmt
Copy link
Collaborator

alexmt commented Aug 14, 2020

Agree. This issue used to happen once every few months, something has changed and now we are seeing it much more often. To mitigate the problem we've added ARGOCD_CLUSTER_CACHE_RESYNC_DURATION env variable. This allows reducing cluster force refresh period (to e.g. 1hr ARGOCD_CLUSTER_CACHE_RESYNC_DURATION=1hr)

@jdfalk
Copy link
Contributor

jdfalk commented Aug 20, 2020

Which application did you add that env var to? Application-controller I assume?

@victorboissiere
Copy link
Contributor

We see the same kind of issues with v1.6.2 and applications get usually stuck. We have about 1200-1400 applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants