Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty objects vs. null objects cause repeated syncs #18213

Closed
pgier opened this issue May 14, 2024 · 26 comments · Fixed by #18840
Closed

Empty objects vs. null objects cause repeated syncs #18213

pgier opened this issue May 14, 2024 · 26 comments · Fixed by #18840
Labels
bug/enhancement component:core Syncing, diffing, cluster state cache enhancement New feature or request type:bug

Comments

@pgier
Copy link

pgier commented May 14, 2024

Summary

When a config option is not set in the a kube manifest (e.g. statefulset) ArgoCD continually tries to re-sync the object because it sees a diff between an empty and undefined object.

Motivation

I have a statefulset that does not define the resources section. ArgoCD continually tries to sync the object because Kubernetes automatically injects an empty object.
Screenshot 2024-05-14 at 7 50 11 AM

Proposal

The diff calculator should ignore the difference between an empty and null object in a kubernetes resource field.

Related

#15554

@pgier pgier added the enhancement New feature or request label May 14, 2024
@blakeromano
Copy link

Is this with Server Side Diff on or off? I would imagine looking at https://argo-cd.readthedocs.io/en/stable/user-guide/diff-strategies/#server-side-diff may help with scenarios like this. I don't believe there's plans to make the default Server Side Diff until a 3.x release but this may be something you want to take a look at...

@evanrich
Copy link

Is this with Server Side Diff on or off? I would imagine looking at https://argo-cd.readthedocs.io/en/stable/user-guide/diff-strategies/#server-side-diff may help with scenarios like this. I don't believe there's plans to make the default Server Side Diff until a 3.x release but this may be something you want to take a look at...

i just tried enabling that and it didn't seem to make a difference on my environment

@evanrich
Copy link

evanrich commented May 15, 2024

FWIW, this works for me for the "resources: {}" diff

      ignoreDifferences:
      - group: apps
        kind: Deployment
        jqPathExpressions:
        - '.spec.template.spec.containers[].resources'
      - group: apps
        kind: StatefulSet
        jqPathExpressions:
        - '.spec.template.spec.containers[].resources'

@pgier
Copy link
Author

pgier commented May 22, 2024

Is this with Server Side Diff on or off?

I haven't set any options for server side diff in the application or the main configmap, so it should be off.
On the application I only have these sync options set.

        syncOptions:
        - CreateNamespace=true
        - ServerSideApply=true

@evanrich Thanks, I was hoping I wouldn't have to ignore resources for the cases where there are real differences, but I can probably use that for now.

@pgier
Copy link
Author

pgier commented May 22, 2024

This is very strange, I have multiple clusters with basically the same configuration.
AWS EKS (same region)
Kube version: 1.28.8
ArgoCD version: v2.10.7+b060053

Same application (external-secrets) deployed to both clusters.
One of the clusters shows a diff on the empty vs. null resources and continually tries to resync.
The other cluster shows the same difference if you look at the live manifest vs. desired manifest, but does not display anything on the diff tab and does not try to re-sync.

I verified that on the cluster which is not showing a diff and therefore not syncing has resources: {} in the live manifest and nothing defined for resources in the desired manifest. So for some reason this cluster is successfully ignoring that difference even though I didn't configure any ignoreDifferences.

So same application running in the same version of Argo and Kubernetes seems to sometimes ignore this difference and sometimes not ignore it.

@batazor
Copy link

batazor commented May 24, 2024

Noticed this problem with version 2.11 quite a lot

Adding this to ignoreDifferences doesn't seem like a good idea to me

@javydekoning
Copy link

This has suddenly emerged in my k3s cluster as well.

Kubernetes version: v1.30.0+k3s1
Argo: v2.11.2

@pgier
Copy link
Author

pgier commented Jun 7, 2024

Similar issue: #13004

@pgier
Copy link
Author

pgier commented Jun 7, 2024

Just wanted to note that I also tried enabling server side diff, but it didn't have any effect on this issue.

@evanrich
Copy link

evanrich commented Jun 7, 2024

I haven't seen any updates in the release notes for recent releases that indicate this has been changed, so I'm staying on helm chart version 6.7.18 (argo v2.10.9) hopefully this gets addressed/commented on soon because it's not just empty objects but the a ton of other resources as well are suddenly throwing sync errors in latest versions.

@vertisan
Copy link

vertisan commented Jun 7, 2024

Same problem here.
Kubernetes: k3s 1.30.0+k3s1
ArgoCD: v2.11.2+25f7504

Like others, I have used ignoreDifferences as a workaround for now, but the fix will be very useful :)
If someone doesn't want to list all kinds, they can use *:

ignoreDifferences:
  - group: apps
    kind: "*"
    jqPathExpressions:
      - ".spec.template.spec.containers[].resources"
  - group: apps
    kind: "*"
    jqPathExpressions:
      - ".spec.template.spec.initContainers[].resources"

@pgier
Copy link
Author

pgier commented Jun 12, 2024

Comparing a working vs. non-working cluster, I found that the gvkParser is nil on the non-working cluster (the one where the empty resources is detected as a difference).
Looks like the gvkParser is not created due to some invalid openapi schema in the cluster. Unfortunately the error message was not printing correctly so it took a while to track this down.
argoproj/gitops-engine#585

With some changes to the code I see this error message in the logs on the cluster that's not working:

"error creating gvk parser: duplicate entry for /v1, Kind=APIResourceList" server="https://kubernetes.default.svc"

See also: argoproj/gitops-engine#425

This seems to be a known issue in the GVKParser: kubernetes/kubernetes#103597

@pgier
Copy link
Author

pgier commented Jun 13, 2024

The issue in my case turned out to be caused by an older version of the kubernetes metrics server installed in the EKS cluster. First I updated updated the application controller using a locally built container (see argoproj/gitops-engine#585) to display the parser error.

duplicate entry for /v1, Kind=APIResourceList"

This duplication comes from the kube api server having multiple versions of the APIResourceList definition, which can be seen in the raw openapi schema.

kubectl get --raw /openapi/v2 | jq -r '.definitions | keys | .[]' | sort | grep io.k8s.apimachinery.pkg.apis.meta.v1

Then I searched for where APIResourceList_v2 is referenced.

kubectl get --raw /openapi/v2 | jq | grep "io.k8s.apimachinery.pkg.apis.meta.v1.APIResourceList_v2" -B 25

The only place it was referenced was the metrics server api endpoint /apis/metrics.k8s.io/v1beta1/.
After applying a recent metrics server config, the issue seems to be resolved.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.1/components.yaml

After the metrics update removed APIResourceList_v2 the issue went away and Argo gives a clean diff. Even though the cluster has some other definitions with a _v2, those don't cause a problem for the parser, probably because they are pruned (kubernetes/kubernetes#110179).
I don't know if this is the same cause for other people that are having this issue, but hopefully these steps can help to debug this.

Edit: also wanted to note that after applying the newer metrics server it takes a few minutes before the openapi schema is updated, and then you'll also need to restart the argocd-application-controller pod.

@pgier pgier closed this as completed Jun 13, 2024
@crenshaw-dev
Copy link
Collaborator

Interesting. I've seen similar duplications in k8s 1.30.0, so I'll be diving into the problem more. Glad you were able to sort it in your instance!

@javydekoning
Copy link

Interesting. I've seen similar duplications in k8s 1.30.0, so I'll be diving into the problem more. Glad you were able to sort it in your instance!

Same here, fresh k3s deployment ships with docker.io/rancher/mirrored-metrics-server:v0.7.0 and still faces this issue.

@pschichtel
Copy link

pschichtel commented Jun 14, 2024

I'm seeing the same since (I think) I upgraded from k8s 1.29.* to 1.30.1. Most of my helm charts render resources: {} when not specifying any, so it only affects two charts for me: kube-prometheus-stack and mariadb-operator.
I can also confirm: server side diff doesn't change anything.

@crenshaw-dev crenshaw-dev reopened this Jun 25, 2024
@pschichtel
Copy link

This disappeared for me with the upgrade to 1.30.2

@crenshaw-dev
Copy link
Collaborator

crenshaw-dev commented Jun 25, 2024

@pschichtel interesting... was there a change in version of metrics-server? (Or of any other aggregated API?)

@alexmt alexmt added bug/enhancement component:core Syncing, diffing, cluster state cache type:bug labels Jun 25, 2024
@pschichtel
Copy link

@crenshaw-dev metrics-server was updated as part of the 1.29.* -> 1.30.1 upgrade, but it hasn't changed when upgrading to 1.30.2

@crenshaw-dev
Copy link
Collaborator

That's strange... are you able to port-forward your metrics-server service and dump the /openapi/v2 output from that service?

@pschichtel
Copy link

not sure that's helpful

curl -k https://localhost:10250/openapi/v2
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/openapi/v2\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

@pschichtel
Copy link

I just checked argo again and the issue actually came back, sorry for the confusion!

@crenshaw-dev
Copy link
Collaborator

Okay that makes more sense, I'm not seeing any relevant changes in k8s or metrics-server.

I spent a bunch of time today trying to de-dupe OpenAPI models before feeding them to GVKParser. It avoids the NewGVKParser error, but apparently some models (like DeploymentSpec are disappearing from the GVKParser's schema), so I'm getting other failures during diff construction.

Working on it though.

@crenshaw-dev
Copy link
Collaborator

I'm working on the fix and have written up a description of the issue here: https://github.com/argoproj/gitops-engine/pull/587/files#diff-bbeadede3a86f27922253617b66c49b7672bb52896a10cf5626b398fa0fa9280

I'm trying to decide between just logging the dropped duplicate GVKs or also surfacing to the user that their diff might be wrong due to duplicate GVKs. So far I'm trying to surface it to the user, but might end up being more code than it's worth.

A couple open questions:

  1. Should aggregated APIs like metrics-server even include definitions for things like APIResourceList in their OpenAPI doc? afaik, that type isn't directly used in any of the API's user-facing functions. Maybe it's for Kubernetes' benefit, doing discovery?
  2. Would constructing proto.Models from an OpenAPI v3 doc make these problems go away? I haven't tested, because I'm not sure how to construct a full OpenAPI v3 doc from the per-type docs currently served by k8s. Even if I did construct that doc, I'm not confident that it wouldn't hit the same duplicate GVK issue when constructing the GVKParser.

crenshaw-dev added a commit to crenshaw-dev/argo-cd that referenced this issue Jun 27, 2024
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
ishitasequeira pushed a commit that referenced this issue Jun 29, 2024
* fix(controller): bad server-side diffs (#18213)

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* use upstream commit

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
@andrii-korotkov-verkada
Copy link
Contributor

andrii-korotkov-verkada commented Jun 30, 2024

The issue seems to still be there for Rollout, analysisRunMetadata field (see the screenshot).
Screenshot 2024-06-30 at 8 34 28 AM
I've build a custom image from latest master (+1 cherry pick), ArgoCD pods were updated, even Redis ones.
Update: I'd wait for a bit in case it's a stale UI issue.

@andrii-korotkov-verkada
Copy link
Contributor

The app eventually got in sync.

crenshaw-dev added a commit to crenshaw-dev/argo-cd that referenced this issue Jul 1, 2024
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit to crenshaw-dev/argo-cd that referenced this issue Jul 1, 2024
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit to crenshaw-dev/argo-cd that referenced this issue Jul 1, 2024
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit to crenshaw-dev/argo-cd that referenced this issue Jul 1, 2024
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit that referenced this issue Jul 1, 2024
* fix(controller): bad server-side diffs (#18213) (2.10)

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix revision

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* hopefully the right hash now

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
ishitasequeira pushed a commit that referenced this issue Jul 2, 2024
* fix(controller): bad server-side diffs (#18213) (2.11)

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* hopefully the right hash now

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit that referenced this issue Jul 2, 2024
* fix(controller): bad server-side diffs (#18213) (2.9)

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix(controller): bad server-side diffs (#18213) (2.9)

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* hopefully the right hash now

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/enhancement component:core Syncing, diffing, cluster state cache enhancement New feature or request type:bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants