Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam/crd: Fix spurious CiliumNode update status failures #17856

Merged

Conversation

gandro
Copy link
Member

@gandro gandro commented Nov 11, 2021

When running in CRD-based IPAM modes (Alibaba, Azure, ENI, CRD), it is
possible to observe spurious "Unable to update CiliumNode custom
resource" failures in the cilium-agent.

The full error message is as follows: "Operation cannot be fulfilled on
ciliumnodes.cilium.io : the object has been modified; please apply
your changes to the latest version and try again".

It means that the Kubernetes UpdateStatus call has failed because the
local ObjectMeta.ResourceVersion of submitted CiliumNode version is
out of date. In the presence of races, this error is expected and will
resolve itself once the agent receives a more recent version of the
object with the new resource version.

However, it is possible that the resource version of a CiliumNode
object is bumped even though the Spec or Status of the CiliumNode
remains the same. This for examples happens when
ObjectMeta.ManagedFields is updated by the Kubernetes apiserver.

Unfortunately, CiliumNode.DeepEqual does not consider any
ObjectMeta fields (including the resource version). Therefore two
objects with different resource versions are considered the same by the
CiliumNode watcher used by IPAM.

But to be able to successfully call UpdateStatus we need to know the
most recent resource version. Otherwise, UpdateStatus will always fail
until the CiliumNode object is updated externally for some reason.

Therefore, this commit modifies the logic to always store the most
recent version of the CiliumNode object, even if Spec or Status
has not changed. This in turn allows nodeStore.refreshNode (which
invokes UpdateStatus) to always work on the most recently observed
resource version.

ipam/crd: Fix spurious "Unable to update CiliumNode custom resource" failures in cilium-agent

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. area/eni Impacts ENI based IPAM. sig/ipam IP address management, including cloud IPAM area/azure Impacts Azure based IPAM. labels Nov 11, 2021
@gandro gandro requested review from a team and twpayne November 11, 2021 16:28
@gandro gandro force-pushed the pr/gandro/crd-ipam-fix-updatestatus-failure branch 2 times, most recently from 009713f to 878948c Compare November 11, 2021 16:39
if oldNode.DeepEqual(newNode) {
equal = true
return
}
valid = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there was another bug in the old code here: valid was only set to true if oldNode and newNode were not equal. This PR looks like it fixes that.

pkg/ipam/crd.go Outdated Show resolved Hide resolved
@gandro gandro force-pushed the pr/gandro/crd-ipam-fix-updatestatus-failure branch 2 times, most recently from d236240 to f433b12 Compare November 17, 2021 17:50
When running in CRD-based IPAM modes (Alibaba, Azure, ENI, CRD), it is
possible to observe spurious "Unable to update CiliumNode custom
resource" failures in the cilium-agent.

The full error message is as follows: "Operation cannot be fulfilled on
ciliumnodes.cilium.io <node>: the object has been modified; please apply
your changes to the latest version and try again".

It means that the Kubernetes `UpdateStatus` call has failed because the
local `ObjectMeta.ResourceVersion` of submitted CiliumNode version is
out of date. In the presence of races, this error is expected and will
resolve itself once the agent receives a more recent version of the
object with the new resource version.

However, it is possible that the resource version of a `CiliumNode`
object is bumped even though the `Spec` or `Status` of the `CiliumNode`
remains the same. This for examples happens when
`ObjectMeta.ManagedFields` is updated by the Kubernetes apiserver.

Unfortunately, `CiliumNode.DeepEqual` does _not_ consider any
`ObjectMeta` fields (including the resource version). Therefore two
objects with different resource versions are considered the same by the
`CiliumNode` watcher used by IPAM.

But to be able to successfully call `UpdateStatus` we need to know the
most recent resource version. Otherwise, `UpdateStatus` will always fail
until the `CiliumNode` object is updated externally for some reason.

Therefore, this commit modifies the logic to always store the most
recent version of the `CiliumNode` object, even if `Spec` or `Status`
has not changed.  This in turn allows `nodeStore.refreshNode` (which
invokes `UpdateStatus`) to always work on the most recently observed
resource version.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/crd-ipam-fix-updatestatus-failure branch from f433b12 to 955d5c5 Compare November 17, 2021 17:54
@gandro gandro requested a review from twpayne November 17, 2021 17:54
Copy link
Contributor

@twpayne twpayne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

@gandro
Copy link
Member Author

gandro commented Nov 18, 2021

/test

Job 'Cilium-PR-K8s-1.20-kernel-5.4' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig AutoDirectNodeRoutes Check connectivity with sockops and direct routing

Failure Output

FAIL: Error creating resource /home/jenkins/workspace/Cilium-PR-K8s-1.20-kernel-5.4/src/github.com/cilium/cilium/test/k8sT/manifests/l3-policy-demo.yaml: Cannot retrieve cilium pod cilium-9w7lg policy revision: cannot get revision from json output '': could not parse JSON from command "kubectl exec -n kube-system cilium-9w7lg -- cilium policy get -o json"

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-5.4 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort

Failure Output

FAIL: Request from testclient-jn7zz pod to service tftp://[fd03::218b]:10069/hello failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.19 so I can create a new GitHub issue to track it.

@gandro
Copy link
Member Author

gandro commented Nov 23, 2021

ConformanceEKS (the only CI where the code modified here is enabled) failed with the connectivity check with IPSec, which is a known flake: #16938 https://github.com/cilium/cilium/actions/runs/1475670758

@gandro
Copy link
Member Author

gandro commented Nov 23, 2021

/ci-eks

@gandro
Copy link
Member Author

gandro commented Nov 25, 2021

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 25, 2021
@qmonnet
Copy link
Member

qmonnet commented Nov 25, 2021

@gandro Can you also have a look at the failure on k8s-1.22-kernel-4.9 please?
[Please re-label if you find the failure is unrelated.]

@qmonnet qmonnet removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 25, 2021
@gandro
Copy link
Member Author

gandro commented Nov 25, 2021

@qmonnet Apologies, on first glance it looked like a variant of #13839 as mentioned above, but might be a new flake since the symptoms are different (the affected tests are the same).

Will take a closer look and open an issue and then mark this again, but I'm very confident that this is unrelated to the PR since the code modified by this PR is not active in our Jenkins suite at all.

@gandro
Copy link
Member Author

gandro commented Nov 25, 2021

The link for https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.9/186/ just expired (it was still valid a few hours ago) 😬 Making it impossible to triage it again. Will restart it.

@gandro
Copy link
Member Author

gandro commented Nov 25, 2021

/test-1.22-4.9

@qmonnet
Copy link
Member

qmonnet commented Nov 26, 2021

Based on Sebastian's previous analysis & confidence and on the fact that the new run passed, I'll go ahead and merge. Thanks!

@qmonnet qmonnet merged commit 18b10b4 into cilium:master Nov 26, 2021
@qmonnet qmonnet mentioned this pull request Nov 26, 2021
12 tasks
@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 3, 2021
@gandro
Copy link
Member Author

gandro commented Mar 3, 2022

Marking this for backport to v1.10. Seems like users are hitting this on v1.10, and the backport should be rather trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/azure Impacts Azure based IPAM. area/eni Impacts ENI based IPAM. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants