New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipam/crd: Fix spurious CiliumNode update status failures #17856
ipam/crd: Fix spurious CiliumNode update status failures #17856
Conversation
009713f
to
878948c
Compare
if oldNode.DeepEqual(newNode) { | ||
equal = true | ||
return | ||
} | ||
valid = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there was another bug in the old code here: valid
was only set to true
if oldNode
and newNode
were not equal. This PR looks like it fixes that.
d236240
to
f433b12
Compare
When running in CRD-based IPAM modes (Alibaba, Azure, ENI, CRD), it is possible to observe spurious "Unable to update CiliumNode custom resource" failures in the cilium-agent. The full error message is as follows: "Operation cannot be fulfilled on ciliumnodes.cilium.io <node>: the object has been modified; please apply your changes to the latest version and try again". It means that the Kubernetes `UpdateStatus` call has failed because the local `ObjectMeta.ResourceVersion` of submitted CiliumNode version is out of date. In the presence of races, this error is expected and will resolve itself once the agent receives a more recent version of the object with the new resource version. However, it is possible that the resource version of a `CiliumNode` object is bumped even though the `Spec` or `Status` of the `CiliumNode` remains the same. This for examples happens when `ObjectMeta.ManagedFields` is updated by the Kubernetes apiserver. Unfortunately, `CiliumNode.DeepEqual` does _not_ consider any `ObjectMeta` fields (including the resource version). Therefore two objects with different resource versions are considered the same by the `CiliumNode` watcher used by IPAM. But to be able to successfully call `UpdateStatus` we need to know the most recent resource version. Otherwise, `UpdateStatus` will always fail until the `CiliumNode` object is updated externally for some reason. Therefore, this commit modifies the logic to always store the most recent version of the `CiliumNode` object, even if `Spec` or `Status` has not changed. This in turn allows `nodeStore.refreshNode` (which invokes `UpdateStatus`) to always work on the most recently observed resource version. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
f433b12
to
955d5c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks!
/test Job 'Cilium-PR-K8s-1.20-kernel-5.4' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
ConformanceEKS (the only CI where the code modified here is enabled) failed with the connectivity check with IPSec, which is a known flake: #16938 https://github.com/cilium/cilium/actions/runs/1475670758 |
/ci-eks |
The modified code is only active in ENI mode (i.e. ConformanceEKS, which passed). The failures are therefore unrelated. Marking as ready-to-merge. |
@gandro Can you also have a look at the failure on |
@qmonnet Apologies, on first glance it looked like a variant of #13839 as mentioned above, but might be a new flake since the symptoms are different (the affected tests are the same). Will take a closer look and open an issue and then mark this again, but I'm very confident that this is unrelated to the PR since the code modified by this PR is not active in our Jenkins suite at all. |
The link for https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.9/186/ just expired (it was still valid a few hours ago) 😬 Making it impossible to triage it again. Will restart it. |
/test-1.22-4.9 |
Based on Sebastian's previous analysis & confidence and on the fact that the new run passed, I'll go ahead and merge. Thanks! |
Marking this for backport to v1.10. Seems like users are hitting this on v1.10, and the backport should be rather trivial. |
When running in CRD-based IPAM modes (Alibaba, Azure, ENI, CRD), it is
possible to observe spurious "Unable to update CiliumNode custom
resource" failures in the cilium-agent.
The full error message is as follows: "Operation cannot be fulfilled on
ciliumnodes.cilium.io : the object has been modified; please apply
your changes to the latest version and try again".
It means that the Kubernetes
UpdateStatus
call has failed because thelocal
ObjectMeta.ResourceVersion
of submitted CiliumNode version isout of date. In the presence of races, this error is expected and will
resolve itself once the agent receives a more recent version of the
object with the new resource version.
However, it is possible that the resource version of a
CiliumNode
object is bumped even though the
Spec
orStatus
of theCiliumNode
remains the same. This for examples happens when
ObjectMeta.ManagedFields
is updated by the Kubernetes apiserver.Unfortunately,
CiliumNode.DeepEqual
does not consider anyObjectMeta
fields (including the resource version). Therefore twoobjects with different resource versions are considered the same by the
CiliumNode
watcher used by IPAM.But to be able to successfully call
UpdateStatus
we need to know themost recent resource version. Otherwise,
UpdateStatus
will always failuntil the
CiliumNode
object is updated externally for some reason.Therefore, this commit modifies the logic to always store the most
recent version of the
CiliumNode
object, even ifSpec
orStatus
has not changed. This in turn allows
nodeStore.refreshNode
(whichinvokes
UpdateStatus
) to always work on the most recently observedresource version.