Skip to content

Commit

Permalink
ipam/crd: Fix spurious CiliumNode update status failures
Browse files Browse the repository at this point in the history
When running in CRD-based IPAM modes (Alibaba, Azure, ENI, CRD), it is
possible to observe spurious "Unable to update CiliumNode custom
resource" failures in the cilium-agent.

The full error message is as follows: "Operation cannot be fulfilled on
ciliumnodes.cilium.io <node>: the object has been modified; please apply
your changes to the latest version and try again".

It means that the Kubernetes `UpdateStatus` call has failed because the
local `ObjectMeta.ResourceVersion` of submitted CiliumNode version is
out of date. In the presence of races, this error is expected and will
resolve itself once the agent receives a more recent version of the
object with the new resource version.

However, it is possible that the resource version of a `CiliumNode`
object is bumped even though the `Spec` or `Status` of the `CiliumNode`
remains the same. This for examples happens when
`ObjectMeta.ManagedFields` is updated by the Kubernetes apiserver.

Unfortunately, `CiliumNode.DeepEqual` does _not_ consider any
`ObjectMeta` fields (including the resource version). Therefore two
objects with different resource versions are considered the same by the
`CiliumNode` watcher used by IPAM.

But to be able to successfully call `UpdateStatus` we need to know the
most recent resource version. Otherwise, `UpdateStatus` will always fail
until the `CiliumNode` object is updated externally for some reason.

Therefore, this commit modifies the logic to always store the most
recent version of the `CiliumNode` object, even if `Spec` or `Status`
has not changed.  This in turn allows `nodeStore.refreshNode` (which
invokes `UpdateStatus`) to always work on the most recently observed
resource version.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
  • Loading branch information
gandro authored and qmonnet committed Nov 26, 2021
1 parent ed73a31 commit 18b10b4
Showing 1 changed file with 17 additions and 2 deletions.
19 changes: 17 additions & 2 deletions pkg/ipam/crd.go
Original file line number Diff line number Diff line change
Expand Up @@ -128,12 +128,19 @@ func newNodeStore(nodeName string, conf Configuration, owner Owner, k8sEventReg
defer func() { k8sEventReg.K8sEventReceived("CiliumNode", "update", valid, equal) }()
if oldNode, ok := oldObj.(*ciliumv2.CiliumNode); ok {
if newNode, ok := newObj.(*ciliumv2.CiliumNode); ok {
valid = true
newNode = newNode.DeepCopy()
if oldNode.DeepEqual(newNode) {
// The UpdateStatus call in refreshNode requires an up-to-date
// CiliumNode.ObjectMeta.ResourceVersion. Therefore, we store the most
// recent version here even if the nodes are equal, because
// CiliumNode.DeepEqual will consider two nodes to be equal even if
// their resource version differs.
store.setOwnNodeWithoutPoolUpdate(newNode)
equal = true
return
}
valid = true
store.updateLocalNodeResource(newNode.DeepCopy())
store.updateLocalNodeResource(newNode)
k8sEventReg.K8sEventProcessed("CiliumNode", "update", true)
} else {
log.Warningf("Unknown CiliumNode object type %T received: %+v", oldNode, oldNode)
Expand Down Expand Up @@ -325,6 +332,14 @@ func (n *nodeStore) updateLocalNodeResource(node *ciliumv2.CiliumNode) {
}
}

// setOwnNodeWithoutPoolUpdate overwrites the local node copy (e.g. to update
// its resourceVersion) without updating the available IP pool.
func (n *nodeStore) setOwnNodeWithoutPoolUpdate(node *ciliumv2.CiliumNode) {
n.mutex.Lock()
n.ownNode = node
n.mutex.Unlock()
}

// refreshNodeTrigger is called to refresh the custom resource after taking the
// configured rate limiting into account
//
Expand Down

0 comments on commit 18b10b4

Please sign in to comment.