linux/node: reallocate nodeID upon conflict #33053
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NodeIDs and IPsec state suffer from a lack of reconciliation. If the agent misses a node deletion event, stale state is never cleaned up. This is somewhat known (#29822, #26298), but was generally considered not of huge consequence. Stale XFRM states/policies can accumulate, but will not match traffic - the effect is mostly slowing down processing in agent and kernel. nodeIDs can eventually run out if too many node deletions are missed, but the rate at which these are missed is expected to be low.
Unfortunately, there are large clusters with high node churn in which rare events become common, and hence the following sequence of events is probable enough to actually observe:
Alternatively, if the new node arrives in a full update, but both the NodeInternalIP and the CiliumInternalIP are recycled from nodes which we missed the delete for, we arrive at the same point.
If this occurs, the agent can have a partioned view of what nodeID this node should have - in the BPF map, the k8s internal IP will map to a different nodeID than the cilium internal ip. This breaks IPsec traffic towards this node, as BPF applies a mark based on the BPF map nodeID of the tunnnel endpoint, but the xfrm states expect to match the mark based on the cilium internal IP. The result is traffic which doesn't match any xfrm state/policy, falling back to the catch all block policy.
To work around this, we enforce that all IPs of a node get the same nodeID - even if an IP was already pointing to an existing nodeID. Since this node update is more current than whatever state we had held, it seems more correct to ensure all IPs point to the same nodeID than avoiding a BPF map write. We do so by forcing the allocation of a new nodeID (and logging an error).
cc @rgo3