linux/node: reallocate nodeID upon conflict #33053

bimmlerd · 2024-06-11T11:49:56Z

NodeIDs and IPsec state suffer from a lack of reconciliation. If the agent misses a node deletion event, stale state is never cleaned up. This is somewhat known (#29822, #26298), but was generally considered not of huge consequence. Stale XFRM states/policies can accumulate, but will not match traffic - the effect is mostly slowing down processing in agent and kernel. nodeIDs can eventually run out if too many node deletions are missed, but the rate at which these are missed is expected to be low.

Unfortunately, there are large clusters with high node churn in which rare events become common, and hence the following sequence of events is probable enough to actually observe:

a node is deleted while the agent is down (e.g. due to being upgraded)
a new node joins the cluster, which is observed by the agent in partial fashion: that is, the agent receives an update which contains only the k8s node internal IP, but not a cilium internal IP.
this new node then receives a cilium internal IP which was previously used.

Alternatively, if the new node arrives in a full update, but both the NodeInternalIP and the CiliumInternalIP are recycled from nodes which we missed the delete for, we arrive at the same point.

If this occurs, the agent can have a partioned view of what nodeID this node should have - in the BPF map, the k8s internal IP will map to a different nodeID than the cilium internal ip. This breaks IPsec traffic towards this node, as BPF applies a mark based on the BPF map nodeID of the tunnnel endpoint, but the xfrm states expect to match the mark based on the cilium internal IP. The result is traffic which doesn't match any xfrm state/policy, falling back to the catch all block policy.

To work around this, we enforce that all IPs of a node get the same nodeID - even if an IP was already pointing to an existing nodeID. Since this node update is more current than whatever state we had held, it seems more correct to ensure all IPs point to the same nodeID than avoiding a BPF map write. We do so by forcing the allocation of a new nodeID (and logging an error).

cc @rgo3

NodeIDs and IPsec state suffer from a lack of reconciliation. If the agent misses a node deletion event, stale state is never cleaned up. This is somewhat known (cilium#29822, cilium#26298), but was generally considered not of huge consequence. Stale XFRM states/policies can accumulate, but will not match traffic - the effect is mostly slowing down processing in agent and kernel. nodeIDs can eventually run out if too many node deletions are missed, but the rate at which these are missed is expected to be low. Unfortunately, there are large clusters with high node churn in which rare events become common, and hence the following sequence of events is probable enough to actually observe: 1. a node is deleted while the agent is down (e.g. due to being upgraded) 2. a new node joins the cluster, which is observed by the agent _in partial fashion_: that is, the agent receives an update which contains _only_ the k8s node internal IP, but not a cilium internal IP. 3. this new node then receives a cilium internal IP _which was previously used_. Alternatively, if the new node arrives in a full update, but _both_ the NodeInternalIP and the CiliumInternalIP are recycled _from nodes which we missed the delete for_, we arrive at the same point. If this occurs, the agent can have a partioned view of what nodeID this node should have - in the BPF map, the k8s internal IP will map to a different nodeID than the cilium internal ip. This breaks IPsec traffic towards this node, as BPF applies a mark based on the BPF map nodeID of the tunnnel endpoint, but the xfrm states expect to match the mark based on the cilium internal IP. The result is traffic which doesn't match any xfrm state/policy, falling back to the catch all block policy. To work around this, we enforce that all IPs of a node get the same nodeID - even if an IP was already pointing to an existing nodeID. Since this node update is more current than whatever state we had held, it seems more correct to ensure all IPs point to the same nodeID than avoiding a BPF map write. We do so by forcing the allocation of a new nodeID (and logging an error). Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

bimmlerd · 2024-06-11T12:03:46Z

/test-backport-1.13

github-actions · 2024-07-12T01:49:22Z

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jun 11, 2024

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 12, 2024

bimmlerd closed this Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linux/node: reallocate nodeID upon conflict #33053

linux/node: reallocate nodeID upon conflict #33053

bimmlerd commented Jun 11, 2024

bimmlerd commented Jun 11, 2024

github-actions bot commented Jul 12, 2024

linux/node: reallocate nodeID upon conflict #33053

linux/node: reallocate nodeID upon conflict #33053

Conversation

bimmlerd commented Jun 11, 2024

bimmlerd commented Jun 11, 2024

github-actions bot commented Jul 12, 2024