Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux/node: reallocate nodeID upon conflict #33053

Conversation

bimmlerd
Copy link
Member

NodeIDs and IPsec state suffer from a lack of reconciliation. If the agent misses a node deletion event, stale state is never cleaned up. This is somewhat known (#29822, #26298), but was generally considered not of huge consequence. Stale XFRM states/policies can accumulate, but will not match traffic - the effect is mostly slowing down processing in agent and kernel. nodeIDs can eventually run out if too many node deletions are missed, but the rate at which these are missed is expected to be low.

Unfortunately, there are large clusters with high node churn in which rare events become common, and hence the following sequence of events is probable enough to actually observe:

  1. a node is deleted while the agent is down (e.g. due to being upgraded)
  2. a new node joins the cluster, which is observed by the agent in partial fashion: that is, the agent receives an update which contains only the k8s node internal IP, but not a cilium internal IP.
  3. this new node then receives a cilium internal IP which was previously used.

Alternatively, if the new node arrives in a full update, but both the NodeInternalIP and the CiliumInternalIP are recycled from nodes which we missed the delete for, we arrive at the same point.

If this occurs, the agent can have a partioned view of what nodeID this node should have - in the BPF map, the k8s internal IP will map to a different nodeID than the cilium internal ip. This breaks IPsec traffic towards this node, as BPF applies a mark based on the BPF map nodeID of the tunnnel endpoint, but the xfrm states expect to match the mark based on the cilium internal IP. The result is traffic which doesn't match any xfrm state/policy, falling back to the catch all block policy.

To work around this, we enforce that all IPs of a node get the same nodeID - even if an IP was already pointing to an existing nodeID. Since this node update is more current than whatever state we had held, it seems more correct to ensure all IPs point to the same nodeID than avoiding a BPF map write. We do so by forcing the allocation of a new nodeID (and logging an error).

cc @rgo3

NodeIDs and IPsec state suffer from a lack of reconciliation. If the
agent misses a node deletion event, stale state is never cleaned up.
This is somewhat known (cilium#29822, cilium#26298), but was generally considered
not of huge consequence. Stale XFRM states/policies can accumulate, but
will not match traffic - the effect is mostly slowing down processing
in agent and kernel. nodeIDs can eventually run out if too many node
deletions are missed, but the rate at which these are missed is expected
to be low.

Unfortunately, there are large clusters with high node churn in which
rare events become common, and hence the following sequence of events is
probable enough to actually observe:

1. a node is deleted while the agent is down (e.g. due to being
  upgraded)
2. a new node joins the cluster, which is observed by the agent _in
  partial fashion_: that is, the agent receives an update which contains
  _only_ the k8s node internal IP, but not a cilium internal IP.
3. this new node then receives a cilium internal IP _which was
  previously used_.

Alternatively, if the new node arrives in a full update, but _both_ the
NodeInternalIP and the CiliumInternalIP are recycled _from nodes which
we missed the delete for_, we arrive at the same point.

If this occurs, the agent can have a partioned view of what nodeID this
node should have - in the BPF map, the k8s internal IP will map to a
different nodeID than the cilium internal ip. This breaks IPsec traffic
towards this node, as BPF applies a mark based on the BPF map nodeID of
the tunnnel endpoint, but the xfrm states expect to match the mark based
on the cilium internal IP. The result is traffic which doesn't match any
xfrm state/policy, falling back to the catch all block policy.

To work around this, we enforce that all IPs of a node get the same
nodeID - even if an IP was already pointing to an existing nodeID. Since
this node update is more current than whatever state we had held, it
seems more correct to ensure all IPs point to the same nodeID than
avoiding a BPF map write. We do so by forcing the allocation of a new
nodeID (and logging an error).

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jun 11, 2024
@bimmlerd
Copy link
Member Author

/test-backport-1.13

Copy link

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 12, 2024
@bimmlerd bimmlerd closed this Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant