Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.12 backports 2023-07-28 #27138

Merged
merged 10 commits into from
Jul 31, 2023

Conversation

pchaigno
Copy link
Member

Once this PR is merged, you can update the PR labels via:

$ for pr in 27029; do contrib/backporting/set-labels.py $pr done 1.13; done

[ upstream commit 93f70bb ]

[ backporter's notes: All changes conflicted so were reapplied from
  scratch. ]

IPsec isn't used or compatible with bpf_xdp and bpf_sock, so there is
no reason to compile test it there. This will simplify subsequent
commits where we will introduce some code in IPsec that isn't compatible
with non-tc BPF hooks.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6d84882 ]

[ backporter's notes: Trivial conflict on the unusual include of
  encrypt.h and due to macro guards at the top. ]

This commit has no functional changes. It simply moves the logic to
prepare packets for IPsec encryption to the dedicated encrypt.h file.
Function set_ipsec_encrypt_mark is also created to simplify a subsequent
commit (no, it won't remain a function that simply calls another
function :)).

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit f711329 ]

This commit has no functional changes. It simply moves the allocation of
node IDs to a single place in the top calling function.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit d2523ff ]

[ backporter's notes: The mocked node ID BPF map doesn't exist in v1.13
  and changes to the integration tests (see last paragraph below) were
  not necessary, so I skipped them. ]

Before this commit, we would allocate a node ID for new nodes if IPsec
was enabled. That wasn't consistent with (1) allocations triggered by
pod creations, where we allocate even if IPsec is disabled, and (2)
deallocations triggered by node deletions which also happen regardless
of IPsec being enabled.

This commit changes the nodeUpdate logic slightly such that the node ID
is allocated regardless of IPsec being enabled. As a consequence, we
need to update several integration tests to mock the node ID BPF map.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 11f7795 ]

[ backporter's notes: Trivial conflicts due to many additional drop
  reasons being added between this branch and v1.14. ]

This new packet drop type/reason will be used by IPsec and Mutual Auth
when a node ID lookup doesn't return anything. Both features rely on the
presence of the node ID to function properly.

A failure to find the node ID can occur if traffic is sent to a pod on
a node for which we didn't receive the NodeAdd or NodeUpdate event yet.
Such out-of-order events (PodAdd received before NodeAdd) are possible
and normal behavior.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6109a38 ]

[ backporter's notes: Many conflicts over large sections of code (in
  particular in encap.h). I resolved by reapplying the changes. ]

The IPsec feature relies on the node IDs encoded in the packet marks to
match packets against encryption rules (XFRM OUT policies and states).
It currently retrieves the ID of the remote node from the ipcache.

In a subsequent commit, however, we will remove this node ID information
from the ipcache (to simplify node ID management in the agent). So
IPsec's datapath must retrieve the node ID from a different place.

Fortunately, we already have a BPF map containing all node IP to node ID
mappings. Thus, in this commit, we use the remote node IP obtained from
the ipcache (ep->tunnel_endpoint) to lookup the remote node ID in this
BPF map.

This one additional hashmap lookup is expected to have a negligible cost
compared to the cost of encryption. It should even be small compared to
the cost of the ipcache LPM lookup.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 4856e38 ]

[ backporter's notes: Many small conflicts due to
  OnIPIdentityCacheChange's prototype changing in v1.14 and the ipcache
  map's structs slightly changing with the new loader. ]

In previous commits, we removed all consumers (IPsec and Mutual Auth)
of the ipcache's node_id field. This commit removes population of those
node IDs in the ipcache.

**Problem**

Needing to populate the ipcache with node IDs causes all sorts of
issues. Node IDs are allocated for and mapped with specific node IP
addresses. In practice, it means node IDs are tied to the lifecycle of
both node and pod objects:
1. We need to allocate node IDs when receiving node objects with new
   node IP addresses (either new node objects or existing node objects
   with changing IP addresses).
2. We need to allocate node IDs when receiving a node IP address via a
   pod (hostIP of the pod).

Node objects have all node IP addresses. So why are we not relying
solely on 1? We also need 2 because pod adds and pod updates (with new
hostIP) can arrive before the related node add or node update.
Basically, kube-apiserver doesn't provide any guarantee of ordering
between node and pod events. Since we need to populate the ipcache with
a node ID for each remote pod, we need to allocate a node ID when we
receive the pod object and not later, when we receive the node object.

Unfortunately, tying the node IDs to two object lifecycles in this way
causes a lot of complications. For example, we can't simply deallocate
node IDs on pod deletions because other pods may still be using the node
ID. We also can't deallocate simply on node deletions because pods may
still be using it. The typical solution to such issues is to maintain a
reference count, but that leads to its own set of issues in this case
[1].

**Considered Solution**

An alternative to all this is to force events to always be managed in
the same order. For example, NodeAdd -> PodAdd -> PodDelete ->
NodeDelete. The agent unfortunately doesn't easily lend itself to that.
Without a major refactoring, we would have to populate the ipcache as
usual, but skip node IDs. Then, on NodeAdd, we would allocate a node ID
and go through the ipcache again to populate the node IDs we skipped.
In the datapath, we would have an explicit drop for such half-populated
ipcache entries.

To implement this without needing to walk through the entire ipcache for
each NodeAdd or NodeUpdate, we need to keep track of the specific
entries that need to be populated. We need the same sort of mechanism
for deletions. This solution quickly turns into something quite complex
to implement with good performance.

**This Solution**

It should be quite clear at this point that this problem's complexity
stems from having node IDs tied to both the pods and nodes' lifecycles.

To untie node IDs from pod lifecycles, and thus, stop populating the
ipcache with node IDs, we will need the datapath to retrieve them from
some other place. Fortunately, we already have a BPF map that maps node
IP addresses to node IDs. It is currently only used as a restoration
mechanism for agent restarts, but there's no reason we can't use it for
more.

Thus, with the node IP address retrieved from the ipcache
(ep->tunnel_endpoint), we can lookup the node ID in the BPF map.

Of course, since pod objects can arrive before their corresponding node
object, the node ID may not always be ready in the BPF map when we need
it. When that happens, we will drop the packet with a specific drop code
(DROP_NO_NODE_ID). These drops simply mean we tried to send traffic to a
node for which we didn't receive the node object yet (hence the
encryption configuration isn't setup yet).

These datapath changes have been implemented in the previous commit.
This commit removes all population of node IDs in the ipcache.

We can also remove node IDs from the tunnel map. It matters less since
the tunnel map is tied to node objects, but let's keep the datapath
consistent between the two maps (ipcache and tunnel map).

1 - cilium#26725 (comment)
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno added the area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. label Jul 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.12 This PR represents a backport for Cilium 1.12.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jul 28, 2023
[ upstream commit efbba59 ]

[ backporter's notes: Many small conflicts due to some node ID getters
  being added in v1.14 for the Mutual Auth feature. Since that feature is
  absent in previous versions, we can also remove the ipcache's
  NodeIDHandler. Some changes to integration tests that don't exist
  before v1.14 were also not backported. ]

The AllocateNodeID function was used to allocate node IDs from the
ipcache logic. Since we don't populate the ipcache with node IDs
anymore, it's unused and can be removed. The DeallocateNodeID function
was never used (a sibling function was used instead).

This commit has no functional changes.

Reported-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 4d46fa9 ]

This commit fixes a leak of node IP to node ID mappings when the set of
IP addresses of a node is updated. We have logic to allocate and map
node IDs on node creation, logic to unmap and deallocate node IDs on
node deletions, but we didn't fully handle the case of node updates.

On node updates, we would map new IP addresses to the already-allocated
node ID. We forgot to unmap the old IP addresses. This commit fixes that
oversight.

This leak means that some IP addresses that were previously assigned to
the node would remain mapped to the node ID. Those IP addresses could
then be assigned to a new node. The new node would then receive the same
node ID as the old node (because we first check if any of the node IPs
already have an ID). This could lead to encrypted traffic being sent to
the wrong node.

Before removing the node IDs from the ipcache (see previous commit),
this leak could have another consequence. If after the node ID had been
reassigned to the new node, the old node was deleted, the node ID would
be deallocated, but it would remain in the ipcache. The XFRM policies
and states of the node would also be identified based on the node ID and
removed. Thus, both the old and new nodes' XFRM configs would be
removed. As a consequence, outgoing packets for those nodes would be
dropped with XfrmOutPolBlock (the default drop-all rule).

Fixes: af88b42 ("datapath: Introduce node IDs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit e1b9063 ]

[ backporter's notes: Small conflict at the top of deallocateIDForNode. ]

The previous commit fixed a bug that could lead to two nodes sharing the
same node ID. This commit adds a check, on node deletion, for such
erroneous state. It completes the existing check that a single node
doesn't have multiple node IDs. In both cases, an error is thrown such
that users will notice and our CI will fail.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno force-pushed the pr/v1.12-backport-2023-07-28 branch from d4fe28c to 4ff31fd Compare July 28, 2023 17:42
@pchaigno
Copy link
Member Author

/test-backport-1.12

@pchaigno
Copy link
Member Author

pchaigno commented Jul 28, 2023

Provisioning error:
/test-1.20-4.9

@pchaigno pchaigno marked this pull request as ready for review July 31, 2023 08:17
@pchaigno pchaigno requested review from a team as code owners July 31, 2023 08:17
@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2023
@dylandreimerink dylandreimerink merged commit ab43969 into cilium:v1.12 Jul 31, 2023
55 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2023
@pchaigno pchaigno deleted the pr/v1.12-backport-2023-07-28 branch July 31, 2023 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. backport/1.12 This PR represents a backport for Cilium 1.12.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants