v1.13 backports 2023-07-27 #27107

pchaigno · 2023-07-27T11:11:21Z

bpf: test: Fix the byte order in the IPV4 macro #25114 -- bpf: test: Fix the byte order in the IPV4 macro (@gentoo-root)
bpf,test: Fix verifier issues in IPv6 BPF tests when running locally #25191 -- bpf,test: Fix verifier issues in IPv6 BPF tests when running locally (@dylandreimerink)
Add BPF unit tests for IPsec #25699 -- Add BPF unit tests for IPsec (@jschwinger233)
ipsec, auth: Avoid out-of-order events that lead to inconsistent node IDs #27029 -- ipsec, auth: Avoid out-of-order events that lead to inconsistent node IDs (@pchaigno)
- Commits bbeda92, c1f4a63, 52b0430, and fc8da72 were skipped because the node ID map doesn't use the hive framework and Mutual Auth feature is only in v1.14.

Once this PR is merged, you can update the PR labels via:

$ for pr in 25114 25191 25699 27029; do contrib/backporting/set-labels.py $pr done 1.13; done

[ upstream commit 41d18ff ] All the usages suggest that the order of parameters of the IPV4 macro in bpf/tests/pktgen.h is natural: IPV4(192, 168, 0, 1) means 192.168.0.1, not 1.0.168.192. However, the IPV4 macro assembles the bytes in the opposite order, i.e. like 1.0.168.192. Fix the macro by putting the bytes in the intended order. This change should only have a cosmetic effect when IP addresses are printed in the failed tests, but it might expose some real issues if some tests rely on the IP prefixes. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit ceaa4c4 ] When running the `TestBPF/ipv6_test.o` test locally with clang v13 and kernel v5.17 it resulted in the following verifier error: ``` 171: (b7) r2 = 1 ; R2_w=inv1 ; authhdr->seq = 1; 172: (63) *(u32 *)(r1 +8) = r2 invalid access to packet, off=8 size=4, R1(id=6,off=8,r=2) R1 offset is outside of the packet ``` Tracked this down to `ipv6_with_hop_auth_tcp_pktgen`. It gets a pointer from `pktgen__append_ipv6_extension_header`. All expected bounds checks were in place. It even has a function called `ctx_data_valid` with some inline assembly which is supposed to perform additional bounds checks. Was able to fix this by replacing `ctx_data_valid` with a normal check and adding a bounds check to `ipv6_with_hop_auth_tcp_pktgen`. This bounds check seems to be necessary because bounds checks do not seem to apply to copies. The verifier log reveals the following: ``` 85: (69) r2 = *(u16 *)(r10 -40) ; R2_w=invP14 R10=fp0 ; builder->layer_offsets[layer_idx] = builder->cur_off; 86: (6b) *(u16 *)(r4 +10) = r2 ; R2_w=invP14 R4_w=fp-44 fp-40=mmmmmmmm ; layer = ctx_data(ctx) + builder->cur_off; 87: (0f) r1 += r7 ; R1_w=pkt(id=0,off=14,r=54,imm=0) R7=invP14 ; builder->cur_off += hdrsize; 88: (69) r2 = *(u16 *)(r10 -40) ; R2_w=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R10=fp0 ``` At instruction 85 we know the offset is `14` exactly. It is then written back to the stack. And at 88 we have lost bounds data. Since this is an offset we use to construct the pointers, this uncertainty becomes part of all pointers causing the need for additional checks. To mitigate the result of the above apparent verifier bug/limitation, I have also changes our offset value from 16 bit to 64 bit which tracks better. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

LXC_IP was ill-defined in config_replacement.h, leading to errors like: In file included from ipsec_from_lxc_test.c:20: In file included from /cilium/bpf/bpf_lxc.c:35: /cilium/bpf/lib/lxc.h:28:16: error: too many arguments provided to function-like macro invocation BPF_V6(valid, LXC_IP); ^ ./config_replacement.h:17:30: note: expanded from macro 'LXC_IP' #define LXC_IP { { LXC_IP_1, LXC_IP_2, LXC_IP_3, LXC_IP_4 } } ^ ./config_replacement.h:14:18: note: expanded from macro 'LXC_IP_2' #define LXC_IP_2 bpf_htonl((0) << 24 | (0) << 16 | (0) << 8 | (0x01)) ^ /cilium/bpf/lib/endian.h:34:2: note: expanded from macro 'bpf_htonl' (__builtin_constant_p(x) ? \ ^ /cilium/bpf/lib/static_data.h:16:9: note: macro 'fetch_ipv6' defined here #define fetch_ipv6(x) fetch_u32_i(x, 1), fetch_u32_i(x, 2), fetch_u32_i(x, 3), fetch_u32_i(x, 4) ^ /cilium/bpf/lib/lxc.h:28:2: note: cannot use initializer list at the beginning of a macro argument BPF_V6(valid, LXC_IP); ^ /cilium/bpf/lib/common.h:222:40: note: expanded from macro 'BPF_V6' #define BPF_V6(dst, ...) BPF_V6_1(dst, fetch_ipv6(__VA_ARGS__)) ^ This commit fixes it. As a result, pktgen.h must be included before config_replacement.h to define bpf_htonl. The only file where that wasn't the case is tc_nodeport_test.c, also fixed here. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 51d7f70 ] The previous implementation of BPFProgram.Test() only allowed passing and returning bytes as *skb->data, without the ability to specify input skb metadata or check output skb metadata. This commit introduces a new function named runBpfProgram, which passes an additional []byte as input skb metadata and returns an additional []byte as output skb metadata. By utilizing this function, we ensure that the skb->mark set in PKTGEN can be properly passed to the SETUP , and any modifications to skb->mark or skb->cb can be accurately examined during the CHECK for validation purposes. The input ctx bytes will be set as the input skb, briefly you can expect `memcpy(skb, ctx, sizeof(*skb))` happening inside, and you can get skb->mark by `mark := ctx[offset(skb, mark): sizeof(skb->mark)]`. Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit b760a99 ] [ backporter's notes: Trivial conflicts due to security identity fields being renamed in v1.14. Note we also need to properly define LXC_IP and LXC_IPV4 to match the source pod IP addresses. This required to pass the source IP verification in bpf_lxc. It isn't required in v1.14 because that verification is disabled in the tests. It can't be disabled before v1.14. ] These testcases cover IPSec datapath on `from-container` for both IPv4 and IPv6. The following behaviors are checked after execution of `from-container`: 1. skb->mark should be set to `node_id << 16 | key << 12 | 0xe00` 2. skb->cb[4] should be set to source sec_id 3. skb should be passed to stack 4. skb->data doesn't change Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit dee0683 ] [ backporter's notes: Minor conflicts due to security identity fields being renamed in v1.14. mock_skb_set_tunnel_key also needed to be updated to remove support for local_ipv4. ] These testcases cover IPSec datapath on `from-host` for both IPv4 and IPv6. The input skb should be ESP-encrypted with mark set to `node_id << 16 | key << 12 | 0xe00` and cb[4] set to sec_id. We will check the following behaviors after execution of `from-host`: 1. skb->mark is cleared to 0 2. skb->cb[4] is cleared to 0 3. skb is redirected to cilium_vxlan 4. skb->data doesn't change 5. VxLAN VNI is set to sec_id Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 6decd65 ] These testcases cover IPsec datapath for both IPv4 and IPv6. An ingress skb will reach `from-overlay` twice, for the 1st time the skb is ESP-encrypted, for the 2nd time the skb is ESP-decrypted with mark `0xd00`. For the 1st reach, we check the following behaviors: 1. skb is passed to stack 2. skb->mark is set to 0xd00 3. skb->data doesn't change For the 2nd reach, we check: 1. skb is redirected to the target lxc veth 2. skb->mark is cleared to 0 Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 93f70bb ] [ backporter's notes: All changes conflicted so were reapplied from scratch. ] IPsec isn't used or compatible with bpf_xdp and bpf_sock, so there is no reason to compile test it there. This will simplify subsequent commits where we will introduce some code in IPsec that isn't compatible with non-tc BPF hooks. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 6d84882 ] [ backporter's notes: Trivial conflict on the unusual include of encrypt.h. ] This commit has no functional changes. It simply moves the logic to prepare packets for IPsec encryption to the dedicated encrypt.h file. Function set_ipsec_encrypt_mark is also created to simplify a subsequent commit (no, it won't remain a function that simply calls another function :)). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit f711329 ] This commit has no functional changes. It simply moves the allocation of node IDs to a single place in the top calling function. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit d2523ff ] [ backporter's notes: The mocked node ID BPF map doesn't exist in v1.13 and changes to the integration tests (see last paragraph below) were not necessary, so I skipped them. ] Before this commit, we would allocate a node ID for new nodes if IPsec was enabled. That wasn't consistent with (1) allocations triggered by pod creations, where we allocate even if IPsec is disabled, and (2) deallocations triggered by node deletions which also happen regardless of IPsec being enabled. This commit changes the nodeUpdate logic slightly such that the node ID is allocated regardless of IPsec being enabled. As a consequence, we need to update several integration tests to mock the node ID BPF map. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 11f7795 ] This new packet drop type/reason will be used by IPsec and Mutual Auth when a node ID lookup doesn't return anything. Both features rely on the presence of the node ID to function properly. A failure to find the node ID can occur if traffic is sent to a pod on a node for which we didn't receive the NodeAdd or NodeUpdate event yet. Such out-of-order events (PodAdd received before NodeAdd) are possible and normal behavior. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 6109a38 ] [ backporter's notes: Many small conflicts due to the rename of security identity variables and fields in the C codebase. ] The IPsec feature relies on the node IDs encoded in the packet marks to match packets against encryption rules (XFRM OUT policies and states). It currently retrieves the ID of the remote node from the ipcache. In a subsequent commit, however, we will remove this node ID information from the ipcache (to simplify node ID management in the agent). So IPsec's datapath must retrieve the node ID from a different place. Fortunately, we already have a BPF map containing all node IP to node ID mappings. Thus, in this commit, we use the remote node IP obtained from the ipcache (ep->tunnel_endpoint) to lookup the remote node ID in this BPF map. This one additional hashmap lookup is expected to have a negligible cost compared to the cost of encryption. It should even be small compared to the cost of the ipcache LPM lookup. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 426c424 ] This will be useful to enforce ordering between different tests, with the first one preparing part of the maps for subsequent tests. Numbering starts at 2 because the next commit will introduce a test 1. "ipsec_" is removed from the test names to shorten it a bit (the IPsec context should be clear from the test filename). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 147a130 ] If the node ID for the destination node of a pod-to-pod packet is not found in the node ID map, we drop the packet with DROP_NO_NODE_ID. This can happen when the k8s Node object for that destination hasn't been received yet. This commit covers that case in BPF unit tests. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 4856e38 ] [ backporter's notes: Many small conflicts due to OnIPIdentityCacheChange's prototype changing in v1.14 and the ipcache map's structs slightly changing with the new loader. ] In previous commits, we removed all consumers (IPsec and Mutual Auth) of the ipcache's node_id field. This commit removes population of those node IDs in the ipcache. **Problem** Needing to populate the ipcache with node IDs causes all sorts of issues. Node IDs are allocated for and mapped with specific node IP addresses. In practice, it means node IDs are tied to the lifecycle of both node and pod objects: 1. We need to allocate node IDs when receiving node objects with new node IP addresses (either new node objects or existing node objects with changing IP addresses). 2. We need to allocate node IDs when receiving a node IP address via a pod (hostIP of the pod). Node objects have all node IP addresses. So why are we not relying solely on 1? We also need 2 because pod adds and pod updates (with new hostIP) can arrive before the related node add or node update. Basically, kube-apiserver doesn't provide any guarantee of ordering between node and pod events. Since we need to populate the ipcache with a node ID for each remote pod, we need to allocate a node ID when we receive the pod object and not later, when we receive the node object. Unfortunately, tying the node IDs to two object lifecycles in this way causes a lot of complications. For example, we can't simply deallocate node IDs on pod deletions because other pods may still be using the node ID. We also can't deallocate simply on node deletions because pods may still be using it. The typical solution to such issues is to maintain a reference count, but that leads to its own set of issues in this case [1]. **Considered Solution** An alternative to all this is to force events to always be managed in the same order. For example, NodeAdd -> PodAdd -> PodDelete -> NodeDelete. The agent unfortunately doesn't easily lend itself to that. Without a major refactoring, we would have to populate the ipcache as usual, but skip node IDs. Then, on NodeAdd, we would allocate a node ID and go through the ipcache again to populate the node IDs we skipped. In the datapath, we would have an explicit drop for such half-populated ipcache entries. To implement this without needing to walk through the entire ipcache for each NodeAdd or NodeUpdate, we need to keep track of the specific entries that need to be populated. We need the same sort of mechanism for deletions. This solution quickly turns into something quite complex to implement with good performance. **This Solution** It should be quite clear at this point that this problem's complexity stems from having node IDs tied to both the pods and nodes' lifecycles. To untie node IDs from pod lifecycles, and thus, stop populating the ipcache with node IDs, we will need the datapath to retrieve them from some other place. Fortunately, we already have a BPF map that maps node IP addresses to node IDs. It is currently only used as a restoration mechanism for agent restarts, but there's no reason we can't use it for more. Thus, with the node IP address retrieved from the ipcache (ep->tunnel_endpoint), we can lookup the node ID in the BPF map. Of course, since pod objects can arrive before their corresponding node object, the node ID may not always be ready in the BPF map when we need it. When that happens, we will drop the packet with a specific drop code (DROP_NO_NODE_ID). These drops simply mean we tried to send traffic to a node for which we didn't receive the node object yet (hence the encryption configuration isn't setup yet). These datapath changes have been implemented in the previous commit. This commit removes all population of node IDs in the ipcache. We can also remove node IDs from the tunnel map. It matters less since the tunnel map is tied to node objects, but let's keep the datapath consistent between the two maps (ipcache and tunnel map). 1 - cilium#26725 (comment) Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit efbba59 ] [ backporter's notes: Many small conflicts due to some node ID getters being added in v1.14 for the Mutual Auth feature. Since that feature is absent in previous versions, we can also remove the ipcache's NodeIDHandler. Some changes to integration tests that don't exist before v1.14 were also not backported. ] The AllocateNodeID function was used to allocate node IDs from the ipcache logic. Since we don't populate the ipcache with node IDs anymore, it's unused and can be removed. The DeallocateNodeID function was never used (a sibling function was used instead). This commit has no functional changes. Reported-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 4d46fa9 ] This commit fixes a leak of node IP to node ID mappings when the set of IP addresses of a node is updated. We have logic to allocate and map node IDs on node creation, logic to unmap and deallocate node IDs on node deletions, but we didn't fully handle the case of node updates. On node updates, we would map new IP addresses to the already-allocated node ID. We forgot to unmap the old IP addresses. This commit fixes that oversight. This leak means that some IP addresses that were previously assigned to the node would remain mapped to the node ID. Those IP addresses could then be assigned to a new node. The new node would then receive the same node ID as the old node (because we first check if any of the node IPs already have an ID). This could lead to encrypted traffic being sent to the wrong node. Before removing the node IDs from the ipcache (see previous commit), this leak could have another consequence. If after the node ID had been reassigned to the new node, the old node was deleted, the node ID would be deallocated, but it would remain in the ipcache. The XFRM policies and states of the node would also be identified based on the node ID and removed. Thus, both the old and new nodes' XFRM configs would be removed. As a consequence, outgoing packets for those nodes would be dropped with XfrmOutPolBlock (the default drop-all rule). Fixes: af88b42 ("datapath: Introduce node IDs") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit e1b9063 ] The previous commit fixed a bug that could lead to two nodes sharing the same node ID. This commit adds a check, on node deletion, for such erroneous state. It completes the existing check that a single node doesn't have multiple node IDs. In both cases, an error is thrown such that users will notice and our CI will fail. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

pchaigno · 2023-07-28T12:55:31Z

/test-backport-1.13

k8s-1.18-kernel-4.19 failed with #13071. Travis CI failing with #25272.

jschwinger233

Thank you Paul!

jschwinger233 · 2023-07-31T03:38:53Z

/test-1.18-4.19

gentoo-root

+1 for my commit.

pchaigno · 2023-07-31T12:42:55Z

Marking ready to merge to summon the tophat who also happens to be the last missing reviewer. Travis CI failed with known flake #25272.

pchaigno added the area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. label Jul 27, 2023

maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jul 27, 2023

pchaigno force-pushed the pr/v1.13-backport-2023-07-27 branch 16 times, most recently from 7119fc4 to 76109c3 Compare July 28, 2023 12:07

gentoo-root and others added 11 commits July 28, 2023 14:21

ipsec: Refactor IPsec logic to allocate node IDs in one place

7857d64

[ upstream commit f711329 ] This commit has no functional changes. It simply moves the allocation of node IDs to a single place in the top calling function. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

pchaigno added 8 commits July 28, 2023 14:21

pchaigno force-pushed the pr/v1.13-backport-2023-07-27 branch from 76109c3 to 510330f Compare July 28, 2023 12:51

pchaigno requested review from gentoo-root and dylandreimerink July 28, 2023 13:50

pchaigno marked this pull request as ready for review July 28, 2023 13:50

pchaigno requested review from a team as code owners July 28, 2023 13:50

pchaigno requested review from nathanjsweet and gandro July 28, 2023 13:50

gandro approved these changes Jul 28, 2023

View reviewed changes

pchaigno requested a review from jschwinger233 July 28, 2023 17:12

nathanjsweet approved these changes Jul 28, 2023

View reviewed changes

jschwinger233 approved these changes Jul 31, 2023

View reviewed changes

gentoo-root approved these changes Jul 31, 2023

View reviewed changes

pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2023

dylandreimerink approved these changes Jul 31, 2023

View reviewed changes

dylandreimerink merged commit 7aae5cb into cilium:v1.13 Jul 31, 2023
60 of 61 checks passed

pchaigno deleted the pr/v1.13-backport-2023-07-27 branch July 31, 2023 12:54

nebril mentioned this pull request Aug 10, 2023

Prepare for release v1.13.6 #27424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.13 backports 2023-07-27 #27107

v1.13 backports 2023-07-27 #27107

pchaigno commented Jul 27, 2023 •

edited by dylandreimerink

pchaigno commented Jul 28, 2023 •

edited

jschwinger233 left a comment

jschwinger233 commented Jul 31, 2023

gentoo-root left a comment

pchaigno commented Jul 31, 2023 •

edited

v1.13 backports 2023-07-27 #27107

v1.13 backports 2023-07-27 #27107

Conversation

pchaigno commented Jul 27, 2023 • edited by dylandreimerink

pchaigno commented Jul 28, 2023 • edited

jschwinger233 left a comment

Choose a reason for hiding this comment

jschwinger233 commented Jul 31, 2023

gentoo-root left a comment

Choose a reason for hiding this comment

pchaigno commented Jul 31, 2023 • edited

pchaigno commented Jul 27, 2023 •

edited by dylandreimerink

pchaigno commented Jul 28, 2023 •

edited

pchaigno commented Jul 31, 2023 •

edited