Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13 backports 2023-07-27 #27107

Merged
merged 19 commits into from
Jul 31, 2023

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Jul 27, 2023

Once this PR is merged, you can update the PR labels via:

$ for pr in 25114 25191 25699 27029; do contrib/backporting/set-labels.py $pr done 1.13; done

@pchaigno pchaigno added the area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. label Jul 27, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jul 27, 2023
@pchaigno pchaigno force-pushed the pr/v1.13-backport-2023-07-27 branch 16 times, most recently from 7119fc4 to 76109c3 Compare July 28, 2023 12:07
gentoo-root and others added 11 commits July 28, 2023 14:21
[ upstream commit 41d18ff ]

All the usages suggest that the order of parameters of the IPV4 macro in
bpf/tests/pktgen.h is natural: IPV4(192, 168, 0, 1) means 192.168.0.1,
not 1.0.168.192. However, the IPV4 macro assembles the bytes in the
opposite order, i.e. like 1.0.168.192.

Fix the macro by putting the bytes in the intended order. This change
should only have a cosmetic effect when IP addresses are printed in the
failed tests, but it might expose some real issues if some tests rely on
the IP prefixes.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit ceaa4c4 ]

When running the `TestBPF/ipv6_test.o` test locally with clang v13 and
kernel v5.17 it resulted in the following verifier error:

```
171: (b7) r2 = 1                      ; R2_w=inv1
; authhdr->seq = 1;
172: (63) *(u32 *)(r1 +8) = r2
invalid access to packet, off=8 size=4, R1(id=6,off=8,r=2)
R1 offset is outside of the packet
```

Tracked this down to `ipv6_with_hop_auth_tcp_pktgen`. It gets a pointer
from `pktgen__append_ipv6_extension_header`. All expected bounds checks
were in place. It even has a function called `ctx_data_valid` with some
inline assembly which is supposed to perform additional bounds checks.

Was able to fix this by replacing `ctx_data_valid` with a normal check
and adding a bounds check to `ipv6_with_hop_auth_tcp_pktgen`. This
bounds check seems to be necessary because bounds checks do not seem
to apply to copies.

The verifier log reveals the following:

```
85: (69) r2 = *(u16 *)(r10 -40)       ; R2_w=invP14 R10=fp0
; builder->layer_offsets[layer_idx] = builder->cur_off;
86: (6b) *(u16 *)(r4 +10) = r2        ; R2_w=invP14 R4_w=fp-44 fp-40=mmmmmmmm
; layer = ctx_data(ctx) + builder->cur_off;
87: (0f) r1 += r7                     ; R1_w=pkt(id=0,off=14,r=54,imm=0) R7=invP14
; builder->cur_off += hdrsize;
88: (69) r2 = *(u16 *)(r10 -40)       ; R2_w=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R10=fp0
```

At instruction 85 we know the offset is `14` exactly. It is then written
back to the stack. And at 88 we have lost bounds data. Since this is
an offset we use to construct the pointers, this uncertainty becomes
part of all pointers causing the need for additional checks.

To mitigate the result of the above apparent verifier bug/limitation,
I have also changes our offset value from 16 bit to 64 bit which tracks
better.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
LXC_IP was ill-defined in config_replacement.h, leading to errors like:

    In file included from ipsec_from_lxc_test.c:20:
    In file included from /cilium/bpf/bpf_lxc.c:35:
    /cilium/bpf/lib/lxc.h:28:16: error: too many arguments provided to function-like macro invocation
            BPF_V6(valid, LXC_IP);
                          ^
    ./config_replacement.h:17:30: note: expanded from macro 'LXC_IP'
    #define LXC_IP { { LXC_IP_1, LXC_IP_2, LXC_IP_3, LXC_IP_4 } }
                                 ^
    ./config_replacement.h:14:18: note: expanded from macro 'LXC_IP_2'
    #define LXC_IP_2 bpf_htonl((0) << 24 | (0) << 16 | (0) << 8 | (0x01))
                     ^
    /cilium/bpf/lib/endian.h:34:2: note: expanded from macro 'bpf_htonl'
            (__builtin_constant_p(x) ?              \
            ^
    /cilium/bpf/lib/static_data.h:16:9: note: macro 'fetch_ipv6' defined here
    #define fetch_ipv6(x) fetch_u32_i(x, 1), fetch_u32_i(x, 2), fetch_u32_i(x, 3), fetch_u32_i(x, 4)
            ^
    /cilium/bpf/lib/lxc.h:28:2: note: cannot use initializer list at the beginning of a macro argument
            BPF_V6(valid, LXC_IP);
            ^
    /cilium/bpf/lib/common.h:222:40: note: expanded from macro 'BPF_V6'
    #define BPF_V6(dst, ...)        BPF_V6_1(dst, fetch_ipv6(__VA_ARGS__))
                                                  ^

This commit fixes it. As a result, pktgen.h must be included before
config_replacement.h to define bpf_htonl. The only file where that
wasn't the case is tc_nodeport_test.c, also fixed here.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 51d7f70 ]

The previous implementation of BPFProgram.Test() only allowed passing
and returning bytes as *skb->data, without the ability to specify input
skb metadata or check output skb metadata.

This commit introduces a new function named runBpfProgram, which passes
an additional []byte as input skb metadata and returns an additional
[]byte as output skb metadata. By utilizing this function, we ensure
that the skb->mark set in PKTGEN can be properly passed to the SETUP ,
and any modifications to skb->mark or skb->cb can be accurately examined
during the CHECK for validation purposes.

The input ctx bytes will be set as the input skb, briefly you can
expect `memcpy(skb, ctx, sizeof(*skb))` happening inside, and you can get
skb->mark by `mark := ctx[offset(skb, mark): sizeof(skb->mark)]`.

Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit b760a99 ]

[ backporter's notes: Trivial conflicts due to security identity fields
  being renamed in v1.14.
  Note we also need to properly define LXC_IP and LXC_IPV4 to match the
  source pod IP addresses. This required to pass the source IP
  verification in bpf_lxc. It isn't required in v1.14 because that
  verification is disabled in the tests. It can't be disabled before
  v1.14. ]

These testcases cover IPSec datapath on `from-container` for both IPv4 and IPv6.

The following behaviors are checked after execution of `from-container`:
1. skb->mark should be set to `node_id << 16 | key << 12 | 0xe00`
2. skb->cb[4] should be set to source sec_id
3. skb should be passed to stack
4. skb->data doesn't change

Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit dee0683 ]

[ backporter's notes: Minor conflicts due to security identity fields
  being renamed in v1.14. mock_skb_set_tunnel_key also needed to be
  updated to remove support for local_ipv4. ]

These testcases cover IPSec datapath on `from-host` for both IPv4 and
IPv6.

The input skb should be ESP-encrypted with mark set to `node_id
<< 16 | key << 12 | 0xe00` and cb[4] set to sec_id. We will check the
following behaviors after execution of `from-host`:

1. skb->mark is cleared to 0
2. skb->cb[4] is cleared to 0
3. skb is redirected to cilium_vxlan
4. skb->data doesn't change
5. VxLAN VNI is set to sec_id

Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6decd65 ]

These testcases cover IPsec datapath for both IPv4 and IPv6.

An ingress skb will reach `from-overlay` twice, for the 1st time the skb
is ESP-encrypted, for the 2nd time the skb is ESP-decrypted with mark
`0xd00`.

For the 1st reach, we check the following behaviors:
1. skb is passed to stack
2. skb->mark is set to 0xd00
3. skb->data doesn't change

For the 2nd reach, we check:
1. skb is redirected to the target lxc veth
2. skb->mark is cleared to 0

Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 93f70bb ]

[ backporter's notes: All changes conflicted so were reapplied from
  scratch. ]

IPsec isn't used or compatible with bpf_xdp and bpf_sock, so there is
no reason to compile test it there. This will simplify subsequent
commits where we will introduce some code in IPsec that isn't compatible
with non-tc BPF hooks.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6d84882 ]

[ backporter's notes: Trivial conflict on the unusual include of
  encrypt.h. ]

This commit has no functional changes. It simply moves the logic to
prepare packets for IPsec encryption to the dedicated encrypt.h file.
Function set_ipsec_encrypt_mark is also created to simplify a subsequent
commit (no, it won't remain a function that simply calls another
function :)).

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit f711329 ]

This commit has no functional changes. It simply moves the allocation of
node IDs to a single place in the top calling function.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit d2523ff ]

[ backporter's notes: The mocked node ID BPF map doesn't exist in v1.13
  and changes to the integration tests (see last paragraph below) were
  not necessary, so I skipped them. ]

Before this commit, we would allocate a node ID for new nodes if IPsec
was enabled. That wasn't consistent with (1) allocations triggered by
pod creations, where we allocate even if IPsec is disabled, and (2)
deallocations triggered by node deletions which also happen regardless
of IPsec being enabled.

This commit changes the nodeUpdate logic slightly such that the node ID
is allocated regardless of IPsec being enabled. As a consequence, we
need to update several integration tests to mock the node ID BPF map.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 11f7795 ]

This new packet drop type/reason will be used by IPsec and Mutual Auth
when a node ID lookup doesn't return anything. Both features rely on the
presence of the node ID to function properly.

A failure to find the node ID can occur if traffic is sent to a pod on
a node for which we didn't receive the NodeAdd or NodeUpdate event yet.
Such out-of-order events (PodAdd received before NodeAdd) are possible
and normal behavior.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6109a38 ]

[ backporter's notes: Many small conflicts due to the rename of security
  identity variables and fields in the C codebase. ]

The IPsec feature relies on the node IDs encoded in the packet marks to
match packets against encryption rules (XFRM OUT policies and states).
It currently retrieves the ID of the remote node from the ipcache.

In a subsequent commit, however, we will remove this node ID information
from the ipcache (to simplify node ID management in the agent). So
IPsec's datapath must retrieve the node ID from a different place.

Fortunately, we already have a BPF map containing all node IP to node ID
mappings. Thus, in this commit, we use the remote node IP obtained from
the ipcache (ep->tunnel_endpoint) to lookup the remote node ID in this
BPF map.

This one additional hashmap lookup is expected to have a negligible cost
compared to the cost of encryption. It should even be small compared to
the cost of the ipcache LPM lookup.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 426c424 ]

This will be useful to enforce ordering between different tests, with
the first one preparing part of the maps for subsequent tests. Numbering
starts at 2 because the next commit will introduce a test 1.

"ipsec_" is removed from the test names to shorten it a bit (the IPsec
context should be clear from the test filename).

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 147a130 ]

If the node ID for the destination node of a pod-to-pod packet is not
found in the node ID map, we drop the packet with DROP_NO_NODE_ID. This
can happen when the k8s Node object for that destination hasn't been
received yet. This commit covers that case in BPF unit tests.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 4856e38 ]

[ backporter's notes: Many small conflicts due to
  OnIPIdentityCacheChange's prototype changing in v1.14 and the ipcache
  map's structs slightly changing with the new loader. ]

In previous commits, we removed all consumers (IPsec and Mutual Auth)
of the ipcache's node_id field. This commit removes population of those
node IDs in the ipcache.

**Problem**

Needing to populate the ipcache with node IDs causes all sorts of
issues. Node IDs are allocated for and mapped with specific node IP
addresses. In practice, it means node IDs are tied to the lifecycle of
both node and pod objects:
1. We need to allocate node IDs when receiving node objects with new
   node IP addresses (either new node objects or existing node objects
   with changing IP addresses).
2. We need to allocate node IDs when receiving a node IP address via a
   pod (hostIP of the pod).

Node objects have all node IP addresses. So why are we not relying
solely on 1? We also need 2 because pod adds and pod updates (with new
hostIP) can arrive before the related node add or node update.
Basically, kube-apiserver doesn't provide any guarantee of ordering
between node and pod events. Since we need to populate the ipcache with
a node ID for each remote pod, we need to allocate a node ID when we
receive the pod object and not later, when we receive the node object.

Unfortunately, tying the node IDs to two object lifecycles in this way
causes a lot of complications. For example, we can't simply deallocate
node IDs on pod deletions because other pods may still be using the node
ID. We also can't deallocate simply on node deletions because pods may
still be using it. The typical solution to such issues is to maintain a
reference count, but that leads to its own set of issues in this case
[1].

**Considered Solution**

An alternative to all this is to force events to always be managed in
the same order. For example, NodeAdd -> PodAdd -> PodDelete ->
NodeDelete. The agent unfortunately doesn't easily lend itself to that.
Without a major refactoring, we would have to populate the ipcache as
usual, but skip node IDs. Then, on NodeAdd, we would allocate a node ID
and go through the ipcache again to populate the node IDs we skipped.
In the datapath, we would have an explicit drop for such half-populated
ipcache entries.

To implement this without needing to walk through the entire ipcache for
each NodeAdd or NodeUpdate, we need to keep track of the specific
entries that need to be populated. We need the same sort of mechanism
for deletions. This solution quickly turns into something quite complex
to implement with good performance.

**This Solution**

It should be quite clear at this point that this problem's complexity
stems from having node IDs tied to both the pods and nodes' lifecycles.

To untie node IDs from pod lifecycles, and thus, stop populating the
ipcache with node IDs, we will need the datapath to retrieve them from
some other place. Fortunately, we already have a BPF map that maps node
IP addresses to node IDs. It is currently only used as a restoration
mechanism for agent restarts, but there's no reason we can't use it for
more.

Thus, with the node IP address retrieved from the ipcache
(ep->tunnel_endpoint), we can lookup the node ID in the BPF map.

Of course, since pod objects can arrive before their corresponding node
object, the node ID may not always be ready in the BPF map when we need
it. When that happens, we will drop the packet with a specific drop code
(DROP_NO_NODE_ID). These drops simply mean we tried to send traffic to a
node for which we didn't receive the node object yet (hence the
encryption configuration isn't setup yet).

These datapath changes have been implemented in the previous commit.
This commit removes all population of node IDs in the ipcache.

We can also remove node IDs from the tunnel map. It matters less since
the tunnel map is tied to node objects, but let's keep the datapath
consistent between the two maps (ipcache and tunnel map).

1 - cilium#26725 (comment)
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit efbba59 ]

[ backporter's notes: Many small conflicts due to some node ID getters
  being added in v1.14 for the Mutual Auth feature. Since that feature is
  absent in previous versions, we can also remove the ipcache's
  NodeIDHandler. Some changes to integration tests that don't exist
  before v1.14 were also not backported. ]

The AllocateNodeID function was used to allocate node IDs from the
ipcache logic. Since we don't populate the ipcache with node IDs
anymore, it's unused and can be removed. The DeallocateNodeID function
was never used (a sibling function was used instead).

This commit has no functional changes.

Reported-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 4d46fa9 ]

This commit fixes a leak of node IP to node ID mappings when the set of
IP addresses of a node is updated. We have logic to allocate and map
node IDs on node creation, logic to unmap and deallocate node IDs on
node deletions, but we didn't fully handle the case of node updates.

On node updates, we would map new IP addresses to the already-allocated
node ID. We forgot to unmap the old IP addresses. This commit fixes that
oversight.

This leak means that some IP addresses that were previously assigned to
the node would remain mapped to the node ID. Those IP addresses could
then be assigned to a new node. The new node would then receive the same
node ID as the old node (because we first check if any of the node IPs
already have an ID). This could lead to encrypted traffic being sent to
the wrong node.

Before removing the node IDs from the ipcache (see previous commit),
this leak could have another consequence. If after the node ID had been
reassigned to the new node, the old node was deleted, the node ID would
be deallocated, but it would remain in the ipcache. The XFRM policies
and states of the node would also be identified based on the node ID and
removed. Thus, both the old and new nodes' XFRM configs would be
removed. As a consequence, outgoing packets for those nodes would be
dropped with XfrmOutPolBlock (the default drop-all rule).

Fixes: af88b42 ("datapath: Introduce node IDs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit e1b9063 ]

The previous commit fixed a bug that could lead to two nodes sharing the
same node ID. This commit adds a check, on node deletion, for such
erroneous state. It completes the existing check that a single node
doesn't have multiple node IDs. In both cases, an error is thrown such
that users will notice and our CI will fail.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno force-pushed the pr/v1.13-backport-2023-07-27 branch from 76109c3 to 510330f Compare July 28, 2023 12:51
@pchaigno
Copy link
Member Author

pchaigno commented Jul 28, 2023

/test-backport-1.13

k8s-1.18-kernel-4.19 failed with #13071. Travis CI failing with #25272.

@pchaigno pchaigno marked this pull request as ready for review July 28, 2023 13:50
@pchaigno pchaigno requested review from a team as code owners July 28, 2023 13:50
Copy link
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Paul!

@jschwinger233
Copy link
Member

/test-1.18-4.19

Copy link
Contributor

@gentoo-root gentoo-root left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for my commit.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2023
@pchaigno
Copy link
Member Author

pchaigno commented Jul 31, 2023

Marking ready to merge to summon the tophat who also happens to be the last missing reviewer. Travis CI failed with known flake #25272.

@dylandreimerink dylandreimerink merged commit 7aae5cb into cilium:v1.13 Jul 31, 2023
60 of 61 checks passed
@pchaigno pchaigno deleted the pr/v1.13-backport-2023-07-27 branch July 31, 2023 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants