-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.12 backports 2023-07-28 #27138
Merged
dylandreimerink
merged 10 commits into
cilium:v1.12
from
pchaigno:pr/v1.12-backport-2023-07-28
Jul 31, 2023
Merged
v1.12 backports 2023-07-28 #27138
dylandreimerink
merged 10 commits into
cilium:v1.12
from
pchaigno:pr/v1.12-backport-2023-07-28
Jul 31, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ upstream commit 93f70bb ] [ backporter's notes: All changes conflicted so were reapplied from scratch. ] IPsec isn't used or compatible with bpf_xdp and bpf_sock, so there is no reason to compile test it there. This will simplify subsequent commits where we will introduce some code in IPsec that isn't compatible with non-tc BPF hooks. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6d84882 ] [ backporter's notes: Trivial conflict on the unusual include of encrypt.h and due to macro guards at the top. ] This commit has no functional changes. It simply moves the logic to prepare packets for IPsec encryption to the dedicated encrypt.h file. Function set_ipsec_encrypt_mark is also created to simplify a subsequent commit (no, it won't remain a function that simply calls another function :)). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit f711329 ] This commit has no functional changes. It simply moves the allocation of node IDs to a single place in the top calling function. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit d2523ff ] [ backporter's notes: The mocked node ID BPF map doesn't exist in v1.13 and changes to the integration tests (see last paragraph below) were not necessary, so I skipped them. ] Before this commit, we would allocate a node ID for new nodes if IPsec was enabled. That wasn't consistent with (1) allocations triggered by pod creations, where we allocate even if IPsec is disabled, and (2) deallocations triggered by node deletions which also happen regardless of IPsec being enabled. This commit changes the nodeUpdate logic slightly such that the node ID is allocated regardless of IPsec being enabled. As a consequence, we need to update several integration tests to mock the node ID BPF map. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 11f7795 ] [ backporter's notes: Trivial conflicts due to many additional drop reasons being added between this branch and v1.14. ] This new packet drop type/reason will be used by IPsec and Mutual Auth when a node ID lookup doesn't return anything. Both features rely on the presence of the node ID to function properly. A failure to find the node ID can occur if traffic is sent to a pod on a node for which we didn't receive the NodeAdd or NodeUpdate event yet. Such out-of-order events (PodAdd received before NodeAdd) are possible and normal behavior. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6109a38 ] [ backporter's notes: Many conflicts over large sections of code (in particular in encap.h). I resolved by reapplying the changes. ] The IPsec feature relies on the node IDs encoded in the packet marks to match packets against encryption rules (XFRM OUT policies and states). It currently retrieves the ID of the remote node from the ipcache. In a subsequent commit, however, we will remove this node ID information from the ipcache (to simplify node ID management in the agent). So IPsec's datapath must retrieve the node ID from a different place. Fortunately, we already have a BPF map containing all node IP to node ID mappings. Thus, in this commit, we use the remote node IP obtained from the ipcache (ep->tunnel_endpoint) to lookup the remote node ID in this BPF map. This one additional hashmap lookup is expected to have a negligible cost compared to the cost of encryption. It should even be small compared to the cost of the ipcache LPM lookup. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 4856e38 ] [ backporter's notes: Many small conflicts due to OnIPIdentityCacheChange's prototype changing in v1.14 and the ipcache map's structs slightly changing with the new loader. ] In previous commits, we removed all consumers (IPsec and Mutual Auth) of the ipcache's node_id field. This commit removes population of those node IDs in the ipcache. **Problem** Needing to populate the ipcache with node IDs causes all sorts of issues. Node IDs are allocated for and mapped with specific node IP addresses. In practice, it means node IDs are tied to the lifecycle of both node and pod objects: 1. We need to allocate node IDs when receiving node objects with new node IP addresses (either new node objects or existing node objects with changing IP addresses). 2. We need to allocate node IDs when receiving a node IP address via a pod (hostIP of the pod). Node objects have all node IP addresses. So why are we not relying solely on 1? We also need 2 because pod adds and pod updates (with new hostIP) can arrive before the related node add or node update. Basically, kube-apiserver doesn't provide any guarantee of ordering between node and pod events. Since we need to populate the ipcache with a node ID for each remote pod, we need to allocate a node ID when we receive the pod object and not later, when we receive the node object. Unfortunately, tying the node IDs to two object lifecycles in this way causes a lot of complications. For example, we can't simply deallocate node IDs on pod deletions because other pods may still be using the node ID. We also can't deallocate simply on node deletions because pods may still be using it. The typical solution to such issues is to maintain a reference count, but that leads to its own set of issues in this case [1]. **Considered Solution** An alternative to all this is to force events to always be managed in the same order. For example, NodeAdd -> PodAdd -> PodDelete -> NodeDelete. The agent unfortunately doesn't easily lend itself to that. Without a major refactoring, we would have to populate the ipcache as usual, but skip node IDs. Then, on NodeAdd, we would allocate a node ID and go through the ipcache again to populate the node IDs we skipped. In the datapath, we would have an explicit drop for such half-populated ipcache entries. To implement this without needing to walk through the entire ipcache for each NodeAdd or NodeUpdate, we need to keep track of the specific entries that need to be populated. We need the same sort of mechanism for deletions. This solution quickly turns into something quite complex to implement with good performance. **This Solution** It should be quite clear at this point that this problem's complexity stems from having node IDs tied to both the pods and nodes' lifecycles. To untie node IDs from pod lifecycles, and thus, stop populating the ipcache with node IDs, we will need the datapath to retrieve them from some other place. Fortunately, we already have a BPF map that maps node IP addresses to node IDs. It is currently only used as a restoration mechanism for agent restarts, but there's no reason we can't use it for more. Thus, with the node IP address retrieved from the ipcache (ep->tunnel_endpoint), we can lookup the node ID in the BPF map. Of course, since pod objects can arrive before their corresponding node object, the node ID may not always be ready in the BPF map when we need it. When that happens, we will drop the packet with a specific drop code (DROP_NO_NODE_ID). These drops simply mean we tried to send traffic to a node for which we didn't receive the node object yet (hence the encryption configuration isn't setup yet). These datapath changes have been implemented in the previous commit. This commit removes all population of node IDs in the ipcache. We can also remove node IDs from the tunnel map. It matters less since the tunnel map is tied to node objects, but let's keep the datapath consistent between the two maps (ipcache and tunnel map). 1 - cilium#26725 (comment) Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
pchaigno
added
the
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
label
Jul 28, 2023
maintainer-s-little-helper
bot
added
backport/1.12
This PR represents a backport for Cilium 1.12.x of a PR that was merged to main.
kind/backports
This PR provides functionality previously merged into master.
labels
Jul 28, 2023
[ upstream commit efbba59 ] [ backporter's notes: Many small conflicts due to some node ID getters being added in v1.14 for the Mutual Auth feature. Since that feature is absent in previous versions, we can also remove the ipcache's NodeIDHandler. Some changes to integration tests that don't exist before v1.14 were also not backported. ] The AllocateNodeID function was used to allocate node IDs from the ipcache logic. Since we don't populate the ipcache with node IDs anymore, it's unused and can be removed. The DeallocateNodeID function was never used (a sibling function was used instead). This commit has no functional changes. Reported-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 4d46fa9 ] This commit fixes a leak of node IP to node ID mappings when the set of IP addresses of a node is updated. We have logic to allocate and map node IDs on node creation, logic to unmap and deallocate node IDs on node deletions, but we didn't fully handle the case of node updates. On node updates, we would map new IP addresses to the already-allocated node ID. We forgot to unmap the old IP addresses. This commit fixes that oversight. This leak means that some IP addresses that were previously assigned to the node would remain mapped to the node ID. Those IP addresses could then be assigned to a new node. The new node would then receive the same node ID as the old node (because we first check if any of the node IPs already have an ID). This could lead to encrypted traffic being sent to the wrong node. Before removing the node IDs from the ipcache (see previous commit), this leak could have another consequence. If after the node ID had been reassigned to the new node, the old node was deleted, the node ID would be deallocated, but it would remain in the ipcache. The XFRM policies and states of the node would also be identified based on the node ID and removed. Thus, both the old and new nodes' XFRM configs would be removed. As a consequence, outgoing packets for those nodes would be dropped with XfrmOutPolBlock (the default drop-all rule). Fixes: af88b42 ("datapath: Introduce node IDs") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit e1b9063 ] [ backporter's notes: Small conflict at the top of deallocateIDForNode. ] The previous commit fixed a bug that could lead to two nodes sharing the same node ID. This commit adds a check, on node deletion, for such erroneous state. It completes the existing check that a single node doesn't have multiple node IDs. In both cases, an error is thrown such that users will notice and our CI will fail. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
pchaigno
force-pushed
the
pr/v1.12-backport-2023-07-28
branch
from
July 28, 2023 17:42
d4fe28c
to
4ff31fd
Compare
/test-backport-1.12 |
Provisioning error: |
pchaigno
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Jul 31, 2023
maintainer-s-little-helper
bot
removed
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Jul 31, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
backport/1.12
This PR represents a backport for Cilium 1.12.x of a PR that was merged to main.
kind/backports
This PR provides functionality previously merged into master.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Once this PR is merged, you can update the PR labels via: