-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipsec: Clear node ID and SPI with output-mark #28258
Merged
pchaigno
merged 1 commit into
cilium:main
from
pchaigno:fix-ipsec-mark-conflict-iptables-nft
Oct 3, 2023
Merged
ipsec: Clear node ID and SPI with output-mark #28258
pchaigno
merged 1 commit into
cilium:main
from
pchaigno:fix-ipsec-mark-conflict-iptables-nft
Oct 3, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fd42c29
to
21f5ac5
Compare
TL;DR. This commit works around a Cilium iptables bug that can affect IPsec. The root of the issue for IPsec is a conflict between the packet mark we use and some iptables-nft rules. The workaround changes the packet mark to avoid the conflict. Cilium's IPsec implementation relies on the packet mark to match packets against XFRM rules. In particular, the packet mark holds the node ID (0xffff0000), the SPI (0xf000), and whether to encrypt or decrypt (0xf00). On egress, the packet mark is written in bpf_lxc before the packet is sent to the stack for encryption. The encryption layer retains the packet mark, except for the encryption bit (0xf00) in some cases. bpf_host then processes the encrypted packet and clears the mark. In some corner cases (what causes this isn't clear yet), some OSes can end up with kube-proxy rules in both iptables-nft and iptables-legacy. Cilium is currently unable to handle that situation properly: it should install iptables rules to skip some of kube-proxy's rules, but only does that for iptables-legacy. We can therefore end up with the following rules matching outgoing packets: [31061:1736670] -A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING [31059:1736390] -A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN [2:280] -A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0 [2:280] -A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully We can see that kube-proxy install a KUBE-POSTROUTING chain in the POSTROUTING netfilter chain. That chain skips all packets not marked with 0x4XXX. Other packets see the mark removed and are then masqueraded. In the above example, 2 packets had the 0x4XXX mark and were masqueraded. These rules can end up matching the mark of our encrypted packets, if the SPI is 4. Encrypted packets are then incorrectly masqueraded. Depending on the node and network configuration, they may end up dropped. The SPI is incremented on key rotation (from 1 to 15, then going back to 1 again). Users performing key rotations are therefore highly likely to hit this bug if they have iptables-nft kube-proxy rules installed. We can work around this bug by clearing all packet mark bits we don't need anymore after the encryption. Once the packet is encrypted, bpf_host only needs the 0xf00 part of the mark to determine if a packet is encrypted or not. The node ID and SPI can be cleared from the mark. This commit therefore clears those bits right after encryption, with output-mark, on egress and ingress. Note this only avoids the IPsec impact of this bug, but doesn't fix the underlying issue with iptables-nft. More work is needed to have the agent correctly handle this situation. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
21f5ac5
to
80fb65d
Compare
/test |
jschwinger233
approved these changes
Oct 3, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This was referenced Oct 17, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
backport-done/1.12
The backport for Cilium 1.12.x for this PR is done.
backport-done/1.13
The backport for Cilium 1.13.x for this PR is done.
backport-done/1.14
The backport for Cilium 1.14.x for this PR is done.
kind/bug
This is a bug in the Cilium logic.
release-blocker/1.12
This issue will prevent the release of the next version of Cilium.
release-blocker/1.13
This issue will prevent the release of the next version of Cilium.
release-blocker/1.14
This issue will prevent the release of the next version of Cilium.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR. This pull request works around a Cilium iptables bug that can affect IPsec. The root of the issue for IPsec is a conflict between the packet mark we use and some iptables-nft rules. The workaround changes the packet mark to avoid the conflict.
Cilium's IPsec implementation relies on the packet mark to match packets against XFRM rules. In particular, the packet mark holds the node ID (
0xffff0000
), the SPI (0xf000
), and whether to encrypt or decrypt (0xf00
). On egress, the packet mark is written in bpf_lxc before the packet is sent to the stack for encryption. The encryption layer retains the packet mark, except for the encryption bit (0xf00
) in some cases. bpf_host then processes the encrypted packet and clears the mark.In some corner cases (what causes this isn't clear yet), some OSes can end up with kube-proxy rules in both iptables-nft and iptables-legacy. Cilium is currently unable to handle that situation properly: it should install iptables rules to skip some of kube-proxy's rules, but only does that for iptables-legacy. We can therefore end up with the following rules matching outgoing packets:
We can see that kube-proxy install a KUBE-POSTROUTING chain in the POSTROUTING netfilter chain. That chain skips all packets not marked with
0x4XXX
. Other packets see the mark removed and are then masqueraded. In the above example, 2 packets had the0x4XXX
mark and were masqueraded.These rules can end up matching the mark of our encrypted packets, if the SPI is 4. Encrypted packets are then incorrectly masqueraded. Depending on the node and network configuration, they may end up dropped.
The SPI is incremented on key rotation (from 1 to 15, then going back to 1 again). Users performing key rotations are therefore highly likely to hit this bug if they have iptables-nft kube-proxy rules installed.
We can work around this bug by clearing all packet mark bits we don't need anymore after the encryption. Once the packet is encrypted, bpf_host only needs the
0xf00
part of the mark to determine if a packet is encrypted or not. The node ID and SPI can be cleared from the mark. This pull request therefore clears those bits right after encryption, with output-mark, on egress and ingress.Note this only avoids the IPsec impact of this bug, but doesn't fix the underlying issue with iptables-nft. More work is needed to have the agent correctly handle this situation.