-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.9 backports 2021-02-12 #14961
v1.9 backports 2021-02-12 #14961
Conversation
[ upstream commit 32e41f6 ] If a pod in host networking and/or the node itself sends traffic to a pod with encryption+vxlan enabled and encryptNode disabled we may see dropped packets. The flow is the following, First, a pkt is sent from the host networking with srcIP=nodeIP dstIP=podIP. Next a source IP is chosen for the packet so that the route src will not be used. Then the route table 'routes' the packet to cilium_host using the route rule matching podIPs subnet, podIPSubnet via $IP dev cilium_host src $SrcIP mtu 1410 Then we drop into BPF space with srcIP=nodeIP,dstIP=podIP as above. Here tohost=true, and we do an ipcache lookup. The ipcache lookup will have a hit for the podIP and will have both a key and tunnelIP. For example something like this, 10.128.5.129/32 1169 3 10.0.0.4 Using above key identifier, in the example '3', the bpf_host program will mark the packet form encryption. This will pass encryption parameters through the ctx where the ingress program attached to cilium_net will in turn use those parameters to set skb->mark before passing up to the stack for encryption. At this point we have a skb ingress'ing the stack with an skb->mark=0xe00,srcIP=nodeIP,dstIP=podIP. Here is where the trouble starts. This will miss encryption policy rules because unless encryptNode is enabled we do not have a rule to match srcIP=nodeIP. So the unencrypted is sent back to cilium_host using a route in the encryption routing table. Then finally bpf_host will send the skb to the overlay, cilium_vxlan. The packet is then received by the podIP node and sent to cilium_vxlan because its a vxlan packet. The vxlan header is popped off and the inner (unencrypted) packet is sent to the pod, remember srcIP=nodeIP still. Assuming the above was a SYN packet the pod will respond with a SYN/ACK with srcIP=podIP,dstIP=nodeIP. This will be handled by the bpf_lxc program. First, we will do an ipcache lookup and get an ipcache entry, but this entry will not have a tunnelIP. It will have a key though. Because no tunnelIP is available a tunnel map lookup will be done. This will fail because the dstIP is a nodeIP and not in a tunnel subnet. (The tunnel map key is done by masking the dstIP with the subnet.) Because the program did not find a valid tunnel the packet is sent to the stack. The stack will run through iptables, but because the packet flow is asymmetric (SYN over cilium_vxlan, SYN/ACK over nic) the MASQUERADE rules will be skipped. The end result is we try to send a packet with the srcIP=podIP. Here a couple different outcomes are possible. If the network infra is strict the packet may be dropped at the sending node, refusing to send a packet with an unknown srcIP. If the receiving node has rp_filter=1 on the receiving interface it will be dropped with a martianIP message in dmesg. Or if rp_filter=0 and the network knows how to route the srcIP somehow it might work. For example some of my test systems managed to work without any issues. In order to fix this we can ensure reply packets are symmetric to the request. To do this we inject tunnel info into nodeIP ipcache entry. Once this is done the pod replying will do the following, Send a packet with srcIP=podIP,dstIP=nodeIP, this is picked up by the bpf_lxc program. The ipcache lookup finds a complete entry now with the tunnel info. Now instead of sending that packet to the stack it will be encapsulated in the vxlan header and send/received on the original requesting node. A quick observation, some of the passes through the xfrm stack are useless here. We know the miss is going to happen and can signal this to the bpf layer by clearing the encryption key in the ipcache. We will do this as a follow up patch. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit d869108 ] [SG]etInternalIPv4Router - This is the CiliumInternalIP of the node. It is only relevant within the Cilium-managed network (this means within the node for direct routing mode and on the overlay for tunnel mode). Inorder to not get confused with InternalIP of a k8s node, we renamed `[SG]etInternalIPv4` to `[SG]etInternalIPv4Router`. [SG]etIPv4 - This returns one of the IPv4 node address based on the following priority: - NodeInternalIP - NodeExternalIP - other IP address type. These are reachable on the network. Since it can be External or Internal IP address, we renamed `[SG]etExternalIPv4` to `[SG]etIPv4` Signed-off-by: Anish Shah <anishshah@google.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit b60f8b7 ] Previously, accessing a NodePort using ExternalIP of the local K8s node would result in connection timeout. This is because ExternalIP of the local K8s node is not added to ipcache and hence no service resolution happens on egress. This commit resolves this issue by adding the ExternalIP of local k8s node to the ipcache. We introduce two functions `SetExternalIPv[46]` and `GetExternalIPv[46]` to store/fetch the External IPv4/IPv6 address of a k8s node. If a node doesn't have any External IP address, then we store/return `nil`. Signed-off-by: Anish Shah <anishshah@google.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
… cluster [ upstream commit 265d677 ] This test checks connectivity to NodePort service from inside a cluster by sending request to ExternalIP:NodePort. Signed-off-by: Anish Shah <anishshah@google.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2ceb386 ] Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 55caedc ] Strictly speaking this is a general truth for all services, but we don't have a dedicated section for explaining services and users have begun reporting specifically in relation to AWS / ENI mode. Put this somewhere in the docs, we can always move it around to somewhere more generic when we have a better location for these links to live. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 83cb610 ] Setting this option makes Cilium to run in a mode that will allow it to reduce its process' RSS even if the system is not under memory pressure. By default, i.e. without 'madvdontneed' being set, Cilium will only release its process' RSS when the system is under pressure. This may be confusing for users since they will assume Cilium needs the memory reported in RSS, which is not true. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 317a671 ] Kube-proxy installs the following rule in the iptables FORWARD chain: -A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP That rule causes iptables to drop packets whenever we have a path asymmetry with for example a SYN packet going through iptables and a SYN+ACK packet skipping iptables via BPF's redirect(). In that example, the next ACK packet would be dropped by that rule because the connection is invalid (SYN+ACK is missing) from iptables' point of view. To avoid such drops, in the IPv4 case, Cilium installs rules in the FORWARD chain (via its own CILIUM_FORWARD chain) to bypass the subsequent -j DROP rule from kube-proxy. For example, with per-endpoint routes enabled: -A FORWARD -m comment --comment "cilium-feeder: CILIUM_FORWARD" -j CILIUM_FORWARD -A CILIUM_FORWARD -o cilium_host -m comment --comment "cilium: any->cluster on cilium_host forward accept" -j ACCEPT -A CILIUM_FORWARD -i cilium_host -m comment --comment "cilium: cluster->any on cilium_host forward accept (nodeport)" -j ACCEPT -A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT -A CILIUM_FORWARD -i cilium_net -m comment --comment "cilium: cluster->any on cilium_net forward accept (nodeport)" -j ACCEPT -A CILIUM_FORWARD -o lxc+ -m comment --comment "cilium: any->cluster on lxc+ forward accept" -j ACCEPT -A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept (nodeport)" -j ACCEPT The same rules are however missing from the IPv6 tables. This commit fixes this inconsistency. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ed68d2c ] This commit fixes a broken packet path on pod --> remote pod via VIP when running with encapsulation and per-endpoint routes, with kube-proxy handling DNATing. The following details the broken packet path: - SYN leaves lxc device for stack with pod1IP -> VIP - SYN goes through cilium_host and cilium_net with pod1IP -> pod2IP - SYN is sent to second node through tunnel with pod1IP -> pod2IP - SYN+ACK arrives from tunnel on cilium_vxlan with pod2IP -> pod1IP - SYN+ACK is processed by ipvX_local_delivery and redirected in BPF directly to the lxc device - SYN+ACK arrives on lxc device with pod2IP -> pod1IP - RST is sent because pod2IP doesn't match VIP Because we use redirect() in BPF, the SYN+ACK packet bypasses most of the stack (including conntrack) on its way from cilium_vxlan to the lxc device. By bypassing conntrack, we also bypass the reverse DNAT. The bug doesn't occur when per-endpoint routes are disabled because, in that case, the SYN is SNATed to the cilium_host IP before leaving through the tunnel. The SYN+ACK packet therefore takes a different path and goes through conntrack. Fixes: #14646 Co-authored-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ae96b92 ] Signed-off-by: Yosh de Vos <yosh@elzorro.nl> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c8052a1 ] When IPSec or the host firewall are enabled in tunneling mode, traffic from a pod to a remote node is sent through the stack. This introduces an asymmetry in kube-proxy DNAT and breaks connections from a node (or hostns pod) to remote pods via externalTrafficPolicy=Local services. The following details the broken packet path node1 --VIP--> pod@node2 (VIP is node2IP): - SYN leaves node1 via native device with node1IP -> VIP - SYN is DNATed on node2 to node1IP -> podIP - SYN is deliverd to lxc device with node1IP -> podIP - SYN+ACK is sent from lxc device with podIP -> node1IP - SYN+ACK is redirected in BPF directly to cilium_vxlan - SYN+ACK arrives on node1 via tunnel with podIP -> node1IP - RST is sent because podIP doesn't match VIP Because we use redirect() in BPF, the SYN+ACK packet bypasses most of the stack (including conntrack) on its way from cilium_vxlan to the lxc device. By bypassing conntrack, we also bypass the reverse DNAT. The bug only occurs for externalTrafficPolicy=Local services because, without that annotation, the SYN packet is SNATed when arriving on node2. The SYN+ACK is therefore first sent via the stack to node2's IP where it is both reverse SNATed and reverse DNATed. Fixes: #14674 Fixes: #12542 Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c4356bd ] For IPSec and the host firewall, in order to apply kube-proxy's reverse DNAT we want to send traffic through the tunnel via the stack instead of doing a redirect call. Such traffic therefore goes from the lxc device to cilium_host where it is redirected to the tunneling device. We however SNAT that traffic on its way to cilium_host. We don't need to apply SNAT there for the pod CIDR as any traffic from the pod CIDR will either leave directly through the native device (masquerading happening there) or go through cilium_host and be redirected to the tunnel. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2303803 ] This test validates a potential regression on the path node1 -> pod@node2 via service with externalTrafficPolicy=Local. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1db771f ] This test validates a potential regression on the path node1 -> pod@node2 via service with externalTrafficPolicy=Local when kube-proxy, tunneling and the host firewall are enabled. This commit also removes the explicit disabling of the host firewall in externalTrafficPolicy=Local tests. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 5440f9f ] This sets the `GOPS_CONFIG_DIR` environment variable in all images which use gops and are based on a scratch image. This fixes an issue with crun where due to missing environment variables, cilium-operator and hubble-relay would error out when starting the gops agent with the following error: ``` Error: failed to start gops agent: unable to get current user home directory: os/user lookup failed; $HOME is empty ``` This only seems to occur on `crun`, which sets `HOME` to an empty variable in scratch images. In non-scratch images, it sets HOME according to `/etc/passwd`, so this currently only affects the cilium-operator and hubble-relay images. Most other container runtimes seem to set `HOME` to `/` in scratch images, which works for gops. Fixes: #14301 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 773b1d2 ] Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3ce38c9 ] This is a preparatory commit that introduces ENI related information for cilium_host (aka router). Specifically, the information includes primary ENI interface that the secondary cilium_host IP is associated with, primary ENI MAC address and VPC CIDRs. Also, refactor the current IPAM internal IP allocating methods so that the ENI information can be retrieved for cilium_host (aka router). This information can be useful for setting up encryption related state such as ip rules and interface to attach BPF programs to. Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit f34371c ] Previously, we weren't adding policy based routing rules for cilium_host secondary IP address. These rules ensure that traffic from source node's cilium_host interface destined to the remote node in the cluster VPC is routed through the primary ENI interface for cilium_host. From there on, the traffic is routed to the AWS infrastructure router, which in turn routes it to the destination node. Fix involves setting up ip rules to the per ENI routing table. Verified that the correct rule is added, and encrypted traffic is reaching the destination node. root@ip-192-168-38-245:/home/cilium# ip addr show cilium_host 5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000 link/ether da:61:19:32:0b:37 brd ff:ff:ff:ff:ff:ff inet 192.168.148.108/32 scope link cilium_host valid_lft forever preferred_lft forever inet6 fe80::d861:19ff:fe32:b37/64 scope link valid_lft forever preferred_lft forever root@ip-192-168-38-245:/home/cilium# ip rule show | grep 192.168.148.108 111: from 192.168.148.108 to 192.168.0.0/16 lookup 11 Tcpdump on the remote node shows encrypted packets getting received - tcpdump -eni eth1 dst 192.168.142.218 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes 17:32:52.216989 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 130: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x64), length 96 17:32:52.217694 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 122: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x65), length 88 17:32:52.217896 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 206: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x66), length 172 Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
…r ENI mode [ upstream commit 2c5e192 ] Problem: Auto detection of encryption.interface leads to bpf_network program that does decryption to get attached to a wrong network interface. As a result, packets were not decrypted on the destination remote node. If users don't specify the cilium `encryption.interface` configuration, the interface can be auto detected based on main routing table. This breaks in ENI IPAM mode, where cilium_host is allocated an IP address from the pool of secondary IP addresses of a primary ENI interface. The IP allocation is influenced by the specified `first-interface-index` configuration. Hence, retrieve the correct primary interface for cilium_host for bpf_network program to be attached. Fix is verified based on the xfrm stats - Packet stats on source node : src 192.168.117.26 dst 192.168.142.218 proto esp spi 0x00000003(3) reqid 1(0x00000001) mode tunnel replay-window 0 seq 0x00000000 flag (0x00000000) mark 0x3e00/0xff00 output-mark 0xe00 aead rfc4106(gcm(aes)) 0xea5799388f441ba52b8c60c4f11c084bcdf5373e (160 bits) 128 anti-replay context: seq 0x0, oseq 0x9, bitmap 0x00000000 sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0 lifetime config: limit: soft (INF)(bytes), hard (INF)(bytes) limit: soft (INF)(packets), hard (INF)(packets) expire add: soft 0(sec), hard 0(sec) expire use: soft 0(sec), hard 0(sec) lifetime current: 540(bytes), 9(packets) add 2021-02-05 18:12:08 use 2021-02-05 18:13:01 stats: replay-window 0 replay 0 failed 0 Packet stats on dest node : ip -s xfrm s src 192.168.142.218 dst 192.168.117.26 proto esp spi 0x00000003(3) reqid 1(0x00000001) mode tunnel replay-window 0 seq 0x00000000 flag (0x00000000) mark 0x3e00/0xff00 output-mark 0xe00 aead rfc4106(gcm(aes)) 0xea5799388f441ba52b8c60c4f11c084bcdf5373e (160 bits) 128 anti-replay context: seq 0x0, oseq 0x6, bitmap 0x00000000 sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0 lifetime config: limit: soft (INF)(bytes), hard (INF)(bytes) limit: soft (INF)(packets), hard (INF)(packets) expire add: soft 0(sec), hard 0(sec) expire use: soft 0(sec), hard 0(sec) lifetime current: 360(bytes), 6(packets) add 2021-02-05 18:12:08 use 2021-02-05 23:19:07 stats: replay-window 0 replay 0 failed 0 Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1e80ec0 ] This commit updates the list of default label filters and clarifies the default behavior, as well as how it changes when a first inclusive label is added. The example is also completed based on the new unit test added in the previous commit. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3b0f6e8 ] Signed-off-by: Paul Chaignon <paul@cilium.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for my commits. I confirmed that the EnableIPv4Masquerade
flag was changed from Masquerade
(v1.9). Thank you!
BPF checkpatch seems safe to ignore. |
Smoketest failure should be addressed by adding a backport for #14954 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG for my commit
Reviews are in and all required CI jobs are green. K8s-1.20-kernel-4.9 was triggered by mistake but isn't required for backports. Cilium-Ginkgo-Test-k8s and Cilium-PR-Ginkgo-Tests-K8s failed to provision but are now replaced by the K8s-1.X-kernel-4.9 jobs anyway. Cilium-Ginkgo-Tests failed with what looks like a flake. That job seems to have been triggered by mistake (wrong trigger phrase config.) however since it's a duplicate of K8s-1.19-kernel-4.9. Marking as ready to merge. |
pkg/datapath/iptables/iptables.go
.pkg/datapath/iptables/iptables.go
and in tests.EnableIPv4Masquerade
toMasquerade
indaemon/cmd/ipam.go
. Please double check.Once this PR is merged, you can update the PR labels via: