-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.8 backports 2021-02-12 #14953
v1.8 backports 2021-02-12 #14953
Conversation
[ upstream commit 32e41f6 ] If a pod in host networking and/or the node itself sends traffic to a pod with encryption+vxlan enabled and encryptNode disabled we may see dropped packets. The flow is the following, First, a pkt is sent from the host networking with srcIP=nodeIP dstIP=podIP. Next a source IP is chosen for the packet so that the route src will not be used. Then the route table 'routes' the packet to cilium_host using the route rule matching podIPs subnet, podIPSubnet via $IP dev cilium_host src $SrcIP mtu 1410 Then we drop into BPF space with srcIP=nodeIP,dstIP=podIP as above. Here tohost=true, and we do an ipcache lookup. The ipcache lookup will have a hit for the podIP and will have both a key and tunnelIP. For example something like this, 10.128.5.129/32 1169 3 10.0.0.4 Using above key identifier, in the example '3', the bpf_host program will mark the packet form encryption. This will pass encryption parameters through the ctx where the ingress program attached to cilium_net will in turn use those parameters to set skb->mark before passing up to the stack for encryption. At this point we have a skb ingress'ing the stack with an skb->mark=0xe00,srcIP=nodeIP,dstIP=podIP. Here is where the trouble starts. This will miss encryption policy rules because unless encryptNode is enabled we do not have a rule to match srcIP=nodeIP. So the unencrypted is sent back to cilium_host using a route in the encryption routing table. Then finally bpf_host will send the skb to the overlay, cilium_vxlan. The packet is then received by the podIP node and sent to cilium_vxlan because its a vxlan packet. The vxlan header is popped off and the inner (unencrypted) packet is sent to the pod, remember srcIP=nodeIP still. Assuming the above was a SYN packet the pod will respond with a SYN/ACK with srcIP=podIP,dstIP=nodeIP. This will be handled by the bpf_lxc program. First, we will do an ipcache lookup and get an ipcache entry, but this entry will not have a tunnelIP. It will have a key though. Because no tunnelIP is available a tunnel map lookup will be done. This will fail because the dstIP is a nodeIP and not in a tunnel subnet. (The tunnel map key is done by masking the dstIP with the subnet.) Because the program did not find a valid tunnel the packet is sent to the stack. The stack will run through iptables, but because the packet flow is asymmetric (SYN over cilium_vxlan, SYN/ACK over nic) the MASQUERADE rules will be skipped. The end result is we try to send a packet with the srcIP=podIP. Here a couple different outcomes are possible. If the network infra is strict the packet may be dropped at the sending node, refusing to send a packet with an unknown srcIP. If the receiving node has rp_filter=1 on the receiving interface it will be dropped with a martianIP message in dmesg. Or if rp_filter=0 and the network knows how to route the srcIP somehow it might work. For example some of my test systems managed to work without any issues. In order to fix this we can ensure reply packets are symmetric to the request. To do this we inject tunnel info into nodeIP ipcache entry. Once this is done the pod replying will do the following, Send a packet with srcIP=podIP,dstIP=nodeIP, this is picked up by the bpf_lxc program. The ipcache lookup finds a complete entry now with the tunnel info. Now instead of sending that packet to the stack it will be encapsulated in the vxlan header and send/received on the original requesting node. A quick observation, some of the passes through the xfrm stack are useless here. We know the miss is going to happen and can signal this to the bpf layer by clearing the encryption key in the ipcache. We will do this as a follow up patch. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 16e8f2f ] Fix #14100 Identity relevant labels is a label prefix list combined of two parts: 1. base part: 1.1. Read from a user specified (--label-prefix-file) json file if this file is provided. Default: `--label-prefix-file=""`. 1.2 If `--label-prefix-file=""`, read from a default hardcoded list (`func defaultLabelPrefixCfg()`). 2. additional part: read from user inputs (--labels), default `--labels=""` When `--label-prefix-file=""` (default) but `--labels=<custom-list>` provided, if `reserved:host` (or `reserved:.*`) is not included in the above `<custom-list>`, the `cilium_host` endpoint will lose its `reserved:host` label. When rolling back to the default configuration, that is, setting `--labels=""` and restarting the agent, cilium agent will raise errors like following: ``` level=warning msg="Regeneration of endpoint failed" .. error="Exposing new BPF failed: invalid LXC MAC: invalid MAC address " level=error msg="endpoint regeneration failed" .. error="Exposing new BPF failed: invalid LXC MAC: invalid MAC address " ``` And subsequently, all pods' traffic on this node will be interrupted. This is because the agent relies on this label to distinguish `cilium_host` endpoint from normal endpoints, and the former has no `lxcMAC`. We should never exclude reserved labels from default label list. Add reserved labels to the default label list could solve the problem. Appendix: Sample custom label file (--label-prefix-file) to overwrite the default base label list: ``` { "version": 1, "valid-prefixes": [ { "source": "k8s", "prefix": "io.kubernetes.pod.namespace" }, { "source": "k8s", "prefix": ":io.cilium.k8s.namespace.labels" }, { "source": "k8s", "prefix": "app.kubernetes.io" },{ "source": "k8s", "prefix": "k8s!:io.kubernetes" },{ "source": "k8s", "prefix": "!kubernetes.io" },{ "source": "k8s", "prefix": "!.*beta.kubernetes.io" },{ "source": "k8s", "prefix": "!k8s.io" },{ "source": "k8s", "prefix": "!pod-template-generation" },{ "source": "k8s", "prefix": "!pod-template-hash" },{ "source": "k8s", "prefix": "!controller-revision-hash" },{ "source": "k8s", "prefix": "!annotation.*" },{ "source": "k8s", "prefix": "!etcd_node" ] } ``` Signed-off-by: ArthurChiao <arthurchiao@hotmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit d869108 ] [SG]etInternalIPv4Router - This is the CiliumInternalIP of the node. It is only relevant within the Cilium-managed network (this means within the node for direct routing mode and on the overlay for tunnel mode). Inorder to not get confused with InternalIP of a k8s node, we renamed `[SG]etInternalIPv4` to `[SG]etInternalIPv4Router`. [SG]etIPv4 - This returns one of the IPv4 node address based on the following priority: - NodeInternalIP - NodeExternalIP - other IP address type. These are reachable on the network. Since it can be External or Internal IP address, we renamed `[SG]etExternalIPv4` to `[SG]etIPv4` Signed-off-by: Anish Shah <anishshah@google.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit b60f8b7 ] Previously, accessing a NodePort using ExternalIP of the local K8s node would result in connection timeout. This is because ExternalIP of the local K8s node is not added to ipcache and hence no service resolution happens on egress. This commit resolves this issue by adding the ExternalIP of local k8s node to the ipcache. We introduce two functions `SetExternalIPv[46]` and `GetExternalIPv[46]` to store/fetch the External IPv4/IPv6 address of a k8s node. If a node doesn't have any External IP address, then we store/return `nil`. Signed-off-by: Anish Shah <anishshah@google.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
… cluster [ upstream commit 265d677 ] This test checks connectivity to NodePort service from inside a cluster by sending request to ExternalIP:NodePort. Signed-off-by: Anish Shah <anishshah@google.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 463e0dc ] Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2ceb386 ] Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 55caedc ] Strictly speaking this is a general truth for all services, but we don't have a dedicated section for explaining services and users have begun reporting specifically in relation to AWS / ENI mode. Put this somewhere in the docs, we can always move it around to somewhere more generic when we have a better location for these links to live. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 83cb610 ] Setting this option makes Cilium to run in a mode that will allow it to reduce its process' RSS even if the system is not under memory pressure. By default, i.e. without 'madvdontneed' being set, Cilium will only release its process' RSS when the system is under pressure. This may be confusing for users since they will assume Cilium needs the memory reported in RSS, which is not true. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 317a671 ] Kube-proxy installs the following rule in the iptables FORWARD chain: -A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP That rule causes iptables to drop packets whenever we have a path asymmetry with for example a SYN packet going through iptables and a SYN+ACK packet skipping iptables via BPF's redirect(). In that example, the next ACK packet would be dropped by that rule because the connection is invalid (SYN+ACK is missing) from iptables' point of view. To avoid such drops, in the IPv4 case, Cilium installs rules in the FORWARD chain (via its own CILIUM_FORWARD chain) to bypass the subsequent -j DROP rule from kube-proxy. For example, with per-endpoint routes enabled: -A FORWARD -m comment --comment "cilium-feeder: CILIUM_FORWARD" -j CILIUM_FORWARD -A CILIUM_FORWARD -o cilium_host -m comment --comment "cilium: any->cluster on cilium_host forward accept" -j ACCEPT -A CILIUM_FORWARD -i cilium_host -m comment --comment "cilium: cluster->any on cilium_host forward accept (nodeport)" -j ACCEPT -A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT -A CILIUM_FORWARD -i cilium_net -m comment --comment "cilium: cluster->any on cilium_net forward accept (nodeport)" -j ACCEPT -A CILIUM_FORWARD -o lxc+ -m comment --comment "cilium: any->cluster on lxc+ forward accept" -j ACCEPT -A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept (nodeport)" -j ACCEPT The same rules are however missing from the IPv6 tables. This commit fixes this inconsistency. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ed68d2c ] This commit fixes a broken packet path on pod --> remote pod via VIP when running with encapsulation and per-endpoint routes, with kube-proxy handling DNATing. The following details the broken packet path: - SYN leaves lxc device for stack with pod1IP -> VIP - SYN goes through cilium_host and cilium_net with pod1IP -> pod2IP - SYN is sent to second node through tunnel with pod1IP -> pod2IP - SYN+ACK arrives from tunnel on cilium_vxlan with pod2IP -> pod1IP - SYN+ACK is processed by ipvX_local_delivery and redirected in BPF directly to the lxc device - SYN+ACK arrives on lxc device with pod2IP -> pod1IP - RST is sent because pod2IP doesn't match VIP Because we use redirect() in BPF, the SYN+ACK packet bypasses most of the stack (including conntrack) on its way from cilium_vxlan to the lxc device. By bypassing conntrack, we also bypass the reverse DNAT. The bug doesn't occur when per-endpoint routes are disabled because, in that case, the SYN is SNATed to the cilium_host IP before leaving through the tunnel. The SYN+ACK packet therefore takes a different path and goes through conntrack. Fixes: #14646 Co-authored-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ae96b92 ] Signed-off-by: Yosh de Vos <yosh@elzorro.nl> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c8052a1 ] When IPSec or the host firewall are enabled in tunneling mode, traffic from a pod to a remote node is sent through the stack. This introduces an asymmetry in kube-proxy DNAT and breaks connections from a node (or hostns pod) to remote pods via externalTrafficPolicy=Local services. The following details the broken packet path node1 --VIP--> pod@node2 (VIP is node2IP): - SYN leaves node1 via native device with node1IP -> VIP - SYN is DNATed on node2 to node1IP -> podIP - SYN is deliverd to lxc device with node1IP -> podIP - SYN+ACK is sent from lxc device with podIP -> node1IP - SYN+ACK is redirected in BPF directly to cilium_vxlan - SYN+ACK arrives on node1 via tunnel with podIP -> node1IP - RST is sent because podIP doesn't match VIP Because we use redirect() in BPF, the SYN+ACK packet bypasses most of the stack (including conntrack) on its way from cilium_vxlan to the lxc device. By bypassing conntrack, we also bypass the reverse DNAT. The bug only occurs for externalTrafficPolicy=Local services because, without that annotation, the SYN packet is SNATed when arriving on node2. The SYN+ACK is therefore first sent via the stack to node2's IP where it is both reverse SNATed and reverse DNATed. Fixes: #14674 Fixes: #12542 Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c4356bd ] For IPSec and the host firewall, in order to apply kube-proxy's reverse DNAT we want to send traffic through the tunnel via the stack instead of doing a redirect call. Such traffic therefore goes from the lxc device to cilium_host where it is redirected to the tunneling device. We however SNAT that traffic on its way to cilium_host. We don't need to apply SNAT there for the pod CIDR as any traffic from the pod CIDR will either leave directly through the native device (masquerading happening there) or go through cilium_host and be redirected to the tunnel. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2303803 ] This test validates a potential regression on the path node1 -> pod@node2 via service with externalTrafficPolicy=Local. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1db771f ] This test validates a potential regression on the path node1 -> pod@node2 via service with externalTrafficPolicy=Local when kube-proxy, tunneling and the host firewall are enabled. This commit also removes the explicit disabling of the host firewall in externalTrafficPolicy=Local tests. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 773b1d2 ] Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3ce38c9 ] This is a preparatory commit that introduces ENI related information for cilium_host (aka router). Specifically, the information includes primary ENI interface that the secondary cilium_host IP is associated with, primary ENI MAC address and VPC CIDRs. Also, refactor the current IPAM internal IP allocating methods so that the ENI information can be retrieved for cilium_host (aka router). This information can be useful for setting up encryption related state such as ip rules and interface to attach BPF programs to. Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit f34371c ] Previously, we weren't adding policy based routing rules for cilium_host secondary IP address. These rules ensure that traffic from source node's cilium_host interface destined to the remote node in the cluster VPC is routed through the primary ENI interface for cilium_host. From there on, the traffic is routed to the AWS infrastructure router, which in turn routes it to the destination node. Fix involves setting up ip rules to the per ENI routing table. Verified that the correct rule is added, and encrypted traffic is reaching the destination node. root@ip-192-168-38-245:/home/cilium# ip addr show cilium_host 5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000 link/ether da:61:19:32:0b:37 brd ff:ff:ff:ff:ff:ff inet 192.168.148.108/32 scope link cilium_host valid_lft forever preferred_lft forever inet6 fe80::d861:19ff:fe32:b37/64 scope link valid_lft forever preferred_lft forever root@ip-192-168-38-245:/home/cilium# ip rule show | grep 192.168.148.108 111: from 192.168.148.108 to 192.168.0.0/16 lookup 11 Tcpdump on the remote node shows encrypted packets getting received - tcpdump -eni eth1 dst 192.168.142.218 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes 17:32:52.216989 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 130: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x64), length 96 17:32:52.217694 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 122: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x65), length 88 17:32:52.217896 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 206: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x66), length 172 Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
…r ENI mode [ upstream commit 2c5e192 ] Problem: Auto detection of encryption.interface leads to bpf_network program that does decryption to get attached to a wrong network interface. As a result, packets were not decrypted on the destination remote node. If users don't specify the cilium `encryption.interface` configuration, the interface can be auto detected based on main routing table. This breaks in ENI IPAM mode, where cilium_host is allocated an IP address from the pool of secondary IP addresses of a primary ENI interface. The IP allocation is influenced by the specified `first-interface-index` configuration. Hence, retrieve the correct primary interface for cilium_host for bpf_network program to be attached. Fix is verified based on the xfrm stats - Packet stats on source node : src 192.168.117.26 dst 192.168.142.218 proto esp spi 0x00000003(3) reqid 1(0x00000001) mode tunnel replay-window 0 seq 0x00000000 flag (0x00000000) mark 0x3e00/0xff00 output-mark 0xe00 aead rfc4106(gcm(aes)) 0xea5799388f441ba52b8c60c4f11c084bcdf5373e (160 bits) 128 anti-replay context: seq 0x0, oseq 0x9, bitmap 0x00000000 sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0 lifetime config: limit: soft (INF)(bytes), hard (INF)(bytes) limit: soft (INF)(packets), hard (INF)(packets) expire add: soft 0(sec), hard 0(sec) expire use: soft 0(sec), hard 0(sec) lifetime current: 540(bytes), 9(packets) add 2021-02-05 18:12:08 use 2021-02-05 18:13:01 stats: replay-window 0 replay 0 failed 0 Packet stats on dest node : ip -s xfrm s src 192.168.142.218 dst 192.168.117.26 proto esp spi 0x00000003(3) reqid 1(0x00000001) mode tunnel replay-window 0 seq 0x00000000 flag (0x00000000) mark 0x3e00/0xff00 output-mark 0xe00 aead rfc4106(gcm(aes)) 0xea5799388f441ba52b8c60c4f11c084bcdf5373e (160 bits) 128 anti-replay context: seq 0x0, oseq 0x6, bitmap 0x00000000 sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0 lifetime config: limit: soft (INF)(bytes), hard (INF)(bytes) limit: soft (INF)(packets), hard (INF)(packets) expire add: soft 0(sec), hard 0(sec) expire use: soft 0(sec), hard 0(sec) lifetime current: 360(bytes), 6(packets) add 2021-02-05 18:12:08 use 2021-02-05 23:19:07 stats: replay-window 0 replay 0 failed 0 Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1e80ec0 ] This commit updates the list of default label filters and clarifies the default behavior, as well as how it changes when a first inclusive label is added. The example is also completed based on the new unit test added in the previous commit. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3b0f6e8 ] Signed-off-by: Paul Chaignon <paul@cilium.io>
37efd4b
to
212598c
Compare
test-backport-1.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for my commits. I confirmed that the EnableIPv4Masquerade flag was changed from Masquerade (v1.8). Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs changes LGTM. I also glanced at the non-core contributor commits, looks straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Cilium-Ginkgo-Test-k8s can be ignored as all the (equivalent) K8s-1.X-kernel-4.9 jobs are passing. |
pkg/node/manager/manager.go
.pkg/endpointmanager/manager.go
.clustermesh-apiserver.Dockerfile
which doesn't exist in v1.8.EnableIPv4Masquerade
toMasquerade
indaemon/cmd/ipam.go
. Please double check.Once this PR is merged, you can update the PR labels via: