-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.9 backports 2021-11-26 #18025
v1.9 backports 2021-11-26 #18025
Conversation
[ upstream commit 573159d ] Note this commit was applied before as 7ef59aa and reverted due to a broken test in commit 346713b. When endpoint routes are enabled, we attach a BPF program on the way to the container and add a Linux route to the lxc interface. So when coming from bpf_network with IPsec, we should use that route to go directly to the lxc device and its attached BPF program. In contrast, when endpoint routes are disabled, we run the BPF program for ingress pod policies from cilium_host, via a tail call in bpf_host. Therefore, in that case, we need to jump from bpf_network to cilium_host first, to follow the correct path to the lxc interface. That's what commit 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled") attempted to implement for when endpoint routes are enabled. It's goal was to go directly from bpf_network to the stack in that case, to use the per-endpoint Linux routes to the lxc device. That commit however implements a noop change: ENABLE_ENDPOINT_ROUTES is defined as a per-endpoint setting, but then used in bpf_network, which is not tied to any endpoint. In practice, that means the macro is defined in the ep_config.h header files used by bpf_lxc, whereas bpf_network (from which the macro is used) relies on the node_config.h header file. The fix is therefore simple: we need to define ENABLE_ENDPOINT_ROUTES as a global config, written in node_config.h. To reproduce the bug and validate the fix, I deploy Cilium on GKE (where endpoint routes are enabled by default) with: helm install cilium ./cilium --namespace kube-system \ --set nodeinit.enabled=true \ --set nodeinit.reconfigureKubelet=true \ --set nodeinit.removeCbrBridge=true \ --set cni.binPath=/home/kubernetes/bin \ --set gke.enabled=true \ --set ipam.mode=kubernetes \ --set nativeRoutingCIDR=$NATIVE_CIDR \ --set nodeinit.restartPods=true \ --set image.repository=docker.io/pchaigno/cilium-dev \ --set image.tag=fix-ipsec-ep-routes \ --set operator.image.repository=quay.io/cilium/operator \ --set operator.image.suffix="-ci" \ --set encryption.enabled=true \ --set encryption.type=ipsec I then deployed the below manifest and attempted a curl request from pod client to the service echo-a. metadata: name: echo-a labels: name: echo-a spec: template: metadata: labels: name: echo-a spec: containers: - name: echo-a-container env: - name: PORT value: "8080" ports: - containerPort: 8080 image: quay.io/cilium/json-mock:v1.3.0 imagePullPolicy: IfNotPresent readinessProbe: timeoutSeconds: 7 exec: command: - curl - -sS - --fail - --connect-timeout - "5" - -o - /dev/null - localhost:8080 selector: matchLabels: name: echo-a replicas: 1 apiVersion: apps/v1 kind: Deployment --- metadata: name: echo-a labels: name: echo-a spec: ports: - name: http port: 8080 type: ClusterIP selector: name: echo-a apiVersion: v1 kind: Service --- apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy metadata: name: "l3-rule" spec: endpointSelector: matchLabels: name: client ingress: - fromEndpoints: - matchLabels: name: echo-a --- apiVersion: v1 kind: Pod metadata: name: client labels: name: client spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - echo-a topologyKey: kubernetes.io/hostname containers: - name: netperf args: - sleep - infinity image: cilium/netperf Fixes: 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit 3ffe49e ] With the previous patch, when IPsec and endpoint routes are enabled, packets flow directly from bpf_network to bpf_lxc via the Linux stack, instead of going through bpf_host. However, we noticed that, when L7 policies are applied, connectivity fails between the proxy and the destination: 43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: SYN) 43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 UNKNOWN 5 (TCP Flags: SYN) 43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: SYN) 43.808: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-proxy FORWARDED (TCP Flags: SYN, ACK) 43.808: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-endpoint FORWARDED (TCP Flags: SYN, ACK) 43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: ACK) 43.808: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: ACK) 43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-endpoint FORWARDED (TCP Flags: ACK, PSH) 43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-proxy FORWARDED (TCP Flags: ACK, PSH) 43.810: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-proxy FORWARDED (TCP Flags: ACK) 43.810: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-endpoint FORWARDED (TCP Flags: ACK) 43.810: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 http-request FORWARDED (HTTP/1.1 GET http://10.240.0.55:8080/) 43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-network FORWARDED (TCP Flags: SYN) 43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-stack FORWARDED (TCP Flags: SYN) 43.812: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-endpoint FORWARDED (TCP Flags: SYN) 43.812: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-endpoint FORWARDED (TCP Flags: SYN, ACK) 43.812: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 to-stack FORWARDED (TCP Flags: SYN, ACK) 43.813: cilium-test/echo-other-node-697d5d69b7-lglxf:8080 -> cilium-test/client2-6dd75b74c6-9nxhx:45704 from-network FORWARDED (TCP Flags: SYN, ACK) 44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-network FORWARDED (TCP Flags: SYN) 44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 from-stack FORWARDED (TCP Flags: SYN) 44.827: cilium-test/client2-6dd75b74c6-9nxhx:45704 -> cilium-test/echo-other-node-697d5d69b7-lglxf:8080 to-endpoint FORWARDED (TCP Flags: SYN) We can see the TCP handshake between the client and proxy, followed by an attempt to perform the TCP handshake between the proxy and server. That second part fails, as the SYN+ACK packets sent by the server never seem to reach the proxy; they are dropped somewhere after the from-network observation point. At the same time, we can see the IPsec error counter XfrmInNoPols increasing on the client node. This indicates that SYN+ACK packets were dropped after decryption, because the XFRM state used for decryption doesn't match any XFRM policy. The XFRM state used for decryption is: src 0.0.0.0 dst 10.240.0.18 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0xd00/0xf00 output-mark 0xd00/0xf00 aead rfc4106(gcm(aes)) 0x3ec3e84ac118c3baf335392e7a4ea24ee3aecb2b 128 anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000 sel src 0.0.0.0/0 dst 0.0.0.0/0 And it should match the following XFRM policy: src 0.0.0.0/0 dst 10.240.0.0/16 dir in priority 0 mark 0xd00/0xf00 tmpl src 0.0.0.0 dst 10.240.0.18 proto esp reqid 1 mode tunnel After the packet is decrypted however, we hit the following rule in iptables because we're going to the proxy. -A CILIUM_PRE_mangle -m socket --transparent -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff As a result, the packet mark is set to 0x200 and we don't match the 0xd00 packet mark of the XFRM policy anymore. The packet is therefore dropped with XfrmInNoPols. To avoid this, we can simply mark the XFRM policy optional when endpoint routes are enabled, in the same way we do for tunneling. Fixes: 287f49c ("cilium: encryption, fix redirect when endpoint routes enabled") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit b385f0f ] This patch modifies a forward policy update introduced by 0b52fd7 so that the template source matches the source which is 0.0.0.0/0 (wildcard). Above modification addresses an issue of intermittent packet drops, as discussed in detail below. During an investigation of intermittent dropped packets in AKS (kernel 5.4.0-1061-azure) with IPSec enabled, there was an increase of (XfrmInTmplMismatch) errors in /proc/net/xfrm_stat as packets were dropped. Tracing revaled that the packets were dropped due to an __xfrm_policy_check failure when the packet was redirected from eth0 (after decryption) to the LXC device of a pod. Further investigation, attributed the drops to changes in the forwarding policy. Specifically, the forwarding policy would change as: src 0.0.0.0/0 dst 10.240.0.0/16 dir fwd priority 2975 - tmpl src 0.0.0.0 dst 10.240.0.19 + tmpl src 10.240.0.19 dst 10.240.0.61 proto esp reqid 1 mode tunnel level use And back: src 0.0.0.0/0 dst 10.240.0.0/16 dir fwd priority 2975 - tmpl src 10.240.0.19 dst 10.240.0.61 + tmpl src 0.0.0.0 dst 10.240.0.19 proto esp reqid 1 mode tunnel level use The above change was caused by: func (n *linuxNodeHandler) enableIPsec(newNode *nodeTypes.Node) in pkg/datapath/linux/node.go. Modifying the code to avoid chancing the policy elimiated the packet drops. The are two places were the xfrm policy is updated in enableIPsec(): (1) inside UpsertIPsecEndpoint() when an IN policy is specified (as happens if newNode.IsLocal()) (2) in enableIPsec() itself, as introduced by 0b52fd7 For example, adding log messages to IpSecReplacePolicyFwd and UpsertIPsecEndpoint produced: level=info msg="IpSecReplacePolicyFwd: src=0.0.0.0/0 dst=10.240.0.61/16 tmplSrc=10.240.0.19/16 tmplDst=10.240.0.61/16" subsys=ipsec level=info msg="UpsertIPsecEndpoint: local:10.240.0.19/16 remote:0.0.0.0/0 fowrard:10.240.0.19/16" subsys=ipsec level=info msg="IpSecReplacePolicyFwd: src=0.0.0.0/0 dst=10.240.0.19/16 tmplSrc=0.0.0.0/0 tmplDst=10.240.0.19/16" subsys=ipsec level=info msg="UpsertIPsecEndpoint: exit" subsys=ipsec Additional testing revealed that the update that resulting in a template with tmplSrc=10.240.0.19/16 was the culprit for the packet drops. Making the source template matching the source which is a wildcard in update (2), eliminated the packet drops. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit 690c112 ] A version of the classic closure with concurrency gotcha. Bind the value of cmd in the loop to a new variable to address the issue. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
/test-backport-1.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for my patch 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
GKE failure seems unrelated - cilium pods were stuck in |
GKE tests are in fact broken on v1.9 at the moment (#17797 for context and WIP fix). All other tests are passing, I'll go ahead and merge the PR. |
Once this PR is merged, you can update the PR labels via: