Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: K8sDatapathConfig Encapsulation Check vxlan connectivity with per-endpoint routes #13774

Closed
nebril opened this issue Oct 27, 2020 · 12 comments · Fixed by #14913
Closed

CI: K8sDatapathConfig Encapsulation Check vxlan connectivity with per-endpoint routes #13774

nebril opened this issue Oct 27, 2020 · 12 comments · Fixed by #14913
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@nebril
Copy link
Member

nebril commented Oct 27, 2020

every build on https://jenkins.cilium.io/job/cilium-master-K8s-all/ fails

/home/jenkins/workspace/cilium-master-K8s-all/1.14-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:514
Kubernetes DNS did not become ready in time
/home/jenkins/workspace/cilium-master-K8s-all/1.14-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:634
@nebril nebril added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Oct 27, 2020
@nebril nebril self-assigned this Oct 27, 2020
@nebril
Copy link
Member Author

nebril commented Oct 27, 2020

focused test run also fails in the same way on 1.14, so it's unlikely this is infra/ci related: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Kernel-Focus/100/

@pchaigno
Copy link
Member

pchaigno commented Jan 6, 2021

It looks like this test sometimes also fails in pipelines other than k8s-all. It failed before in #14097 with the same error message.

It also just failed in #14525. That last PR should only affect the IPSec code path so I'm fairly confident it's a flake.
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.20-kernel-4.9/316/testReport/junit/Suite-k8s-1/20/K8sDatapathConfig_Encapsulation_Check_vxlan_connectivity_with_per_endpoint_routes/
ae88713e_K8sDatapathConfig_Encapsulation_Check_vxlan_connectivity_with_per-endpoint_routes.zip

@ungureanuvladvictor
Copy link
Member

@pchaigno pchaigno changed the title CI: K8sDatapathConfig Encapsulation Check vxlan connectivity with per-endpoint routes fails on k8s-all CI: K8sDatapathConfig Encapsulation Check vxlan connectivity with per-endpoint routes Jan 11, 2021
@jibi
Copy link
Member

jibi commented Feb 1, 2021

I managed to reproduce this locally with:

NFS=1 NETNEXT=1 KUBEPROXY=0 ginkgo -v --focus "K8sDatapathConfig.*Check vxlan connectivity with per-endpoint routes" -- -cilium.provision=false -cilium.holdEnvironment=true -cilium.skipLogs -cilium.runQuarantined

Although I'm seeing a different failure:

17:47:17 STEP: Applying policy /home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/l3-policy-demo.yaml
17:47:25 STEP: Waiting for 4m0s for 5 pods of deployment demo_ds.yaml to become ready
17:47:25 STEP: WaitforNPods(namespace="202102011747k8sdatapathconfigencapsulationcheckvxlanconnectivit", filter="")
17:47:30 STEP: WaitforNPods(namespace="202102011747k8sdatapathconfigencapsulationcheckvxlanconnectivit", filter="") => <nil>
17:47:30 STEP: Checking pod connectivity between nodes
17:47:30 STEP: WaitforPods(namespace="202102011747k8sdatapathconfigencapsulationcheckvxlanconnectivit", filter="-l zgroup=testDSClient")
17:47:30 STEP: WaitforPods(namespace="202102011747k8sdatapathconfigencapsulationcheckvxlanconnectivit", filter="-l zgroup=testDSClient") => <nil>
17:47:30 STEP: WaitforPods(namespace="202102011747k8sdatapathconfigencapsulationcheckvxlanconnectivit", filter="-l zgroup=testDS")
17:47:30 STEP: WaitforPods(namespace="202102011747k8sdatapathconfigencapsulationcheckvxlanconnectivit", filter="-l zgroup=testDS") => <nil>

---
K8sDatapathConfig Encapsulation Check vxlan connectivity with per-endpoint routes
at /home/jibi/go/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:514

[Connectivity test between nodes failed
Expected
    <bool>: false
to be true]

@jibi
Copy link
Member

jibi commented Feb 4, 2021

Reproduced the actual failure:

K8sDatapathConfig Encapsulation Check vxlan connectivity with per-endpoint routes
at /home/jibi/go/src/github.com/cilium/cilium-test/test/ginkgo-ext/scopes.go:514

[Kubernetes DNS did not become ready in time]

The deployment of coredns seems to be failing due to:

15:16:24 STEP: Kubernetes DNS is not ready yet: unable to resolve service name kubernetes.default.svc.cluster.local with DNS server 10.96.0.10 by running 'dig +short kubernetes.default.svc.cluster.local @10.96.0.10' Cilium pod: Exitcode: 9
Err: Process exited with status 9
Stdout:
 	 ;; connection timed out; no servers could be reached
	
Stderr:
 	 command terminated with exit code 9

although coredns looks healthy:

vagrant@k8s1:~$ ks get pods -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-cc45bff6b-7zqsw   1/1     Running   0          18m
vagrant@k8s1:~$ ks describe pod -l k8s-app=kube-dns
Name:               coredns-cc45bff6b-7zqsw
Namespace:          kube-system
Priority:           2000000000
PriorityClassName:  system-cluster-critical
Node:               k8s2/192.168.36.12
Start Time:         Thu, 04 Feb 2021 14:12:32 +0000
Labels:             k8s-app=kube-dns
                    pod-template-hash=cc45bff6b
Annotations:        seccomp.security.alpha.kubernetes.io/pod: docker/default
Status:             Running
IP:                 10.0.1.244
Controlled By:      ReplicaSet/coredns-cc45bff6b
Containers:
  coredns:
    Container ID:  docker://edc002608d440ee4f771b653f4aea55be12e55f989b0a307253d46583bc691bd
    Image:         k8s.gcr.io/coredns:1.3.1
    Image ID:      docker-pullable://k8s.gcr.io/coredns@sha256:02382353821b12c21b062c59184e227e001079bb13ebd01f9d3270ba0fcbf1e4
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Thu, 04 Feb 2021 14:12:49 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-r4mkp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-r4mkp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-r4mkp
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     CriticalAddonsOnly
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  20m   default-scheduler  Successfully assigned kube-system/coredns-cc45bff6b-7zqsw to k8s2
  Normal  Pulling    19m   kubelet, k8s2      Pulling image "k8s.gcr.io/coredns:1.3.1"
  Normal  Pulled     19m   kubelet, k8s2      Successfully pulled image "k8s.gcr.io/coredns:1.3.1"
  Normal  Created    19m   kubelet, k8s2      Created container coredns
  Normal  Started    19m   kubelet, k8s2      Started container coredns

The IP of the resolver is correct:

vagrant@k8s1:~$ ks get svc kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   38m

But the dig command is failing:

vagrant@k8s1:~$ dig +short test.local @10.96.0.10
;; connection timed out; no servers could be reached

Looking at coredns logs we can see that it is receiving the requests:

vagrant@k8s1:~$ ks logs -l k8s-app=kube-dns | grep test
2021-02-04T14:36:40.884Z [INFO] 10.0.2.15:39865 - 19766 "A IN test.local. udp 51 false 4096" NXDOMAIN qr,rd,ra,ad 114 0.023345241s
2021-02-04T14:36:45.881Z [INFO] 10.0.2.15:39865 - 19766 "A IN test.local. udp 51 false 4096" NXDOMAIN qr,rd,ra,ad 114 0.020891984s
2021-02-04T14:36:50.886Z [INFO] 10.0.2.15:39865 - 19766 "A IN test.local. udp 51 false 4096" NXDOMAIN qr,rd,ra,ad 114 0.02507274s

So the responses are getting dropped for some reason.

Nothing interesting from tcpdump running on the hostns:

vagrant@k8s1:~$ sudo tcpdump -i any -n udp and port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
14:44:07.769082 IP 10.0.2.15.55554 > 10.0.1.244.53: 49592+ [1au] A? test.local. (51)
14:44:12.771207 IP 10.0.2.15.55554 > 10.0.1.244.53: 49592+ [1au] A? test.local. (51)
14:44:17.771222 IP 10.0.2.15.55554 > 10.0.1.244.53: 49592+ [1au] A? test.local. (51)
^C
3 packets captured
3 packets received by filter
0 packets dropped by kernel

cilium monitor:

root@k8s1:/home/cilium# cilium monitor | grep :53
level=info msg="Initializing dissection cache..." subsys=monitor
-> overlay flow 0x0 identity remote-node->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:51549 -> 10.0.1.244:53 udp
-> overlay flow 0x0 identity remote-node->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:51549 -> 10.0.1.244:53 udp
-> overlay flow 0x0 identity remote-node->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:51549 -> 10.0.1.244:53 udp

@jibi
Copy link
Member

jibi commented Feb 4, 2021

Running the same dig command on the other node (k8s2) works:

vagrant@k8s2:~$ dig +short kubernetes.default.svc.cluster.local @10.96.0.10
10.96.0.1

And looking at tcpdump on k8s2 (where coredns is running) I can also see the response when I run dig from the k8s1 node:

vagrant@k8s2:~$ sudo tcpdump -i any -n udp and port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
15:15:50.480991 IP 10.0.2.15.50097 > 10.0.1.244.53: 20781+ [1au] A? google.com. (51)
15:15:50.481074 IP 10.0.2.15.50097 > 10.0.1.244.53: 20781+ [1au] A? google.com. (51)
15:15:50.481356 IP 10.0.1.244.33091 > 8.8.8.8.53: 20781+ [1au] A? google.com. (51)
15:15:50.481412 IP 10.0.2.15.33091 > 8.8.8.8.53: 20781+ [1au] A? google.com. (51)
15:15:50.506223 IP 8.8.8.8.53 > 10.0.2.15.33091: 20781 1/0/1 A 216.58.198.14 (55)
15:15:50.506330 IP 8.8.8.8.53 > 10.0.1.244.33091: 20781 1/0/1 A 216.58.198.14 (55)
15:15:50.506510 IP 10.0.1.244.53 > 10.0.2.15.50097: 20781 1/0/1 A 216.58.198.14 (65) <--

so the response is somehow getting lost while being tunneled from k8s2 to k8s1

edit: the response we are seeing is from the lxc device. If dump the traffic from the cilium_vxlan device we only see the request:

vagrant@k8s2:~$ sudo tcpdump -i cilium_vxlan -n udp and port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on cilium_vxlan, link-type EN10MB (Ethernet), capture size 262144 bytes
15:33:10.668417 IP 10.0.2.15.53680 > 10.0.1.244.53: 15943+ [1au] A? kubernetes.default.svc.cluster.local. (77)
^C
1 packet captured

@jibi
Copy link
Member

jibi commented Feb 4, 2021

Restarted Cilium with monitor-aggregation: none and rerun cilium monitor:

k8s1:

vagrant@k8s1:~$ ks exec $(cilium_pod k8s1) -it cilium monitor | grep :53
<- host flow 0x0 identity host->unknown state new ifindex 0 orig-ip 0.0.0.0: 10.0.2.15:34111 -> 10.0.1.244:53 udp
-> overlay flow 0x0 identity remote-node->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:34111 -> 10.0.1.244:53 udp

and on k8s2:

vagrant@k8s1:~$ ks exec $(cilium_pod k8s2) -it cilium monitor | grep :53
<- overlay flow 0x0 identity unknown->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:52390 -> 10.0.1.244:53 udp
-> endpoint 359 flow 0x0 identity remote-node->593 state new ifindex lxce71012cea7d0 orig-ip 10.0.2.15: 10.0.2.15:52390 -> 10.0.1.244:53 udp
<- stack flow 0x0 identity world->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:52390 -> 10.0.1.244:53 udp
-> endpoint 359 flow 0x0 identity world->593 state established ifindex 0 orig-ip 10.0.2.15: 10.0.2.15:52390 -> 10.0.1.244:53 udp
<- endpoint 359 flow 0x0 identity 593->unknown state new ifindex 0 orig-ip 0.0.0.0: 10.0.1.244:53 -> 10.0.2.15:52390 udp
-> stack flow 0x0 identity 593->host state reply ifindex 0 orig-ip 0.0.0.0: 10.0.1.244:53 -> 10.0.2.15:52390 udp

@jibi
Copy link
Member

jibi commented Feb 4, 2021

Underlying problem: both nodes have the same IP for the enp0s3 interface:

vagrant@k8s1:~$ ip a s dev enp0s3 | grep inet
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
    inet6 fe80::a00:27ff:fe4e:92d0/64 scope link
vagrant@k8s2:~$ ip a s dev enp0s3 | grep inet
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
    inet6 fe80::a00:27ff:fe4e:92d0/64 scope link

so k8s2 is blackholing the traffic to k8s1.

Possible explanation to the flakiness: the test fails only when coredns is scheduled on k8s2

@jibi
Copy link
Member

jibi commented Feb 5, 2021

Setting enable-endpoint-routes to false causes k8s2 to not blackhole anymore the response traffic for k8s1:

vagrant@k8s1:~$ dig +short kubernetes.default.svc.cluster.local @10.96.0.10
10.96.0.1

@pchaigno
Copy link
Member

pchaigno commented Feb 8, 2021

I validated that the following diff fixes the flake locally:

$ git diff
diff --git a/pkg/datapath/iptables/iptables.go b/pkg/datapath/iptables/iptables.go
index 787a527b3..10a68b361 100644
--- a/pkg/datapath/iptables/iptables.go
+++ b/pkg/datapath/iptables/iptables.go
@@ -980,7 +980,7 @@ func (m *IptablesManager) installMasqueradeRules(prog, ifName, localDeliveryInte
                m.waitArgs,
                "-t", "nat",
                "-A", ciliumPostNatChain,
-               "!", "-o", localDeliveryInterface,
+               "!", "-o", ifName,
                "-m", "comment", "--comment", "exclude non-"+ifName+" traffic from masquerade",
                "-j", "RETURN"), false); err != nil {
                return err

Matching this iptables rule causes packets to bypass masquerading and leave with enp0s3's IP. This change (from ! -o cilium_host to ! -o lxc+) when per-endpoint routes are enabled was introduced by commit c496e25.

@pchaigno pchaigno assigned jibi and pchaigno and unassigned nebril Feb 9, 2021
pchaigno added a commit that referenced this issue Feb 9, 2021
--- Analysis ---

In tunneling mode, our CILIUM_POST_nat chain is currently as follows.

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o cilium_host -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o cilium_host -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

The second rule implements an early exit from the chain, as none of the
subsequent rules match on output interfaces other than cilium_host.

Once per-endpoint routes are enabled in addition to tunneling, the chain
changes. The second and fifth rules now match on lxc+ as the output
interface:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o lxc+ -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

Commit c496e25 ("eni: Support masquerading") implemented that change,
based on the fact that with per-endpoint routes, packets are routed
directly to lxc devices without going through cilium_host.

Nevertheless, the fourth rule still matches on cilium_host and therefore
becomes noop. At the time c496e25 was implemented, this change was
correct because the fourth rule is only present when tunneling is
enabled and per-endpoint routes were not compatible with tunneling.
Commit 3179a47 ("datapath: Support enable-endpoint-routes with
encapsulation") however made those options compatible and the above
chain possible.

--- Fix ---

Ideally, we would update the second rule when running with tunneling and
per-endpoint routes, to be '! -o lxc+ ! -o cilium_host'. Iptables
however doesn't support multiple output interface matchers. This commit
implements a different fix and drops the second rule. Since subsequent
SNATing rules already match on an output interface, the second rule is
unnecessary. With tunneling and per-endpoint routes, the table now looks
like:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquerade" -j ACCEPT
3. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
4. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

--- Bug Impact ---

This lack of masquerading can cause issues for example when trying to
connect to a VIP with a remote backend from the hostns in our test VMs:

1. DNS request is made to VIP 10.96.0.10.
2. 10.0.2.15, the IP of enp0s3 (default route) is assigned as source IP.
3. kube-proxy translates VIP to backend IP on different node, e.g.
   10.0.0.87.
3. Packet is sent to cilium_host as per the ip routes.
4. The packet is not masqueraded because it matches rule 2 in the bogus
   iptables chain (i.e., cilium_host != lxc+).
5. The packet arrives as 10.0.2.15 -> 10.0.0.87 on the second node.
6. Second node tries to answer to 10.0.2.15 unsuccessfully (all nodes
   have the same IP 10.0.2.15 for enp0s3; that IP isn't routable across
   nodes).

This bug is described in #13774.

Fixes: #13774
Co-authored-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this issue Feb 9, 2021
--- Analysis ---

In tunneling mode, our CILIUM_POST_nat chain is currently as follows.

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o cilium_host -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o cilium_host -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

The second rule implements an early exit from the chain, as none of the
subsequent rules match on output interfaces other than cilium_host.

Once per-endpoint routes are enabled in addition to tunneling, the chain
changes. The second and fifth rules now match on lxc+ as the output
interface:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o lxc+ -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

Commit c496e25 ("eni: Support masquerading") implemented that change,
based on the fact that with per-endpoint routes, packets are routed
directly to lxc devices without going through cilium_host.

Nevertheless, the fourth rule still matches on cilium_host and therefore
becomes noop. At the time c496e25 was implemented, this change was
correct because the fourth rule is only present when tunneling is
enabled and per-endpoint routes were not compatible with tunneling.
Commit 3179a47 ("datapath: Support enable-endpoint-routes with
encapsulation") however made those options compatible and the above
chain possible.

--- Fix ---

Ideally, we would update the second rule when running with tunneling and
per-endpoint routes, to be '! -o lxc+ ! -o cilium_host'. Iptables
however doesn't support multiple output interface matchers. This commit
implements a different fix and drops the second rule. Since subsequent
SNATing rules already match on an output interface, the second rule is
unnecessary. With tunneling and per-endpoint routes, the table now looks
like:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquerade" -j ACCEPT
3. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
4. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

--- Bug Impact ---

This lack of masquerading can cause issues for example when trying to
connect to a VIP with a remote backend from the hostns in our test VMs:

1. DNS request is made to VIP 10.96.0.10.
2. 10.0.2.15, the IP of enp0s3 (default route) is assigned as source IP.
3. kube-proxy translates VIP to backend IP on different node, e.g.
   10.0.0.87.
3. Packet is sent to cilium_host as per the ip routes.
4. The packet is not masqueraded because it matches rule 2 in the bogus
   iptables chain (i.e., cilium_host != lxc+).
5. The packet arrives as 10.0.2.15 -> 10.0.0.87 on the second node.
6. Second node tries to answer to 10.0.2.15 unsuccessfully (all nodes
   have the same IP 10.0.2.15 for enp0s3; that IP isn't routable across
   nodes).

This bug is described in #13774.

Fixes: #13774
Co-authored-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this issue Feb 9, 2021
--- Analysis ---

In tunneling mode, our CILIUM_POST_nat chain is currently as follows.

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o cilium_host -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o cilium_host -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

The second rule implements an early exit from the chain, as none of the
subsequent rules match on output interfaces other than cilium_host.

Once per-endpoint routes are enabled in addition to tunneling, the chain
changes. The second and fifth rules now match on lxc+ as the output
interface:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o lxc+ -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

Commit c496e25 ("eni: Support masquerading") implemented that change,
based on the fact that with per-endpoint routes, packets are routed
directly to lxc devices without going through cilium_host.

Nevertheless, the fourth rule still matches on cilium_host and therefore
becomes noop. At the time c496e25 was implemented, this change was
correct because the fourth rule is only present when tunneling is
enabled and per-endpoint routes were not compatible with tunneling.
Commit 3179a47 ("datapath: Support enable-endpoint-routes with
encapsulation") however made those options compatible and the above
chain possible.

--- Fix ---

Ideally, we would update the second rule when running with tunneling and
per-endpoint routes, to be '! -o lxc+ ! -o cilium_host'. Iptables
however doesn't support multiple output interface matchers. This commit
implements a different fix and drops the second rule. Since subsequent
SNATing rules already match on an output interface, the second rule is
unnecessary. With tunneling and per-endpoint routes, the table now looks
like:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquerade" -j ACCEPT
3. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
4. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

--- Bug Impact ---

This lack of masquerading can cause issues for example when trying to
connect to a VIP with a remote backend from the hostns in our test VMs:

1. DNS request is made to VIP 10.96.0.10.
2. 10.0.2.15, the IP of enp0s3 (default route) is assigned as source IP.
3. kube-proxy translates VIP to backend IP on different node, e.g.
   10.0.0.87.
3. Packet is sent to cilium_host as per the ip routes.
4. The packet is not masqueraded because it matches rule 2 in the bogus
   iptables chain (i.e., cilium_host != lxc+).
5. The packet arrives as 10.0.2.15 -> 10.0.0.87 on the second node.
6. Second node tries to answer to 10.0.2.15 unsuccessfully (all nodes
   have the same IP 10.0.2.15 for enp0s3; that IP isn't routable across
   nodes).

This bug is described in #13774.

Fixes: #13774
Co-authored-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
nathanjsweet pushed a commit that referenced this issue Feb 10, 2021
--- Analysis ---

In tunneling mode, our CILIUM_POST_nat chain is currently as follows.

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o cilium_host -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o cilium_host -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

The second rule implements an early exit from the chain, as none of the
subsequent rules match on output interfaces other than cilium_host.

Once per-endpoint routes are enabled in addition to tunneling, the chain
changes. The second and fifth rules now match on lxc+ as the output
interface:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o lxc+ -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

Commit c496e25 ("eni: Support masquerading") implemented that change,
based on the fact that with per-endpoint routes, packets are routed
directly to lxc devices without going through cilium_host.

Nevertheless, the fourth rule still matches on cilium_host and therefore
becomes noop. At the time c496e25 was implemented, this change was
correct because the fourth rule is only present when tunneling is
enabled and per-endpoint routes were not compatible with tunneling.
Commit 3179a47 ("datapath: Support enable-endpoint-routes with
encapsulation") however made those options compatible and the above
chain possible.

--- Fix ---

Ideally, we would update the second rule when running with tunneling and
per-endpoint routes, to be '! -o lxc+ ! -o cilium_host'. Iptables
however doesn't support multiple output interface matchers. This commit
implements a different fix and drops the second rule. Since subsequent
SNATing rules already match on an output interface, the second rule is
unnecessary. With tunneling and per-endpoint routes, the table now looks
like:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquerade" -j ACCEPT
3. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
4. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

--- Bug Impact ---

This lack of masquerading can cause issues for example when trying to
connect to a VIP with a remote backend from the hostns in our test VMs:

1. DNS request is made to VIP 10.96.0.10.
2. 10.0.2.15, the IP of enp0s3 (default route) is assigned as source IP.
3. kube-proxy translates VIP to backend IP on different node, e.g.
   10.0.0.87.
3. Packet is sent to cilium_host as per the ip routes.
4. The packet is not masqueraded because it matches rule 2 in the bogus
   iptables chain (i.e., cilium_host != lxc+).
5. The packet arrives as 10.0.2.15 -> 10.0.0.87 on the second node.
6. Second node tries to answer to 10.0.2.15 unsuccessfully (all nodes
   have the same IP 10.0.2.15 for enp0s3; that IP isn't routable across
   nodes).

This bug is described in #13774.

Fixes: #13774
Co-authored-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
lyveng pushed a commit to lyveng/cilium that referenced this issue Mar 4, 2021
--- Analysis ---

In tunneling mode, our CILIUM_POST_nat chain is currently as follows.

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o cilium_host -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o cilium_host -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

The second rule implements an early exit from the chain, as none of the
subsequent rules match on output interfaces other than cilium_host.

Once per-endpoint routes are enabled in addition to tunneling, the chain
changes. The second and fifth rules now match on lxc+ as the output
interface:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat ! -o lxc+ -m comment --comment "exclude non-cilium_host traffic from masquerade" -j RETURN
3. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquarade" -j ACCEPT
4. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
5. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

Commit c496e25 ("eni: Support masquerading") implemented that change,
based on the fact that with per-endpoint routes, packets are routed
directly to lxc devices without going through cilium_host.

Nevertheless, the fourth rule still matches on cilium_host and therefore
becomes noop. At the time c496e25 was implemented, this change was
correct because the fourth rule is only present when tunneling is
enabled and per-endpoint routes were not compatible with tunneling.
Commit 3179a47 ("datapath: Support enable-endpoint-routes with
encapsulation") however made those options compatible and the above
chain possible.

--- Fix ---

Ideally, we would update the second rule when running with tunneling and
per-endpoint routes, to be '! -o lxc+ ! -o cilium_host'. Iptables
however doesn't support multiple output interface matchers. This commit
implements a different fix and drops the second rule. Since subsequent
SNATing rules already match on an output interface, the second rule is
unnecessary. With tunneling and per-endpoint routes, the table now looks
like:

1. -A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
2. -A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquerade" -j ACCEPT
3. -A CILIUM_POST_nat ! -s 10.0.1.6/32 ! -d 10.0.1.0/24 -o cilium_host -m comment --comment "cilium host->cluster masquerade" -j SNAT --to-source 10.0.1.6
4. -A CILIUM_POST_nat -s 127.0.0.1/32 -o lxc+ -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 10.0.1.6

--- Bug Impact ---

This lack of masquerading can cause issues for example when trying to
connect to a VIP with a remote backend from the hostns in our test VMs:

1. DNS request is made to VIP 10.96.0.10.
2. 10.0.2.15, the IP of enp0s3 (default route) is assigned as source IP.
3. kube-proxy translates VIP to backend IP on different node, e.g.
   10.0.0.87.
3. Packet is sent to cilium_host as per the ip routes.
4. The packet is not masqueraded because it matches rule 2 in the bogus
   iptables chain (i.e., cilium_host != lxc+).
5. The packet arrives as 10.0.2.15 -> 10.0.0.87 on the second node.
6. Second node tries to answer to 10.0.2.15 unsuccessfully (all nodes
   have the same IP 10.0.2.15 for enp0s3; that IP isn't routable across
   nodes).

This bug is described in cilium#13774.

Fixes: cilium#13774
Co-authored-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants