CI: `K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully` fails on k8s-all #13773

nebril · 2020-10-27T08:43:31Z

every build on https://jenkins.cilium.io/job/cilium-master-K8s-all/ fails

/home/jenkins/workspace/cilium-master-K8s-all/1.14-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:514
Pod "testclient-gzk8w" can not connect to "http://google.com"
Expected command: kubectl exec -n 202010260914k8sdatapathconfigencapsulationcheckiptablesmasquera testclient-gzk8w -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 http://google.com -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" --retry 10 
To succeed, but it failed:
Exitcode: 1 
Err: context deadline exceeded
Stdout:
 	 
Stderr:

The text was updated successfully, but these errors were encountered:

nebril · 2020-10-27T08:47:17Z

focused test run also fails in the same way on 1.14, so it's unlikely this is infra/ci related: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Kernel-Focus/100/

pchaigno · 2020-11-19T21:55:29Z

This test validates a new Cilium flag (iptablesRandomFully) that is propagated to our masquerading iptables rules (as --random-fully). That flags should only affect the selection of the new source port when doing SNAT.

The test doesn't fail on our PRs but fails in k8s-all, which indicates the failure is somehow related to the Kubernetes version. So, it doesn't fail for 1.12, 1.19, and 1.18, but does fail for 1.14, 1.15, 1.16 at least. That could be due to some weird interaction with kube-proxy for those versions.

We weren't able to reproduce locally with the same K8s versions and kernel (4.9), even after multiple runs. There might be some timing factor that makes this harder.

Two things we can check next:

Compare the iptables rules for K8s versions that work and don't work. We have enough versions of each that it may be possible to spot a difference that matters.
I don't think we have an equivalent test in CI without the new iptablesRandomFully option (i.e., Check iptables masquerading with random-fully without that option). It might be worth adding such a test to check if it also fails. If it also fails, then the flake has nothing to do with the new option.

As suggested in the ticket, let's add a test for iptables masqueranding _without_ the random-fully option. This should tell us if the "K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully" test is failing due to the random-fully option or not. Related-to: #13773 Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>

jibi · 2021-01-05T10:08:00Z

I don't think we have an equivalent test in CI without the new iptablesRandomFully option (i.e., Check iptables masquerading with random-fully without that option). It might be worth adding such a test to check if it also fails. If it also fails, then the flake has nothing to do with the new option.

In #14476 I added a test for masquerading without the random-fully iptables option, and the test is running fine for the different 1.14, 1.15 and 1.16 k8s versions:

jibi · 2021-01-07T19:05:53Z

Some updates here: I took a look at the iptables configuration to see if I can spot any difference between a k8s version that is known to work and one that does not.

I started from 1.18, which is known to work.

I reproduced the test locally with:

$ NFS=1 K8S_VERSION=1.18 KUBEPROXY=1 ginkgo --focus "K8sDatapathConfig.*Check iptables masquerading with random-fully" -- -cilium.provision=true -cilium.holdEnvironment=true-cilium.runQuarantined

before that I added a helpers.HoldEnvironment("") at the end of relevant test:

--- a/test/k8sT/DatapathConfiguration.go
+++ b/test/k8sT/DatapathConfiguration.go
@@ -279,6 +279,7 @@ var _ = Describe("K8sDatapathConfig", func() {
                        By("Test iptables masquerading")
                        Expect(testPodHTTPToOutside(kubectl, "http://google.com", false, false)).
                                Should(BeTrue(), "Connectivity test to http://google.com failed")
+                       helpers.HoldEnvironment("")
                })

After the test run I dumped the iptables configuration (1.18.txt) and it looks like there are no rules with the --random-fully option set:

$ rg random-fully 1.18.txt
$

although if I grep for --random-fully in the Cilium logs I get:

level=debug msg="Considering removing iptables rule" obj="-A CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment \"cilium masquerade non-cluster\" -j MASQUERADE --random-fully" subsys=iptables
level=debug msg="Removing iptables rule" obj="[-w 5 -t nat -D CILIUM_POST_nat -s 10.0.1.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment cilium masquerade non-cluster -j MASQUERADE --random-fully]" subsys=iptables
level=debug msg="Considering removing iptables rule" obj="-A KUBE-POSTROUTING -m comment --comment \"kubernetes service traffic requiring SNAT\" -j MASQUERADE --random-fully" subsys=iptables
level=debug msg="Considering removing ip6tables rule" obj="-A CILIUM_POST_nat -s fd02::100/120 ! -d fd02::100/120 ! -o cilium_+ -m comment --comment \"cilium masquerade non-cluster\" -j MASQUERADE --random-fully" subsys=iptables
level=debug msg="Removing ip6tables rule" obj="[-w 5 -t nat -D CILIUM_POST_nat -s fd02::100/120 ! -d fd02::100/120 ! -o cilium_+ -m comment --comment cilium masquerade non-cluster -j MASQUERADE --random-fully]" subsys=iptables
level=debug msg="Considering removing ip6tables rule" obj="-A KUBE-POSTROUTING -m comment --comment \"kubernetes service traffic requiring SNAT\" -j MASQUERADE --random-fully" subsys=iptables

(I don't have context on why Cilium is suggesting that).

Repeating the same test on 1.14 (which is known for not working) I got the following iptables configuration: 1.14.txt. There is still no trace of rules with --random-fully:

$ rg random-fully 1.14.txt
$

But this time Cilium was not suggesting to remove the rules:

vagrant@k8s1:~$ ks logs $(cilium_pod k8s1) | grep random-fully
level=info msg="  --iptables-random-fully='true'" subsys=daemon

pchaigno · 2021-01-07T20:38:30Z

That netfilter option has had a somewhat lacking support in iptables in the past, so it may be worth checking that it is even supposed to appear in iptables rule dumps. It could be that in the 1.14 case, there are iptables rules with --random-fully but they are dumped without the --random-fully, maybe? (Which I'd consider a bug upstream.)

The Considering removing ... messages come from removeCiliumRules. It might be worth checking if that is expected. If --random-fully rules are for some reason removed on 1.18, it could explain why 1.18 succeeds and 1.14 fails.

jibi · 2021-01-08T09:24:33Z

Good shout @pchaigno 🙌
It looks like iptables 1.6.1 shipped with the test VM (based on ubuntu 18.04) does not support --random-fully:

vagrant@k8s1:~$ sudo iptables -t nat -A CILIUM_POST_nat -s 10.0.0.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment 'cilium masquerade non-cluster' -j MASQUERADE --random-fully
iptables v1.6.1: unknown option "--random-fully"
Try `iptables -h' or 'iptables --help' for more information.

There's even a bug tracked for that but it doesn't look like the fix landed in 18.04.

That said, I'm still unsure about why this test runs fine on some k8s version while not on others :/ are all CI images for 4.9 based on ubuntu 18.04?

pchaigno · 2021-01-08T09:28:26Z

@jibi In this case, aren't we using the iptables binary shipped in our cilium-agent Docker image?

$ iptables -V
iptables v1.6.1
vagrant@k8s1:~$ ks exec cilium-9czg5 -- iptables -V
iptables v1.8.4 (legacy)

jibi · 2021-01-08T09:47:51Z

Ah that's right, I was looking at the wrong iptables output. Running it again from inside the container shows that the rule is there:

vagrant@k8s1:~$ ks exec -it $(cilium_pod k8s1) -- iptables-save | grep random-fully
-A CILIUM_POST_nat -s 10.0.0.0/24 ! -d 10.0.0.0/8 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE --random-fully

jibi · 2021-01-29T09:20:02Z

Some updates here: I opened #14562 to try catch this test failing in the CI (as I am unable to do so on the local test environment) without success. I think next step would be to unquarantine the test so that maybe we can try catch it failing in the k8s-all pipeline?

pchaigno · 2021-01-29T09:27:13Z

It might make sense to wait for the split of k8s-all in multiple pipelines to unquarantine. That split might be enough to fix the flakes we have/had.

tklauser · 2021-03-08T15:26:21Z

Not sure whether this is exactly the same flake, but I've just seen K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully and K8sDatapathConfig Encapsulation Check iptables masquerading without random-fully fail on #15241: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-4.19/4894/
7352980d_K8sDatapathConfig_Encapsulation_Check_iptables_masquerading_with_random-fully.zip
f6b24e3d_K8sDatapathConfig_Encapsulation_Check_iptables_masquerading_without_random-fully.zip

pchaigno · 2021-03-08T15:29:43Z

Looks similar, let's reopen. I might have a fix anyway :-)

pchaigno · 2021-03-09T08:19:42Z

Here is what I think is happening:

The tests are failing because a client pod fails to resolve a domain name. If you reproduce locally, you'll see that connections from the CoreDNS pods to 8.8.8.8 timeout.

The test that ran just before our failing tests is Check vxlan connectivity with per-endpoint routes. That should always be the case. That previous test enabled per-endpoint routes in the agent (enable-endpoint-routes). When per-endpoint routes are enabled, we install a route for each endpoint and attach a BPF program on egress of the lxc devices (i.e., on the path to the container). We also change the rule in CILIUM_FORWARD to:

-A CILIUM_FORWARD -o cilium_host -m comment --comment "cilium: any->cluster on cilium_host forward accept" -j ACCEPT
-A CILIUM_FORWARD -i cilium_host -m comment --comment "cilium: cluster->any on cilium_host forward accept (nodeport)" -j ACCEPT
-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT
-A CILIUM_FORWARD -i cilium_net -m comment --comment "cilium: cluster->any on cilium_net forward accept (nodeport)" -j ACCEPT
# Specific to endpoint routes:
-A CILIUM_FORWARD -o lxc+ -m comment --comment "cilium: any->cluster on lxc+ forward accept" -j ACCEPT
-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept (nodeport)" -j ACCEPT

Then, our failing test is started and disables per-endpoint routes. The above two rules are removed. Endpoints are restored from disk after the restart with their original datapath configuration, which includes the per-endpoint setting on a per-endpoint basis. So for restored endpoint (including CoreDNS pods), per-endpoint routes remain enabled with the egress lxc program and the Linux route.

Reply DNS packets going to the CoreDNS match the endpoint rule and are therefore routed to output interface lxcXXXXX. They don't match any rule in CILIUM_FORWARD because the -o lxc+ rule was removed when per-endpoint routes were disabled. Since the FORWARD chain defaults to DROP, those packets are dropped. The non-zero counter below (796 packets) validates this.

Chain FORWARD (policy DROP 796 packets, 130K bytes)
 pkts bytes target     prot opt in     out     source               destination         
 1635  173K CILIUM_FORWARD  all  --  any    any     anywhere             anywhere             /* cilium-feeder: CILIUM_FORWARD */

I'm not yet sure why the test failure doesn't happen 100% of times. It could be that the CoreDNS pod is sometimes restarted and its stale endpoint route removed.

#15228 fixes this flake by:

Not restoring the endpoints' datapath configuration. Instead the current agent configuration is used.
Removing any existing endpoint route and egress lxc BPF program if per-endpoint routes are disabled in the endpoint.

nebril added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Oct 27, 2020

nebril assigned pchaigno Oct 27, 2020

pchaigno added this to In Progress (Cilium) in CI Force Oct 27, 2020

aditighag self-assigned this Nov 19, 2020

jibi self-assigned this Dec 15, 2020

aditighag removed their assignment Dec 15, 2020

jibi mentioned this issue Dec 22, 2020

test: add iptables masquerading without random-fully test #14476

Merged

This was referenced Feb 12, 2021

CI: Suite-k8s-1.15.K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully #14949

Closed

v1.9 backports 2021-02-11 #14930

Merged

aanm mentioned this issue Feb 20, 2021

policy: Clear references to old EndpointPolicies #15042

Merged

pchaigno mentioned this issue Mar 4, 2021

test: Unquarantine the random-fully test #15205

Merged

aanm closed this as completed in #15205 Mar 5, 2021

CI Force automation moved this from In Progress (Cilium) to Fixed / Done Mar 5, 2021

pchaigno reopened this Mar 8, 2021

CI Force automation moved this from Fixed / Done to In Progress (Cilium) Mar 8, 2021

pchaigno mentioned this issue Mar 9, 2021

endpoint: Fix the initialization of endpoints' datapath configs #15228

Merged

kkourt closed this as completed in #15228 Mar 10, 2021

CI Force automation moved this from In Progress (Cilium) to Fixed / Done Mar 10, 2021

tklauser mentioned this issue Sep 9, 2021

CI: K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully #17353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: `K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully` fails on k8s-all #13773

CI: `K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully` fails on k8s-all #13773

nebril commented Oct 27, 2020 •

edited

Loading

nebril commented Oct 27, 2020

pchaigno commented Nov 19, 2020 •

edited

Loading

jibi commented Jan 5, 2021

jibi commented Jan 7, 2021

pchaigno commented Jan 7, 2021

jibi commented Jan 8, 2021

pchaigno commented Jan 8, 2021

jibi commented Jan 8, 2021

jibi commented Jan 29, 2021

pchaigno commented Jan 29, 2021

tklauser commented Mar 8, 2021 •

edited by pchaigno

Loading

pchaigno commented Mar 8, 2021

pchaigno commented Mar 9, 2021

CI: K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully fails on k8s-all #13773

CI: K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully fails on k8s-all #13773

Comments

nebril commented Oct 27, 2020 • edited Loading

nebril commented Oct 27, 2020

pchaigno commented Nov 19, 2020 • edited Loading

jibi commented Jan 5, 2021

jibi commented Jan 7, 2021

pchaigno commented Jan 7, 2021

jibi commented Jan 8, 2021

pchaigno commented Jan 8, 2021

jibi commented Jan 8, 2021

jibi commented Jan 29, 2021

pchaigno commented Jan 29, 2021

tklauser commented Mar 8, 2021 • edited by pchaigno Loading

pchaigno commented Mar 8, 2021

pchaigno commented Mar 9, 2021

CI: `K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully` fails on k8s-all #13773

CI: `K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully` fails on k8s-all #13773

nebril commented Oct 27, 2020 •

edited

Loading

pchaigno commented Nov 19, 2020 •

edited

Loading

tklauser commented Mar 8, 2021 •

edited by pchaigno

Loading