-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully
fails on k8s-all
#13773
Comments
focused test run also fails in the same way on 1.14, so it's unlikely this is infra/ci related: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Kernel-Focus/100/ |
This test validates a new Cilium flag ( The test doesn't fail on our PRs but fails in k8s-all, which indicates the failure is somehow related to the Kubernetes version. So, it doesn't fail for 1.12, 1.19, and 1.18, but does fail for 1.14, 1.15, 1.16 at least. That could be due to some weird interaction with kube-proxy for those versions. We weren't able to reproduce locally with the same K8s versions and kernel (4.9), even after multiple runs. There might be some timing factor that makes this harder. Two things we can check next:
|
As suggested in the ticket, let's add a test for iptables masqueranding _without_ the random-fully option. This should tell us if the "K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully" test is failing due to the random-fully option or not. Related-to: #13773 Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
As suggested in the ticket, let's add a test for iptables masqueranding _without_ the random-fully option. This should tell us if the "K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully" test is failing due to the random-fully option or not. Related-to: #13773 Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
In #14476 I added a test for masquerading without the random-fully iptables option, and the test is running fine for the different 1.14, 1.15 and 1.16 k8s versions:
|
Some updates here: I took a look at the iptables configuration to see if I can spot any difference between a k8s version that is known to work and one that does not. I started from 1.18, which is known to work. I reproduced the test locally with: $ NFS=1 K8S_VERSION=1.18 KUBEPROXY=1 ginkgo --focus "K8sDatapathConfig.*Check iptables masquerading with random-fully" -- -cilium.provision=true -cilium.holdEnvironment=true-cilium.runQuarantined before that I added a --- a/test/k8sT/DatapathConfiguration.go
+++ b/test/k8sT/DatapathConfiguration.go
@@ -279,6 +279,7 @@ var _ = Describe("K8sDatapathConfig", func() {
By("Test iptables masquerading")
Expect(testPodHTTPToOutside(kubectl, "http://google.com", false, false)).
Should(BeTrue(), "Connectivity test to http://google.com failed")
+ helpers.HoldEnvironment("")
}) After the test run I dumped the iptables configuration (1.18.txt) and it looks like there are no rules with the $ rg random-fully 1.18.txt
$ although if I grep for
(I don't have context on why Cilium is suggesting that). Repeating the same test on 1.14 (which is known for not working) I got the following iptables configuration: 1.14.txt. There is still no trace of rules with $ rg random-fully 1.14.txt
$ But this time Cilium was not suggesting to remove the rules: vagrant@k8s1:~$ ks logs $(cilium_pod k8s1) | grep random-fully
level=info msg=" --iptables-random-fully='true'" subsys=daemon |
That netfilter option has had a somewhat lacking support in iptables in the past, so it may be worth checking that it is even supposed to appear in iptables rule dumps. It could be that in the 1.14 case, there are iptables rules with The |
Good shout @pchaigno 🙌
There's even a bug tracked for that but it doesn't look like the fix landed in 18.04. That said, I'm still unsure about why this test runs fine on some k8s version while not on others :/ are all CI images for 4.9 based on ubuntu 18.04? |
@jibi In this case, aren't we using the iptables binary shipped in our cilium-agent Docker image?
|
Ah that's right, I was looking at the wrong iptables output. Running it again from inside the container shows that the rule is there:
|
Some updates here: I opened #14562 to try catch this test failing in the CI (as I am unable to do so on the local test environment) without success. I think next step would be to unquarantine the test so that maybe we can try catch it failing in the k8s-all pipeline? |
It might make sense to wait for the split of k8s-all in multiple pipelines to unquarantine. That split might be enough to fix the flakes we have/had. |
Not sure whether this is exactly the same flake, but I've just seen K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully and K8sDatapathConfig Encapsulation Check iptables masquerading without random-fully fail on #15241: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-4.19/4894/ |
Looks similar, let's reopen. I might have a fix anyway :-) |
Here is what I think is happening: The tests are failing because a client pod fails to resolve a domain name. If you reproduce locally, you'll see that connections from the CoreDNS pods to 8.8.8.8 timeout. The test that ran just before our failing tests is -A CILIUM_FORWARD -o cilium_host -m comment --comment "cilium: any->cluster on cilium_host forward accept" -j ACCEPT
-A CILIUM_FORWARD -i cilium_host -m comment --comment "cilium: cluster->any on cilium_host forward accept (nodeport)" -j ACCEPT
-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT
-A CILIUM_FORWARD -i cilium_net -m comment --comment "cilium: cluster->any on cilium_net forward accept (nodeport)" -j ACCEPT
# Specific to endpoint routes:
-A CILIUM_FORWARD -o lxc+ -m comment --comment "cilium: any->cluster on lxc+ forward accept" -j ACCEPT
-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept (nodeport)" -j ACCEPT Then, our failing test is started and disables per-endpoint routes. The above two rules are removed. Endpoints are restored from disk after the restart with their original datapath configuration, which includes the per-endpoint setting on a per-endpoint basis. So for restored endpoint (including CoreDNS pods), per-endpoint routes remain enabled with the egress lxc program and the Linux route. Reply DNS packets going to the CoreDNS match the endpoint rule and are therefore routed to output interface
I'm not yet sure why the test failure doesn't happen 100% of times. It could be that the CoreDNS pod is sometimes restarted and its stale endpoint route removed. #15228 fixes this flake by:
|
every build on https://jenkins.cilium.io/job/cilium-master-K8s-all/ fails
The text was updated successfully, but these errors were encountered: