Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: prevent panic on k8s services host fw test on some runs. #25747

Merged

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented May 29, 2023

Change introduced 0e20d30 in order to provide workaround fix for flake can panic depending on the order which tests are run if the deploymentManager is not setup with a kubectl object.

k8s/services already deploys Cilium with the hostfirewall enabled, so this moves installing Cilium out of the host-few preperation step and defers that to the caller (such as in datapath_configuration).

Fixes #25775

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 29, 2023
@tommyp1ckles tommyp1ckles marked this pull request as ready for review May 29, 2023 18:27
@tommyp1ckles tommyp1ckles requested review from a team as code owners May 29, 2023 18:27
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-panic-on-svc-hostfw-test branch from 0ef3120 to ce7d345 Compare May 29, 2023 18:31
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles added the release-note/ci This PR makes changes to the CI. label May 29, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 29, 2023
Copy link
Contributor

@viktor-kurchenko viktor-kurchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@julianwiedmann
Copy link
Member

Do we have an example of such a panic? Or a related issue? So that we can trace past occurrences back to this PR and know that they should be fixed now.

Change introduced 0e20d30 in order to
provide workaround fix for flake can panic depending on the order which
tests are run if the deploymentManager is not setup with a kubectl
object.

k8s/services already deploys Cilium with the hostfirewall enabled, so
this moves installing Cilium out of the host-few preparation step and
defers that to the caller (such as in datapath_configuration).

As well, this was causing failures with hostfw K8sServices test because
the deploymentManager Cilium install procedure is more strict regarding
the ensure the liveness of Agents health endpoints.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-panic-on-svc-hostfw-test branch from ce7d345 to 5865474 Compare May 30, 2023 21:11
@tommyp1ckles
Copy link
Contributor Author

Do we have an example of such a panic? Or a related issue? So that we can trace past occurrences back to this PR and know that they should be fixed now.

Good point - was working off a log file someone sent me, ill open up an issue to track.

@tommyp1ckles
Copy link
Contributor Author

/test

@pchaigno
Copy link
Member

Do we have an example of such a panic? Or a related issue? So that we can trace past occurrences back to this PR and know that they should be fixed now.

I don't think we can see this in CI. It happened to Maxim (IIRC) locally because he wasn't running the tests in the same order as the CI or because he was running only this test.

@tommyp1ckles
Copy link
Contributor Author

@pchaigno We can see it here

@pchaigno
Copy link
Member

Oh, I stand corrected. So what is the random part here that makes this not fail 100% of the time?

@tommyp1ckles
Copy link
Contributor Author

So what is the random part here that makes this not fail 100% of the time?

I'm having a hard time figuring that out, I initially presumed that there was no set order to the tests running but that seemed to be wrong. In all cases it appears as though none of the tests that would call SetKubectl are called prior to the K8sDatapathServices ... with host policy test.

@tommyp1ckles tommyp1ckles added the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label May 31, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.3 May 31, 2023
@tommyp1ckles
Copy link
Contributor Author

I'll see if I can figure it out here: #25748

@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

tommyp1ckles commented May 31, 2023

/test-backport-1.13

Job 'Cilium-PR-K8s-1.24-kernel-4.19' failed:

Click to show.

Test Name

K8sAgentPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath

Failure Output

FAIL: cannot install connectivity-check

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-4.19/12/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@julianwiedmann
Copy link
Member

@pchaigno We can see it here

Quoting so we don't lose it:

09:31:01  •! Panic in Spec Setup (BeforeEach) [161.815 seconds]
09:31:01  K8sDatapathServicesTest
09:31:01  /home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
09:31:01    Checks N/S loadbalancing
09:31:01    /home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
09:31:01      With host policy
09:31:01      /home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
09:31:01        Tests NodePort [BeforeEach]
09:31:01        /home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
09:31:01  
09:31:01        Test Panicked
09:31:01        runtime error: invalid memory address or nil pointer dereference
09:31:01        /home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:443
09:31:01  
09:31:01        Full Stack Trace
09:31:01        github.com/cilium/cilium/test/ginkgo-ext.beforeEach.func1.1()
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:443 +0x47
09:31:01        panic({0x20dcbc0, 0x3acecf0})
09:31:01        	/usr/local/go/src/runtime/panic.go:884 +0x213
09:31:01        github.com/cilium/cilium/test/k8s.DeployCiliumOptionsAndDNS(0x0, {0xc0009ea140?, 0x279e201?}, 0xc0031fecd0?)
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/k8s/assertion_helpers.go:204 +0x27
09:31:01        github.com/cilium/cilium/test/helpers.(*DeploymentManager).DeployCilium(0x3b22dc0, 0xc0005f3800?, 0x247d0a1?)
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/helpers/manifest.go:305 +0x3c
09:31:01        github.com/cilium/cilium/test/k8s.prepareHostPolicyEnforcement(0xc0005f32c0)
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/k8s/datapath_configuration.go:548 +0xa5
09:31:01        github.com/cilium/cilium/test/k8s.glob..func22.7.21.1()
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/k8s/services.go:589 +0xb2
09:31:01        github.com/cilium/cilium/test/ginkgo-ext.BeforeAll.func1()
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:195 +0x69
09:31:01        github.com/cilium/cilium/test/ginkgo-ext.beforeEach.func1()
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:447 +0xe6
09:31:01        github.com/cilium/cilium/test/ginkgo-ext.applyAdvice.func1({0x3b548a8, 0x0, 0x0})
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:558 +0x8a
09:31:01        github.com/cilium/cilium/test.Test(0xc00046ad00)
09:31:01        	/home/jenkins/workspace/cilium-v1.13-k8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/test_suite_test.go:112 +0x484
09:31:01        testing.tRunner(0xc00046ad00, 0x2597978)
09:31:01        	/usr/local/go/src/testing/testing.go:1576 +0x10b
09:31:01        created by testing.(*T).Run
09:31:01        	/usr/local/go/src/testing/testing.go:1629 +0x3ea

@julianwiedmann
Copy link
Member

This looks like we're just working around the missing SetKubectl(), no?

diff --git a/test/k8s/services.go b/test/k8s/services.go
index ea09d8869c..342a8ddc23 100644
--- a/test/k8s/services.go
+++ b/test/k8s/services.go
@@ -41,6 +41,7 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sDatapathServicesTest", func()
 
        BeforeAll(func() {
                kubectl = helpers.CreateKubectl(helpers.K8s1VMName(), logger)
+               deploymentManager.SetKubectl(kubectl)
 
                ni, err = helpers.GetNodesInfo(kubectl)
                Expect(err).Should(BeNil(), "Cannot get nodes info")
@@ -158,7 +159,6 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sDatapathServicesTest", func()
                        return helpers.RunsWithKubeProxyReplacement()
                }, "Tests NodePort inside cluster (kube-proxy)", func() {
                        SkipItIf(helpers.DoesNotRunOn419OrLaterKernel, "with IPSec and externalTrafficPolicy=Local", func() {
-                               deploymentManager.SetKubectl(kubectl)
                                deploymentManager.Deploy(helpers.CiliumNamespace, IPSecSecret)
                                DeployCiliumOptionsAndDNS(kubectl, ciliumFilename, map[string]string{
                                        "encryption.enabled": "true",

@tommyp1ckles
Copy link
Contributor Author

@julianwiedmann

This looks like we're just working around the missing SetKubectl(), no?

Basically, alternatively I could've added the SetKubectl call to that test but that didn't make much sense to me since that test doesn't use "deploymentManager" (unlike the hostfw DatapathConfig tests).

What I don't know is why there are the different approaches.

@tommyp1ckles
Copy link
Contributor Author

/test-1.26-net-next

@tommyp1ckles
Copy link
Contributor Author

tommyp1ckles commented May 31, 2023

/test-1.24-4.19

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks N/S loadbalancing Tests with XDP, direct routing, DSR with Geneve and Maglev

Failure Output

FAIL: Can not connect to service "http://[fd04::11]:30417" from outside cluster (1/10)

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/337/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

@tommyp1ckles tommyp1ckles added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 1, 2023
@julianwiedmann julianwiedmann merged commit c05e6a4 into cilium:main Jun 1, 2023
70 of 71 checks passed
@sayboras sayboras mentioned this pull request Jun 2, 2023
8 tasks
@sayboras sayboras added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jun 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.3 Jun 2, 2023
@YutaroHayakawa YutaroHayakawa added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Jun 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.3 Jun 7, 2023
@julianwiedmann
Copy link
Member

julianwiedmann commented Jun 8, 2023

This looks like we're just working around the missing SetKubectl(), no?

diff --git a/test/k8s/services.go b/test/k8s/services.go
index ea09d8869c..342a8ddc23 100644
--- a/test/k8s/services.go
+++ b/test/k8s/services.go
@@ -41,6 +41,7 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sDatapathServicesTest", func()
 
        BeforeAll(func() {
                kubectl = helpers.CreateKubectl(helpers.K8s1VMName(), logger)
+               deploymentManager.SetKubectl(kubectl)
 
                ni, err = helpers.GetNodesInfo(kubectl)
                Expect(err).Should(BeNil(), "Cannot get nodes info")
@@ -158,7 +159,6 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sDatapathServicesTest", func()
                        return helpers.RunsWithKubeProxyReplacement()
                }, "Tests NodePort inside cluster (kube-proxy)", func() {
                        SkipItIf(helpers.DoesNotRunOn419OrLaterKernel, "with IPSec and externalTrafficPolicy=Local", func() {
-                               deploymentManager.SetKubectl(kubectl)
                                deploymentManager.Deploy(helpers.CiliumNamespace, IPSecSecret)
                                DeployCiliumOptionsAndDNS(kubectl, ciliumFilename, map[string]string{
                                        "encryption.enabled": "true",

@aanm now also applied this variant of the fix via a2c2be5 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.13.3
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

CI: Suite-k8s-1.26.K8sDatapathServicesTest Checks N/S loadbalancing With host policy Tests NodePort
7 participants