Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: RuntimePrivilegedUnitTests TestArpPingHandling failed, timeout reached while waiting for condition #16221

Closed
tklauser opened this issue May 19, 2021 · 7 comments · Fixed by #16432
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects

Comments

@tklauser
Copy link
Member

tklauser commented May 19, 2021

	 node_linux_test.go:1207:
	     wait(nodev1.Identity(), nil, true)
	 node_linux_test.go:1048:
	     c.Assert(err, check.IsNil)
	 ... value *errors.errorString = &errors.errorString{s:"timeout reached while waiting for condition"} ("timeout reached while waiting for condition")
	 
	 START: node_linux_test.go:138: linuxPrivilegedIPv4OnlyTestSuite.TearDownTest
	 PASS: node_linux_test.go:138: linuxPrivilegedIPv4OnlyTestSuite.TearDownTest	0.052s
	 
	 FAIL: node_linux_test.go:956: linuxPrivilegedIPv4OnlyTestSuite.TestArpPingHandling

This failed on #16210 (a 1.10 backport PR) and it looks like #16072 was in place.

https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/4731/

test_results_Cilium-PR-Runtime-4.9_4731_BDD-Test-PR.zip

Previously: #16075 (looks like this one appeared again in another PR), #16040, #14640, #14125

/cc @brb

@tklauser tklauser added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 19, 2021
This was referenced May 19, 2021
@jibi
Copy link
Member

jibi commented Jun 4, 2021

I can reproduce this on a 4.9 dev VM:

vagrant@k8s1:~/go/src/github.com/cilium/cilium/pkg/datapath/linux$ while true; do sudo go test . -count=1 -tags=privileged_tests; done
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.123s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.848s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.791s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.562s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.968s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.763s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.837s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.844s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.558s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.144s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.722s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.922s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.899s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.127s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.784s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.709s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.577s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.542s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.522s
ok  	github.com/cilium/cilium/pkg/datapath/linux	9.027s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.844s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.807s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.689s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.831s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.105s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.834s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.651s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.013s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.519s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.765s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.796s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.721s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.925s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.958s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.699s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.658s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.711s
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.084s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.555s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.784s
ok  	github.com/cilium/cilium/pkg/datapath/linux	7.717s

----------------------------------------------------------------------
FAIL: node_linux_test.go:956: linuxPrivilegedIPv4OnlyTestSuite.TestArpPingHandling

node_linux_test.go:1207:
    wait(nodev1.Identity(), nil, true)
node_linux_test.go:1048:
    c.Assert(err, check.IsNil)
... value *errors.errorString = &errors.errorString{s:"timeout reached while waiting for condition"} ("timeout reached while waiting for condition")

OOPS: 21 passed, 1 FAILED
--- FAIL: Test (12.39s)
FAIL
FAIL	github.com/cilium/cilium/pkg/datapath/linux	12.403s
FAIL
ok  	github.com/cilium/cilium/pkg/datapath/linux	8.040s

@brb do you have any pointers/context from previous investigations about this flake?

@tklauser
Copy link
Member Author

tklauser commented Jun 4, 2021

@aditighag I assume this was closed by mistake? Or is this fixed already?

@jibi jibi reopened this Jun 4, 2021
@aditighag
Copy link
Member

I'm not sure how that happened. Sorry! 😕

@brb
Copy link
Member

brb commented Jun 4, 2021

do you have any pointers/context from previous investigations about this flake?

@jibi I think I know what's happening. We don't wait for goroutines (started at https://github.com/cilium/cilium/blob/v1.10.0/pkg/datapath/linux/node_linux_test.go#L1161-L1174) to finish. So they can keep updating the ARP entries making the wait to time out.

Could you try the following patch?

diff --git a/pkg/datapath/linux/node_linux_test.go b/pkg/datapath/linux/node_linux_test.go
index 20f965bf6..e7f3e287c 100644
--- a/pkg/datapath/linux/node_linux_test.go
+++ b/pkg/datapath/linux/node_linux_test.go
@@ -1201,6 +1201,7 @@ func (s *linuxPrivilegedIPv4OnlyTestSuite) TestArpPingHandling(c *check.C) {
        }
        // Cleanup
        close(done)
+       wg.Wait()
        now = time.Now()
        err = linuxNodeHandler.NodeDelete(nodev1)
        c.Assert(err, check.IsNil)

@jibi
Copy link
Member

jibi commented Jun 4, 2021

That seems to work 💥 the entry from linuxNodeHandler.neighNextHopByNode is now properly removed each time and the wait function does not timeout anymore. I'll fill a PR

@brb
Copy link
Member

brb commented Jun 4, 2021

Cool, thanks!

jibi added a commit that referenced this issue Jun 4, 2021
In TestArpPingHandling, wait for all goroutines that are inserting the
new neighbors to finish before deleting the node.

Fixes: #16221
Suggested-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
aanm pushed a commit that referenced this issue Jun 4, 2021
In TestArpPingHandling, wait for all goroutines that are inserting the
new neighbors to finish before deleting the node.

Fixes: #16221
Suggested-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
gandro pushed a commit to gandro/cilium that referenced this issue Jun 14, 2021
[ upstream commit 5a418a3 ]

In TestArpPingHandling, wait for all goroutines that are inserting the
new neighbors to finish before deleting the node.

Fixes: cilium#16221
Suggested-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro pushed a commit to gandro/cilium that referenced this issue Jun 14, 2021
[ upstream commit 5a418a3 ]

In TestArpPingHandling, wait for all goroutines that are inserting the
new neighbors to finish before deleting the node.

Fixes: cilium#16221
Suggested-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
aanm pushed a commit that referenced this issue Jun 16, 2021
[ upstream commit 5a418a3 ]

In TestArpPingHandling, wait for all goroutines that are inserting the
new neighbors to finish before deleting the node.

Fixes: #16221
Suggested-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
aditighag pushed a commit that referenced this issue Jun 16, 2021
[ upstream commit 5a418a3 ]

In TestArpPingHandling, wait for all goroutines that are inserting the
new neighbors to finish before deleting the node.

Fixes: #16221
Suggested-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
pchaigno pushed a commit to pchaigno/cilium that referenced this issue Jun 22, 2021
[ upstream commit 5a418a3 ]

In TestArpPingHandling, wait for all goroutines that are inserting the
new neighbors to finish before deleting the node.

Fixes: cilium#16221
Suggested-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
jibi added a commit that referenced this issue Jun 23, 2021
[ upstream commit 5a418a3 ]

In TestArpPingHandling, wait for all goroutines that are inserting the
new neighbors to finish before deleting the node.

Fixes: #16221
Suggested-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno assigned jibi and brb Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
No open projects
CI Force
  
Awaiting triage
Development

Successfully merging a pull request may close this issue.

5 participants