Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: RuntimeDatapathPrivilegedUnitTests Run Tests / TestControlPlane/IdentityGC #22470

Closed
jrajahalme opened this issue Dec 1, 2022 · 8 comments · Fixed by #22491
Closed

CI: RuntimeDatapathPrivilegedUnitTests Run Tests / TestControlPlane/IdentityGC #22470

jrajahalme opened this issue Dec 1, 2022 · 8 comments · Fixed by #22491
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@jrajahalme
Copy link
Member

Test Name

RuntimeDatapathPrivilegedUnitTests Run Tests
github.com/cilium/cilium/test/controlplane • TestControlPlane/IdentityGC

Failure Output

[19:26:19] �[38;5;103m┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓�[0m
	 [19:26:19] �[38;5;103m┃�[0m   �[91m�[91mPANIC�[0m�[0m  package: github.com/cilium/cilium/test/controlplane • TestControlPlane/IdentityGC   �[38;5;103m┃�[0m
	 [19:26:19] �[38;5;103m┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛�[0m
	 [19:26:19] panic: close of closed channel                                                                  
	 [19:26:19]                                                                                                 
	 [19:26:19] goroutine 508 [running]:                                                                        
	 [19:26:19] github.com/cilium/cilium/operator/cmd.startSynchronizingCiliumNodes.func9()                     
	 [19:26:19] 	/home/vagrant/go/src/github.com/cilium/cilium/operator/cmd/cilium_node.go:225 +0x129            
	 [19:26:19] created by github.com/cilium/cilium/operator/cmd.startSynchronizingCiliumNodes                  
	 [19:26:19] 	/home/vagrant/go/src/github.com/cilium/cilium/operator/cmd/cilium_node.go:223 +0xd18            
	 [19:26:19] FAIL	github.com/cilium/cilium/test/controlplane	0.544s                                            
	 [19:26:19]                                                                                                 
	 [19:26:19] ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
	 [19:26:19] │  STATUS │ ELAPSED │                             PACKAGE                             │ COVER  │ PASS │ FAIL │ SKIP  │
	 [19:26:19] │─────────┼─────────┼─────────────────────────────────────────────────────────────────┼────────┼──────┼──────┼───────│
	 [19:26:19] │  �[91mPANIC�[0m  │  0.00s  │ github.com/cilium/cilium/test/controlplane                      │   --   │  --  │  --  │  --   │

Stack Trace

/home/jenkins/workspace/Cilium-PR-Runtime-net-next/runtime-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
Failed to run privileged unit tests
Expected command: bash -c 'sudo make -C /home/vagrant/go/src/github.com/cilium/cilium/ tests-privileged | ts "[%H:%M:%S]"; exit "${PIPESTATUS[0]}"' 
To succeed, but it failed:
Exitcode: 2 
Err: Process exited with status 2

Standard Output

[19:26:19] �[38;5;103m┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓�[0m
	 [19:26:19] �[38;5;103m┃�[0m   �[91m�[91mPANIC�[0m�[0m  package: github.com/cilium/cilium/test/controlplane • TestControlPlane/IdentityGC   �[38;5;103m┃�[0m
	 [19:26:19] �[38;5;103m┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛�[0m
	 [19:26:19] panic: close of closed channel                                                                  
	 [19:26:19]                                                                                                 
	 [19:26:19] goroutine 508 [running]:                                                                        
	 [19:26:19] github.com/cilium/cilium/operator/cmd.startSynchronizingCiliumNodes.func9()                     
	 [19:26:19] 	/home/vagrant/go/src/github.com/cilium/cilium/operator/cmd/cilium_node.go:225 +0x129            
	 [19:26:19] created by github.com/cilium/cilium/operator/cmd.startSynchronizingCiliumNodes                  
	 [19:26:19] 	/home/vagrant/go/src/github.com/cilium/cilium/operator/cmd/cilium_node.go:223 +0xd18            
	 [19:26:19] FAIL	github.com/cilium/cilium/test/controlplane	0.544s                                            
	 [19:26:19]                                                                                                 
	 [19:26:19] ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
	 [19:26:19] │  STATUS │ ELAPSED │                             PACKAGE                             │ COVER  │ PASS │ FAIL │ SKIP  │
	 [19:26:19] │─────────┼─────────┼─────────────────────────────────────────────────────────────────┼────────┼──────┼──────┼───────│
	 [19:26:19] │  �[91mPANIC�[0m  │  0.00s  │ github.com/cilium/cilium/test/controlplane                      │   --   │  --  │  --  │  --   │

Standard Error

make: *** [Makefile:124: tests-privileged] Error 1

Resources

Anything else?

close of closed channel happens here in operator/cmd/cilium_node.go:

	go func() {
		cache.WaitForCacheSync(ctx.Done(), ciliumNodeInformer.HasSynced)
		close(k8sCiliumNodesCacheSynced)
@jrajahalme jrajahalme added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Dec 1, 2022
@jrajahalme
Copy link
Member Author

Looks like ControlPlaneTest.StopOperator() should wait for the go-routine at https://github.com/cilium/cilium/blob/master/operator/cmd/cilium_node.go#L223-L251 to finish before allowing StartOperator() to be called again?

@jrajahalme
Copy link
Member Author

Another failure on the same test, panic on close of another channel: https://jenkins.cilium.io/job/Cilium-PR-Runtime-net-next/4131/

@joamaki
Copy link
Contributor

joamaki commented Dec 1, 2022

Will take a look at fixing this. Btw, also related is: #21818.
EDIT: Ok added a fix. It's not by any means the proper way to fix this. We should instead look into what Dylan suggested in 21818, but that is much more involved and it applies to many other places as well.

joamaki added a commit to joamaki/cilium that referenced this issue Dec 2, 2022
The control-plane tests start and stop the operator multiple times,
which sometimes leads to double-close of k8sCiliumNodesCacheSynced as
startSynchronizingCiliumNodes forks off a goroutine that we do not wait
for. This fixes the issue by moving k8sCiliumNodesCacheSynced and
ciliumNodeManagerQueueSynced into a struct and passing that onto use-sites.

Fixes: cilium#22470
Signed-off-by: Jussi Maki <jussi@isovalent.com>
pippolo84 pushed a commit that referenced this issue Dec 2, 2022
The control-plane tests start and stop the operator multiple times,
which sometimes leads to double-close of k8sCiliumNodesCacheSynced as
startSynchronizingCiliumNodes forks off a goroutine that we do not wait
for. This fixes the issue by moving k8sCiliumNodesCacheSynced and
ciliumNodeManagerQueueSynced into a struct and passing that onto use-sites.

Fixes: #22470
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@gandro
Copy link
Member

gandro commented Dec 5, 2022

@joamaki @pippolo84 I think a similar issue is still occurring here:

https://app.travis-ci.com/github/cilium/cilium/jobs/590240001#L538
Branch already has your fixed applied: https://github.com/tanberBro/cilium/tree/const

@jrajahalme jrajahalme reopened this Dec 6, 2022
@jrajahalme
Copy link
Member Author

Still happens: https://github.com/cilium/cilium/pull/22570/checks?check_run_id=9914257029

Different spot, same problem (close of closed channel) at: https://github.com/cilium/cilium/blob/master/operator/watchers/pod.go#L140

@gandro
Copy link
Member

gandro commented Dec 7, 2022

This seems very common on Travis now (also hit on master) and decrease the signal-to-noise ratio. Can we disable the test for now?

@pippolo84
Copy link
Member

This PR should solve this issue, too.

@joestringer
Copy link
Member

Following @pippolo84 's pointer above, I'll close this as "fixed". If anyone observes this again later, we can always choose to reopen the investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants