Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Fix the "close of closed channel" panic #11056

Merged
merged 1 commit into from Apr 23, 2020

Conversation

Sen666666
Copy link

When enable ipv4&ipv6 with crd ipam option,cilium panic after restoring
old endpoint because of "close of closed channel"

We should check whether or not the channel has been closed, before we do
the action.

Signed-off-by: yangxingwu yangxingwu@kuaishou.com
Signed-off-by: huangxuesen huangxuesen@kuaishou.com

@Sen666666 Sen666666 requested a review from a team as a code owner April 20, 2020 09:37
@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@aanm
Copy link
Member

aanm commented Apr 20, 2020

@Sen666666 Thank you for your PR. Do you have the golang's stack trace for this panic?

@aanm aanm added kind/bug This is a bug in the Cilium logic. pending-review release-note/bug This PR fixes an issue in a previous release of Cilium. labels Apr 20, 2020
@Sen666666
Copy link
Author

@aanm yeap

cilium-agent[25601]: panic: close of closed channel
cilium-agent[25601]: goroutine 1 [running]:
cilium-agent[25601]: github.com/cilium/cilium/pkg/ipam.(*crdAllocator).RestoreFinished(0xc000963540)
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/pkg/ipam/crd.go:685 +0x32
cilium-agent[25601]: github.com/cilium/cilium/daemon/cmd.NewDaemon(0x2773260, 0xc0000ca140, 0x27b6680, 0xc00017d8f0, 0x2790960, 0xc0003fbc60, 0x27b6680, 0xc00017d8f0)
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/daemon/cmd/daemon.go:489 +0x21f4
cilium-agent[25601]: github.com/cilium/cilium/daemon/cmd.runDaemon()
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/daemon/cmd/daemon_main.go:1202 +0x34a
cilium-agent[25601]: github.com/cilium/cilium/daemon/cmd.glob..func1(0x3aa0aa0, 0xc0004ee7e0, 0x0, 0x11)
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/daemon/cmd/daemon_main.go:103 +0x7b
cilium-agent[25601]: github.com/spf13/cobra.(*Command).execute(0x3aa0aa0, 0xc0000cc010, 0x11, 0x11, 0x3aa0aa0, 0xc0000cc010)
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:830 +0x2aa
cilium-agent[25601]: github.com/spf13/cobra.(*Command).ExecuteC(0x3aa0aa0, 0xc00009cde0, 0x0, 0x0)
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:914 +0x2fb
cilium-agent[25601]: github.com/spf13/cobra.(*Command).Execute(...)
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:864
cilium-agent[25601]: github.com/cilium/cilium/daemon/cmd.Execute()
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/daemon/cmd/daemon_main.go:134 +0x184
cilium-agent[25601]: main.main()
cilium-agent[25601]: /home/vagrant/go/src/github.com/cilium/cilium/daemon/main.go:22 +0x20

@coveralls
Copy link

coveralls commented Apr 20, 2020

Coverage Status

Coverage increased (+0.02%) to 44.683% when pulling 11e6809 on Sen666666:master into 5c0b426 on cilium:master.

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sen666666 thanks for the stack trace. I think with a once.Do will more readable.

@@ -623,5 +623,10 @@ func (a *crdAllocator) Dump() (map[string]string, string) {

// RestoreFinished marks the status of restoration as done
func (a *crdAllocator) RestoreFinished() {
close(a.store.restoreFinished)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a Once.Do here, it will guarantee that this call is only executed once. (The Once will be a field of nodeStore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, I will try to do some modification.

@aanm aanm added kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. needs-backport/1.7 labels Apr 20, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.3 Apr 20, 2020
@aanm
Copy link
Member

aanm commented Apr 20, 2020

Regression introduced by #9888

@Sen666666
Copy link
Author

use once.Do to make the check more readable

When enable ipv4&ipv6 with crd ipam option, cilium panic after restoring
old endpoint because of "close of closed channel"

Use once.Do to make sure the close action only happen once.

Signed-off-by: yangxingwu <yangxingwu@kuaishou.com>
Signed-off-by: huangxuesen <huangxuesen@kuaishou.com>
@pchaigno
Copy link
Member

pchaigno commented Apr 21, 2020

test-me-please

GKE failed with:

Cilium cannot be installed
Expected
    <*errors.errorString | 0xc0004c6100>: {
        s: "could not find pods in namespace 'cilium' with label 'io.cilium/app=operator': signal: killed",
    }
to be nil

@pchaigno
Copy link
Member

test-gke

Copy link
Contributor

@jaffcheng jaffcheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the regression introduced by me, I think we should ensure endpoints of both stacks are restored

@@ -623,5 +624,7 @@ func (a *crdAllocator) Dump() (map[string]string, string) {

// RestoreFinished marks the status of restoration as done
func (a *crdAllocator) RestoreFinished() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim here is to restore all endpoints before triggering sync with upstream, so I think we should close a.store.restoreFinished only after the restoration of both IPv6Allocator and IPv4Allocator are finished. They are called from here:

cilium/daemon/cmd/daemon.go

Lines 491 to 498 in 90b256e

if option.Config.IPAM == option.IPAMCRD || option.Config.IPAM == option.IPAMENI || option.Config.IPAM == option.IPAMAzure {
if option.Config.EnableIPv6 {
d.ipam.IPv6Allocator.RestoreFinished()
}
if option.Config.EnableIPv4 {
d.ipam.IPv4Allocator.RestoreFinished()
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaffcheng @Sen666666 Looking at the code, this comment seems to be a potential further fix that should go in at some point, but does not directly interact with this specific PR #11056 that just fixes a specific crash issue. I'd like to merge this PR as-is to fix the crash, then you can follow up to address these comments. Does that sound reasonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's address this comment in a follow-up PR.

@pchaigno
Copy link
Member

test-gke

@pchaigno
Copy link
Member

Merging despite GKE build being stuck because 1) GKE builds are not required to merge anymore, and 2) failure n-1 was flake #11117 and affected different test case than failure n-2.

@pchaigno pchaigno merged commit 2412167 into cilium:master Apr 23, 2020
1.8.0 automation moved this from In progress to Merged Apr 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.3 Apr 28, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.3 Apr 29, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.3 Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.7.3
Backport done to v1.7
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

8 participants