Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contexthelpers: Fix deadlock when nobody recvs on success channel #13408

Merged
merged 1 commit into from
Oct 7, 2020

Conversation

brb
Copy link
Member

@brb brb commented Oct 5, 2020

It has been observed that the sessionSuccess <- true statement can block forever. E.g. with Cilium v1.7.9:

    goroutine 742 [chan send, 547 minutes]:
    github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000f66000, 0x2790b40, 0xc000e247c0, 0x0, 0x0)
            /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:657 +0x3e2
    github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func6(0x2790b40, 0xc000e247c0, 0x3aae820, 0x2)
            /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:819 +0x3e
    github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000f42500)
            /go/src/github.com/cilium/cilium/pkg/controller/controller.go:205 +0xa2a
    created by github.com/cilium/cilium/pkg/controller.(*Manager).updateController
            /go/src/github.com/cilium/cilium/pkg/controller/manager.go:120 +0xb09

This can happen when the context cancellation timer fires after a function which consumes the context has returned, but before sessionSuccess <- true is executed. After the timeout, the receiving goroutine is closed, making the sending block forever.

Fix this by making the sessionSuccess channel buffered. Note that after sending we don't check whether the context has been cancelled, as we expect that any subsequent functions which consume the context will check for the cancellation.

Fixes: 0262854 ("etcd: Fix incorrect context usage in session renewal")

@brb brb added kind/bug This is a bug in the Cilium logic. pending-review release-note/bug This PR fixes an issue in a previous release of Cilium. labels Oct 5, 2020
@brb brb requested review from a team as code owners October 5, 2020 13:39
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 5, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.13 Oct 5, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.11 Oct 5, 2020
@brb brb force-pushed the pr/brb/fix-etcd-session-success branch from 4384690 to 9a697ed Compare October 5, 2020 13:44
It has been observed that the "sessionSuccess <- true" statement can
block forever. E.g. Cilium v1.7.9:

    goroutine 742 [chan send, 547 minutes]:
    github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000f66000, 0x2790b40, 0xc000e247c0, 0x0, 0x0)
	    /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:657 +0x3e2
    github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func6(0x2790b40, 0xc000e247c0, 0x3aae820, 0x2)
	    /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:819 +0x3e
    github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000f42500)
	    /go/src/github.com/cilium/cilium/pkg/controller/controller.go:205 +0xa2a
    created by github.com/cilium/cilium/pkg/controller.(*Manager).updateController
	    /go/src/github.com/cilium/cilium/pkg/controller/manager.go:120 +0xb09

This can happen when the context cancellation timer fires after a
function which consumes the context has returned, but before
"sessionSuccess <- true" is executed. After the timeout, the receiving
goroutine is closed, making the sending block forever.

Fix this by making the sessionSuccess channel buffered. Note that after
sending we don't check whether the context has been cancelled, as we
expect that any subsequent functions which consume the context will
check for the cancellation.

Fixes: 0262854 ("etcd: Fix incorrect context usage in session renewal")
Signed-off-by: Martynas Pumputis <m@lambda.lt>
pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
@brb brb force-pushed the pr/brb/fix-etcd-session-success branch from 9a697ed to 746674e Compare October 5, 2020 13:59
@brb brb requested a review from aanm October 5, 2020 13:59
@brb
Copy link
Member Author

brb commented Oct 5, 2020

test-me-please

@brb brb requested a review from tgraf October 6, 2020 08:58
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 7, 2020
@aanm aanm merged commit 4a66961 into master Oct 7, 2020
@aanm aanm deleted the pr/brb/fix-etcd-session-success branch October 7, 2020 09:40
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 7, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.13 Oct 7, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.11 Oct 7, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.11 Oct 7, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.13 Oct 7, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 9, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.6.13
Backport done to v1.6
1.7.11
Backport done to v1.7
1.8.5
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

5 participants