Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identity: cache: close channel in writing party #25353

Merged
merged 1 commit into from May 11, 2023

Conversation

bimmlerd
Copy link
Member

As part of the shutdown procedure involving IPCache and the identity allocation components, it was possible to hit a 'write on closed channel' panic, caused by writes of the localIdentityCache to the events channel, which is closed as part of the shutdown of the identity allocator.

Instead of directly closing the channel, call into the localIdentityCache (the only writer) to do so, with proper mutual exclusion guaranteed by the mutex.

The offending writes happened in lookupOrCreate as well as release, both of which take the mutex, and hence are correctly synchronised with the new close() method.

Fixes: #25235

Since this affects the shutdown procedure, I believe this not to impact actual cilium deployments, merely our integration tests; hence release-note/misc.

@bimmlerd bimmlerd added the release-note/misc This PR makes changes that have no direct user impact. label May 10, 2023
@bimmlerd
Copy link
Member Author

CI triage:

=== RUN   TestConformance/TLSRouteSimpleSameNamespace
    suite.go:258: Applying tests/tlsroute-simple-same-namespace.yaml
    apply.go:211: Creating gateway-conformance-infra-test TLSRoute
    apply.go:211: Creating gateway-tlsroute Gateway
    helpers.go:290: Gateway expected observedGeneration to be updated for all conditions, only 0/1 were updated. stale conditions are: Accepted
    helpers.go:441: Accepted condition set to Status False with Reason InvalidTLSRoute, expected Status True
    helpers.go:441: Accepted was not in conditions list [[{Accepted False 1 2023-05-10 12:36:01 +0000 UTC InvalidTLSRoute Gateway.gateway.networking.k8s.io "gateway-tlsroute" not found}]]
    helpers.go:441: Accepted condition set to Status False with Reason InvalidTLSRoute, expected Status True
    helpers.go:441: Accepted was not in conditions list [[{Accepted False 1 2023-05-10 12:36:01 +0000 UTC InvalidTLSRoute Gateway.gateway.networking.k8s.io "gateway-tlsroute" not found}]]
    helpers.go:441: Accepted condition set to Status False with Reason InvalidTLSRoute, expected Status True
    [SNIP, lots of repetition]
    helpers.go:441: Accepted was not in conditions list [[{Accepted False 1 2023-05-10 12:36:01 +0000 UTC InvalidTLSRoute Gateway.gateway.networking.k8s.io "gateway-tlsroute" not found}]]
    helpers.go:441: Accepted condition set to Status False with Reason InvalidTLSRoute, expected Status True
    helpers.go:441: Accepted was not in conditions list [[{Accepted False 1 2023-05-10 12:36:01 +0000 UTC InvalidTLSRoute Gateway.gateway.networking.k8s.io "gateway-tlsroute" not found}]]
    tlsroute-simple-same-namespace.go:51: 
        	Error Trace:	/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/kubernetes/helpers.go:445
        	            				/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/kubernetes/helpers.go:606
        	            				/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/tests/tlsroute-simple-same-namespace.go:51
        	            				/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/suite/suite.go:262
        	            				/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/suite/suite.go:220
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	Test:       	TestConformance/TLSRouteSimpleSameNamespace
        	Messages:   	error waiting for TLSRoute to have parents matching expectations
    apply.go:219: Deleting gateway-tlsroute Gateway
    apply.go:219: Deleting gateway-conformance-infra-test TLSRoute


     --- FAIL: TestConformance/TLSRouteSimpleSameNamespace (61.07s)
FAIL
FAIL	github.com/cilium/cilium/operator/pkg/gateway-api	117.095s
FAIL

rerunning to see if it's flaky or just broken by me

@bimmlerd
Copy link
Member Author

/test

@bimmlerd
Copy link
Member Author

bimmlerd commented May 10, 2023

/ci-verifier

Github actions encountered an internal error

@bimmlerd bimmlerd marked this pull request as ready for review May 10, 2023 13:27
@bimmlerd bimmlerd requested a review from a team as a code owner May 10, 2023 13:27
@bimmlerd bimmlerd requested a review from nebril May 10, 2023 13:27
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One nit and one potentially blocking issue.

pkg/identity/cache/local.go Show resolved Hide resolved
@@ -273,7 +273,9 @@ func (m *CachingIdentityAllocator) Close() {

m.IdentityAllocator.Delete()
if m.events != nil {
close(m.events)
Copy link
Member

@gandro gandro May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since we don't close m.events anymore, could we make type AllocatorEventChan chan AllocatorEvent into type AllocatorEventChan chan <-AllocatorEvent to disallow any code from accidentally closing it?

You might will to change the signature of newLocalIdentityCache, but all other receivers should hopefully be fine with a receive-only channel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is actually a bit more involved than I though - looking into other uses of that type.

@bimmlerd bimmlerd force-pushed the pr/bimmlerd/fix-id-alloc-panic branch from a8a86c6 to fb59470 Compare May 11, 2023 07:29
@bimmlerd bimmlerd requested a review from a team as a code owner May 11, 2023 07:29
@bimmlerd bimmlerd requested a review from giorio94 May 11, 2023 07:29
@bimmlerd bimmlerd marked this pull request as draft May 11, 2023 07:37
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/fix-id-alloc-panic branch 2 times, most recently from 6b878f8 to 7c9c3b8 Compare May 11, 2023 08:28
@bimmlerd
Copy link
Member Author

bimmlerd commented May 11, 2023

I've redone the PR somewhat, to try to implement Sebastian's suggestion of encoding who writes/reads from the channel in the type system. It's a bit messy, however, since the caching identity allocator needs to pass its channel on to the actual allocator, hence needs to maintain both a recv/send variant of the same channel (or a bidirectional reference, hence losing the type safety). I'm on the fence in terms of which variant is better, feedback welcome.

@bimmlerd bimmlerd marked this pull request as ready for review May 11, 2023 08:56
@bimmlerd bimmlerd requested a review from gandro May 11, 2023 08:56
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

pkg/identity/cache/allocator.go Outdated Show resolved Hide resolved
As part of the shutdown procedure involving IPCache and the identity
allocation components, it was possible to hit a 'send on closed
channel' panic, caused by writes of the localIdentityCache to the events
channel, which is closed as part of the shutdown of the identity
allocator.

Instead of directly closing the channel after shutting down the
allocator, call into the localIdentityCache to do so, with proper mutual
exclusion guaranteed its mutex.

The offending writes happened in 'lookupOrCreate' as well as 'release',
both of which take the mutex, and hence are correctly synchronised with
the new 'close()' method. The other writer is the allocator, but we
block on its shutdown before 'close()'.

Suggested-by: André Martins <andre@cilium.io>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/fix-id-alloc-panic branch from 7c9c3b8 to fb9291f Compare May 11, 2023 11:07
@bimmlerd
Copy link
Member Author

/test

@bimmlerd
Copy link
Member Author

CI triage

  • ci-multicluster
Errors: 1 pods of DaemonSet cilium are not ready
Warnings:         cilium             cilium-tq5q9    pod is pending

Error: Unable to install Cilium: timeout while waiting for status to become successful: context deadline exceeded

Didn't really find anything interesting; in the k8s events there's some references to the node not being ready. No logs, cilium's init containers didn't even run.

@bimmlerd
Copy link
Member Author

/ci-multicluster

@@ -49,6 +49,9 @@ type CachingIdentityAllocator struct {

identitiesPath string

// This field exists is to hand out references that are either for sending
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This field exists is to hand out references that are either for sending
// This field exists to hand out references that are either for sending

Nit.

Comment on lines +279 to +280
// Have the now only remaining writing party close the events channel,
// to ensure we don't panic with 'send on closed channel'.
Copy link
Member

@giorio94 giorio94 May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident that this statement is guaranteed to be true in a clustermesh scenario. The events channel is also propagated to remote watchers, which might still be running when the main allocator is closed (the clustermesh subsystem is never stopped on shutdown). Likely not a big deal though, since that should be extremely rare and we are already shutting down. For this reason I wonder if it is something actually worth solving it in this PR (it might be quite complex to fix).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, that's still broken.

It's kind of unclear to me what the ownership relations are in the clustermesh scenario. Does a m.IdentityAllocator own its remote caches (and hence, should Allocator.Delete() call .Close() on them)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the cache corresponding to a given remote cluster is closed here, which happens when the associated cluster configuration is removed, but is not triggered when the agent is shut down.

The two possible alternatives seem to be either ensuring that the clustermesh subsystem is stopped before the allocator is closed, or making sure that m.IdentityAllocator stops also the remote caches when closed. I'm not sure how the first approach plays with the current structure, given that clustermesh has not yet been converted to a Cell module. The second approach instead requires to protect RemoteCache.Close() from being called twice, as it would panic.

Does a m.IdentityAllocator own its remote caches?

I would tend to say so, since a remote cache is pointless to exist without the main allocator (but their lifecycle is currently managed externally).

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor concern inline, but I feel that it might be also left as a followup, as it will like require some refactoring around clustermesh.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 11, 2023
@tommyp1ckles tommyp1ckles merged commit efb6f56 into cilium:main May 11, 2023
57 checks passed
@bimmlerd bimmlerd deleted the pr/bimmlerd/fix-id-alloc-panic branch May 16, 2023 08:19
@bimmlerd bimmlerd self-assigned this May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: daemon/cmd test panic'ed in TravisCI
5 participants