New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
identity: cache: close channel in writing party #25353
identity: cache: close channel in writing party #25353
Conversation
CI triage:
rerunning to see if it's flaky or just broken by me |
/test |
/ci-verifier Github actions encountered an internal error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! One nit and one potentially blocking issue.
@@ -273,7 +273,9 @@ func (m *CachingIdentityAllocator) Close() { | |||
|
|||
m.IdentityAllocator.Delete() | |||
if m.events != nil { | |||
close(m.events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since we don't close m.events
anymore, could we make type AllocatorEventChan chan AllocatorEvent
into type AllocatorEventChan chan <-AllocatorEvent
to disallow any code from accidentally closing it?
You might will to change the signature of newLocalIdentityCache
, but all other receivers should hopefully be fine with a receive-only channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is actually a bit more involved than I though - looking into other uses of that type.
a8a86c6
to
fb59470
Compare
6b878f8
to
7c9c3b8
Compare
I've redone the PR somewhat, to try to implement Sebastian's suggestion of encoding who writes/reads from the channel in the type system. It's a bit messy, however, since the caching identity allocator needs to pass its channel on to the actual allocator, hence needs to maintain both a recv/send variant of the same channel (or a bidirectional reference, hence losing the type safety). I'm on the fence in terms of which variant is better, feedback welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
As part of the shutdown procedure involving IPCache and the identity allocation components, it was possible to hit a 'send on closed channel' panic, caused by writes of the localIdentityCache to the events channel, which is closed as part of the shutdown of the identity allocator. Instead of directly closing the channel after shutting down the allocator, call into the localIdentityCache to do so, with proper mutual exclusion guaranteed its mutex. The offending writes happened in 'lookupOrCreate' as well as 'release', both of which take the mutex, and hence are correctly synchronised with the new 'close()' method. The other writer is the allocator, but we block on its shutdown before 'close()'. Suggested-by: André Martins <andre@cilium.io> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
7c9c3b8
to
fb9291f
Compare
/test |
CI triage
Didn't really find anything interesting; in the k8s events there's some references to the node not being ready. No logs, cilium's init containers didn't even run. |
/ci-multicluster |
@@ -49,6 +49,9 @@ type CachingIdentityAllocator struct { | |||
|
|||
identitiesPath string | |||
|
|||
// This field exists is to hand out references that are either for sending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This field exists is to hand out references that are either for sending | |
// This field exists to hand out references that are either for sending |
Nit.
// Have the now only remaining writing party close the events channel, | ||
// to ensure we don't panic with 'send on closed channel'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident that this statement is guaranteed to be true in a clustermesh scenario. The events channel is also propagated to remote watchers, which might still be running when the main allocator is closed (the clustermesh subsystem is never stopped on shutdown). Likely not a big deal though, since that should be extremely rare and we are already shutting down. For this reason I wonder if it is something actually worth solving it in this PR (it might be quite complex to fix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, that's still broken.
It's kind of unclear to me what the ownership relations are in the clustermesh scenario. Does a m.IdentityAllocator
own its remote caches (and hence, should Allocator.Delete()
call .Close()
on them)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the cache corresponding to a given remote cluster is closed here, which happens when the associated cluster configuration is removed, but is not triggered when the agent is shut down.
The two possible alternatives seem to be either ensuring that the clustermesh subsystem is stopped before the allocator is closed, or making sure that m.IdentityAllocator
stops also the remote caches when closed. I'm not sure how the first approach plays with the current structure, given that clustermesh has not yet been converted to a Cell module. The second approach instead requires to protect RemoteCache.Close()
from being called twice, as it would panic.
Does a m.IdentityAllocator own its remote caches?
I would tend to say so, since a remote cache is pointless to exist without the main allocator (but their lifecycle is currently managed externally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor concern inline, but I feel that it might be also left as a followup, as it will like require some refactoring around clustermesh.
As part of the shutdown procedure involving IPCache and the identity allocation components, it was possible to hit a 'write on closed channel' panic, caused by writes of the localIdentityCache to the events channel, which is closed as part of the shutdown of the identity allocator.
Instead of directly closing the channel, call into the localIdentityCache (the only writer) to do so, with proper mutual exclusion guaranteed by the mutex.
The offending writes happened in
lookupOrCreate
as well asrelease
, both of which take the mutex, and hence are correctly synchronised with the newclose()
method.Fixes: #25235
Since this affects the shutdown procedure, I believe this not to impact actual cilium deployments, merely our integration tests; hence
release-note/misc
.