identity: cache: close channel in writing party #25353

bimmlerd · 2023-05-10T12:30:28Z

As part of the shutdown procedure involving IPCache and the identity allocation components, it was possible to hit a 'write on closed channel' panic, caused by writes of the localIdentityCache to the events channel, which is closed as part of the shutdown of the identity allocator.

Instead of directly closing the channel, call into the localIdentityCache (the only writer) to do so, with proper mutual exclusion guaranteed by the mutex.

The offending writes happened in lookupOrCreate as well as release, both of which take the mutex, and hence are correctly synchronised with the new close() method.

Fixes: #25235

Since this affects the shutdown procedure, I believe this not to impact actual cilium deployments, merely our integration tests; hence release-note/misc.

bimmlerd · 2023-05-10T13:07:25Z

CI triage:

Travis CI hit CI: Travis: Times out from no output received in the last 10m0s #25272
ConformanceGatewayAPI / gateway-api-conformance-test failed with something new seemingly

=== RUN   TestConformance/TLSRouteSimpleSameNamespace
    suite.go:258: Applying tests/tlsroute-simple-same-namespace.yaml
    apply.go:211: Creating gateway-conformance-infra-test TLSRoute
    apply.go:211: Creating gateway-tlsroute Gateway
    helpers.go:290: Gateway expected observedGeneration to be updated for all conditions, only 0/1 were updated. stale conditions are: Accepted
    helpers.go:441: Accepted condition set to Status False with Reason InvalidTLSRoute, expected Status True
    helpers.go:441: Accepted was not in conditions list [[{Accepted False 1 2023-05-10 12:36:01 +0000 UTC InvalidTLSRoute Gateway.gateway.networking.k8s.io "gateway-tlsroute" not found}]]
    helpers.go:441: Accepted condition set to Status False with Reason InvalidTLSRoute, expected Status True
    helpers.go:441: Accepted was not in conditions list [[{Accepted False 1 2023-05-10 12:36:01 +0000 UTC InvalidTLSRoute Gateway.gateway.networking.k8s.io "gateway-tlsroute" not found}]]
    helpers.go:441: Accepted condition set to Status False with Reason InvalidTLSRoute, expected Status True
    [SNIP, lots of repetition]
    helpers.go:441: Accepted was not in conditions list [[{Accepted False 1 2023-05-10 12:36:01 +0000 UTC InvalidTLSRoute Gateway.gateway.networking.k8s.io "gateway-tlsroute" not found}]]
    helpers.go:441: Accepted condition set to Status False with Reason InvalidTLSRoute, expected Status True
    helpers.go:441: Accepted was not in conditions list [[{Accepted False 1 2023-05-10 12:36:01 +0000 UTC InvalidTLSRoute Gateway.gateway.networking.k8s.io "gateway-tlsroute" not found}]]
    tlsroute-simple-same-namespace.go:51: 
        	Error Trace:	/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/kubernetes/helpers.go:445
        	            				/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/kubernetes/helpers.go:606
        	            				/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/tests/tlsroute-simple-same-namespace.go:51
        	            				/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/suite/suite.go:262
        	            				/home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/suite/suite.go:220
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	Test:       	TestConformance/TLSRouteSimpleSameNamespace
        	Messages:   	error waiting for TLSRoute to have parents matching expectations
    apply.go:219: Deleting gateway-tlsroute Gateway
    apply.go:219: Deleting gateway-conformance-infra-test TLSRoute


     --- FAIL: TestConformance/TLSRouteSimpleSameNamespace (61.07s)
FAIL
FAIL	github.com/cilium/cilium/operator/pkg/gateway-api	117.095s
FAIL

rerunning to see if it's flaky or just broken by me

bimmlerd · 2023-05-10T13:08:39Z

/test

bimmlerd · 2023-05-10T13:27:22Z

/ci-verifier

Github actions encountered an internal error

gandro

Thanks! One nit and one potentially blocking issue.

pkg/identity/cache/local.go

gandro · 2023-05-10T14:48:53Z

pkg/identity/cache/allocator.go

@@ -273,7 +273,9 @@ func (m *CachingIdentityAllocator) Close() {

 	m.IdentityAllocator.Delete()
 	if m.events != nil {
-		close(m.events)


nit: Since we don't close m.events anymore, could we make type AllocatorEventChan chan AllocatorEvent into type AllocatorEventChan chan <-AllocatorEvent to disallow any code from accidentally closing it?

You might will to change the signature of newLocalIdentityCache, but all other receivers should hopefully be fine with a receive-only channel.

Hmm this is actually a bit more involved than I though - looking into other uses of that type.

bimmlerd · 2023-05-11T08:56:30Z

I've redone the PR somewhat, to try to implement Sebastian's suggestion of encoding who writes/reads from the channel in the type system. It's a bit messy, however, since the caching identity allocator needs to pass its channel on to the actual allocator, hence needs to maintain both a recv/send variant of the same channel (or a bidirectional reference, hence losing the type safety). I'm on the fence in terms of which variant is better, feedback welcome.

gandro

Awesome, thanks!

pkg/identity/cache/allocator.go

As part of the shutdown procedure involving IPCache and the identity allocation components, it was possible to hit a 'send on closed channel' panic, caused by writes of the localIdentityCache to the events channel, which is closed as part of the shutdown of the identity allocator. Instead of directly closing the channel after shutting down the allocator, call into the localIdentityCache to do so, with proper mutual exclusion guaranteed its mutex. The offending writes happened in 'lookupOrCreate' as well as 'release', both of which take the mutex, and hence are correctly synchronised with the new 'close()' method. The other writer is the allocator, but we block on its shutdown before 'close()'. Suggested-by: André Martins <andre@cilium.io> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

bimmlerd · 2023-05-11T11:24:27Z

/test

bimmlerd · 2023-05-11T14:29:15Z

CI triage

ci-multicluster

Errors: 1 pods of DaemonSet cilium are not ready
Warnings:         cilium             cilium-tq5q9    pod is pending

Error: Unable to install Cilium: timeout while waiting for status to become successful: context deadline exceeded

Didn't really find anything interesting; in the k8s events there's some references to the node not being ready. No logs, cilium's init containers didn't even run.

bimmlerd · 2023-05-11T14:29:32Z

/ci-multicluster

giorio94 · 2023-05-11T15:48:17Z

pkg/identity/cache/allocator.go

@@ -49,6 +49,9 @@ type CachingIdentityAllocator struct {

 	identitiesPath string

+	// This field exists is to hand out references that are either for sending


Suggested change

// This field exists is to hand out references that are either for sending

// This field exists to hand out references that are either for sending

Nit.

giorio94 · 2023-05-11T16:16:05Z

pkg/identity/cache/allocator.go

+		// Have the now only remaining writing party close the events channel,
+		// to ensure we don't panic with 'send on closed channel'.


I'm not confident that this statement is guaranteed to be true in a clustermesh scenario. The events channel is also propagated to remote watchers, which might still be running when the main allocator is closed (the clustermesh subsystem is never stopped on shutdown). Likely not a big deal though, since that should be extremely rare and we are already shutting down. For this reason I wonder if it is something actually worth solving it in this PR (it might be quite complex to fix).

Yeah, you're right, that's still broken.

It's kind of unclear to me what the ownership relations are in the clustermesh scenario. Does a m.IdentityAllocator own its remote caches (and hence, should Allocator.Delete() call .Close() on them)?

Currently the cache corresponding to a given remote cluster is closed here, which happens when the associated cluster configuration is removed, but is not triggered when the agent is shut down.

The two possible alternatives seem to be either ensuring that the clustermesh subsystem is stopped before the allocator is closed, or making sure that m.IdentityAllocator stops also the remote caches when closed. I'm not sure how the first approach plays with the current structure, given that clustermesh has not yet been converted to a Cell module. The second approach instead requires to protect RemoteCache.Close() from being called twice, as it would panic.

Does a m.IdentityAllocator own its remote caches?

I would tend to say so, since a remote cache is pointless to exist without the main allocator (but their lifecycle is currently managed externally).

giorio94

One minor concern inline, but I feel that it might be also left as a followup, as it will like require some refactoring around clustermesh.

bimmlerd added the release-note/misc This PR makes changes that have no direct user impact. label May 10, 2023

bimmlerd mentioned this pull request May 10, 2023

CI: daemon/cmd test panic'ed in TravisCI #25235

Closed

bimmlerd marked this pull request as ready for review May 10, 2023 13:27

bimmlerd requested a review from a team as a code owner May 10, 2023 13:27

bimmlerd requested a review from nebril May 10, 2023 13:27

gandro requested changes May 10, 2023

View reviewed changes

tommyp1ckles approved these changes May 10, 2023

View reviewed changes

gandro approved these changes May 10, 2023

View reviewed changes

bimmlerd force-pushed the pr/bimmlerd/fix-id-alloc-panic branch from a8a86c6 to fb59470 Compare May 11, 2023 07:29

bimmlerd requested a review from a team as a code owner May 11, 2023 07:29

bimmlerd requested a review from giorio94 May 11, 2023 07:29

bimmlerd marked this pull request as draft May 11, 2023 07:37

bimmlerd force-pushed the pr/bimmlerd/fix-id-alloc-panic branch 2 times, most recently from 6b878f8 to 7c9c3b8 Compare May 11, 2023 08:28

bimmlerd marked this pull request as ready for review May 11, 2023 08:56

bimmlerd requested a review from gandro May 11, 2023 08:56

gandro approved these changes May 11, 2023

View reviewed changes

pkg/identity/cache/allocator.go Outdated Show resolved Hide resolved

bimmlerd force-pushed the pr/bimmlerd/fix-id-alloc-panic branch from 7c9c3b8 to fb9291f Compare May 11, 2023 11:07

bimmlerd mentioned this pull request May 11, 2023

Modularize eventsmap and monitor.Agent #25197

Merged

giorio94 reviewed May 11, 2023

View reviewed changes

giorio94 approved these changes May 11, 2023

View reviewed changes

nebril approved these changes May 11, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 11, 2023

tommyp1ckles mentioned this pull request May 11, 2023

test/k8s: for services test, wait for all applied manifests to delete #25341

Merged

tommyp1ckles merged commit efb6f56 into cilium:main May 11, 2023
57 checks passed

bimmlerd deleted the pr/bimmlerd/fix-id-alloc-panic branch May 16, 2023 08:19

bimmlerd self-assigned this May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identity: cache: close channel in writing party #25353

identity: cache: close channel in writing party #25353

bimmlerd commented May 10, 2023

bimmlerd commented May 10, 2023

bimmlerd commented May 10, 2023

bimmlerd commented May 10, 2023 •

edited

gandro left a comment

gandro May 10, 2023 •

edited

bimmlerd May 11, 2023

bimmlerd commented May 11, 2023 •

edited

gandro left a comment

bimmlerd commented May 11, 2023

bimmlerd commented May 11, 2023

bimmlerd commented May 11, 2023

giorio94 May 11, 2023

giorio94 May 11, 2023 •

edited

bimmlerd May 12, 2023

giorio94 May 12, 2023

giorio94 left a comment

		@@ -49,6 +49,9 @@ type CachingIdentityAllocator struct {

		identitiesPath string

		// This field exists is to hand out references that are either for sending

	// This field exists is to hand out references that are either for sending
	// This field exists to hand out references that are either for sending

		// Have the now only remaining writing party close the events channel,
		// to ensure we don't panic with 'send on closed channel'.

identity: cache: close channel in writing party #25353

identity: cache: close channel in writing party #25353

Conversation

bimmlerd commented May 10, 2023

bimmlerd commented May 10, 2023

bimmlerd commented May 10, 2023

bimmlerd commented May 10, 2023 • edited

gandro left a comment

Choose a reason for hiding this comment

gandro May 10, 2023 • edited

Choose a reason for hiding this comment

bimmlerd May 11, 2023

Choose a reason for hiding this comment

bimmlerd commented May 11, 2023 • edited

gandro left a comment

Choose a reason for hiding this comment

bimmlerd commented May 11, 2023

bimmlerd commented May 11, 2023

bimmlerd commented May 11, 2023

giorio94 May 11, 2023

Choose a reason for hiding this comment

giorio94 May 11, 2023 • edited

Choose a reason for hiding this comment

bimmlerd May 12, 2023

Choose a reason for hiding this comment

giorio94 May 12, 2023

Choose a reason for hiding this comment

giorio94 left a comment

Choose a reason for hiding this comment

bimmlerd commented May 10, 2023 •

edited

gandro May 10, 2023 •

edited

bimmlerd commented May 11, 2023 •

edited

giorio94 May 11, 2023 •

edited