-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional FQDN selector identity tracking fixes #17788
Additional FQDN selector identity tracking fixes #17788
Commits on Nov 18, 2021
-
identity: Add notifyOwner flag to identity Release
The AllocateIdentity() method accepts a 'notifyOwner' boolean to determine whether to notify the SelectorCache about the change in identities. To balance the APIs into this structure, this commit adds the corresponding notifyOwner flag to the Release function. No other changes are made at this time; all existing cases which notify the owner will continue to notify the owner, and all cases which do not notify the owner will continue to not notify the owner. The next commit will change the behaviour and argue why that is correct. Signed-off-by: Joe Stringer <joe@cilium.io>
Configuration menu - View commit details
-
Copy full SHA for 3357b1b - Browse repository at this point
Copy the full SHA 3357b1bView commit details -
endpoint: Document SelectorCache notification
Document why the endpoint's AllocateIdentity() function triggers notification into the SelectorCache and yet the Release does not. Signed-off-by: Joe Stringer <joe@cilium.io>
Configuration menu - View commit details
-
Copy full SHA for ab1e307 - Browse repository at this point
Copy the full SHA ab1e307View commit details -
identity: Remove disbalanced SelectorCache notify code
While attempting to address identity garbage collection issues in upcoming commits, I hit the stack trace below. This is not currently possible to hit, because CIDR identity references from the SelectorCache are never released. However, after introducing the proper identity reference count release implementation, we observe that the SelectorCache attempts to release a CIDR identity, which then triggers a call back into the SelectorCache to notify it about the release of the identities. Since the outer calling code must hold the SelectorCache mutex in order to protect the writes against the SelectorCache itself, when the release is executed and attempts to call back into the SelectorCache, UpdateIdentities() attempts to re-acquire the mutex that the goroutine is already holding. This causes deadlocks which then propagate through to other subsystems, for instance locking Endpoints out from applying policy changes and preventing CLI calls for commands such as `cilium policy selectors list` from gathering and returning the internal state of the system. Upon further investigation, the notification of the "owner", ie SelectorCache, is also disbalanced here between the AllocateIdentity() call and the Release() call. The AllocateIdentity() call will only perform the notification when it is called from endpoint (restore or identity update) or clustermesh-apiserver packages. It does not notify the SelectorCache when called from CIDR allocation routines. On the other hand, the Release function inexplicably only notifies the SelectorCache when CIDR identities are released. When this functionality was initially introduced, it seems that we did not recognize and address this disbalance. The CIDR identity allocation routines would not notify the owner (because the callers take on that responsibility), and yet the release _would_ notify the owner. To fix this, this commit disables the owner notification for the Release() function so that it matches the AllocateIdentity() function. This will also prevent the caller from notifying itself again and triggering a deadlock when we apply the fixes in upcoming commits. sync.runtime_SemacquireMutex(0x44, 0x18, 0xc000a9a2d8) /usr/local/go/src/runtime/sema.go:71 +0x25 sync.(*Mutex).lockSlow(0xc000532d90) /usr/local/go/src/sync/mutex.go:138 +0x165 sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 sync.(*RWMutex).Lock(0x580) /usr/local/go/src/sync/rwmutex.go:111 +0x36 github.com/cilium/cilium/pkg/policy.(*SelectorCache).UpdateIdentities(0xc000532d90, 0x0, 0xc0016540f0, 0x24bc800) /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:950 +0x65 github.com/cilium/cilium/daemon/cmd.(*Daemon).UpdateIdentities(0xc0000c5440, 0xc0016540f0, 0xc001000002) /go/src/github.com/cilium/cilium/daemon/cmd/policy.go:88 +0x5a github.com/cilium/cilium/pkg/identity/cache.(*CachingIdentityAllocator).Release(0xc0007a18c0, {0x2b86420, 0xc0009e7140}, 0xc001c42540) /go/src/github.com/cilium/cilium/pkg/identity/cache/allocator.go:423 +0x324 github.com/cilium/cilium/pkg/ipcache.releaseCIDRIdentities({0x2b86420, 0xc0009e7140}, 0xc0006b08c5) /go/src/github.com/cilium/cilium/pkg/ipcache/cidr.go:141 +0x296 github.com/cilium/cilium/pkg/ipcache.ReleaseCIDRIdentitiesByID({0x2b86420, 0xc0009e7140}, {0xc001597450, 0x41de7a0, 0x1bf08eb000}) /go/src/github.com/cilium/cilium/pkg/ipcache/cidr.go:200 +0x4af github.com/cilium/cilium/daemon/cmd.cachingIdentityAllocator.ReleaseCIDRIdentitiesByID({0x2b863e8}, {0x2b86420, 0xc0009e7140}, {0xc001597450, 0x3, 0x10}) /go/src/github.com/cilium/cilium/daemon/cmd/identity.go:117 +0x37 github.com/cilium/cilium/pkg/policy.(*fqdnSelector).releaseIdentityMappings(0xc001c24e10, {0x7f0588917fb8, 0xc0007a18c0}) /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:526 +0x1aa github.com/cilium/cilium/pkg/policy.(*SelectorCache).removeSelectorLocked(0xc000532d90, {0x2ba5660, 0xc001c24e10}, {0x2b33100, 0xc000953880}) /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:901 +0xd7 github.com/cilium/cilium/pkg/policy.(*SelectorCache).RemoveSelectors(0xc000532d90, {0xc000f255c0, 0x2, 0xc0026e65c8}, {0x2b33100, 0xc000953880}) /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:919 +0xb4 github.com/cilium/cilium/pkg/policy.(*L4Filter).removeSelectors(0xc000953880, 0xc0026e6690) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:624 +0x185 github.com/cilium/cilium/pkg/policy.(*L4Filter).detach(0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:631 +0x1e github.com/cilium/cilium/pkg/policy.L4PolicyMap.Detach(0x0, 0xc000a80160) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:801 +0x7e github.com/cilium/cilium/pkg/policy.(*L4Policy).Detach(0xc000635c80, 0x1) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:1011 +0x45 github.com/cilium/cilium/pkg/policy.(*selectorPolicy).Detach(...) /go/src/github.com/cilium/cilium/pkg/policy/resolve.go:107 github.com/cilium/cilium/pkg/policy.(*cachedSelectorPolicy).setPolicy(0xc000532e70, 0xc0009e63c0) /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:188 +0x3b github.com/cilium/cilium/pkg/policy.(*PolicyCache).updateSelectorPolicy(0xc000131410, 0xc0009e63c0) /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:124 +0x195 github.com/cilium/cilium/pkg/policy.(*PolicyCache).UpdatePolicy(...) /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:153 github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regeneratePolicy(0xc00078b500) /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:230 +0x22b github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runPreCompilationSteps(0xc00078b500, 0xc000363400) /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:815 +0x2dd github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerateBPF(0xc00078b500, 0xc000363400) /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:584 +0x19d github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0xc00078b500, 0xc000363400) /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:405 +0x7b3 github.com/cilium/cilium/pkg/endpoint.(*EndpointRegenerationEvent).Handle(0xc000d8e060, 0x40568a) /go/src/github.com/cilium/cilium/pkg/endpoint/events.go:53 +0x32c github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run.func1() /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:245 +0x13b sync.(*Once).doSlow(0x1, 0x43f325) /usr/local/go/src/sync/once.go:68 +0xd2 sync.(*Once).Do(...) /usr/local/go/src/sync/once.go:59 github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run(0x0) /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:233 +0x45 created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:229 +0x7b Fixes: 83e576e ("identity: create `CachingIdentityAllocator` type") Signed-off-by: Joe Stringer <joe@cilium.io>
Configuration menu - View commit details
-
Copy full SHA for 6d83d26 - Browse repository at this point
Copy the full SHA 6d83d26View commit details -
fqdn: Move identity allocation to FQDN selector
Previously, the CIDR identity alloction logic was tied directly into the NameManager in RegisterForIdentityUpdatesLocked(), however these identities were never cleaned up. In order to properly release CIDR identities that are allocated upon creation of an FQDN selector, we need to track them and eventually call into release functions. Similar to recent changes that looked at identity leak during updates of FQDN selectors, we have two options - continue to tie the allocation into the NameManager (and hence teach UnregisterForIdentityUpdatesLocked() about identities and their associations), or we can tie them more closely to the cached FQDN selectors that require the allocations to occur. Given that not all cached selectors need this treatment (in particular, cached label selectors do not hold identity references), this code opts to refactor the allocation routines up one layer into the selectorcache's 'fqdnSelector' initialization handling logic. This will pave the way for fqdnSelector cleanup logic to then release those references, thereby balancing the allocation and release of identities. This commit should have no functional changes. Technically the identity allocation is moved inside the SelectorCache critical section while holding the mutex, but CIDR identities are only ever locally allocated within memory anyway so this is not expected to block for a long time. Regarding the dropped test, by name it seemed like it's intended to test the identity notifier, except that at its core it is using a mock implementation of the identity notifier. It was also depending on the assumption that the identity notifier is responsible for identity allocation, whereas this patch removes that responsibility and pushes it back to the FQDN selector. We perhaps could have expanded the way that it prepares the identity allocator and the identity notifier in order to validate that when adding a new FQDN selector, existing IPs from the DNS cache are properly propagated into the selector via the mock selection user that's also present in this test file. However, by this time we mock out this many implementation details around the DNS cache, FQDN selectors, cached selection users, Identity allocation and Identity notifier, it seemed like the test is getting too deep into implementation details to be able to robustly validate the functionality it's intending to validate. Rather than fight this structure and attempt to build yet more implementation-specific testing, it seemed like a simpler path to just remove the test. If this selection notification doesn't occur properly here, then I'd hope we can pick this up during later integration tests rather than solely relying on this existing test. Signed-off-by: Joe Stringer <joe@cilium.io>
Configuration menu - View commit details
-
Copy full SHA for 51126bf - Browse repository at this point
Copy the full SHA 51126bfView commit details -
fqdn: Handle duplicate ids during selector caching
Previously this code didn't consider that this function could be called against a selector that is already cached (and therefore has identities associated with its internal 'cachedSelections' list). If a selector with the same string representation was added twice into the cache and identities associated with it in between, then that could potentially cause those identities to have a reference allocated without eventually being freed. Account for this case by detecting duplicates here and releasing any identities that are already referenced by this selector. Signed-off-by: Joe Stringer <joe@cilium.io>
Configuration menu - View commit details
-
Copy full SHA for 088aa05 - Browse repository at this point
Copy the full SHA 088aa05View commit details -
selectorcache: Release identities on selector removal
FQDN selectors hold references to CIDR identities, one for each 'fqdnSelector.cachedSelections' entry. Previously, these would be associated with the fqdnSelector during creation of the selector but they were not released when the fqdnSelector is deleted from the cache. Commit de10e82 ("fqdn: Move identity allocation to FQDN selector") fixed a similar bug recently during the update of existing selectors; this code is balancing the release for identities allocated during the initial creation of each FQDN selector. Signed-off-by: Joe Stringer <joe@cilium.io>
Configuration menu - View commit details
-
Copy full SHA for 9fd6593 - Browse repository at this point
Copy the full SHA 9fd6593View commit details