Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional FQDN selector identity tracking fixes #17788

Merged
merged 6 commits into from
Nov 22, 2021

Commits on Nov 18, 2021

  1. identity: Add notifyOwner flag to identity Release

    The AllocateIdentity() method accepts a 'notifyOwner' boolean to
    determine whether to notify the SelectorCache about the change in
    identities.
    
    To balance the APIs into this structure, this commit adds the
    corresponding notifyOwner flag to the Release function. No other changes
    are made at this time; all existing cases which notify the owner will
    continue to notify the owner, and all cases which do not notify the
    owner will continue to not notify the owner.
    
    The next commit will change the behaviour and argue why that is correct.
    
    Signed-off-by: Joe Stringer <joe@cilium.io>
    joestringer committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    3357b1b View commit details
    Browse the repository at this point in the history
  2. endpoint: Document SelectorCache notification

    Document why the endpoint's AllocateIdentity() function triggers
    notification into the SelectorCache and yet the Release does not.
    
    Signed-off-by: Joe Stringer <joe@cilium.io>
    joestringer committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    ab1e307 View commit details
    Browse the repository at this point in the history
  3. identity: Remove disbalanced SelectorCache notify code

    While attempting to address identity garbage collection issues in
    upcoming commits, I hit the stack trace below. This is not currently
    possible to hit, because CIDR identity references from the SelectorCache
    are never released. However, after introducing the proper identity
    reference count release implementation, we observe that the
    SelectorCache attempts to release a CIDR identity, which then triggers a
    call back into the SelectorCache to notify it about the release of the
    identities. Since the outer calling code must hold the SelectorCache
    mutex in order to protect the writes against the SelectorCache itself,
    when the release is executed and attempts to call back into the
    SelectorCache, UpdateIdentities() attempts to re-acquire the mutex that
    the goroutine is already holding. This causes deadlocks which then
    propagate through to other subsystems, for instance locking Endpoints
    out from applying policy changes and preventing CLI calls for commands
    such as `cilium policy selectors list` from gathering and returning the
    internal state of the system.
    
    Upon further investigation, the notification of the "owner", ie
    SelectorCache, is also disbalanced here between the AllocateIdentity()
    call and the Release() call. The AllocateIdentity() call will only
    perform the notification when it is called from endpoint (restore or
    identity update) or clustermesh-apiserver packages. It does not notify
    the SelectorCache when called from CIDR allocation routines. On the
    other hand, the Release function inexplicably only notifies the
    SelectorCache when CIDR identities are released.
    
    When this functionality was initially introduced, it seems that we did
    not recognize and address this disbalance. The CIDR identity allocation
    routines would not notify the owner (because the callers take on that
    responsibility), and yet the release _would_ notify the owner.
    
    To fix this, this commit disables the owner notification for the
    Release() function so that it matches the AllocateIdentity() function.
    This will also prevent the caller from notifying itself again and
    triggering a deadlock when we apply the fixes in upcoming commits.
    
        sync.runtime_SemacquireMutex(0x44, 0x18, 0xc000a9a2d8)
                /usr/local/go/src/runtime/sema.go:71 +0x25
        sync.(*Mutex).lockSlow(0xc000532d90)
                /usr/local/go/src/sync/mutex.go:138 +0x165
        sync.(*Mutex).Lock(...)
                /usr/local/go/src/sync/mutex.go:81
        sync.(*RWMutex).Lock(0x580)
                /usr/local/go/src/sync/rwmutex.go:111 +0x36
        github.com/cilium/cilium/pkg/policy.(*SelectorCache).UpdateIdentities(0xc000532d90, 0x0, 0xc0016540f0, 0x24bc800)
                /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:950 +0x65
        github.com/cilium/cilium/daemon/cmd.(*Daemon).UpdateIdentities(0xc0000c5440, 0xc0016540f0, 0xc001000002)
                /go/src/github.com/cilium/cilium/daemon/cmd/policy.go:88 +0x5a
        github.com/cilium/cilium/pkg/identity/cache.(*CachingIdentityAllocator).Release(0xc0007a18c0, {0x2b86420, 0xc0009e7140}, 0xc001c42540)
                /go/src/github.com/cilium/cilium/pkg/identity/cache/allocator.go:423 +0x324
        github.com/cilium/cilium/pkg/ipcache.releaseCIDRIdentities({0x2b86420, 0xc0009e7140}, 0xc0006b08c5)
                /go/src/github.com/cilium/cilium/pkg/ipcache/cidr.go:141 +0x296
        github.com/cilium/cilium/pkg/ipcache.ReleaseCIDRIdentitiesByID({0x2b86420, 0xc0009e7140}, {0xc001597450, 0x41de7a0, 0x1bf08eb000})
                /go/src/github.com/cilium/cilium/pkg/ipcache/cidr.go:200 +0x4af
        github.com/cilium/cilium/daemon/cmd.cachingIdentityAllocator.ReleaseCIDRIdentitiesByID({0x2b863e8}, {0x2b86420, 0xc0009e7140}, {0xc001597450, 0x3, 0x10})
                /go/src/github.com/cilium/cilium/daemon/cmd/identity.go:117 +0x37
        github.com/cilium/cilium/pkg/policy.(*fqdnSelector).releaseIdentityMappings(0xc001c24e10, {0x7f0588917fb8, 0xc0007a18c0})
                /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:526 +0x1aa
        github.com/cilium/cilium/pkg/policy.(*SelectorCache).removeSelectorLocked(0xc000532d90, {0x2ba5660, 0xc001c24e10}, {0x2b33100, 0xc000953880})
                /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:901 +0xd7
        github.com/cilium/cilium/pkg/policy.(*SelectorCache).RemoveSelectors(0xc000532d90, {0xc000f255c0, 0x2, 0xc0026e65c8}, {0x2b33100, 0xc000953880})
                /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:919 +0xb4
        github.com/cilium/cilium/pkg/policy.(*L4Filter).removeSelectors(0xc000953880, 0xc0026e6690)
                /go/src/github.com/cilium/cilium/pkg/policy/l4.go:624 +0x185
        github.com/cilium/cilium/pkg/policy.(*L4Filter).detach(0x0, 0x0)
                /go/src/github.com/cilium/cilium/pkg/policy/l4.go:631 +0x1e
        github.com/cilium/cilium/pkg/policy.L4PolicyMap.Detach(0x0, 0xc000a80160)
                /go/src/github.com/cilium/cilium/pkg/policy/l4.go:801 +0x7e
        github.com/cilium/cilium/pkg/policy.(*L4Policy).Detach(0xc000635c80, 0x1)
                /go/src/github.com/cilium/cilium/pkg/policy/l4.go:1011 +0x45
        github.com/cilium/cilium/pkg/policy.(*selectorPolicy).Detach(...)
                /go/src/github.com/cilium/cilium/pkg/policy/resolve.go:107
        github.com/cilium/cilium/pkg/policy.(*cachedSelectorPolicy).setPolicy(0xc000532e70, 0xc0009e63c0)
                /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:188 +0x3b
        github.com/cilium/cilium/pkg/policy.(*PolicyCache).updateSelectorPolicy(0xc000131410, 0xc0009e63c0)
                /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:124 +0x195
        github.com/cilium/cilium/pkg/policy.(*PolicyCache).UpdatePolicy(...)
                /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:153
        github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regeneratePolicy(0xc00078b500)
                /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:230 +0x22b
        github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runPreCompilationSteps(0xc00078b500, 0xc000363400)
                /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:815 +0x2dd
        github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerateBPF(0xc00078b500, 0xc000363400)
                /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:584 +0x19d
        github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0xc00078b500, 0xc000363400)
                /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:405 +0x7b3
        github.com/cilium/cilium/pkg/endpoint.(*EndpointRegenerationEvent).Handle(0xc000d8e060, 0x40568a)
                /go/src/github.com/cilium/cilium/pkg/endpoint/events.go:53 +0x32c
        github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run.func1()
                /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:245 +0x13b
        sync.(*Once).doSlow(0x1, 0x43f325)
                /usr/local/go/src/sync/once.go:68 +0xd2
        sync.(*Once).Do(...)
                /usr/local/go/src/sync/once.go:59
        github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run(0x0)
                /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:233 +0x45
        created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run
                /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:229 +0x7b
    
    Fixes: 83e576e ("identity: create `CachingIdentityAllocator` type")
    Signed-off-by: Joe Stringer <joe@cilium.io>
    joestringer committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    6d83d26 View commit details
    Browse the repository at this point in the history
  4. fqdn: Move identity allocation to FQDN selector

    Previously, the CIDR identity alloction logic was tied directly into the
    NameManager in RegisterForIdentityUpdatesLocked(), however these
    identities were never cleaned up. In order to properly release
    CIDR identities that are allocated upon creation of an FQDN selector, we
    need to track them and eventually call into release functions.
    
    Similar to recent changes that looked at identity leak during updates of
    FQDN selectors, we have two options - continue to tie the allocation
    into the NameManager (and hence teach UnregisterForIdentityUpdatesLocked()
    about identities and their associations), or we can tie them more
    closely to the cached FQDN selectors that require the allocations to
    occur.
    
    Given that not all cached selectors need this treatment (in particular,
    cached label selectors do not hold identity references), this code opts
    to refactor the allocation routines up one layer into the
    selectorcache's 'fqdnSelector' initialization handling logic. This will
    pave the way for fqdnSelector cleanup logic to then release those
    references, thereby balancing the allocation and release of identities.
    
    This commit should have no functional changes. Technically the identity
    allocation is moved inside the SelectorCache critical section while
    holding the mutex, but CIDR identities are only ever locally allocated
    within memory anyway so this is not expected to block for a long time.
    
    Regarding the dropped test, by name it seemed like it's intended to test
    the identity notifier, except that at its core it is using a mock
    implementation of the identity notifier. It was also depending on the
    assumption that the identity notifier is responsible for identity
    allocation, whereas this patch removes that responsibility and pushes it
    back to the FQDN selector. We perhaps could have expanded the way that
    it prepares the identity allocator and the identity notifier in order to
    validate that when adding a new FQDN selector, existing IPs from the DNS
    cache are properly propagated into the selector via the mock selection
    user that's also present in this test file. However, by this time we
    mock out this many implementation details around the DNS cache, FQDN
    selectors, cached selection users, Identity allocation and Identity
    notifier, it seemed like the test is getting too deep into
    implementation details to be able to robustly validate the functionality
    it's intending to validate. Rather than fight this structure and attempt
    to build yet more implementation-specific testing, it seemed like a
    simpler path to just remove the test. If this selection notification
    doesn't occur properly here, then I'd hope we can pick this up during
    later integration tests rather than solely relying on this existing
    test.
    
    Signed-off-by: Joe Stringer <joe@cilium.io>
    joestringer committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    51126bf View commit details
    Browse the repository at this point in the history
  5. fqdn: Handle duplicate ids during selector caching

    Previously this code didn't consider that this function could be called
    against a selector that is already cached (and therefore has identities
    associated with its internal 'cachedSelections' list). If a selector
    with the same string representation was added twice into the cache and
    identities associated with it in between, then that could potentially
    cause those identities to have a reference allocated without eventually
    being freed.
    
    Account for this case by detecting duplicates here and releasing any
    identities that are already referenced by this selector.
    
    Signed-off-by: Joe Stringer <joe@cilium.io>
    joestringer committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    088aa05 View commit details
    Browse the repository at this point in the history
  6. selectorcache: Release identities on selector removal

    FQDN selectors hold references to CIDR identities, one for each
    'fqdnSelector.cachedSelections' entry. Previously, these would be
    associated with the fqdnSelector during creation of the selector but
    they were not released when the fqdnSelector is deleted from the cache.
    
    Commit de10e82 ("fqdn: Move identity allocation to FQDN selector")
    fixed a similar bug recently during the update of existing selectors;
    this code is balancing the release for identities allocated during the
    initial creation of each FQDN selector.
    
    Signed-off-by: Joe Stringer <joe@cilium.io>
    joestringer committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    9fd6593 View commit details
    Browse the repository at this point in the history