Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix identity leak via FQDN selectors #17699

Merged

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Oct 25, 2021

Review commit by commit.

Main bugfix commit message included here for review convenience:

Users recently reported an issue where pods could not be deployed into
their Cilium cluster, because the datapath regeneration for new Cilium
endpoints would fail. The reason for the failure was that while
attempting to initially provision the datapath for the new endpoint,
Cilium would attempt to populate the endpoint's PolicyMap, but reach the
limit of the total number of entries that the PolicyMap could contain.
As such, in abundance of caution, Cilium backed off on attempting to
provision the network for that pod and returned an error back up to the
Kubelet and prevented subsequent pod provisioning.

The reason for reaching the PolicyMap size was partially due to the node
containing a large number of Identities, and partially due to an overly
permissive network policy that programmed the new endpoints to allow
traffic from all of these identities (individually). Typically this sort
of problem is not present if the user specifies a `toEntities: all`
policy to explicitly allow everything, but it can occasionally occur if
the policy is very permissive but more complex than this explicit
allow-all policy. One such example would be using `toFQDNs` policy with
a MatchPattern on all domains.

One of the curious aspects that we discovered while investigating this
issue was that if the cilium-agent pods were restarted, the issue would
disappear. This is indeed one potential mitigation for the issue.
Furthermore, when comparing the `cilium policy selectors` output between
sysdump output of a node where the problem is present vs. where the
problem is not present, we discovered that there were CIDR identities
derived from `toFQDNs` policy even when `toFQDNs` policy was not present.

Upon further code inspection along with Chris Tarazi, we observed that
the NameManager is responsible for handling allocation of Identities to
associate with CIDR policies, and that the name manager then hands those
identities over to the SelectorCache in order to be associated with
individual toFQDNs policy selectors, and then passes a final list of
identities up to the Daemon in order to plumb the underlying IPCache
map. However, none of these layers were taking responsibility for
releasing references to these Identities upon expiry of the entry that
triggered the CIDR Identity allocation.

Allocation path:

    Daemon.notifyOnDNSMsg()
    -> NameManager.UpdateGenerateDNS()
       -> NameManager.Config.UpdateSelectors -> Daemon.updateSelectors()
          -> identitiesForFQDNSelectorIPs() (in daemon/cmd/fqdn.go)

This leaves us with a couple of options for releasing these references:
* Associate each identity with the NameManager and have the NameManager
  release the references when the entry expires
* Associate each identity with the FQDN selectors that are actively
  using it, and release the references when the FQDN selectors are
  released.

Although I can see arguments for perhaps associating these references
with the NameManager itself as it tracks the usage of each IP for FQDN
policy and hence could tie the reference to the lifetime of the usage of
the IP overall in the agent, so far the code has implicitly tied the
allocation of each identity to its usage in each selector. As such, this
commit attempts to simply acknowledge this reality and extend the
update/release of the selectors in the NameManager to also release those
same references that each selector holds.

Practically speaking this means passing the allocated references from:
identitiesForFQDNSelectorIPs() through to the sibling function
updateSelectorCacheFQDNs():

    Daemon.notifyOnDNSMsg()
    -> NameManager.UpdateGenerateDNS()
       -> NameManager.Config.UpdateSelectors -> Daemon.updateSelectors()
          -> identitiesForFQDNSelectorIPs() (allocation)
          -> updateSelectorCacheFQDNs()
             -> SelectorCache.{U,u}pdateFQDNSelector()

Here we're basically just taking the new complete set of identities to
be associated with the selector, and merging them into the existing
selector in the cache. As such, we need to handle each case where an
identity may have a duplicate reference between the existing cache and
the new list, or any case where an identity is removed from the
selector, and then collect those for subsequent release of the identity.

Conveniently for the above approach, the release of the selectors is
handled through the same core function, just coming from the caller
SelectorCache.RemoveIdentitiesFQDNSelectors(). So this means we can
contain the modifications necessary to resolve this bug to the
selectorcache code.

@joestringer joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Oct 25, 2021
@joestringer joestringer force-pushed the submit/fqdn-cidr-identity-leak branch 4 times, most recently from 6edab40 to b02d727 Compare October 25, 2021 21:09
@joestringer
Copy link
Member Author

/test

@maintainer-s-little-helper

This comment has been minimized.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 27, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 27, 2021
@joestringer
Copy link
Member Author

/test

Comment on lines 124 to 157
releaseCtx, cancel := context.WithTimeout(context.Background(), option.Config.KVstoreConnectivityTimeout)
defer cancel()
identities = append(identities, id)
} else {
log.Errorf("Unable to find identity of previously used CIDR %s", prefix.String())
}
}

released, err := IdentityAllocator.Release(releaseCtx, id)
if err != nil {
log.WithError(err).Warningf("Unable to release identity for CIDR %s. Ignoring error. Identity may be leaked", prefix.String())
}
releaseCIDRIdentities(context.Background(), identities)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need the previous context.WithTimeout bounded by the kvstore connectivity timeout anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a little bit awkward but basically CIDR identities never require kvstore/k8s allocation, so technically it doesn't matter whether we have a context here or not. Technically it also can't fail (as far as I understand).

We had initially relied on clusterwide coordination of these identities for CIDRs, but at some point we switched it to local-only and we haven't gone back through and established clear patterns for how to handle this allocation interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, that makes sense thanks. Would be nice in the future to split regular identities allocation from CIDR identity allocation via a different call. That separation itself is useful for documentation purposes too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we resisted this in the past because abstractly, you just need an identity for a set of labels. The notion that CIDR labels are different and don't require clusterwide co-ordination is more of an optimization.

pkg/policy/selectorcache.go Outdated Show resolved Hide resolved
Comment on lines +154 to +158
AllocateCIDRsForIPs(ips []net.IP, newlyAllocatedIdentities map[string]*identity.Identity) ([]*identity.Identity, error)

// ReleaseCIDRIdentitiesByID() is a wrapper for ReleaseSlice() that
// also handles ipcache entries.
ReleaseCIDRIdentitiesByID(context.Context, []identity.NumericIdentity)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these funcions to the pkg/identity/cache:IdentityAllocator interface here is a bit awkward given that the intent is for these calls to go via ipcache in the real implementation, but I think we probably just need to come up with a bit clearer ownership/structure/relationship between the daemon, fqdn, policy, ipcache and identity subsystems in order to tidy this up properly for longer term. I'm not planning to address that right now in this bugfix.

I think this is also why the 'newlyAllocatedIdentities' parameter is necessary, I think that if we make the relationship clearer then that parameter should be able to go away.

@joestringer joestringer force-pushed the submit/fqdn-cidr-identity-leak branch 2 times, most recently from df99f69 to bdeec70 Compare November 2, 2021 01:30
@joestringer
Copy link
Member Author

/test

@joestringer
Copy link
Member Author

Travis, Netlify and runtime builds failed due to GitHub shutting down git+git protocol which we were using for docs builds. This is already fixed by #17756 .

The other ci-eks failure looks like known issue #16938 .

Marking ready for review.

@joestringer joestringer marked this pull request as ready for review November 2, 2021 04:28
@joestringer joestringer requested a review from a team November 2, 2021 04:28
@joestringer joestringer requested review from a team as code owners November 2, 2021 04:28
@joestringer joestringer requested review from a team November 2, 2021 04:28
@joestringer joestringer requested a review from a team as a code owner November 2, 2021 04:28
@joestringer
Copy link
Member Author

/test-runtime

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very thorough explanation and commit msgs. Great to see that the code docs are getting updated with what the caller should expect after making the call 👍.

Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for splitting up the changes into bite-size commits :-)

@joestringer
Copy link
Member Author

/ci-eks

@joestringer
Copy link
Member Author

From further testing, I found that this is not quite enough to cover all of the cases, but these steps are correct and necessary to partially resolve the original bug report. I think we should merge this as-is and then I can follow up separately with another PR to cover the other case.

The mock identity notifier and identity allocator make sense together
under pkg/testutils/identity, and refactoring like this will be required
by upcoming commits in order to avoid circular imports.

Signed-off-by: Joe Stringer <joe@cilium.io>
This constructor will simplify upcoming commits by performing all of the
necessary initialization for the FakeIdentityAllocator in all tests that
make use of this structure.

Signed-off-by: Joe Stringer <joe@cilium.io>
Upcoming commits will rely on the daemon passing the identityAllocator
module down to the fqdn subsystem via the policy repository. For now,
adjust the function signatures to pass this through, but avoid actually
using the new parameter for now.

This commit is a functional no-op but reduces the noise in terms of code
churn from upcoming commits.

Signed-off-by: Joe Stringer <joe@cilium.io>
In order to test an upcoming bugfix, the internal FQDN handling
functions that are triggered from the unit tests must be able to refer
to an abstract identity allocator. This will then be mocked out by the
tests.

This commit extends the existing cache.IdentityAllocator interface to
support allocating identities for CIDRs. A future commit will also add
Identity release functions to the same abstraction.

This also paves the way for some improvements to code structure: Some of
the FQDN logic under `daemon/cmd/fqdn.go` may be able to be pushed down
into pkg/fqdn now that it doesn't directly reference the globals from
the ipcache package.

There's another remaining open question around code structure for how
the IPcache and identity allocation packages handle the pattern for
allocating CIDR identities vs. injecting those identities into the
underlying IPcache maps (eg in eBPF). This commit shies away from
solving that problem right now in order to keep the changes minimal as
this code is likely to be proposed for backport.

Signed-off-by: Joe Stringer <joe@cilium.io>
An upcoming commit will introduce a new variation on the release
function, so just rename this one here to simplify the upcoming changes.

Signed-off-by: Joe Stringer <joe@cilium.io>
Everything that goes up must eventually come down. And so it is also
with everything that is allocated; it must be deallocated.

Extend the IdentityAllocator interface to also support CIDR release
functions and plumb through an initial implementation. This is currently
not hooked into the agent itself, this will come with an upcoming bugfix
instead.

Signed-off-by: Joe Stringer <joe@cilium.io>
Users recently reported an issue where pods could not be deployed into
their Cilium cluster, because the datapath regeneration for new Cilium
endpoints would fail. The reason for the failure was that while
attempting to initially provision the datapath for the new endpoint,
Cilium would attempt to populate the endpoint's PolicyMap, but reach the
limit of the total number of entries that the PolicyMap could contain.
As such, in abundance of caution, Cilium backed off on attempting to
provision the network for that pod and returned an error back up to the
Kubelet and prevented subsequent pod provisioning.

The reason for reaching the PolicyMap size was partially due to the node
containing a large number of Identities, and partially due to an overly
permissive network policy that programmed the new endpoints to allow
traffic from all of these identities (individually). Typically this sort
of problem is not present if the user specifies a `toEntities: all`
policy to explicitly allow everything, but it can occasionally occur if
the policy is very permissive but more complex than this explicit
allow-all policy. One such example would be using `toFQDNs` policy with
a MatchPattern on all domains.

One of the curious aspects that we discovered while investigating this
issue was that if the cilium-agent pods were restarted, the issue would
disappear. This is indeed one potential mitigation for the issue.
Furthermore, when comparing the `cilium policy selectors` output between
sysdump output of a node where the problem is present vs. where the
problem is not present, we discovered that there were CIDR identities
derived from `toFQDNs` policy even when `toFQDNs` policy was not present.

Upon further code inspection along with Chris Tarazi, we observed that
the NameManager is responsible for handling allocation of Identities to
associate with CIDR policies, and that the name manager then hands those
identities over to the SelectorCache in order to be associated with
individual toFQDNs policy selectors, and then passes a final list of
identities up to the Daemon in order to plumb the underlying IPCache
map. However, none of these layers were taking responsibility for
releasing references to these Identities upon expiry of the entry that
triggered the CIDR Identity allocation.

Allocation path:

    Daemon.notifyOnDNSMsg()
    -> NameManager.UpdateGenerateDNS()
       -> NameManager.Config.UpdateSelectors -> Daemon.updateSelectors()
          -> identitiesForFQDNSelectorIPs() (in daemon/cmd/fqdn.go)

This leaves us with a couple of options for releasing these references:
* Associate each identity with the NameManager and have the NameManager
  release the references when the entry expires
* Associate each identity with the FQDN selectors that are actively
  using it, and release the references when the FQDN selectors are
  released.

Although I can see arguments for perhaps associating these references
with the NameManager itself as it tracks the usage of each IP for FQDN
policy and hence could tie the reference to the lifetime of the usage of
the IP overall in the agent, so far the code has implicitly tied the
allocation of each identity to its usage in each selector. As such, this
commit attempts to simply acknowledge this reality and extend the
update/release of the selectors in the NameManager to also release those
same references that each selector holds.

Practically speaking this means passing the allocated references from:
identitiesForFQDNSelectorIPs() through to the sibling function
updateSelectorCacheFQDNs():

    Daemon.notifyOnDNSMsg()
    -> NameManager.UpdateGenerateDNS()
       -> NameManager.Config.UpdateSelectors -> Daemon.updateSelectors()
          -> identitiesForFQDNSelectorIPs() (allocation)
          -> updateSelectorCacheFQDNs()
             -> SelectorCache.{U,u}pdateFQDNSelector()

Here we're basically just taking the new complete set of identities to
be associated with the selector, and merging them into the existing
selector in the cache. As such, we need to handle each case where an
identity may have a duplicate reference between the existing cache and
the new list, or any case where an identity is removed from the
selector, and then collect those for subsequent release of the identity.

Conveniently for the above approach, the release of the selectors is
handled through the same core function, just coming from the caller
SelectorCache.RemoveIdentitiesFQDNSelectors(). So this means we can
contain the modifications necessary to resolve this bug to the
selectorcache code.

Signed-off-by: Joe Stringer <joe@cilium.io>
Add a new unit test that tests the integration between the daemon DNS
event handling logic and the core FQDN and policy (selector) subsystems.

The core idea here is to validate that when DNS events are handled that
associate FQDNs to IPs, ensure that the reference counting on CIDR
identities remains sane. To this end, we put a mock identity reference
counter into the Daemon (using pkg/counter), then trigger DNS events to
occur. By exercising this core Daemon / FQDN / selector code with our
own "known-good" reference counting identity allocator backend, we can
check that the intermediate code is appropriately balancing the identity
allocation and release.

The resulting test ensures that even if we trigger DNS events multiple
times, the core agent code should not leak identity references.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer
Copy link
Member Author

I rebased the PR against a minor conflict in the tree (go includes ordering). Should be good to merge as long as the smoke tests pass.

@joestringer joestringer merged commit 7e1056d into cilium:master Nov 4, 2021
@joestringer joestringer deleted the submit/fqdn-cidr-identity-leak branch November 4, 2021 21:10
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.6 Nov 12, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.6 Nov 23, 2021
@christarazi christarazi added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
No open projects
1.10.6
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

None yet

7 participants