Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.14] Fix bug where restored CIDR identities would be incorrectly released, leading to packet loss #27305

Closed
wants to merge 4 commits into from

Conversation

joestringer
Copy link
Member

This is a v1.14 branch backport for #27255 -- Fix bug where restored CIDR identities would be incorrectly released, leading to packet loss.

Once this PR is merged, you can update the PR labels via:

$ for pr in 27255; do contrib/backporting/set-labels.py $pr done 1.14; done

The first use case for this is truncating a slice of netip.Prefix for
logging purposes. Ideally it'd be sorted, but netip.Prefix isn't yet
comparable. There's upstream work going on for that, so in the mean time
let's just pick any random keys from the map and return N of them.

Signed-off-by: Joe Stringer <joe@cilium.io>
Some of the lines of this function will be shared by an upcoming commit.
Refactor this separately to minimize diff. No functional changes.

Signed-off-by: Joe Stringer <joe@cilium.io>
On startup the agent restores CIDR identities from the ipcache map and
allocates references to those Security Identities temporarily, until the
identity-restore-grace-period is complete. At the end of this period,
the CIDR Identities may be released. Although many of these CIDR
identities will be referenced from the FQDN selectors associated with
the DNS Name Manager, none of the references will be held by the name
manager or its FQDN selectors due to the way that internal reference
counting works for Identities in such rules. See commit de10e82
("Fix identity leak via FQDN selectors") for more history on the
structure regarding identity reference counts in the tree today.

The unfortunate consequence of this structure is that once the
identity-restore-grace-period completes, these identities will be
forgotten about until the next time a DNS lookup occurs for the names
associated with these IP addresses. This can cause connectivity
disruption for applications following an upgrade or restart of the
agent, if that connectivity is enabled by ToFQDNs policy.

One might be tempted to think, well what if we just don't release those
CIDR Identity references? If we don't release them, then Cilium won't
forget how to properly enforce policy for those Identities, and traffic
will continue to flow. However, in this scenario, regardless of whether
applications continue to leverage the CIDR identities, Cilium will keep
these Identities alive persistently for the lifetime of the current
agent, and even persist them on to the next time the agent starts up.
These identities are generated based on DNS lookup activity during the
agent lifetime, and there is currently no limit on how many of these
Identities there may be. If these are not appropriately garbage
collected, then endpoint policymaps may fill more easily, leading to
incorrect policy enforcement or even rejecting the scheduling of new
endpoints.

Well, then why don't we identify exactly which IP addresses are being
used by the FQDN selectors and allocate additional references to the
corresponding CIDR Identities on startup? The current code structure
doesn't have any sort of bootstrap phase where these restored CIDRs are
identified with the context of the corresponding FQDNs. Additionally,
any particular restored Identity for an IP could be selected by multiple
selectors, and the name manager expects to retain a reference to the
identity for each selector. So it's not enough just to reference count
the identities once per CIDR. Furthermore, this means that the policy
engine must have done a full policy calculation first in order to
initialize the FQDN selectors and attach them to the name manager, so
that we can determine how many references there may be to each CIDR
Identity!

At some point we should probably take a longer harder look at this and
maybe revisit some of the design decisions from commit de10e82
("Fix identity leak via FQDN selectors") around reference counting. In
the interests of preparing a safe, small, backportable fix, this commit
explores alternate short term options.

With that in mind, this commit proposes the following solution: wait
until the identity-restore-grace-period is completed and the identities
have their references released, then attempt to re-allocate the identity
references used by the name manager at that time. Any identities that
are already referenced will have referenced allocated and then removed
again by this new logic, resulting in a no-op. Any newly allocated
identities will be reinserted into the ipcache. This may cause brief
connectivity disruption to workloads impacted by the latter category of
identities, but this is better than having persistent disruption.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer added the backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. label Aug 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 6, 2023
@joestringer joestringer changed the base branch from main to v1.14 August 6, 2023 18:32
@joestringer joestringer added kind/backports This PR provides functionality previously merged into master. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 6, 2023
@joestringer
Copy link
Member Author

/test-backport-1.14

Boris, Jason and Kim-Eirik report that ten minutes after startup, Cilium
releases CIDR identities that are allowing Cilium to implement network
policy, which leads to connectivity outages. Boris traced this down to
commit c96b9d8 ("ipcache: Remove superfluous if condition") which
attempted to solve a reference leak issue in the core of ipcache. Upon
deeper inspection, the agent's CIDR Identity restore logic was not
playing well with network policy allocations of CIDR Identity in the
ipcache, due to a misalignment of expectations of ownership over
Identity and IPcache entry structures between the two agent modules.
This commit resolves the misalignment by moving the CIDR Identity
restore logic over to the newer IPCache APIs, thereby balancing the
reference counting ownership between the different code paths.

Previously the daemon on startup would:
1. Read previously-used CIDRs from the bpf ipcache map
2. Allocate the old numeric identities for the corresponding CIDRs
3. Directly inject entries into the new bpf ipcachemap for these CIDR to
   Identity mappings, to be used by an upcoming datapath reinitialization.
4. Wait for the identity-restore-grace-period (default 10m)
5. Release its references to the identities
6. If these are the final references on the identities, attempt to remove
   the corresponding prefix->identity mappings from the ipcache.

The goal is to ensure that after Cilium starts up, it tries to retain
the same identity numbers for the same prefixes in order to minimize the
diff between policymaps as endpoints are progressively migrated from the
datapath configured by the old Cilium agent towards the new one
understood by this newly started agent.

Step (3) above uses the old ipcache APIs that delegate responsibility to
callers to properly reference count the Identities. However, nothing was
stopping other parts of Cilium from configuring network policies for the
same IP prefixes into the ipcache using the new ipcache APIs, where the
ipcache package itself takes care of reference counting.

If another subsystem like the policy engine requests the ipcache to
allocate & inject an identity + IP mapping into the datapath during step
(4) above, it would hit the inner loop of InjectLabels(). That logic
attempts to (re-)allocate an identity for the CIDR, which would coincide
with the identity allocated in Step (2) above. Subsequently, it checks
whether the Identity is already pushed into the datapath. If yes, then
it will release the (re-)allocated identity reference for the CIDR,
thinking that it had previously grabbed a reference on the Identity.
This is actually incorrect, the ipcache never acquired a reference on
the Identity for itself! In the short term connectivity is kept alive
because the CIDR retains a reference from the startup logic, but when
the identity restore grace period runs out, this triggers persistent
drops on the datapath until a user re-inserts their network policies or
otherwise triggers re-evaluation of the policy paths that inject
mappings into the ipcache.

In this scenario, when the agent gets to step (5), it will release the
final reference to each such CIDR. This releases the identity internally
across the agent so that the policy engine no longer has any knowledge
of it, which triggers policymap entry deletion on the datapath, leading
to policy drops.

As it turns out, due to the way that the ipcache entry is inserted into
the datapath with a higher priority source, the logic in step (6)
actually fails to clean up the datapath's ipcache entries since the
entry is not really owned by the daemon restore logic, but by the
ipcache at this stage. The result is that the datapath's ipcache maps
certain CIDRs to identities that do not exist in the agent.

In order to resolve this conflict, this commit moves steps (3) and (6)
for the identity restore logic over to the new ipcache APIs, so that the
ipcache can correctly allocate its own references for the identities and
keep them alive. The daemon startup logic for identity restore still
separately allocates and releases its own references to the underlying
identities in order to ensure that the identities retain the prefix
correspondence across upgrades, but this reference is no longer tied up
with the old ipcache interfaces that caused the reference count
ambiguity.

As a matter of process to clearly delineate the codepaths for step (5),
this commit also moves that code into the daemon so that there is no
ambiguity over whether the daemon's identity references are tied to the
ipcache or not.

Fixes: c96b9d8 ("ipcache: Remove superfluous if condition")
Reported-by: Boris Petrovic <carnerito.b@gmail.com>
Reported-by: Kim-Eirik Karlsen <kim.eirik@gmail.com>
Reported-by: Jason Witkowski <jason@witkow.ski>
Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer
Copy link
Member Author

/test-backport-1.14

@joestringer
Copy link
Member Author

joestringer commented Aug 14, 2023

Superseded by an alternative proposal (#27327).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant