Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure IPCache to handle metadata merging #19765

Merged
merged 11 commits into from
Sep 1, 2022
Merged

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented May 10, 2022

Meta: #21142

IPCache has historically been based around a concept of associating IP prefixes with a certain Identity, where each caller independently determines the exact identity that a IP Prefix should have. Each of these callers has a corresponding source.Source which decides the priority between these different sources of Identity.

More recently however, features like #14724 have driven the need to combine multiple sources of information, such as associating both the remote-node labels (from k8s CiliumNode resources) and kube-apiserver labels (from k8s Endpoints / EndpointSlices resources). Having an exclusionary policy where only one source of information can be correct can have implications like causing conflicts between policies that match on information from each source. Up until now, we've resolved these conflicts somewhat manually by forcing the callers to handle the merging of this information. For most callers, they're independent enough that users would only encounter bugs in rare conditions of using two features together in an unexpected way. We have made some attempts in certain paths to resolve these conflicts where necessary, including the example above. However, over time this code becomes more and more 🍝 with each caller attempting to resolve conflicts with the other callers.

This PR reworks the metadata map inside the IPCache to handle multiple sources of information at once, keying them in the map first by IP prefix, then by a {resourceType, resourceNamespace, resourceName} tuple (ResourceID). When merging the information, a new prefixInfo structure provides convenient merging of the labels from all of the sources of information. In effect, this introduced one additional layer of indirection with a series of chained structures represented in the mermaid chart below:

flowchart
  subgraph resourceInfo
  labels.Labels
  source.Source
  end
  subgraph prefixInfo
  UA[ResourceID]-->LA[resourceInfo]
  UB[ResourceID]-->LB[resourceInfo]
  ...
  end
  subgraph identityMetadata
  IP_Prefix-->prefixInfo
  end

Furthemore, rather than the previous implementation where adds into the metadata map were asynchronous and deletes were synchronous, this PR refactors the code to force both paths to be asynchronous. The first step which is synchronous is to update the desired state of the IPCache metadata, ie upserting the labels into the metadata map. This is called directly from various watchers like k8s resource watchers or the node manager. These callers then TriggerLabelInjection() to apply the changes out-of-band. This causes a goroutine to separately iterate through the changes that need to be made, then ensure that the control plane policy engine (in the SelectorCache) to be updated, followed by the dataplane policy maps, then finally followed by the dataplane ipcache. This ensures that policy transitions for newly allocated identities will occur in the correct order to prevent packet drops when different labels are associated with (or removed from) IP prefixes.

By making the deletes also asynchronous and combining the core logic, we are able to also delete a bunch of code, and more consistently handle the various cases of changing the sets of labels associated with prefixes.

See the individual commits for more details.

Combined design and implementation with @christarazi .

Supersedes: #19758
Related: #18301

@joestringer joestringer requested review from a team May 10, 2022 17:46
@joestringer joestringer requested a review from a team as a code owner May 10, 2022 17:46
@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label May 10, 2022
@joestringer joestringer requested review from jibi and sayboras May 10, 2022 17:46
@maintainer-s-little-helper
Copy link

Commit 690dd50 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@joestringer joestringer requested a review from jrfastab May 10, 2022 17:46
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels May 10, 2022
@joestringer joestringer marked this pull request as draft May 10, 2022 17:47
@joestringer joestringer added this to the 1.13 milestone May 10, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 8, 2022
@joestringer
Copy link
Member Author

/test

pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/k8s/watchers/endpoint.go Outdated Show resolved Hide resolved
pkg/k8s/watchers/endpoint.go Outdated Show resolved Hide resolved
@joestringer
Copy link
Member Author

/test

@christarazi
Copy link
Member

Looks like the failure from earlier is repeatable:

2022-06-18T23:28:17.549530025Z level=error msg="Failed to replace ipcache entry with new identity after label removal. Traffic may be disrupted." error="unable to overwrite source \"kvstore\" with source \"custom-resource\"" identity="{remote-node custom-resource false}" ipAddr=10.0.0.191 subsys=ipcache

@joestringer
Copy link
Member Author

joestringer commented Jun 19, 2022

OK, it looks like in etcd mode the Cilium agent receives updates for the node IPs both via etcd and via k8s. In this case we get two events for the ipcache update with different sources, both attempting to associate the IP with 'remote-node'. The second event doesn't make a meaningful difference on the ipcache side since the first event already pushes this update down to the datapath.

I think we can just handle that case by attempting to allocate the identity for the "new" set of labels, then looking to see if the allocator just picked the existing identity. If so, then rather than upserting into the IPCache, we can just release the reference in the ipcache and continue:

diff --git a/pkg/ipcache/metadata.go b/pkg/ipcache/metadata.go
index 3c18db14f05f..c8dbb30ea7da 100644                                                               
--- a/pkg/ipcache/metadata.go
+++ b/pkg/ipcache/metadata.go          
@@ -218,6 +218,16 @@ func (ipc *IPCache) InjectLabels(ctx context.Context, modifiedPrefixes []netip.P
                                break
                        }                          
                                                   
+                       // It's plausible to pull the same information twice
+                       // from different sources, for instance in etcd mode
+                       // where node information is propagated both via the
+                       // kvstore and via the k8s control plane. If the new
+                       // security identity is the same as the noe currently
+                       // being used, then no need to update it.
+                       if id.ID == newID.ID {                                                        
+                               goto releaseIdentity
+                       }                          
+                                                  
                        idsToAdd[newID.ID] = lbls.LabelArray()
                        entriesToReplace[prefix] = Identity{
                                ID:     newID.ID,
@@ -233,6 +243,7 @@ func (ipc *IPCache) InjectLabels(ctx context.Context, modifiedPrefixes []netip.P
                                forceIPCacheUpdate[prefix] = true
                        }                          
                }                                  
+       releaseIdentity:                           
                if entryExists {
                        // 'prefix' is being removed or modified, so some prior
                        // iteration of this loop hit the 'injectLabels' case

joestringer and others added 7 commits August 31, 2022 22:18
Move the logic to change the host label into injectLabels() so that the
caller doesn't need to worry about this detail. This allows injectLabels
to fully handle ensuring that the specified labels are present in the
returned identity, which will simplify further upcoming changes to
generalize this code. In particular it will make it easier to handle
identity (re-)allocations for other label-based features.

Signed-off-by: Joe Stringer <joe@cilium.io>
Previously, the InjectLabels() function which reacts to updates to the
IPCache metadata would iterate through the entire metadata map in order
to detect changes that occur, and subsequently react to those updates.
The reaction includes potentially allocating new identities, then
triggering updates into the datapath policymaps and the datapath's ipcache.

Given that there was no way to track which prefixes had been updated,
this meant iterating over the entire map, perhaps re-allocating
identities for prefixes that had not changed. This also made it more
difficult in future to support different types of updates to the ipcache
from other sources, since each feature would need to have special logic
encoded in this function to correctly handle the identity allocation.

Rework the InjectLabels() implementation to handle a queue of prefixes
to update instead. This commit only modifies the path used while
associating new sets of labels with a given prefix. A follow up commit
will handle the delete cases. However, given that adding new sets of
labels could also cause an existing prefix -> Labels (and hence Identity)
mapping to change, even this case must handle the cleanup of old
identities, which is operationally similar to deleting labels (and hence
identities).

This new code is also built around asynchronous updates to the label
associations for the prefixes. When triggered it only tracks which
prefixes have been changed, and not yet what it must do to ensure that
these prefixes have the correct identity. It uses two key pieces of
information to determine what must be done in order to ensure the
correct association:
* The desired state of the labels, prepared in the IPCache.metadata map
  during a previous UpsertMetadata() call, and
* The realized state of the ipcache itself.

The first time that a set of labels is associated with a prefix, the
ipcache will likely not have any corresponding entry, so this case is
simple: Check the desired state of the labels, allocate a corresponding
identity, and then update the policy selectorcache, datapath policymaps,
and finally the ipcache.

The second time that a (new) set of labels is associated with a prefix,
it gets a bit more complicated: The previous set of labels may still be
relevant. First, a lookup into the existing ipcache tells the function
that this prefix is already associated with some set of labels.

* First, determine the complete set of labels that should be associated
  with this prefix. In this commit, the new labels are likely a
  superset of the existing set. A subsequent commit will handle disjoint
  sets.
* Allocate a new identity corresponding to this complete set of labels.
* Update the SelectorCache and endpoint policy maps to potentially allow
  this new identity. Traffic will not yet be identified with this new
  identity, but this prepares the datapath policy engine for that
  eventuality.
* Update the IPcache to identify this prefix as having the new identity.
* Now that the new identity is in use in the datapath, we can remove
  references to the old identity (in the SelectorCache and policymaps)
* Finally, given that the IPCache no longer associates this prefix with
  the old identity, release the reference on the old identity to ensure
  that the old identity is returned back to the identity allocation pool.

Previously, the "identity release" logic was performed as part of the
core loop, and primarily based on whether any allocated identity was
considered "new" or had the "kube-apiserver" labels associated with it,
which was a bit difficult to reason about. That code is now simpler,
being tracked via an `idsToDelete` map during the processing of modified
prefixes, and the release is moved to the end of the function to ensure
that it is not prematurely released.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Now that the IPCache handles metadata updates incrementally, we can
switch the deletion path over to use this same incremental update logic.
This allows the users of the deletion APIs to inject updates into the
ipcache in a similar manner to the users of the add APIs, and have the
updates incrementally triggered into the subsequent subsystems (policy,
datapath) consistently. This means that updates into the ipcache from
both add and delete paths for the kube-apiserver policy feature will
only actually occur from a single goroutine, triggered by the
TriggerLabelInjection call.

This removes the need to reason about multiple concurrent adds or
deletions into the IPcache occurring from the kube-apiserver policy
feature, and it also lays the path to do the same in future for CIDR
policy and other subsystems. This way, each user of the IPCache can
propagate the information that it intends to push into the IPCache, then
the IPCache itself can decide how to handle those updates and how to
combine information from the various subsystems.

Some key observations here:
* The previous commit actually includes 90% of the logic required to
  implement deletions, based partially on the previous 'add' code and
  partially on the code being deleted here.
* What changes here is adding support in InjectLabels() for the case
  where a set of labels is removed from a prefix. If all labels are
  removed, then this results in the 'IPCache.metadata' map having no
  corresponding entry, so in this case the corresponding old identity
  currently in 'IPCache' should be removed from the selectorcache,
  policymaps, and the IPCache.
  * Caveat: This is only the case if _only_ the metadata map has
    references to the identity. At this point in time, CIDR policies for
    instance are not yet converted over to the metadata map approach for
    associating labels with prefixes, so that path may independently
    allocate their own identities. If those are still referenced from
    CIDR policy, then the label injector should simply remove references
    to the corresponding identities but not remove it from the ipcache.
* Another case is when there are some set of labels (eg A, B) associated
  with a prefix, then one set (eg B) is removed. The result is that a
  previous identity with labels (A, B) must be removed, and a new
  identity with labels (A) should be allocated / associated with the
  prefix. In general, this is very similar to the existing case where a
  set of labels is expanded by associating new labels with the prefix
  (already handled in the previous commit).
  * This also has a curly case: Each set of labels has a source
    associated, for instance initially there could be "remote-node"
    (source: custom-resource) and "kube-apiserver" (source:
    kube-apiserver). When previously upserting into the IPCache, the
    source will be kube-apiserver. If the kube apiserver is no longer
    associated with the IP, and hence that label removed, then the
    resulting set of labels will only be "remote-node", with source
    "custom-resource". Given that the source "custom-resource" has a
    lower priority in pkg/source than kube-apiserver source, we cannot
    update the ipcache directly with the new set of labels using the
    "custom-resource" source. However, the label removal is still
    legitimate. To work around the clunky APIs, the function here just
    overrides the source check in the IPCache.Upsert(). We should be
    able to remove this clunkiness over time when the metadata map is
    the primary source of information for prefixes, but more refactoring
    is necessary to get to that point.
* Now that the label removal doesn't have its own independent logic from
  a separate goroutine, there is no longer a need to use the
  'applyChangesMU' mutex in the metadata cache to ensure safety around
  the critical section. Furthermore, the core InjectLabels() call
  doesn't modify the metadata map. So, we can remove one lock and reduce
  the other lock down to a read mutex rather than holding it for write.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Most actual ipcache updates will in future be occurring as a result of
triggering the label injection controller, so propagate the context from
that controller down into the various child functions where they need a
context for identity allocation and datapath map updates.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Back in the commit "ipcache/md: Rework metadata cache for any labels",
the underlying mechanism for storing IP metadata was made more generic
to store any information that 'IPMetadata' represents. At the time, the
upsert codepaths were made generic to handle different types of incoming
information, but the remove paths remained specific to labels.

Now that the remove paths have been tidied up, we can more easily
abstract out the remove path as well. This should mean that future
patches which wish to store alternative types of information in the
ipcache should only need to update pkg/ipcache/types.go for the new data
type, then pull that information out in the main InjectLabels() loop in
order to create the new IPCache entry using the new info.

Signed-off-by: Joe Stringer <joe@cilium.io>
This commit moves most of the newer ipcache logic over to "net/ipnet" so
that we can have stronger types when dealing with later additions to the
core IP structures. Previously with the string keys, the expectation was
that users of the metadata cache would insert IP addresses rather than
prefixes. However, features like CIDR policy need to insert using
prefixes. We could ambiguously insert with either IPs or CIDRs like the
main IPCache structure does, but this can easily introduce subtle bugs.
Instead, swap out the key for the ipcache.metadata structure for
netip.Prefix to force use of prefixes as the key even for existing use
cases that are based around IP addresses. As a side benefit, netip can
be used as map keys and it's supposed to be more efficient than the
prior net types.

This commit does not currently swap the main IPCache structure over to
net/ipnet, as this would force us to deal with conflicts between IP and
prefix keys in the main map. We can first introduce this in the metadata
cache, then introduce a newer IPCache metadata API that handles
conflicts (upcoming commit), migrate the existing codepaths that inject
into the ipcache to the new API, and then finally migrate the main
IPCache structure over to net/netip types.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
@maintainer-s-little-helper

This comment was marked as resolved.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 1, 2022
Introduce a new generic API for associating information with the
IPs in the IPCache, which accounts for multiple sources of information
such as labels coming from different resources (eg Services ->
reserved:kube-apiserver, NetPol -> CIDR labels).

The primary core of this API is the UpsertMetadata(...) function, which
takes the following parameters:
- prefix: IP (range) that this info applies to;
- src: info source of the information;
- resource: specific resource name in the information source,
- aux: variable length list of information to associate with the prefix.

'aux' is typed as IPMetadata, which is effectively just an interface{}
to allow any information to be associated with the IPCache. Developers
should read the comments around IPMetadata and expand the IPCache
package in the relevant areas (particularly pkg/ipcache/types.go and the
InjectLabels() call) to ensure that the IPCache package correctly
handles the new information and effectively merges the different sources
of info correctly.

After info is upserted into the IPCache via this new API, it will
automatically trigger an out-of-band resolution of what the new IPCache
entry for the prefix should look like, taking into account each piece of
source information from various resources.

In this patch, we switch the current kube-apiserver logic over to the
new API as an initial example, removing the need for the caller to
trigger label injection since the new API will automatically schedule
this. Future work will expand this to switch other subsystems over to
the new APIs, introducing new resourceInfo fields and merging logic in
the ipcache package to decide how complementary (or even conflicting)
information should be combined in order to generate IPCache entries.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 1, 2022
@joestringer
Copy link
Member Author

joestringer commented Sep 1, 2022

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sFQDNTest Restart Cilium validate that FQDN is still working

Failure Output

FAIL: Unable to restart unmanaged pods with 'kubectl -n kube-system delete pods coredns-8cfc78c54-bdd48': Exitcode: 1 

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

@joestringer
Copy link
Member Author

joestringer commented Sep 1, 2022

/mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9

👍 created #21176 #21177 #21178

@joestringer
Copy link
Member Author

Triage:

  • 4.9 hit some weird issues where core k8s components started crashing. Doesn't appear related to the PR.
  • net-next failed during provisioning.
  • external-workloads failed one of the to-fqdns test, it seems unlikely this is a real issue since all of the other tests are passing.

@joestringer
Copy link
Member Author

/test-1.16-4.9

@joestringer
Copy link
Member Author

/test-1.25-net-next

@joestringer
Copy link
Member Author

/ci-external-workloads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. kind/enhancement This would improve or streamline existing functionality. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants