Restructure IPCache to handle metadata merging #19765

joestringer · 2022-05-10T17:46:57Z

IPCache has historically been based around a concept of associating IP prefixes with a certain Identity, where each caller independently determines the exact identity that a IP Prefix should have. Each of these callers has a corresponding source.Source which decides the priority between these different sources of Identity.

More recently however, features like #14724 have driven the need to combine multiple sources of information, such as associating both the remote-node labels (from k8s CiliumNode resources) and kube-apiserver labels (from k8s Endpoints / EndpointSlices resources). Having an exclusionary policy where only one source of information can be correct can have implications like causing conflicts between policies that match on information from each source. Up until now, we've resolved these conflicts somewhat manually by forcing the callers to handle the merging of this information. For most callers, they're independent enough that users would only encounter bugs in rare conditions of using two features together in an unexpected way. We have made some attempts in certain paths to resolve these conflicts where necessary, including the example above. However, over time this code becomes more and more 🍝 with each caller attempting to resolve conflicts with the other callers.

This PR reworks the metadata map inside the IPCache to handle multiple sources of information at once, keying them in the map first by IP prefix, then by a {resourceType, resourceNamespace, resourceName} tuple (ResourceID). When merging the information, a new prefixInfo structure provides convenient merging of the labels from all of the sources of information. In effect, this introduced one additional layer of indirection with a series of chained structures represented in the mermaid chart below:

flowchart
  subgraph resourceInfo
  labels.Labels
  source.Source
  end
  subgraph prefixInfo
  UA[ResourceID]-->LA[resourceInfo]
  UB[ResourceID]-->LB[resourceInfo]
  ...
  end
  subgraph identityMetadata
  IP_Prefix-->prefixInfo
  end

Furthemore, rather than the previous implementation where adds into the metadata map were asynchronous and deletes were synchronous, this PR refactors the code to force both paths to be asynchronous. The first step which is synchronous is to update the desired state of the IPCache metadata, ie upserting the labels into the metadata map. This is called directly from various watchers like k8s resource watchers or the node manager. These callers then TriggerLabelInjection() to apply the changes out-of-band. This causes a goroutine to separately iterate through the changes that need to be made, then ensure that the control plane policy engine (in the SelectorCache) to be updated, followed by the dataplane policy maps, then finally followed by the dataplane ipcache. This ensures that policy transitions for newly allocated identities will occur in the correct order to prevent packet drops when different labels are associated with (or removed from) IP prefixes.

By making the deletes also asynchronous and combining the core logic, we are able to also delete a bunch of code, and more consistently handle the various cases of changing the sets of labels associated with prefixes.

See the individual commits for more details.

Combined design and implementation with @christarazi .

Supersedes: #19758
Related: #18301

maintainer-s-little-helper · 2022-05-10T17:46:59Z

Commit 690dd50 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

joestringer · 2022-06-08T16:03:17Z

/test

pkg/ipcache/metadata.go

pkg/k8s/watchers/endpoint.go

joestringer · 2022-06-18T22:30:37Z

/test

christarazi · 2022-06-19T02:04:10Z

Looks like the failure from earlier is repeatable:

2022-06-18T23:28:17.549530025Z level=error msg="Failed to replace ipcache entry with new identity after label removal. Traffic may be disrupted." error="unable to overwrite source \"kvstore\" with source \"custom-resource\"" identity="{remote-node custom-resource false}" ipAddr=10.0.0.191 subsys=ipcache

joestringer · 2022-06-19T16:42:46Z

OK, it looks like in etcd mode the Cilium agent receives updates for the node IPs both via etcd and via k8s. In this case we get two events for the ipcache update with different sources, both attempting to associate the IP with 'remote-node'. The second event doesn't make a meaningful difference on the ipcache side since the first event already pushes this update down to the datapath.

I think we can just handle that case by attempting to allocate the identity for the "new" set of labels, then looking to see if the allocator just picked the existing identity. If so, then rather than upserting into the IPCache, we can just release the reference in the ipcache and continue:

diff --git a/pkg/ipcache/metadata.go b/pkg/ipcache/metadata.go
index 3c18db14f05f..c8dbb30ea7da 100644                                                               
--- a/pkg/ipcache/metadata.go
+++ b/pkg/ipcache/metadata.go          
@@ -218,6 +218,16 @@ func (ipc *IPCache) InjectLabels(ctx context.Context, modifiedPrefixes []netip.P
                                break
                        }                          
                                                   
+                       // It's plausible to pull the same information twice
+                       // from different sources, for instance in etcd mode
+                       // where node information is propagated both via the
+                       // kvstore and via the k8s control plane. If the new
+                       // security identity is the same as the noe currently
+                       // being used, then no need to update it.
+                       if id.ID == newID.ID {                                                        
+                               goto releaseIdentity
+                       }                          
+                                                  
                        idsToAdd[newID.ID] = lbls.LabelArray()
                        entriesToReplace[prefix] = Identity{
                                ID:     newID.ID,
@@ -233,6 +243,7 @@ func (ipc *IPCache) InjectLabels(ctx context.Context, modifiedPrefixes []netip.P
                                forceIPCacheUpdate[prefix] = true
                        }                          
                }                                  
+       releaseIdentity:                           
                if entryExists {
                        // 'prefix' is being removed or modified, so some prior
                        // iteration of this loop hit the 'injectLabels' case

Move the logic to change the host label into injectLabels() so that the caller doesn't need to worry about this detail. This allows injectLabels to fully handle ensuring that the specified labels are present in the returned identity, which will simplify further upcoming changes to generalize this code. In particular it will make it easier to handle identity (re-)allocations for other label-based features. Signed-off-by: Joe Stringer <joe@cilium.io>

Previously, the InjectLabels() function which reacts to updates to the IPCache metadata would iterate through the entire metadata map in order to detect changes that occur, and subsequently react to those updates. The reaction includes potentially allocating new identities, then triggering updates into the datapath policymaps and the datapath's ipcache. Given that there was no way to track which prefixes had been updated, this meant iterating over the entire map, perhaps re-allocating identities for prefixes that had not changed. This also made it more difficult in future to support different types of updates to the ipcache from other sources, since each feature would need to have special logic encoded in this function to correctly handle the identity allocation. Rework the InjectLabels() implementation to handle a queue of prefixes to update instead. This commit only modifies the path used while associating new sets of labels with a given prefix. A follow up commit will handle the delete cases. However, given that adding new sets of labels could also cause an existing prefix -> Labels (and hence Identity) mapping to change, even this case must handle the cleanup of old identities, which is operationally similar to deleting labels (and hence identities). This new code is also built around asynchronous updates to the label associations for the prefixes. When triggered it only tracks which prefixes have been changed, and not yet what it must do to ensure that these prefixes have the correct identity. It uses two key pieces of information to determine what must be done in order to ensure the correct association: * The desired state of the labels, prepared in the IPCache.metadata map during a previous UpsertMetadata() call, and * The realized state of the ipcache itself. The first time that a set of labels is associated with a prefix, the ipcache will likely not have any corresponding entry, so this case is simple: Check the desired state of the labels, allocate a corresponding identity, and then update the policy selectorcache, datapath policymaps, and finally the ipcache. The second time that a (new) set of labels is associated with a prefix, it gets a bit more complicated: The previous set of labels may still be relevant. First, a lookup into the existing ipcache tells the function that this prefix is already associated with some set of labels. * First, determine the complete set of labels that should be associated with this prefix. In this commit, the new labels are likely a superset of the existing set. A subsequent commit will handle disjoint sets. * Allocate a new identity corresponding to this complete set of labels. * Update the SelectorCache and endpoint policy maps to potentially allow this new identity. Traffic will not yet be identified with this new identity, but this prepares the datapath policy engine for that eventuality. * Update the IPcache to identify this prefix as having the new identity. * Now that the new identity is in use in the datapath, we can remove references to the old identity (in the SelectorCache and policymaps) * Finally, given that the IPCache no longer associates this prefix with the old identity, release the reference on the old identity to ensure that the old identity is returned back to the identity allocation pool. Previously, the "identity release" logic was performed as part of the core loop, and primarily based on whether any allocated identity was considered "new" or had the "kube-apiserver" labels associated with it, which was a bit difficult to reason about. That code is now simpler, being tracked via an `idsToDelete` map during the processing of modified prefixes, and the release is moved to the end of the function to ensure that it is not prematurely released. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>

Now that the IPCache handles metadata updates incrementally, we can switch the deletion path over to use this same incremental update logic. This allows the users of the deletion APIs to inject updates into the ipcache in a similar manner to the users of the add APIs, and have the updates incrementally triggered into the subsequent subsystems (policy, datapath) consistently. This means that updates into the ipcache from both add and delete paths for the kube-apiserver policy feature will only actually occur from a single goroutine, triggered by the TriggerLabelInjection call. This removes the need to reason about multiple concurrent adds or deletions into the IPcache occurring from the kube-apiserver policy feature, and it also lays the path to do the same in future for CIDR policy and other subsystems. This way, each user of the IPCache can propagate the information that it intends to push into the IPCache, then the IPCache itself can decide how to handle those updates and how to combine information from the various subsystems. Some key observations here: * The previous commit actually includes 90% of the logic required to implement deletions, based partially on the previous 'add' code and partially on the code being deleted here. * What changes here is adding support in InjectLabels() for the case where a set of labels is removed from a prefix. If all labels are removed, then this results in the 'IPCache.metadata' map having no corresponding entry, so in this case the corresponding old identity currently in 'IPCache' should be removed from the selectorcache, policymaps, and the IPCache. * Caveat: This is only the case if _only_ the metadata map has references to the identity. At this point in time, CIDR policies for instance are not yet converted over to the metadata map approach for associating labels with prefixes, so that path may independently allocate their own identities. If those are still referenced from CIDR policy, then the label injector should simply remove references to the corresponding identities but not remove it from the ipcache. * Another case is when there are some set of labels (eg A, B) associated with a prefix, then one set (eg B) is removed. The result is that a previous identity with labels (A, B) must be removed, and a new identity with labels (A) should be allocated / associated with the prefix. In general, this is very similar to the existing case where a set of labels is expanded by associating new labels with the prefix (already handled in the previous commit). * This also has a curly case: Each set of labels has a source associated, for instance initially there could be "remote-node" (source: custom-resource) and "kube-apiserver" (source: kube-apiserver). When previously upserting into the IPCache, the source will be kube-apiserver. If the kube apiserver is no longer associated with the IP, and hence that label removed, then the resulting set of labels will only be "remote-node", with source "custom-resource". Given that the source "custom-resource" has a lower priority in pkg/source than kube-apiserver source, we cannot update the ipcache directly with the new set of labels using the "custom-resource" source. However, the label removal is still legitimate. To work around the clunky APIs, the function here just overrides the source check in the IPCache.Upsert(). We should be able to remove this clunkiness over time when the metadata map is the primary source of information for prefixes, but more refactoring is necessary to get to that point. * Now that the label removal doesn't have its own independent logic from a separate goroutine, there is no longer a need to use the 'applyChangesMU' mutex in the metadata cache to ensure safety around the critical section. Furthermore, the core InjectLabels() call doesn't modify the metadata map. So, we can remove one lock and reduce the other lock down to a read mutex rather than holding it for write. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>

Most actual ipcache updates will in future be occurring as a result of triggering the label injection controller, so propagate the context from that controller down into the various child functions where they need a context for identity allocation and datapath map updates. Signed-off-by: Joe Stringer <joe@cilium.io>

Signed-off-by: Joe Stringer <joe@cilium.io>

Back in the commit "ipcache/md: Rework metadata cache for any labels", the underlying mechanism for storing IP metadata was made more generic to store any information that 'IPMetadata' represents. At the time, the upsert codepaths were made generic to handle different types of incoming information, but the remove paths remained specific to labels. Now that the remove paths have been tidied up, we can more easily abstract out the remove path as well. This should mean that future patches which wish to store alternative types of information in the ipcache should only need to update pkg/ipcache/types.go for the new data type, then pull that information out in the main InjectLabels() loop in order to create the new IPCache entry using the new info. Signed-off-by: Joe Stringer <joe@cilium.io>

This commit moves most of the newer ipcache logic over to "net/ipnet" so that we can have stronger types when dealing with later additions to the core IP structures. Previously with the string keys, the expectation was that users of the metadata cache would insert IP addresses rather than prefixes. However, features like CIDR policy need to insert using prefixes. We could ambiguously insert with either IPs or CIDRs like the main IPCache structure does, but this can easily introduce subtle bugs. Instead, swap out the key for the ipcache.metadata structure for netip.Prefix to force use of prefixes as the key even for existing use cases that are based around IP addresses. As a side benefit, netip can be used as map keys and it's supposed to be more efficient than the prior net types. This commit does not currently swap the main IPCache structure over to net/ipnet, as this would force us to deal with conflicts between IP and prefix keys in the main map. We can first introduce this in the metadata cache, then introduce a newer IPCache metadata API that handles conflicts (upcoming commit), migrate the existing codepaths that inject into the ipcache to the new API, and then finally migrate the main IPCache structure over to net/netip types. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>

Introduce a new generic API for associating information with the IPs in the IPCache, which accounts for multiple sources of information such as labels coming from different resources (eg Services -> reserved:kube-apiserver, NetPol -> CIDR labels). The primary core of this API is the UpsertMetadata(...) function, which takes the following parameters: - prefix: IP (range) that this info applies to; - src: info source of the information; - resource: specific resource name in the information source, - aux: variable length list of information to associate with the prefix. 'aux' is typed as IPMetadata, which is effectively just an interface{} to allow any information to be associated with the IPCache. Developers should read the comments around IPMetadata and expand the IPCache package in the relevant areas (particularly pkg/ipcache/types.go and the InjectLabels() call) to ensure that the IPCache package correctly handles the new information and effectively merges the different sources of info correctly. After info is upserted into the IPCache via this new API, it will automatically trigger an out-of-band resolution of what the new IPCache entry for the prefix should look like, taking into account each piece of source information from various resources. In this patch, we switch the current kube-apiserver logic over to the new API as an initial example, removing the need for the caller to trigger label injection since the new API will automatically schedule this. Future work will expand this to switch other subsystems over to the new APIs, introducing new resourceInfo fields and merging logic in the ipcache package to decide how complementary (or even conflicting) information should be combined in order to generate IPCache entries. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

joestringer · 2022-09-01T05:37:06Z

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sFQDNTest Restart Cilium validate that FQDN is still working

Failure Output

FAIL: Unable to restart unmanaged pods with 'kubectl -n kube-system delete pods coredns-8cfc78c54-bdd48': Exitcode: 1

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

joestringer · 2022-09-01T15:24:04Z

/mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9

👍 created #21176 #21177 #21178

joestringer · 2022-09-01T15:25:25Z

Triage:

4.9 hit some weird issues where core k8s components started crashing. Doesn't appear related to the PR.
net-next failed during provisioning.
external-workloads failed one of the to-fqdns test, it seems unlikely this is a real issue since all of the other tests are passing.

joestringer · 2022-09-01T15:25:35Z

/test-1.16-4.9

joestringer · 2022-09-01T15:25:43Z

/test-1.25-net-next

joestringer · 2022-09-01T15:25:50Z

/ci-external-workloads

joestringer requested review from a team May 10, 2022 17:46

joestringer requested a review from a team as a code owner May 10, 2022 17:46

joestringer added the release-note/misc This PR makes changes that have no direct user impact. label May 10, 2022

joestringer requested review from jibi and sayboras May 10, 2022 17:46

joestringer requested a review from jrfastab May 10, 2022 17:46

joestringer marked this pull request as draft May 10, 2022 17:47

joestringer removed request for jibi, jrfastab and sayboras May 10, 2022 17:47

joestringer added this to the 1.13 milestone May 10, 2022

joestringer force-pushed the pr/joe/ipcache-new branch from 690dd50 to f9e1d09 Compare June 8, 2022 06:54

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 8, 2022

joestringer force-pushed the pr/joe/ipcache-new branch from f9e1d09 to 9f1fbc0 Compare June 8, 2022 07:01

christarazi reviewed Jun 14, 2022

View reviewed changes

joestringer mentioned this pull request Jun 14, 2022

Remove TUNNEL_MAP lookup #20170

Open

joestringer force-pushed the pr/joe/ipcache-new branch 3 times, most recently from fd71399 to 536453b Compare June 18, 2022 17:49

This was referenced Jun 18, 2022

Move policy package over to asynchronous IPCache API #20116

Merged

Move NodeManager over to asynchronous IPCache API #20117

Closed

joestringer and others added 7 commits August 31, 2022 22:18

ipcache/md: Update label injection diagram

ad2d3c8

Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer force-pushed the pr/joe/ipcache-new branch from c8ab347 to 179d2e8 Compare September 1, 2022 05:18

This comment was marked as resolved.

Sign in to view

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 1, 2022

joestringer force-pushed the pr/joe/ipcache-new branch from 179d2e8 to 3d191bc Compare September 1, 2022 05:30

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 1, 2022

joestringer merged commit e637814 into master Sep 1, 2022

joestringer deleted the pr/joe/ipcache-new branch September 1, 2022 18:43

pchaigno mentioned this pull request Oct 31, 2022

ipcache: Fix IPcache leak of remote-node IP addresses #21932

Merged

joestringer mentioned this pull request Apr 14, 2023

ipcache don't short-circuit InjectLabels if source differs #24875

Merged

christarazi added kind/enhancement This would improve or streamline existing functionality. area/daemon Impacts operation of the Cilium daemon. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Apr 24, 2023

jaredledvina mentioned this pull request Aug 10, 2023

CiliumInternalIP removed & re-added on cilium-agent restart #27439

Closed

2 tasks

squeed mentioned this pull request Oct 4, 2023

"kube-apiserver" policy entity is broken on dual-stack clusters #28259

Closed

schlosna mentioned this pull request Oct 4, 2023

ipcache: fix flapping labels in SelectorCache when reserved:host identity has multiple IPs #28332

Merged

julianwiedmann mentioned this pull request Dec 19, 2023

[v1.13] node: Fix IP removal from ipset on node updates #29898

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure IPCache to handle metadata merging #19765

Restructure IPCache to handle metadata merging #19765

joestringer commented May 10, 2022 •

edited

maintainer-s-little-helper bot commented May 10, 2022

joestringer commented Jun 8, 2022

joestringer commented Jun 18, 2022

christarazi commented Jun 19, 2022

joestringer commented Jun 19, 2022 •

edited

This comment was marked as resolved.

joestringer commented Sep 1, 2022 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

joestringer commented Sep 1, 2022 •

edited by maintainer-s-little-helper bot

joestringer commented Sep 1, 2022

joestringer commented Sep 1, 2022

joestringer commented Sep 1, 2022

joestringer commented Sep 1, 2022

Restructure IPCache to handle metadata merging #19765

Restructure IPCache to handle metadata merging #19765

Conversation

joestringer commented May 10, 2022 • edited

maintainer-s-little-helper bot commented May 10, 2022

joestringer commented Jun 8, 2022

joestringer commented Jun 18, 2022

christarazi commented Jun 19, 2022

joestringer commented Jun 19, 2022 • edited

This comment was marked as resolved.

joestringer commented Sep 1, 2022 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

joestringer commented Sep 1, 2022 • edited by maintainer-s-little-helper bot

joestringer commented Sep 1, 2022

joestringer commented Sep 1, 2022

joestringer commented Sep 1, 2022

joestringer commented Sep 1, 2022

joestringer commented May 10, 2022 •

edited

joestringer commented Jun 19, 2022 •

edited

joestringer commented Sep 1, 2022 •

edited by maintainer-s-little-helper bot

joestringer commented Sep 1, 2022 •

edited by maintainer-s-little-helper bot