-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure IPCache to handle metadata merging #19765
Conversation
Commit 690dd50 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
690dd50
to
f9e1d09
Compare
f9e1d09
to
9f1fbc0
Compare
/test |
fd71399
to
536453b
Compare
/test |
Looks like the failure from earlier is repeatable:
|
OK, it looks like in etcd mode the Cilium agent receives updates for the node IPs both via etcd and via k8s. In this case we get two events for the ipcache update with different sources, both attempting to associate the IP with 'remote-node'. The second event doesn't make a meaningful difference on the ipcache side since the first event already pushes this update down to the datapath. I think we can just handle that case by attempting to allocate the identity for the "new" set of labels, then looking to see if the allocator just picked the existing identity. If so, then rather than upserting into the IPCache, we can just release the reference in the ipcache and continue:
|
Move the logic to change the host label into injectLabels() so that the caller doesn't need to worry about this detail. This allows injectLabels to fully handle ensuring that the specified labels are present in the returned identity, which will simplify further upcoming changes to generalize this code. In particular it will make it easier to handle identity (re-)allocations for other label-based features. Signed-off-by: Joe Stringer <joe@cilium.io>
Previously, the InjectLabels() function which reacts to updates to the IPCache metadata would iterate through the entire metadata map in order to detect changes that occur, and subsequently react to those updates. The reaction includes potentially allocating new identities, then triggering updates into the datapath policymaps and the datapath's ipcache. Given that there was no way to track which prefixes had been updated, this meant iterating over the entire map, perhaps re-allocating identities for prefixes that had not changed. This also made it more difficult in future to support different types of updates to the ipcache from other sources, since each feature would need to have special logic encoded in this function to correctly handle the identity allocation. Rework the InjectLabels() implementation to handle a queue of prefixes to update instead. This commit only modifies the path used while associating new sets of labels with a given prefix. A follow up commit will handle the delete cases. However, given that adding new sets of labels could also cause an existing prefix -> Labels (and hence Identity) mapping to change, even this case must handle the cleanup of old identities, which is operationally similar to deleting labels (and hence identities). This new code is also built around asynchronous updates to the label associations for the prefixes. When triggered it only tracks which prefixes have been changed, and not yet what it must do to ensure that these prefixes have the correct identity. It uses two key pieces of information to determine what must be done in order to ensure the correct association: * The desired state of the labels, prepared in the IPCache.metadata map during a previous UpsertMetadata() call, and * The realized state of the ipcache itself. The first time that a set of labels is associated with a prefix, the ipcache will likely not have any corresponding entry, so this case is simple: Check the desired state of the labels, allocate a corresponding identity, and then update the policy selectorcache, datapath policymaps, and finally the ipcache. The second time that a (new) set of labels is associated with a prefix, it gets a bit more complicated: The previous set of labels may still be relevant. First, a lookup into the existing ipcache tells the function that this prefix is already associated with some set of labels. * First, determine the complete set of labels that should be associated with this prefix. In this commit, the new labels are likely a superset of the existing set. A subsequent commit will handle disjoint sets. * Allocate a new identity corresponding to this complete set of labels. * Update the SelectorCache and endpoint policy maps to potentially allow this new identity. Traffic will not yet be identified with this new identity, but this prepares the datapath policy engine for that eventuality. * Update the IPcache to identify this prefix as having the new identity. * Now that the new identity is in use in the datapath, we can remove references to the old identity (in the SelectorCache and policymaps) * Finally, given that the IPCache no longer associates this prefix with the old identity, release the reference on the old identity to ensure that the old identity is returned back to the identity allocation pool. Previously, the "identity release" logic was performed as part of the core loop, and primarily based on whether any allocated identity was considered "new" or had the "kube-apiserver" labels associated with it, which was a bit difficult to reason about. That code is now simpler, being tracked via an `idsToDelete` map during the processing of modified prefixes, and the release is moved to the end of the function to ensure that it is not prematurely released. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
Now that the IPCache handles metadata updates incrementally, we can switch the deletion path over to use this same incremental update logic. This allows the users of the deletion APIs to inject updates into the ipcache in a similar manner to the users of the add APIs, and have the updates incrementally triggered into the subsequent subsystems (policy, datapath) consistently. This means that updates into the ipcache from both add and delete paths for the kube-apiserver policy feature will only actually occur from a single goroutine, triggered by the TriggerLabelInjection call. This removes the need to reason about multiple concurrent adds or deletions into the IPcache occurring from the kube-apiserver policy feature, and it also lays the path to do the same in future for CIDR policy and other subsystems. This way, each user of the IPCache can propagate the information that it intends to push into the IPCache, then the IPCache itself can decide how to handle those updates and how to combine information from the various subsystems. Some key observations here: * The previous commit actually includes 90% of the logic required to implement deletions, based partially on the previous 'add' code and partially on the code being deleted here. * What changes here is adding support in InjectLabels() for the case where a set of labels is removed from a prefix. If all labels are removed, then this results in the 'IPCache.metadata' map having no corresponding entry, so in this case the corresponding old identity currently in 'IPCache' should be removed from the selectorcache, policymaps, and the IPCache. * Caveat: This is only the case if _only_ the metadata map has references to the identity. At this point in time, CIDR policies for instance are not yet converted over to the metadata map approach for associating labels with prefixes, so that path may independently allocate their own identities. If those are still referenced from CIDR policy, then the label injector should simply remove references to the corresponding identities but not remove it from the ipcache. * Another case is when there are some set of labels (eg A, B) associated with a prefix, then one set (eg B) is removed. The result is that a previous identity with labels (A, B) must be removed, and a new identity with labels (A) should be allocated / associated with the prefix. In general, this is very similar to the existing case where a set of labels is expanded by associating new labels with the prefix (already handled in the previous commit). * This also has a curly case: Each set of labels has a source associated, for instance initially there could be "remote-node" (source: custom-resource) and "kube-apiserver" (source: kube-apiserver). When previously upserting into the IPCache, the source will be kube-apiserver. If the kube apiserver is no longer associated with the IP, and hence that label removed, then the resulting set of labels will only be "remote-node", with source "custom-resource". Given that the source "custom-resource" has a lower priority in pkg/source than kube-apiserver source, we cannot update the ipcache directly with the new set of labels using the "custom-resource" source. However, the label removal is still legitimate. To work around the clunky APIs, the function here just overrides the source check in the IPCache.Upsert(). We should be able to remove this clunkiness over time when the metadata map is the primary source of information for prefixes, but more refactoring is necessary to get to that point. * Now that the label removal doesn't have its own independent logic from a separate goroutine, there is no longer a need to use the 'applyChangesMU' mutex in the metadata cache to ensure safety around the critical section. Furthermore, the core InjectLabels() call doesn't modify the metadata map. So, we can remove one lock and reduce the other lock down to a read mutex rather than holding it for write. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
Most actual ipcache updates will in future be occurring as a result of triggering the label injection controller, so propagate the context from that controller down into the various child functions where they need a context for identity allocation and datapath map updates. Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Back in the commit "ipcache/md: Rework metadata cache for any labels", the underlying mechanism for storing IP metadata was made more generic to store any information that 'IPMetadata' represents. At the time, the upsert codepaths were made generic to handle different types of incoming information, but the remove paths remained specific to labels. Now that the remove paths have been tidied up, we can more easily abstract out the remove path as well. This should mean that future patches which wish to store alternative types of information in the ipcache should only need to update pkg/ipcache/types.go for the new data type, then pull that information out in the main InjectLabels() loop in order to create the new IPCache entry using the new info. Signed-off-by: Joe Stringer <joe@cilium.io>
This commit moves most of the newer ipcache logic over to "net/ipnet" so that we can have stronger types when dealing with later additions to the core IP structures. Previously with the string keys, the expectation was that users of the metadata cache would insert IP addresses rather than prefixes. However, features like CIDR policy need to insert using prefixes. We could ambiguously insert with either IPs or CIDRs like the main IPCache structure does, but this can easily introduce subtle bugs. Instead, swap out the key for the ipcache.metadata structure for netip.Prefix to force use of prefixes as the key even for existing use cases that are based around IP addresses. As a side benefit, netip can be used as map keys and it's supposed to be more efficient than the prior net types. This commit does not currently swap the main IPCache structure over to net/ipnet, as this would force us to deal with conflicts between IP and prefix keys in the main map. We can first introduce this in the metadata cache, then introduce a newer IPCache metadata API that handles conflicts (upcoming commit), migrate the existing codepaths that inject into the ipcache to the new API, and then finally migrate the main IPCache structure over to net/netip types. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
c8ab347
to
179d2e8
Compare
This comment was marked as resolved.
This comment was marked as resolved.
Introduce a new generic API for associating information with the IPs in the IPCache, which accounts for multiple sources of information such as labels coming from different resources (eg Services -> reserved:kube-apiserver, NetPol -> CIDR labels). The primary core of this API is the UpsertMetadata(...) function, which takes the following parameters: - prefix: IP (range) that this info applies to; - src: info source of the information; - resource: specific resource name in the information source, - aux: variable length list of information to associate with the prefix. 'aux' is typed as IPMetadata, which is effectively just an interface{} to allow any information to be associated with the IPCache. Developers should read the comments around IPMetadata and expand the IPCache package in the relevant areas (particularly pkg/ipcache/types.go and the InjectLabels() call) to ensure that the IPCache package correctly handles the new information and effectively merges the different sources of info correctly. After info is upserted into the IPCache via this new API, it will automatically trigger an out-of-band resolution of what the new IPCache entry for the prefix should look like, taking into account each piece of source information from various resources. In this patch, we switch the current kube-apiserver logic over to the new API as an initial example, removing the need for the caller to trigger label injection since the new API will automatically schedule this. Future work will expand this to switch other subsystems over to the new APIs, introducing new resourceInfo fields and merging logic in the ipcache package to decide how complementary (or even conflicting) information should be combined in order to generate IPCache entries. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
179d2e8
to
3d191bc
Compare
/test Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
Triage:
|
/test-1.16-4.9 |
/test-1.25-net-next |
/ci-external-workloads |
Meta: #21142
IPCache has historically been based around a concept of associating IP prefixes with a certain Identity, where each caller independently determines the exact identity that a IP Prefix should have. Each of these callers has a corresponding
source.Source
which decides the priority between these different sources of Identity.More recently however, features like #14724 have driven the need to combine multiple sources of information, such as associating both the
remote-node
labels (from k8sCiliumNode
resources) andkube-apiserver
labels (from k8sEndpoints
/EndpointSlices
resources). Having an exclusionary policy where only one source of information can be correct can have implications like causing conflicts between policies that match on information from each source. Up until now, we've resolved these conflicts somewhat manually by forcing the callers to handle the merging of this information. For most callers, they're independent enough that users would only encounter bugs in rare conditions of using two features together in an unexpected way. We have made some attempts in certain paths to resolve these conflicts where necessary, including the example above. However, over time this code becomes more and more 🍝 with each caller attempting to resolve conflicts with the other callers.This PR reworks the metadata map inside the IPCache to handle multiple sources of information at once, keying them in the map first by IP prefix, then by a
{resourceType, resourceNamespace, resourceName}
tuple (ResourceID
). When merging the information, a newprefixInfo
structure provides convenient merging of the labels from all of the sources of information. In effect, this introduced one additional layer of indirection with a series of chained structures represented in the mermaid chart below:Furthemore, rather than the previous implementation where adds into the metadata map were asynchronous and deletes were synchronous, this PR refactors the code to force both paths to be asynchronous. The first step which is synchronous is to update the desired state of the IPCache metadata, ie upserting the labels into the metadata map. This is called directly from various watchers like k8s resource watchers or the node manager. These callers then
TriggerLabelInjection()
to apply the changes out-of-band. This causes a goroutine to separately iterate through the changes that need to be made, then ensure that the control plane policy engine (in theSelectorCache
) to be updated, followed by the dataplane policy maps, then finally followed by the dataplane ipcache. This ensures that policy transitions for newly allocated identities will occur in the correct order to prevent packet drops when different labels are associated with (or removed from) IP prefixes.By making the deletes also asynchronous and combining the core logic, we are able to also delete a bunch of code, and more consistently handle the various cases of changing the sets of labels associated with prefixes.
See the individual commits for more details.
Combined design and implementation with @christarazi .
Supersedes: #19758
Related: #18301