-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure IPCache to handle metadata merging #19758
Conversation
Previously, the modification of ipcache.metadata from the node manager and k8s watchers was not locking the map before making changes. If two events happened at around the same time, this could cause weird behaviour. In general, this is fairly unlikely given the relatively static usage of the metadata map in the current case (plus that InjectLabels() waits for k8s to be fully synced before running), however this is a good cleanup and will be required in future when the ipcache metadata map plays a more central role in managing IPcache label association. Signed-off-by: Joe Stringer <joe@cilium.io>
Rework the internals of the "IDMD" Identity <-> Metadata cache so that it can store multiple pieces of information coming from different sources, rather than only labels coming from the kube-apiserver identity feature. Initially in this commit, this only includes one feature source which only stores labels internally, which is what's necessary for the current user of this code. In future, we anticipate extending the core info structure here to include not just labels, but also auxiliary data like port name mappings and encryption indexes shared by other nodes. Information is keyed by the ResourceID, typically a {resource, name, namespace} tuple of a resource from kubernetes, which allows convenient insertion/deletion, and provides a way to trace the source of the information for debugging purposes. This can mean that the same information may be stored multiple times by different ResourceID inside the structure, but this is not expected to be the normal case so the memory cost for this is expected to be minimal. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
Now that each individual PrefixInfo (labels + source) is held together in one big happy identityMetadata map, we can rely directly on the map for the correct source for updating the IPCache, rather than forcing each caller of the trigger to pass it down. Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
Add a queue into the metadata structure for handling the set of prefixes which need to be processed as part of the next label injection trigger. Signed-off-by: Joe Stringer <joe@cilium.io>
Move the logic to change the host label into injectLabels() so that the caller doesn't need to worry about this detail. This allows injectLabels to fully handle ensuring that the specified labels are present in the returned identity, so the caller doesn't have to. This will simplify further upcoming changes to generalize this code to handle other newer identity allocations for other label-based features. Signed-off-by: Joe Stringer <joe@cilium.io>
Previously, the InjectLabels() function which reacts to updates to the IPCache metadata would iterate through the entire metadata map in order to detect changes that occur, and subsequently react to those updates. The reaction includes potentially allocating new identities, then triggering updates into the datapath policymaps and the datapath's ipcache. Given that there was no way to track which prefixes had been updated, this meant iterating over the entire map, perhaps re-allocating identities for prefixes that had not changed. This also made it more difficult in future to support different types of updates to the ipcache from other sources, since each feature would need to have special logic encoded in this function to correctly handle the identity allocation. Rework the InjectLabels() implementation to handle a queue of prefixes to update instead. This commit only modifies the path used while associating new sets of labels with a given prefix. A follow up commit will handle the delete cases. However, given that adding new sets of labels could also cause an existing prefix -> Labels (and hence Identity) mapping to change, even this case must handle the cleanup of old identities, which is operationally similar to deleting labels (and hence identities). This new code is also built around asynchronous updates to the label associations for the prefixes. When triggered it only tracks which prefixes have been changed, and not yet what it must do to ensure that these prefixes have the correct identity. It uses two key pieces of information to determine what must be done in order to ensure the correct association: * The desired state of the labels, prepared in the IPCache.metadata map during a previous UpsertMetadata() call, and * The realized state of the ipcache itself. The first time that a set of labels is associated with a prefix, the ipcache will likely not have any corresponding entry, so this case is simple: Check the desired state of the labels, allocate a corresponding identity, and then update the policy selectorcache, datapath policymaps, and finally the ipcache. The second time that a (new) set of labels is associated with a prefix, it gets a bit more complicated: The previous set of labels may still be relevant. First, a lookup into the existing ipcache tells the function that this prefix is already associated with some set of labels. * First, determine the complete set of labels that should be associated with this prefix. In this commit, the new labels are likely a superset of the existing set. A subsequent commit will handle disjoint sets. * Allocate a new identity corresponding to this complete set of labels. * Update the SelectorCache and endpoint policy maps to potentially allow this new identity. Traffic will not yet be identified with this new identity, but this prepares the datapath policy engine for that eventuality. * Update the IPcache to identify this prefix as having the new identity. * Now that the new identity is in use in the datapath, we can remove references to the old identity (in the SelectorCache and policymaps) * Finally, given that the IPCache no longer associates this prefix with the old identity, release the reference on the old identity to ensure that the old identity is returned back to the identity allocation pool. Previously, the "identity release" logic was performed as part of the core loop, and primarily based on whether any allocated identity was considered "new" or had the "kube-apiserver" labels associated with it, which was a bit difficult to reason about. That code is now simpler, being tracked via an `idsToDelete` map during the processing of modified prefixes, and the release is moved to the end of the function to ensure that it is not prematurely released. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
Now that the IPCache handles metadata updates incrementally, we can switch the deletion path over to use this same incremental update logic. This allows the users of the deletion APIs to inject updates into the ipcache in a similar manner to the users of the add APIs, and have the updates incrementally triggered into the subsequent subsystems (policy, datapath) consistently. This means that updates into the ipcache from both add and delete paths for the kube-apiserver policy feature will only actually occur from a single goroutine, triggered by the TriggerLabelInjection call. This removes the need to reason about multiple concurrent adds or deletions into the IPcache occurring from the kube-apiserver policy feature, and it also lays the path to do the same in future for CIDR policy and other subsystems. This way, each user of the IPCache can propagate the information that it intends to push into the IPCache, then the IPCache itself can decide how to handle those updates and how to combine information from the various subsystems. Some key observations here: * The previous commit actually includes 90% of the logic required to implement deletions, based partially on the previous 'add' code and partially on the code being deleted here. * What changes here is adding support in InjectLabels() for the case where a set of labels is removed from a prefix. If all labels are removed, then this results in the 'IPCache.metadata' map having no corresponding entry, so in this case the corresponding old identity currently in 'IPCache' should be removed from the selectorcache, policymaps, and the IPCache. * Caveat: This is only the case if _only_ the metadata map has references to the identity. At this point in time, CIDR policies for instance are not yet converted over to the metadata map approach for associating labels with prefixes, so that path may independently allocate their own identities. If those are still referenced from CIDR policy, then the label injector should simply remove references to the corresponding identities but not remove it from the ipcache. * Another case is when there are some set of labels (eg A, B) associated with a prefix, then one set (eg B) is removed. The result is that a previous identity with labels (A, B) must be removed, and a new identity with labels (A) should be allocated / associated with the prefix. In general, this is very similar to the existing case where a set of labels is expanded by associating new labels with the prefix (already handled in the previous commit). * This also has a curly case: Each set of labels has a source associated, for instance initially there could be "remote-node" (source: custom-resource) and "kube-apiserver" (source: kube-apiserver). When previously upserting into the IPCache, the source will be kube-apiserver. If the kube apiserver is no longer associated with the IP, and hence that label removed, then the resulting set of labels will only be "remote-node", with source "custom-resource". Given that the source "custom-resource" has a lower priority in pkg/source than kube-apiserver source, we cannot update the ipcache directly with the new set of labels using the "custom-resource" source. However, the label removal is still legitimate. To work around the clunky APIs, the function here just overrides the source check in the IPCache.Upsert(). We should be able to remove this clunkiness over time when the metadata map is the primary source of information for prefixes, but more refactoring is necessary to get to that point. * Now that the label removal doesn't have its own independent logic from a separate goroutine, there is no longer a need to use the 'applyChangesMU' mutex in the metadata cache to ensure safety around the critical section. Furthermore, the core InjectLabels() call doesn't modify the metadata map. So, we can remove one lock and reduce the other lock down to a read mutex rather than holding it for write. Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
Most actual ipcache updates will in future be occurring as a result of triggering the label injection controller, so propagate the context from that controller down into the various child functions where they need a context for identity allocation and datapath map updates. Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Commit 1119ae5 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
FWIW, I'm still struggling with getting #19318 to work (there is still an ordering issue, and I haven't had the cycles to track it down). Since this PR supersedes it, it might be easier to close #19318. The only reason not to close #19318 would be that it makes it possible to backport the node-to-node encryption regression fix. |
@gandro If we ever think it may be useful to independently have #19318 in older releases, then it probably makes sense to get #19318 in first. I can potentially try to take another look there if this is important. While some of the code from #19318 is superseded by this implementation, the piece for handling encryption info is not yet part of this PR, which is the primary motivation for basing this PR on top of #19318. That said, this PR is standalone and doesn't have direct dependencies against that PR. So we do alternatively have the option to merge this in first and then rework #19318 on top to inject the encryption info into the ipcache entries. Broadly I anticipate that this PR provides some structure in place that can then be reused/extended from:
That said, I also think that we should not merge this PR at least until we branch 1.12 off the main branch, to minimize the impacts of the refactor here on the upcoming release. |
Woops, did not mean to delete the branch/close the PR. I'll file a new one. |
That would be my preferred approach too. |
WIP, also up for debate how far this refactor should go before merge.
Depends on:
IPCache has historically been based around a concept of associating IP prefixes with a certain Identity, where each caller independently determines the exact identity that a IP Prefix should have. Each of these callers has a corresponding
source.Source
which decides the priority between these different sources of Identity.More recently however, features like #14724 have driven the need to combine multiple sources of information, such as associating both the
remote-node
labels (from k8sCiliumNode
resources) andkube-apiserver
labels (from k8sEndpoints
/EndpointSlices
resources). Having an exclusionary policy where only one source of information can be correct can have implications like causing conflicts between policies that match on information from each source. Up until now, we've resolved these conflicts somewhat manually by forcing the callers to handle the merging of this information. For most callers, they're independent enough that users would only encounter bugs in rare conditions of using two features together in an unexpected way. We have made some attempts in certain paths to resolve these conflicts where necessary, including the example above. However, over time this code becomes more and more 馃崫 with each caller attempting to resolve conflicts with the other callers.This PR reworks the metadata map inside the IPCache to handle multiple sources of information at once, keying them in the map first by IP prefix, then by a
{resourceType, resourceNamespace, resourceName}
tuple (ResourceID
). When merging the information, a newprefixInfo
structure provides convenient merging of the labels from all of the sources of information. In effect, this introduced one additional layer of indirection with a series of chained structures represented in the mermaid chart below:Furthemore, rather than the previous implementation where adds into the metadata map were asynchronous and deletes were synchronous, this PR refactors the code to force both paths to be asynchronous. The first step which is synchronous is to update the desired state of the IPCache metadata, ie upserting the labels into the metadata map. This is called directly from various watchers like k8s resource watchers or the node manager. These callers then
TriggerLabelInjection()
to apply the changes out-of-band. This causes a goroutine to separately iterate through the changes that need to be made, then ensure that the control plane policy engine (in theSelectorCache
) to be updated, followed by the dataplane policy maps, then finally followed by the dataplane ipcache. This ensures that policy transitions for newly allocated identities will occur in the correct order to prevent packet drops when different labels are associated with (or removed from) IP prefixes.By making the deletes also asynchronous and combining the core logic, we are able to also delete a bunch of code, and more consistently handle the various cases of changing the sets of labels associated with prefixes.
See the individual commits for more details.
Combined design and implementation with @christarazi .
Related: #18301