Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure IPCache to handle metadata merging #19758

Closed
wants to merge 10 commits into from

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented May 10, 2022

WIP, also up for debate how far this refactor should go before merge.

Depends on:

IPCache has historically been based around a concept of associating IP prefixes with a certain Identity, where each caller independently determines the exact identity that a IP Prefix should have. Each of these callers has a corresponding source.Source which decides the priority between these different sources of Identity.

More recently however, features like #14724 have driven the need to combine multiple sources of information, such as associating both the remote-node labels (from k8s CiliumNode resources) and kube-apiserver labels (from k8s Endpoints / EndpointSlices resources). Having an exclusionary policy where only one source of information can be correct can have implications like causing conflicts between policies that match on information from each source. Up until now, we've resolved these conflicts somewhat manually by forcing the callers to handle the merging of this information. For most callers, they're independent enough that users would only encounter bugs in rare conditions of using two features together in an unexpected way. We have made some attempts in certain paths to resolve these conflicts where necessary, including the example above. However, over time this code becomes more and more 馃崫 with each caller attempting to resolve conflicts with the other callers.

This PR reworks the metadata map inside the IPCache to handle multiple sources of information at once, keying them in the map first by IP prefix, then by a {resourceType, resourceNamespace, resourceName} tuple (ResourceID). When merging the information, a new prefixInfo structure provides convenient merging of the labels from all of the sources of information. In effect, this introduced one additional layer of indirection with a series of chained structures represented in the mermaid chart below:

flowchart                                                                       
  subgraph labelsWithSource                                                     
  labels.Labels                                                                 
  source.Feature                                                                
  end                                                                           
  subgraph prefixInfo                                                           
  UA[ResourceID]-->LA[labelsWithSource]                                         
  UB[ResourceID]-->LB[labelsWithSource]                                         
  ...                                                                           
  end                                                                           
  subgraph identityMetadata                                                     
  IP_Prefix-->prefixInfo                                                        
  end                                                                           
Loading

Furthemore, rather than the previous implementation where adds into the metadata map were asynchronous and deletes were synchronous, this PR refactors the code to force both paths to be asynchronous. The first step which is synchronous is to update the desired state of the IPCache metadata, ie upserting the labels into the metadata map. This is called directly from various watchers like k8s resource watchers or the node manager. These callers then TriggerLabelInjection() to apply the changes out-of-band. This causes a goroutine to separately iterate through the changes that need to be made, then ensure that the control plane policy engine (in the SelectorCache) to be updated, followed by the dataplane policy maps, then finally followed by the dataplane ipcache. This ensures that policy transitions for newly allocated identities will occur in the correct order to prevent packet drops when different labels are associated with (or removed from) IP prefixes.

By making the deletes also asynchronous and combining the core logic, we are able to also delete a bunch of code, and more consistently handle the various cases of changing the sets of labels associated with prefixes.

See the individual commits for more details.

Combined design and implementation with @christarazi .

Related: #18301

joestringer and others added 10 commits May 9, 2022 17:50
Previously, the modification of ipcache.metadata from the node manager
and k8s watchers was not locking the map before making changes. If
two events happened at around the same time, this could cause weird
behaviour. In general, this is fairly unlikely given the relatively
static usage of the metadata map in the current case (plus that
InjectLabels() waits for k8s to be fully synced before running), however
this is a good cleanup and will be required in future when the ipcache
metadata map plays a more central role in managing IPcache label
association.

Signed-off-by: Joe Stringer <joe@cilium.io>
Rework the internals of the "IDMD" Identity <-> Metadata cache so that
it can store multiple pieces of information coming from different
sources, rather than only labels coming from the kube-apiserver identity
feature. Initially in this commit, this only includes one feature source
which only stores labels internally, which is what's necessary for the
current user of this code.

In future, we anticipate extending the core info structure here to
include not just labels, but also auxiliary data like port name mappings
and encryption indexes shared by other nodes.

Information is keyed by the ResourceID, typically a {resource, name,
namespace} tuple of a resource from kubernetes, which allows convenient
insertion/deletion, and provides a way to trace the source of the
information for debugging purposes. This can mean that the same
information may be stored multiple times by different ResourceID inside
the structure, but this is not expected to be the normal case so the
memory cost for this is expected to be minimal.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Now that each individual PrefixInfo (labels + source) is held together
in one big happy identityMetadata map, we can rely directly on the map
for the correct source for updating the IPCache, rather than forcing
each caller of the trigger to pass it down.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Add a queue into the metadata structure for handling the set of prefixes
which need to be processed as part of the next label injection trigger.

Signed-off-by: Joe Stringer <joe@cilium.io>
Move the logic to change the host label into injectLabels() so that the
caller doesn't need to worry about this detail. This allows injectLabels
to fully handle ensuring that the specified labels are present in the
returned identity, so the caller doesn't have to. This will simplify
further upcoming changes to generalize this code to handle other newer
identity allocations for other label-based features.

Signed-off-by: Joe Stringer <joe@cilium.io>
Previously, the InjectLabels() function which reacts to updates to the
IPCache metadata would iterate through the entire metadata map in order
to detect changes that occur, and subsequently react to those updates.
The reaction includes potentially allocating new identities, then
triggering updates into the datapath policymaps and the datapath's ipcache.

Given that there was no way to track which prefixes had been updated,
this meant iterating over the entire map, perhaps re-allocating
identities for prefixes that had not changed. This also made it more
difficult in future to support different types of updates to the ipcache
from other sources, since each feature would need to have special logic
encoded in this function to correctly handle the identity allocation.

Rework the InjectLabels() implementation to handle a queue of prefixes
to update instead. This commit only modifies the path used while
associating new sets of labels with a given prefix. A follow up commit
will handle the delete cases. However, given that adding new sets of
labels could also cause an existing prefix -> Labels (and hence Identity)
mapping to change, even this case must handle the cleanup of old
identities, which is operationally similar to deleting labels (and hence
identities).

This new code is also built around asynchronous updates to the label
associations for the prefixes. When triggered it only tracks which
prefixes have been changed, and not yet what it must do to ensure that
these prefixes have the correct identity. It uses two key pieces of
information to determine what must be done in order to ensure the
correct association:
* The desired state of the labels, prepared in the IPCache.metadata map
  during a previous UpsertMetadata() call, and
* The realized state of the ipcache itself.

The first time that a set of labels is associated with a prefix, the
ipcache will likely not have any corresponding entry, so this case is
simple: Check the desired state of the labels, allocate a corresponding
identity, and then update the policy selectorcache, datapath policymaps,
and finally the ipcache.

The second time that a (new) set of labels is associated with a prefix,
it gets a bit more complicated: The previous set of labels may still be
relevant. First, a lookup into the existing ipcache tells the function
that this prefix is already associated with some set of labels.

* First, determine the complete set of labels that should be associated
  with this prefix. In this commit, the new labels are likely a
  superset of the existing set. A subsequent commit will handle disjoint
  sets.
* Allocate a new identity corresponding to this complete set of labels.
* Update the SelectorCache and endpoint policy maps to potentially allow
  this new identity. Traffic will not yet be identified with this new
  identity, but this prepares the datapath policy engine for that
  eventuality.
* Update the IPcache to identify this prefix as having the new identity.
* Now that the new identity is in use in the datapath, we can remove
  references to the old identity (in the SelectorCache and policymaps)
* Finally, given that the IPCache no longer associates this prefix with
  the old identity, release the reference on the old identity to ensure
  that the old identity is returned back to the identity allocation pool.

Previously, the "identity release" logic was performed as part of the
core loop, and primarily based on whether any allocated identity was
considered "new" or had the "kube-apiserver" labels associated with it,
which was a bit difficult to reason about. That code is now simpler,
being tracked via an `idsToDelete` map during the processing of modified
prefixes, and the release is moved to the end of the function to ensure
that it is not prematurely released.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Now that the IPCache handles metadata updates incrementally, we can
switch the deletion path over to use this same incremental update logic.
This allows the users of the deletion APIs to inject updates into the
ipcache in a similar manner to the users of the add APIs, and have the
updates incrementally triggered into the subsequent subsystems (policy,
datapath) consistently. This means that updates into the ipcache from
both add and delete paths for the kube-apiserver policy feature will
only actually occur from a single goroutine, triggered by the
TriggerLabelInjection call.

This removes the need to reason about multiple concurrent adds or
deletions into the IPcache occurring from the kube-apiserver policy
feature, and it also lays the path to do the same in future for CIDR
policy and other subsystems. This way, each user of the IPCache can
propagate the information that it intends to push into the IPCache, then
the IPCache itself can decide how to handle those updates and how to
combine information from the various subsystems.

Some key observations here:
* The previous commit actually includes 90% of the logic required to
  implement deletions, based partially on the previous 'add' code and
  partially on the code being deleted here.
* What changes here is adding support in InjectLabels() for the case
  where a set of labels is removed from a prefix. If all labels are
  removed, then this results in the 'IPCache.metadata' map having no
  corresponding entry, so in this case the corresponding old identity
  currently in 'IPCache' should be removed from the selectorcache,
  policymaps, and the IPCache.
  * Caveat: This is only the case if _only_ the metadata map has
    references to the identity. At this point in time, CIDR policies for
    instance are not yet converted over to the metadata map approach for
    associating labels with prefixes, so that path may independently
    allocate their own identities. If those are still referenced from
    CIDR policy, then the label injector should simply remove references
    to the corresponding identities but not remove it from the ipcache.
* Another case is when there are some set of labels (eg A, B) associated
  with a prefix, then one set (eg B) is removed. The result is that a
  previous identity with labels (A, B) must be removed, and a new
  identity with labels (A) should be allocated / associated with the
  prefix. In general, this is very similar to the existing case where a
  set of labels is expanded by associating new labels with the prefix
  (already handled in the previous commit).
  * This also has a curly case: Each set of labels has a source
    associated, for instance initially there could be "remote-node"
    (source: custom-resource) and "kube-apiserver" (source:
    kube-apiserver). When previously upserting into the IPCache, the
    source will be kube-apiserver. If the kube apiserver is no longer
    associated with the IP, and hence that label removed, then the
    resulting set of labels will only be "remote-node", with source
    "custom-resource". Given that the source "custom-resource" has a
    lower priority in pkg/source than kube-apiserver source, we cannot
    update the ipcache directly with the new set of labels using the
    "custom-resource" source. However, the label removal is still
    legitimate. To work around the clunky APIs, the function here just
    overrides the source check in the IPCache.Upsert(). We should be
    able to remove this clunkiness over time when the metadata map is
    the primary source of information for prefixes, but more refactoring
    is necessary to get to that point.
* Now that the label removal doesn't have its own independent logic from
  a separate goroutine, there is no longer a need to use the
  'applyChangesMU' mutex in the metadata cache to ensure safety around
  the critical section. Furthermore, the core InjectLabels() call
  doesn't modify the metadata map. So, we can remove one lock and reduce
  the other lock down to a read mutex rather than holding it for write.

Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Most actual ipcache updates will in future be occurring as a result of
triggering the label injection controller, so propagate the context from
that controller down into the various child functions where they need a
context for identity allocation and datapath map updates.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label May 10, 2022
@maintainer-s-little-helper
Copy link

Commit 1119ae5 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label May 10, 2022
@joestringer joestringer added this to the 1.13 milestone May 10, 2022
@gandro
Copy link
Member

gandro commented May 10, 2022

FWIW, I'm still struggling with getting #19318 to work (there is still an ordering issue, and I haven't had the cycles to track it down).

Since this PR supersedes it, it might be easier to close #19318. The only reason not to close #19318 would be that it makes it possible to backport the node-to-node encryption regression fix.

@joestringer
Copy link
Member Author

@gandro If we ever think it may be useful to independently have #19318 in older releases, then it probably makes sense to get #19318 in first. I can potentially try to take another look there if this is important.

While some of the code from #19318 is superseded by this implementation, the piece for handling encryption info is not yet part of this PR, which is the primary motivation for basing this PR on top of #19318. That said, this PR is standalone and doesn't have direct dependencies against that PR. So we do alternatively have the option to merge this in first and then rework #19318 on top to inject the encryption info into the ipcache entries.

Broadly I anticipate that this PR provides some structure in place that can then be reused/extended from:

  • CIDR policy
  • Encryption
  • Named Ports
  • NodeDiscovery (for local node info)
  • Possibly FQDN policy (tbd, I recall there were some quirks that made this case somehow a bit tricky)

That said, I also think that we should not merge this PR at least until we branch 1.12 off the main branch, to minimize the impacts of the refactor here on the upcoming release.

@joestringer joestringer deleted the pr/joe/ipcache-new branch May 10, 2022 17:43
@joestringer
Copy link
Member Author

joestringer commented May 10, 2022

Woops, did not mean to delete the branch/close the PR. I'll file a new one.

@gandro
Copy link
Member

gandro commented May 11, 2022

So we do alternatively have the option to merge this in first and then rework #19318 on top to inject the encryption info into the ipcache entries.

That would be my preferred approach too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants