-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various improvements and bugfixes for kube-apiserver policy matching #18150
Various improvements and bugfixes for kube-apiserver policy matching #18150
Conversation
/test |
/test Job 'Cilium-PR-Runtime-net-next' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
/test-runtime |
/test-gke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only nits. My codeowners look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks :)
cebc447
to
76b953a
Compare
/test |
/test Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
/test-1.23-net-next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. I've posted a few questions mainly around the second-to-last commit where it seems like we can potentially simplify the code a little further and also eliminate one source of policy drops when labels are injected or removed.
3fa174e
to
1795722
Compare
Relevant diff addressing Joe's comments above: Diffdiff --git a/pkg/ipcache/ipcache.go b/pkg/ipcache/ipcache.go
index 7e1107c6fe..6ebc0466e0 100644
--- a/pkg/ipcache/ipcache.go
+++ b/pkg/ipcache/ipcache.go
@@ -223,10 +223,23 @@ func (ipc *IPCache) updateNamedPorts() (namedPortsChanged bool) {
func (ipc *IPCache) Upsert(ip string, hostIP net.IP, hostKey uint8, k8sMeta *K8sMetadata, newIdentity Identity) (namedPortsChanged bool, err error) {
ipc.mutex.Lock()
defer ipc.mutex.Unlock()
- return ipc.upsertLocked(ip, hostIP, hostKey, k8sMeta, newIdentity)
-}
-
-func (ipc *IPCache) upsertLocked(ip string, hostIP net.IP, hostKey uint8, k8sMeta *K8sMetadata, newIdentity Identity) (namedPortsChanged bool, err error) {
+ return ipc.upsertLocked(ip, hostIP, hostKey, k8sMeta, newIdentity, false /* !force */)
+}
+
+// upsertLocked adds / updates the provided IP and identity into the IPCache,
+// assuming that the IPCache lock has been taken. Warning, do not use force
+// unless you know exactly what you're doing. Forcing adding / updating the
+// IPCache will not take into account the source of the identity and bypasses
+// the overwrite logic! Once GH-18301 is addressed, there will be no need for
+// any force logic.
+func (ipc *IPCache) upsertLocked(
+ ip string,
+ hostIP net.IP,
+ hostKey uint8,
+ k8sMeta *K8sMetadata,
+ newIdentity Identity,
+ force bool,
+) (namedPortsChanged bool, err error) {
var newNamedPorts policy.NamedPortMap
if k8sMeta != nil {
newNamedPorts = k8sMeta.NamedPorts
@@ -258,7 +271,7 @@ func (ipc *IPCache) upsertLocked(ip string, hostIP net.IP, hostKey uint8, k8sMet
cachedIdentity, found := ipc.ipToIdentityCache[ip]
if found {
- if !source.AllowOverwrite(cachedIdentity.Source, newIdentity.Source) {
+ if !force && !source.AllowOverwrite(cachedIdentity.Source, newIdentity.Source) {
return false, NewErrOverwrite(cachedIdentity.Source, newIdentity.Source)
}
diff --git a/pkg/ipcache/metadata.go b/pkg/ipcache/metadata.go
index 22d2415423..ecdd4a80ed 100644
--- a/pkg/ipcache/metadata.go
+++ b/pkg/ipcache/metadata.go
@@ -73,12 +73,15 @@ func GetIDMetadataByIP(prefix string) labels.Labels {
// InjectLabels injects labels from the identityMetadata (IDMD) map into the
// identities used for the prefixes in the IPCache. The given source is the
// source of the caller, as inserting into the IPCache requires knowing where
-// this updated information comes from.
+// this updated information comes from. Conversely, RemoveLabelsExcluded()
+// performs the inverse: removes labels from the IDMD map and releases
+// identities allocated by this function.
//
// Note that as this function iterates through the IDMD, if it detects a change
// in labels for a given prefix, then this might allocate a new identity. If a
// prefix was previously associated with an identity, it will get deallocated,
-// so a balance is kept.
+// so a balance is kept, ensuring a one-to-one mapping between prefix and
+// identity.
func InjectLabels(src source.Source, updater identityUpdater, triggerer policyTriggerer) error {
if IdentityAllocator == nil || !IdentityAllocator.IsLocalIdentityAllocatorInitialized() {
return ErrLocalIdentityAllocatorUninitialized
@@ -131,8 +134,8 @@ func InjectLabels(src source.Source, updater identityUpdater, triggerer policyTr
newLbls := id.Labels
tmpSrc = source.KubeAPIServer
trigger = true
- // If any reserved ID has changed, update its labels.
- if id.IsReserved() {
+ // If host identity has changed, update its labels.
+ if id.ID == identity.ReservedIdentityHost {
identity.AddReservedIdentityWithLabels(id.ID, newLbls)
}
idsToPropagate[id.ID] = newLbls.LabelArray()
@@ -189,7 +192,7 @@ func InjectLabels(src source.Source, updater identityUpdater, triggerer policyTr
if _, err := IPIdentityCache.upsertLocked(ip, hIP, key, meta, Identity{
ID: id.ID,
Source: id.Source,
- }); err != nil {
+ }, false /* !force */); err != nil {
return fmt.Errorf("failed to upsert %s into ipcache with identity %d: %w", ip, id.ID, err)
}
}
@@ -305,12 +308,9 @@ func filterMetadataByLabels(filter labels.Labels) []string {
return matching
}
-// removeLabelsFromIPs provides a convenient method for the caller to remove
-// all given prefixes at once. This function should be the inverse of
-// InjectLabels(). This function will trigger policy update and recalculation
-// if necessary on behalf of the caller.
-//
-// Identities allocated by InjectLabels() may be released by removeLabels().
+// removeLabelsFromIPs removes all given prefixes at once. This function will
+// trigger policy update and recalculation if necessary on behalf of the
+// caller.
//
// A prefix will only be removed from the IDMD if the set of labels becomes
// empty.
@@ -325,8 +325,8 @@ func removeLabelsFromIPs(
var (
idsToAdd = make(map[identity.NumericIdentity]labels.LabelArray)
idsToDelete = make(map[identity.NumericIdentity]labels.LabelArray)
- // toReplace stores identity pairs to replace in the ipcache.
- toReplace = make(map[string]idPair)
+ // toReplace stores the identity to replace in the ipcache.
+ toReplace = make(map[string]Identity)
)
IPIdentityCache.Lock()
@@ -338,48 +338,25 @@ func removeLabelsFromIPs(
continue
}
+ idsToDelete[id.ID] = nil // labels for deletion don't matter to UpdateIdentities()
+
// Insert to propagate the updated set of labels after removal.
- //
- // There are two concurring cases here and they are annotated in the
- // code below:
- // 1) When labels are disassociated with an IP, the remaining set
- // either contains some labels or it's empty. In the former, the
- // identity is both marked for deletion and addition, because
- // there are labels still associated with the IP. In the latter,
- // the identity is marked for deletion only.
- // 2) If removeLabels() indicates that a new identity needs to be
- // generated (newIdentity), a new identity is allocated (with the
- // appropriate labels removed), marked for addition, and lastly,
- // upserted into the ipcache (replacing the existing identity
- // entry). This can happen when an existing identity was *not*
- // released during removeLabels() due to the identity's refcount
- // being >1.
- var (
- newIdentity bool
- la labels.LabelArray
- )
- l, newIdentity := removeLabels(prefix, lbls, src)
- if l != nil {
- la = l.LabelArray()
- }
- idsToDelete[id.ID] = la
- if len(la) > 0 { // (1)
- // If for example kube-apiserver label is removed from
- // a remote-node, then removeLabels() will return a
- // non-empty set representing the new full set of
- // labels to associate with the node. In order to
- // propagate the new identity, we must emit a delete
- // event for the old identity and then an add event for
- // the new identity.
- idsToAdd[id.ID] = la
- }
- if newIdentity { // (2)
- // If for example kube-apiserver label is removed from a CIDR
- // identity (contains CIDR labels), then removeLabels() will return
- // a non-empty set representing just the CIDR labels. We do the
- // same as (1) from above, but additionally, we must also mark the
- // new identity for addition and upsert the new identity into the
- // IPCache.
+ l := removeLabels(prefix, lbls, src)
+ if len(l) > 0 {
+ // If for example kube-apiserver label is removed from the local
+ // host identity (when the kube-apiserver is running within the
+ // cluster and it is no longer running on the current local host),
+ // then removeLabels() will return a non-empty set representing the
+ // new full set of labels to associate with the node (in this
+ // example, just the local host label). In order to propagate the
+ // new identity, we must emit a delete event for the old identity
+ // and then an add event for the new identity.
+
+ // If host identity has changed, update its labels.
+ if id.ID == identity.ReservedIdentityHost {
+ identity.AddReservedIdentityWithLabels(id.ID, l)
+ }
+
newID, _, err := IdentityAllocator.AllocateIdentity(context.TODO(), l, false)
if err != nil {
log.WithError(err).WithFields(logrus.Fields{
@@ -392,16 +369,10 @@ func removeLabelsFromIPs(
)
continue
}
- idsToAdd[newID.ID] = la
- toReplace[prefix] = idPair{
- old: Identity{
- ID: id.ID,
- Source: sourceByLabels(src, lbls),
- },
- new: Identity{
- ID: newID.ID,
- Source: sourceByLabels(src, l),
- },
+ idsToAdd[newID.ID] = l.LabelArray()
+ toReplace[prefix] = Identity{
+ ID: newID.ID,
+ Source: sourceByLabels(src, l),
}
}
}
@@ -417,64 +388,89 @@ func removeLabelsFromIPs(
triggerer.TriggerPolicyUpdates(false, "kube-apiserver identity updated by removal")
}
- for ip, pair := range toReplace {
+ for ip, id := range toReplace {
hIP, key := IPIdentityCache.getHostIPCache(ip)
meta := IPIdentityCache.getK8sMetadata(ip)
- IPIdentityCache.deleteLocked(ip, pair.old.Source) // remove old entry and replace
- if _, err := IPIdentityCache.upsertLocked(ip, hIP, key, meta, Identity{
- ID: pair.new.ID,
- Source: pair.new.Source,
- }); err != nil {
+ if _, err := IPIdentityCache.upsertLocked(
+ ip,
+ hIP,
+ key,
+ meta,
+ id,
+ true, /* force upsert */
+ ); err != nil {
log.WithError(err).WithFields(logrus.Fields{
logfields.IPAddr: ip,
- logfields.Identity: pair.new,
+ logfields.Identity: id,
}).Error("Failed to replace ipcache entry with new identity after label removal. Traffic may be disrupted.")
}
}
}
-type idPair struct {
- old, new Identity
-}
-
// removeLabels removes the given labels association with the given prefix. The
-// leftover labels are returned, if any, and a boolean indicating whether the
-// caller must generate a new identity with the leftover labels. If the caller
-// must do so, then the following must occur *in order* to avoid drops:
-// 1) policy recalculation must be triggered
+// leftover labels are returned, if any. If there are leftover labels, the
+// caller must allocate a new identity and do the following *in order* to avoid
+// drops:
+// 1) policy recalculation must be implemented into the datapath and
// 2) new identity must have a new entry upserted into the IPCache
+// Note: GH-17962, triggering policy recalculation doesn't actually *implement*
+// the changes into datapath (because it's an async call), this is a known
+// issue. There's a very small window for drops when two policies select the
+// same traffic and the identity changes. For example, this is possible if an
+// IP is associated with the kube-apiserver and referenced inside a ToCIDR
+// policy, and then the IP is no longer associated with the kube-apiserver.
//
// Identities are deallocated and their subequent entry in the IPCache is
// removed if the prefix is no longer associated with any labels.
//
// This function assumes that the IDMD and the IPIdentityCache locks are taken!
-//
-// It is the responsibility of the caller to trigger policy recalculation after
-// calling this function.
-func removeLabels(prefix string, lbls labels.Labels, src source.Source) (labels.Labels, bool) {
- var needNewIdentity bool
-
+func removeLabels(prefix string, lbls labels.Labels, src source.Source) labels.Labels {
l, ok := identityMetadata[prefix]
if !ok {
- return nil, needNewIdentity
+ return nil
}
l = l.Remove(lbls)
- if len(l) != 0 { // Labels left over, do not deallocate
+ if len(l) == 0 { // Labels empty, delete
+ // No labels left. Example case: when the kube-apiserver is running
+ // outside of the cluster, meaning that the IDMD only ever had the
+ // kube-apiserver label (CIDR labels are not added) and it's now being
+ // removed.
+ delete(identityMetadata, prefix)
+ } else {
+ // This case is only hit if the prefix for the kube-apiserver is
+ // running within the cluster. Therefore, the IDMD can only ever has
+ // the following labels:
+ // * kube-apiserver + remote-node
+ // * kube-apiserver + host
identityMetadata[prefix] = l
- return l, needNewIdentity
}
- // No labels left, perform deallocation
-
- delete(identityMetadata, prefix)
id, exists := IPIdentityCache.LookupByIPRLocked(prefix)
if !exists {
- return nil, needNewIdentity
+ log.WithFields(logrus.Fields{
+ logfields.CIDR: prefix,
+ logfields.Labels: lbls,
+ }).Warn(
+ "Identity for prefix was unexpectedly not found in ipcache, unable " +
+ "to remove labels from prefix. If a network policy is applied, check " +
+ "for any drops. It's possible that insertion or removal from " +
+ "the ipcache failed.",
+ )
+ return nil
}
realID := IdentityAllocator.LookupIdentityByID(context.TODO(), id.ID)
if realID == nil {
- return nil, needNewIdentity
+ log.WithFields(logrus.Fields{
+ logfields.CIDR: prefix,
+ logfields.Labels: lbls,
+ logfields.Identity: id,
+ }).Warn(
+ "Identity unexpectedly not found within the identity allocator, " +
+ "unable to remove labels from prefix. It's possible that insertion " +
+ "or removal from the ipcache failed.",
+ )
+ return nil
}
released, err := IdentityAllocator.Release(context.TODO(), realID, false)
if err != nil {
@@ -486,21 +482,21 @@ func removeLabels(prefix string, lbls labels.Labels, src source.Source) (labels.
}).Error(
"Failed to release assigned identity to IP while removing label association, this might be a leak.",
)
+ return nil
}
if released {
IPIdentityCache.deleteLocked(prefix, sourceByLabels(src, lbls))
- return nil, needNewIdentity
+ return nil
}
// Generate new identity with the label removed. This should be the case
// where the existing identity had >1 refcount, meaning that something was
// referring to it.
- needNewIdentity = true
allLbls := labels.NewLabelsFromModel(nil)
allLbls.MergeLabels(realID.Labels)
allLbls = allLbls.Remove(lbls)
- return allLbls, needNewIdentity
+ return allLbls
}
func sourceByLabels(d source.Source, lbls labels.Labels) source.Source { |
Clarify that the label injection is also triggered from the node discovery package as well. Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity metadata") Signed-off-by: Chris Tarazi <chris@isovalent.com>
Previously, the label injection logic was using the source passed into InjectLabels(), and on top of that, modifying it instead of taking a copy. This is important because if we modify the source to upgrade it to KubeAPIServer source (highest priority), then all subsequent entries into the ipcache though the IDMD (from the same InjectLabels() invocation) will have KubeAPIServer source! This can cause entries to never be modifiable again on a subsequent invocation of InjectLabels(). Here's an example error from the ipcache-inject-labels controller: ``` ipcache-inject-labels 37m4s ago 14s ago 63 failed to inject labels into ipcache: failed to upsert 10.24.1.96 into ipcache with identity 6: unable to overwrite source "kube-apiserver" with source "custom-resource" ``` In the above error, we are failing to update ID 6! This meant that any updates to the remote-node-associated IPs would never be updateable (unless they somehow were made with the KubeAPIServer source). This commit fixes this by creating a copy of the source for each iteration through the IDMD instead of sharing the global, as well as fixing a typo where the source was not used from `toUpsert`. Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity metadata") Signed-off-by: Chris Tarazi <chris@isovalent.com>
Previously, any IP that mapped to a reserved ID was considered as to-be-propagated to the ipcache in InjectLabels(). This was meant to ensure any reserved IDs associated with the kube-apiserver would be properly accounted for. To be specific with an example, if the kube-apiserver is running within the cluster and the local host contains the kube-apiserver, then the local host reserved ID (1) has the `reserved:host` + `reserved:kube-apiserver` labels. Cilium starts up with ID 1 only containing the `reserved:host` label, and only adds the `reserved:kube-apiserver` label once the K8s Endpoint watcher triggers the InjectLabels() logic. Hence why this condition existed in order to update reserved ID 1 when the `reserved:kube-apiserver` label was associated with it. The problem with the scope of the condition is that it leads to unnecessary updates to the ipcache. Specifically, any call to InjectLabels() will cause ipcache update failures to occur for reserved IDs 1 & 6, after InjectLabels() runs for the very first time. This is because node discovery will trigger InjectLabels() to insert all known IPs of the local node with source=Local. After this, the CiliumNode watcher will trigger InjectLabels() with source=CustomResource which causes the following failing controller: ``` ipcache-inject-labels 4m15s ago 17s ago 22 failed to inject labels into ipcache: failed to upsert 10.168.15.192 into ipcache with identity 6: unable to overwrite source "local" with source "custom-resource" ``` This commit fixes this by reducing the scope of the condition in which to propagate updates to the ipcache, to only any IDMD entries which contain the kube-apiserver label. This will include entries such as the reserved ID 1. In addition, ensure that the final set of labels that are attached to the identity object itself are used when updating the identity selector cache (UpdateIdentities()), rather than the labels assoicated with the prefix from the IDMD. If the final set is not used, then a race is possible where the next call to UpdateIdentities() will revert the work done in the first call, causing the selectors to lose which identities they selected. Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity metadata") Signed-off-by: Chris Tarazi <chris@isovalent.com>
Previously, there existed logic to handle "transient identities" while Cilium is bootstrapping and receiving cluster information from EP & CN watchers, such that it might generate transient identities related to the kube-apiserver IPs. This logic would release the transient identity and re-allocate a new one if needed. Given that InjectLabels() already errors early if the K8s caches are not synced, it is actually not possible to have transient identities generated from incomplete information. In reality, by the time InjectLabels() is running, it has the full set of cluster information needed from EP & CN watchers, so the identities generated are "stable". This logic needs to be removed because it was actually causing a bug reported by a community user on Slack. The symptoms were that on any update to the CN (CiliumNode) resource, it was triggering the label injection, which was releasing and re-allocating the CIDR identity for the kube-apiserver that was running outside of the cluster. This unnecessary identity churn is problematic as it can cause policy drops and exhaust all the local identity range. Additionally, the question of what happens to the CIDR identity reference counts is actually handled inside InjectLabels() already, when the identity is neither reserved or new, it releases the identity (refcount deference). Hence, this commit fixes the identity churn by removing the extraneous, unneeded logic from injectLabels(). Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity metadata") Reported-by: Alexander Block <ablock84@gmail.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>
The previous variable will always be non-empty as the callers from RemoveLabelsFromIPs() will provide labels to be removed from IPs, otherwise the call would be useless. This would cause identities to always be refreshed in the selector cache for no reason. Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity metadata") Signed-off-by: Chris Tarazi <chris@isovalent.com>
This will be useful in the next commit where the unit tests will rely on the changes in this commit to validate refcounts. Signed-off-by: Chris Tarazi <chris@isovalent.com>
This handles the case where there is a kube-apiserver (outside of the cluster) and CIDR policy selecting the same IP(s) and identity. In this case, there will be one identity that the policies both select (identity refcount == 2). When a kube-apiserver changes IP(s), then the previous IP(s) will no longer be assoicated with the kube-apiserver label inside the IDMD, which causes the kube-apiserver and ToCIDR policy to now select two different identities. One of the identities from above will become a pure CIDR identity because it will no longer have the kube-apiserver label (due to the IPs changing) and only have CIDR labels left over. Once that identity is generated, we need to replace the existing ipcache entry for the previously-associated IP(s) of the kube-apiserver with the new identity without the kube-apiserver label (just CIDR labels). However, in order to "replace" the ipcache entry, we needed to allow force upserting into the ipcache. This is because the ipcache entry for the kube-apiserver IPs would be created with src=kube-apiserver (which is the strongest source). The ipcache entry for the new identity will be created with src=generated, which is weaker than src=kube-apiserver. Unfortunately, the alternatives aren't great either. One of them is to delete the ipcache entry with src=kube-apiserver and then re-add the CIDR ipcache entry with src=generated. However, this is not atomic and introduces an undesirable transient state in the datapath. The other alternative is to address ciliumGH-18301, however that is non-trivial and was avoided in this commit. This commit allows for the aforementioned case to be handled and a unit test was added to validate that the behavior is correct. Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity metadata") Co-authored-by: Joe Stringer <joe@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
Prevously, this code suffered from a TOCTOU bug. The IDMD lock would be taken by two different calls into the IDMD logic and assumed that the underlying IDMD would stay in the same state across both calls. However, that is incorrect because of two separate calls took the IDMD lock, so it's possible for the IDMD to change in between the two calls, hence TOCTOU. This commit fixes this by amending the API of the IDMD logic to expose one function to perform the work of the previous two functions. Those two functions are now unexported. Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity metadata") Reported-by: Joe Stringer <joe@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
1795722
to
a256ebf
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please see individual commit msgs.
Related: #17962