Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various improvements and bugfixes for kube-apiserver policy matching #18150

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented Dec 7, 2021

Please see individual commit msgs.

  • ipcache: Update IDMD diagram
  • ipcache: Fix incorrect ipcache source used inside IDMD
  • ipcache: Reduce scope for propagating updates to ipcache
  • ipcache: Remove incorrect identity release upon label injection
  • ipcache: Fix incorrect variable used in RemoveLabelsFromIPs
  • testutils/identity: Implement proper refcounting for mock
  • ipcache: Generate new identity if label is removed
  • ipcache, k8s/watchers: Fix non-atomic window when removing labels

Related: #17962

Stablize kube-apiserver policy matching feature, namely by fixing unncessary identity churn when kube-apiserver is running outside of the cluster

@christarazi christarazi requested review from a team as code owners December 7, 2021 07:34
@christarazi christarazi requested a review from a team December 7, 2021 07:34
@christarazi christarazi added release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. kind/bug This is a bug in the Cilium logic. labels Dec 7, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Dec 7, 2021
@christarazi christarazi requested a review from aanm December 7, 2021 07:34
@christarazi
Copy link
Member Author

/test

@christarazi
Copy link
Member Author

christarazi commented Dec 7, 2021

/test

Job 'Cilium-PR-Runtime-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

RuntimePrivilegedUnitTests Run Tests

Failure Output

FAIL: Failed to run privileged unit tests

If it is a flake, comment /mlh new-flake Cilium-PR-Runtime-net-next so I can create a new GitHub issue to track it.

@christarazi
Copy link
Member Author

/test-runtime

@christarazi
Copy link
Member Author

/test-gke

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nits. My codeowners look good.

pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks :)

@christarazi christarazi force-pushed the pr/christarazi/fix-transient-identity-kube-apiserver branch 3 times, most recently from cebc447 to 76b953a Compare December 8, 2021 21:47
@christarazi
Copy link
Member Author

/test

@christarazi
Copy link
Member Author

christarazi commented Dec 16, 2021

/test

Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sBandwidthTest Checks Bandwidth Rate-Limiting Checks Pod to Pod bandwidth, vxlan tunneling

Failure Output

FAIL: failed to ensure kubectl version: failed to run kubectl version

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-net-next so I can create a new GitHub issue to track it.

@christarazi
Copy link
Member Author

/test-1.23-net-next

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I've posted a few questions mainly around the second-to-last commit where it seems like we can potentially simplify the code a little further and also eliminate one source of policy drops when labels are injected or removed.

pkg/testutils/identity/allocator.go Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Show resolved Hide resolved
pkg/ipcache/metadata.go Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
@christarazi christarazi force-pushed the pr/christarazi/fix-transient-identity-kube-apiserver branch from 3fa174e to 1795722 Compare December 21, 2021 22:50
@christarazi
Copy link
Member Author

Relevant diff addressing Joe's comments above:

Diff
diff --git a/pkg/ipcache/ipcache.go b/pkg/ipcache/ipcache.go
index 7e1107c6fe..6ebc0466e0 100644
--- a/pkg/ipcache/ipcache.go
+++ b/pkg/ipcache/ipcache.go
@@ -223,10 +223,23 @@ func (ipc *IPCache) updateNamedPorts() (namedPortsChanged bool) {
 func (ipc *IPCache) Upsert(ip string, hostIP net.IP, hostKey uint8, k8sMeta *K8sMetadata, newIdentity Identity) (namedPortsChanged bool, err error) {
 	ipc.mutex.Lock()
 	defer ipc.mutex.Unlock()
-	return ipc.upsertLocked(ip, hostIP, hostKey, k8sMeta, newIdentity)
-}
-
-func (ipc *IPCache) upsertLocked(ip string, hostIP net.IP, hostKey uint8, k8sMeta *K8sMetadata, newIdentity Identity) (namedPortsChanged bool, err error) {
+	return ipc.upsertLocked(ip, hostIP, hostKey, k8sMeta, newIdentity, false /* !force */)
+}
+
+// upsertLocked adds / updates the provided IP and identity into the IPCache,
+// assuming that the IPCache lock has been taken. Warning, do not use force
+// unless you know exactly what you're doing. Forcing adding / updating the
+// IPCache will not take into account the source of the identity and bypasses
+// the overwrite logic! Once GH-18301 is addressed, there will be no need for
+// any force logic.
+func (ipc *IPCache) upsertLocked(
+	ip string,
+	hostIP net.IP,
+	hostKey uint8,
+	k8sMeta *K8sMetadata,
+	newIdentity Identity,
+	force bool,
+) (namedPortsChanged bool, err error) {
 	var newNamedPorts policy.NamedPortMap
 	if k8sMeta != nil {
 		newNamedPorts = k8sMeta.NamedPorts
@@ -258,7 +271,7 @@ func (ipc *IPCache) upsertLocked(ip string, hostIP net.IP, hostKey uint8, k8sMet
 
 	cachedIdentity, found := ipc.ipToIdentityCache[ip]
 	if found {
-		if !source.AllowOverwrite(cachedIdentity.Source, newIdentity.Source) {
+		if !force && !source.AllowOverwrite(cachedIdentity.Source, newIdentity.Source) {
 			return false, NewErrOverwrite(cachedIdentity.Source, newIdentity.Source)
 		}
 
diff --git a/pkg/ipcache/metadata.go b/pkg/ipcache/metadata.go
index 22d2415423..ecdd4a80ed 100644
--- a/pkg/ipcache/metadata.go
+++ b/pkg/ipcache/metadata.go
@@ -73,12 +73,15 @@ func GetIDMetadataByIP(prefix string) labels.Labels {
 // InjectLabels injects labels from the identityMetadata (IDMD) map into the
 // identities used for the prefixes in the IPCache. The given source is the
 // source of the caller, as inserting into the IPCache requires knowing where
-// this updated information comes from.
+// this updated information comes from. Conversely, RemoveLabelsExcluded()
+// performs the inverse: removes labels from the IDMD map and releases
+// identities allocated by this function.
 //
 // Note that as this function iterates through the IDMD, if it detects a change
 // in labels for a given prefix, then this might allocate a new identity. If a
 // prefix was previously associated with an identity, it will get deallocated,
-// so a balance is kept.
+// so a balance is kept, ensuring a one-to-one mapping between prefix and
+// identity.
 func InjectLabels(src source.Source, updater identityUpdater, triggerer policyTriggerer) error {
 	if IdentityAllocator == nil || !IdentityAllocator.IsLocalIdentityAllocatorInitialized() {
 		return ErrLocalIdentityAllocatorUninitialized
@@ -131,8 +134,8 @@ func InjectLabels(src source.Source, updater identityUpdater, triggerer policyTr
 				newLbls := id.Labels
 				tmpSrc = source.KubeAPIServer
 				trigger = true
-				// If any reserved ID has changed, update its labels.
-				if id.IsReserved() {
+				// If host identity has changed, update its labels.
+				if id.ID == identity.ReservedIdentityHost {
 					identity.AddReservedIdentityWithLabels(id.ID, newLbls)
 				}
 				idsToPropagate[id.ID] = newLbls.LabelArray()
@@ -189,7 +192,7 @@ func InjectLabels(src source.Source, updater identityUpdater, triggerer policyTr
 		if _, err := IPIdentityCache.upsertLocked(ip, hIP, key, meta, Identity{
 			ID:     id.ID,
 			Source: id.Source,
-		}); err != nil {
+		}, false /* !force */); err != nil {
 			return fmt.Errorf("failed to upsert %s into ipcache with identity %d: %w", ip, id.ID, err)
 		}
 	}
@@ -305,12 +308,9 @@ func filterMetadataByLabels(filter labels.Labels) []string {
 	return matching
 }
 
-// removeLabelsFromIPs provides a convenient method for the caller to remove
-// all given prefixes at once. This function should be the inverse of
-// InjectLabels(). This function will trigger policy update and recalculation
-// if necessary on behalf of the caller.
-//
-// Identities allocated by InjectLabels() may be released by removeLabels().
+// removeLabelsFromIPs removes all given prefixes at once. This function will
+// trigger policy update and recalculation if necessary on behalf of the
+// caller.
 //
 // A prefix will only be removed from the IDMD if the set of labels becomes
 // empty.
@@ -325,8 +325,8 @@ func removeLabelsFromIPs(
 	var (
 		idsToAdd    = make(map[identity.NumericIdentity]labels.LabelArray)
 		idsToDelete = make(map[identity.NumericIdentity]labels.LabelArray)
-		// toReplace stores identity pairs to replace in the ipcache.
-		toReplace = make(map[string]idPair)
+		// toReplace stores the identity to replace in the ipcache.
+		toReplace = make(map[string]Identity)
 	)
 
 	IPIdentityCache.Lock()
@@ -338,48 +338,25 @@ func removeLabelsFromIPs(
 			continue
 		}
 
+		idsToDelete[id.ID] = nil // labels for deletion don't matter to UpdateIdentities()
+
 		// Insert to propagate the updated set of labels after removal.
-		//
-		// There are two concurring cases here and they are annotated in the
-		// code below:
-		//   1) When labels are disassociated with an IP, the remaining set
-		//      either contains some labels or it's empty. In the former, the
-		//      identity is both marked for deletion and addition, because
-		//      there are labels still associated with the IP. In the latter,
-		//      the identity is marked for deletion only.
-		//   2) If removeLabels() indicates that a new identity needs to be
-		//      generated (newIdentity), a new identity is allocated (with the
-		//      appropriate labels removed), marked for addition, and lastly,
-		//      upserted into the ipcache (replacing the existing identity
-		//      entry). This can happen when an existing identity was *not*
-		//      released during removeLabels() due to the identity's refcount
-		//      being >1.
-		var (
-			newIdentity bool
-			la          labels.LabelArray
-		)
-		l, newIdentity := removeLabels(prefix, lbls, src)
-		if l != nil {
-			la = l.LabelArray()
-		}
-		idsToDelete[id.ID] = la
-		if len(la) > 0 { // (1)
-			// If for example kube-apiserver label is removed from
-			// a remote-node, then removeLabels() will return a
-			// non-empty set representing the new full set of
-			// labels to associate with the node. In order to
-			// propagate the new identity, we must emit a delete
-			// event for the old identity and then an add event for
-			// the new identity.
-			idsToAdd[id.ID] = la
-		}
-		if newIdentity { // (2)
-			// If for example kube-apiserver label is removed from a CIDR
-			// identity (contains CIDR labels), then removeLabels() will return
-			// a non-empty set representing just the CIDR labels. We do the
-			// same as (1) from above, but additionally, we must also mark the
-			// new identity for addition and upsert the new identity into the
-			// IPCache.
+		l := removeLabels(prefix, lbls, src)
+		if len(l) > 0 {
+			// If for example kube-apiserver label is removed from the local
+			// host identity (when the kube-apiserver is running within the
+			// cluster and it is no longer running on the current local host),
+			// then removeLabels() will return a non-empty set representing the
+			// new full set of labels to associate with the node (in this
+			// example, just the local host label). In order to propagate the
+			// new identity, we must emit a delete event for the old identity
+			// and then an add event for the new identity.
+
+			// If host identity has changed, update its labels.
+			if id.ID == identity.ReservedIdentityHost {
+				identity.AddReservedIdentityWithLabels(id.ID, l)
+			}
+
 			newID, _, err := IdentityAllocator.AllocateIdentity(context.TODO(), l, false)
 			if err != nil {
 				log.WithError(err).WithFields(logrus.Fields{
@@ -392,16 +369,10 @@ func removeLabelsFromIPs(
 				)
 				continue
 			}
-			idsToAdd[newID.ID] = la
-			toReplace[prefix] = idPair{
-				old: Identity{
-					ID:     id.ID,
-					Source: sourceByLabels(src, lbls),
-				},
-				new: Identity{
-					ID:     newID.ID,
-					Source: sourceByLabels(src, l),
-				},
+			idsToAdd[newID.ID] = l.LabelArray()
+			toReplace[prefix] = Identity{
+				ID:     newID.ID,
+				Source: sourceByLabels(src, l),
 			}
 		}
 	}
@@ -417,64 +388,89 @@ func removeLabelsFromIPs(
 
 		triggerer.TriggerPolicyUpdates(false, "kube-apiserver identity updated by removal")
 	}
-	for ip, pair := range toReplace {
+	for ip, id := range toReplace {
 		hIP, key := IPIdentityCache.getHostIPCache(ip)
 		meta := IPIdentityCache.getK8sMetadata(ip)
-		IPIdentityCache.deleteLocked(ip, pair.old.Source) // remove old entry and replace
-		if _, err := IPIdentityCache.upsertLocked(ip, hIP, key, meta, Identity{
-			ID:     pair.new.ID,
-			Source: pair.new.Source,
-		}); err != nil {
+		if _, err := IPIdentityCache.upsertLocked(
+			ip,
+			hIP,
+			key,
+			meta,
+			id,
+			true, /* force upsert */
+		); err != nil {
 			log.WithError(err).WithFields(logrus.Fields{
 				logfields.IPAddr:   ip,
-				logfields.Identity: pair.new,
+				logfields.Identity: id,
 			}).Error("Failed to replace ipcache entry with new identity after label removal. Traffic may be disrupted.")
 		}
 	}
 }
 
-type idPair struct {
-	old, new Identity
-}
-
 // removeLabels removes the given labels association with the given prefix. The
-// leftover labels are returned, if any, and a boolean indicating whether the
-// caller must generate a new identity with the leftover labels. If the caller
-// must do so, then the following must occur *in order* to avoid drops:
-//   1) policy recalculation must be triggered
+// leftover labels are returned, if any. If there are leftover labels, the
+// caller must allocate a new identity and do the following *in order* to avoid
+// drops:
+//   1) policy recalculation must be implemented into the datapath and
 //   2) new identity must have a new entry upserted into the IPCache
+// Note: GH-17962, triggering policy recalculation doesn't actually *implement*
+// the changes into datapath (because it's an async call), this is a known
+// issue. There's a very small window for drops when two policies select the
+// same traffic and the identity changes. For example, this is possible if an
+// IP is associated with the kube-apiserver and referenced inside a ToCIDR
+// policy, and then the IP is no longer associated with the kube-apiserver.
 //
 // Identities are deallocated and their subequent entry in the IPCache is
 // removed if the prefix is no longer associated with any labels.
 //
 // This function assumes that the IDMD and the IPIdentityCache locks are taken!
-//
-// It is the responsibility of the caller to trigger policy recalculation after
-// calling this function.
-func removeLabels(prefix string, lbls labels.Labels, src source.Source) (labels.Labels, bool) {
-	var needNewIdentity bool
-
+func removeLabels(prefix string, lbls labels.Labels, src source.Source) labels.Labels {
 	l, ok := identityMetadata[prefix]
 	if !ok {
-		return nil, needNewIdentity
+		return nil
 	}
 
 	l = l.Remove(lbls)
-	if len(l) != 0 { // Labels left over, do not deallocate
+	if len(l) == 0 { // Labels empty, delete
+		// No labels left. Example case: when the kube-apiserver is running
+		// outside of the cluster, meaning that the IDMD only ever had the
+		// kube-apiserver label (CIDR labels are not added) and it's now being
+		// removed.
+		delete(identityMetadata, prefix)
+	} else {
+		// This case is only hit if the prefix for the kube-apiserver is
+		// running within the cluster. Therefore, the IDMD can only ever has
+		// the following labels:
+		//   * kube-apiserver + remote-node
+		//   * kube-apiserver + host
 		identityMetadata[prefix] = l
-		return l, needNewIdentity
 	}
 
-	// No labels left, perform deallocation
-
-	delete(identityMetadata, prefix)
 	id, exists := IPIdentityCache.LookupByIPRLocked(prefix)
 	if !exists {
-		return nil, needNewIdentity
+		log.WithFields(logrus.Fields{
+			logfields.CIDR:   prefix,
+			logfields.Labels: lbls,
+		}).Warn(
+			"Identity for prefix was unexpectedly not found in ipcache, unable " +
+				"to remove labels from prefix. If a network policy is applied, check " +
+				"for any drops. It's possible that insertion or removal from " +
+				"the ipcache failed.",
+		)
+		return nil
 	}
 	realID := IdentityAllocator.LookupIdentityByID(context.TODO(), id.ID)
 	if realID == nil {
-		return nil, needNewIdentity
+		log.WithFields(logrus.Fields{
+			logfields.CIDR:     prefix,
+			logfields.Labels:   lbls,
+			logfields.Identity: id,
+		}).Warn(
+			"Identity unexpectedly not found within the identity allocator, " +
+				"unable to remove labels from prefix. It's possible that insertion " +
+				"or removal from the ipcache failed.",
+		)
+		return nil
 	}
 	released, err := IdentityAllocator.Release(context.TODO(), realID, false)
 	if err != nil {
@@ -486,21 +482,21 @@ func removeLabels(prefix string, lbls labels.Labels, src source.Source) (labels.
 		}).Error(
 			"Failed to release assigned identity to IP while removing label association, this might be a leak.",
 		)
+		return nil
 	}
 	if released {
 		IPIdentityCache.deleteLocked(prefix, sourceByLabels(src, lbls))
-		return nil, needNewIdentity
+		return nil
 	}
 
 	// Generate new identity with the label removed. This should be the case
 	// where the existing identity had >1 refcount, meaning that something was
 	// referring to it.
-	needNewIdentity = true
 	allLbls := labels.NewLabelsFromModel(nil)
 	allLbls.MergeLabels(realID.Labels)
 	allLbls = allLbls.Remove(lbls)
 
-	return allLbls, needNewIdentity
+	return allLbls
 }
 
 func sourceByLabels(d source.Source, lbls labels.Labels) source.Source {

christarazi and others added 8 commits December 21, 2021 14:55
Clarify that the label injection is also triggered from the node
discovery package as well.

Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity
metadata")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Previously, the label injection logic was using the source passed into
InjectLabels(), and on top of that, modifying it instead of taking a
copy. This is important because if we modify the source to upgrade it to
KubeAPIServer source (highest priority), then all subsequent entries
into the ipcache though the IDMD (from the same InjectLabels()
invocation) will have KubeAPIServer source! This can cause entries to
never be modifiable again on a subsequent invocation of InjectLabels().

Here's an example error from the ipcache-inject-labels controller:

```
  ipcache-inject-labels                         37m4s ago      14s ago      63      failed to inject labels into ipcache: failed to upsert 10.24.1.96 into ipcache with identity 6: unable to overwrite source "kube-apiserver" with source "custom-resource"
```

In the above error, we are failing to update ID 6! This meant that any
updates to the remote-node-associated IPs would never be updateable
(unless they somehow were made with the KubeAPIServer source).

This commit fixes this by creating a copy of the source for each
iteration through the IDMD instead of sharing the global, as well as
fixing a typo where the source was not used from `toUpsert`.

Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity
metadata")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Previously, any IP that mapped to a reserved ID was considered as
to-be-propagated to the ipcache in InjectLabels(). This was meant to
ensure any reserved IDs associated with the kube-apiserver would be
properly accounted for. To be specific with an example, if the
kube-apiserver is running within the cluster and the local host contains
the kube-apiserver, then the local host reserved ID (1) has the
`reserved:host` + `reserved:kube-apiserver` labels. Cilium starts up
with ID 1 only containing the `reserved:host` label, and only adds the
`reserved:kube-apiserver` label once the K8s Endpoint watcher triggers
the InjectLabels() logic. Hence why this condition existed in order to
update reserved ID 1 when the `reserved:kube-apiserver` label was
associated with it.

The problem with the scope of the condition is that it leads to
unnecessary updates to the ipcache. Specifically, any call to
InjectLabels() will cause ipcache update failures to occur for reserved
IDs 1 & 6, after InjectLabels() runs for the very first time. This is
because node discovery will trigger InjectLabels() to insert all known
IPs of the local node with source=Local. After this, the CiliumNode
watcher will trigger InjectLabels() with source=CustomResource which
causes the following failing controller:

```
ipcache-inject-labels    4m15s ago   17s ago 22    failed to inject labels into ipcache: failed to upsert 10.168.15.192 into ipcache with identity 6: unable to overwrite source "local" with source "custom-resource"
```

This commit fixes this by reducing the scope of the condition in which
to propagate updates to the ipcache, to only any IDMD entries which
contain the kube-apiserver label. This will include entries such as the
reserved ID 1.

In addition, ensure that the final set of labels that are attached to
the identity object itself are used when updating the identity selector
cache (UpdateIdentities()), rather than the labels assoicated with the
prefix from the IDMD. If the final set is not used, then a race is
possible where the next call to UpdateIdentities() will revert the
work done in the first call, causing the selectors to lose which
identities they selected.

Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity
metadata")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Previously, there existed logic to handle "transient identities" while
Cilium is bootstrapping and receiving cluster information from EP & CN
watchers, such that it might generate transient identities related to
the kube-apiserver IPs. This logic would release the transient identity
and re-allocate a new one if needed. Given that InjectLabels() already
errors early if the K8s caches are not synced, it is actually not
possible to have transient identities generated from incomplete
information. In reality, by the time InjectLabels() is running, it has
the full set of cluster information needed from EP & CN watchers, so the
identities generated are "stable".

This logic needs to be removed because it was actually causing a bug
reported by a community user on Slack. The symptoms were that on any
update to the CN (CiliumNode) resource, it was triggering the label
injection, which was releasing and re-allocating the CIDR identity for
the kube-apiserver that was running outside of the cluster. This
unnecessary identity churn is problematic as it can cause policy drops
and exhaust all the local identity range.

Additionally, the question of what happens to the CIDR identity
reference counts is actually handled inside InjectLabels() already, when
the identity is neither reserved or new, it releases the identity
(refcount deference). Hence, this commit fixes the identity churn by
removing the extraneous, unneeded logic from injectLabels().

Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity
metadata")

Reported-by: Alexander Block <ablock84@gmail.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
The previous variable will always be non-empty as the callers from
RemoveLabelsFromIPs() will provide labels to be removed from IPs,
otherwise the call would be useless. This would cause identities to
always be refreshed in the selector cache for no reason.

Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity
metadata")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This will be useful in the next commit where the unit tests will rely on
the changes in this commit to validate refcounts.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This handles the case where there is a kube-apiserver (outside of the
cluster) and CIDR policy selecting the same IP(s) and identity. In this
case, there will be one identity that the policies both select (identity
refcount == 2). When a kube-apiserver changes IP(s), then the previous
IP(s) will no longer be assoicated with the kube-apiserver label inside
the IDMD, which causes the kube-apiserver and ToCIDR policy to now
select two different identities.

One of the identities from above will become a pure CIDR identity
because it will no longer have the kube-apiserver label (due to the IPs
changing) and only have CIDR labels left over. Once that identity is
generated, we need to replace the existing ipcache entry for the
previously-associated IP(s) of the kube-apiserver with the new identity
without the kube-apiserver label (just CIDR labels).

However, in order to "replace" the ipcache entry, we needed to allow
force upserting into the ipcache. This is because the ipcache entry for
the kube-apiserver IPs would be created with src=kube-apiserver (which
is the strongest source). The ipcache entry for the new identity will be
created with src=generated, which is weaker than src=kube-apiserver.

Unfortunately, the alternatives aren't great either. One of them is to
delete the ipcache entry with src=kube-apiserver and then re-add the
CIDR ipcache entry with src=generated. However, this is not atomic and
introduces an undesirable transient state in the datapath. The other
alternative is to address ciliumGH-18301, however that is non-trivial and was
avoided in this commit.

This commit allows for the aforementioned case to be handled and a unit
test was added to validate that the behavior is correct.

Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity
metadata")

Co-authored-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Prevously, this code suffered from a TOCTOU bug. The IDMD lock would be
taken by two different calls into the IDMD logic and assumed that the
underlying IDMD would stay in the same state across both calls. However,
that is incorrect because of two separate calls took the IDMD lock, so
it's possible for the IDMD to change in between the two calls, hence
TOCTOU.

This commit fixes this by amending the API of the IDMD logic to expose
one function to perform the work of the previous two functions. Those
two functions are now unexported.

Fixes: 2b17d4d ("ipcache, policy: Inject labels from identity
metadata")

Reported-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-transient-identity-kube-apiserver branch from 1795722 to a256ebf Compare December 21, 2021 22:55
@christarazi
Copy link
Member Author

/test

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@joestringer joestringer merged commit 69e843c into cilium:master Jan 4, 2022
@christarazi christarazi deleted the pr/christarazi/fix-transient-identity-kube-apiserver branch January 5, 2022 00:00
@joestringer joestringer added this to Needs backport from master in 1.11.1 Jan 5, 2022
@pchaigno pchaigno added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed needs-backport/1.11 labels Jan 10, 2022
@christarazi christarazi moved this from Needs backport from master to Backport done to v1.11 in 1.11.1 Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
No open projects
1.11.1
Backport done to v1.11
Development

Successfully merging this pull request may close these issues.

None yet

6 participants