Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identity: stop double-update of selector cache and regenerate when a local identity is allocated #29865

Merged
merged 3 commits into from
Dec 20, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Dec 13, 2023

As I was looking in to some issues, I discovered that we are still regenerating all endpoints whenever we allocated a new local identity. This is not necessary, since we already have an incremental policy update engine.

This has been the case for some time, but it is a lot more relevant since #29036, which moved FQDN policy over to this code-path.

Background:

The AllocateIdentity() function takes a parameter, notifyOwner. When this is true, any newly allocated identities are passed to the policy engine by the identity allocator and an endpoint regeneration is triggered. However, there is a funny inconsistency: if the identity is local (i.e. a CIDR identity), then notifyOwner is ignored and the policy engine is always updated. Incidentally, as part of updating the policy engine, all endpoints are regenerated.

However, the only source of local (CIDR) identities now is the IPCache, which already has its own policy update mechanism. There is one missed case -- fixed in this PR -- so there really is no need to ignore notifyOwner. This is, also, the only caller where AllocateIdentity(notifyOwner = false).

The fix:

  1. Push allocated identities in to the policy engine with the legacy / synchronous ipcache AllocateCIDRs() / releaseCIDRIdentities() calls too. These methods are only have one caller: the ServiceHandler that translates toServices: policy rules to a CIDR selector.
  2. Make the local identity allocator respect notifyOwner, and don't pass policy updates when it is false.

What's changed?

It's useful to look at the flow, just to understand the full context of what's changed

Before:

  1. Something calls ipcache.UpsertMetadata(prefix, ...)
  2. The ipcache calls AllocateIdentity(labels, notifyOwner = false)
  3. The allocator calls SelectorCache.UpdateIdentities(...) and RegenerateAllEndpoints()
  4. The ipcache also calls SelectorCache.UpdateIdentities() and UpdatePolicyMaps()

After:

  1. Something calls ipcache.UpsertMetadata(prefix, ...)
  2. The ipcache calls AllocateIdentity(labels, notifyOwner = false)
  3. The ipcache calls SelectorCache.UpdateIdentities() and UpdatePolicyMaps()

So, this change is safe, in that it is not possible for a new identity to be missed.

Fixes: #29681
Fixes: #29846

@squeed squeed requested review from a team as code owners December 13, 2023 16:30
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 13, 2023
@squeed squeed added kind/bug This is a bug in the Cilium logic. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Dec 13, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Dec 13, 2023
@squeed squeed added sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. release-note/misc This PR makes changes that have no direct user impact. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Dec 13, 2023
@squeed squeed force-pushed the localalloc-stop-update-policy branch from f5908e7 to d45c917 Compare December 13, 2023 16:34
@squeed
Copy link
Contributor Author

squeed commented Dec 13, 2023

This may also fix #29846, which is an error message due to slow FQDN policy updates. Regeneration blocks incremental policy updates, which causes delays, which causes that error message.

@squeed
Copy link
Contributor Author

squeed commented Dec 13, 2023

/test
(edit: all green!)

@squeed squeed changed the title identity: stop double-update of selector cache and regenerate when local identity allocated identity: stop double-update of selector cache and regenerate when a local identity is allocated Dec 13, 2023
@squeed
Copy link
Contributor Author

squeed commented Dec 13, 2023

/test
(Previous run was 100% green, just paranoia)
result: a few red. One bing.com flake, one case of #29532 (run), one possible flake that involved nothing related to this codepath - no policy, pod-to-pod.

@squeed squeed force-pushed the localalloc-stop-update-policy branch from d45c917 to f76a756 Compare December 14, 2023 08:52
@squeed squeed requested a review from a team as a code owner December 14, 2023 08:52
@squeed
Copy link
Contributor Author

squeed commented Dec 14, 2023

/test

@gandro
Copy link
Member

gandro commented Dec 14, 2023

Given that this is a follow-up to a refactor which (to my knowledge) landed in v1.15, should we also backport this?

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for IPCache (I mainly focused if the new usage is consistent with how we do things in the async path), thanks. The overall reasoning makes sense, but I'll defer the identity allocator review to other reviewers.

@squeed squeed force-pushed the localalloc-stop-update-policy branch from f76a756 to e8362ff Compare December 14, 2023 09:47
@squeed
Copy link
Contributor Author

squeed commented Dec 14, 2023

Given that this is a follow-up to a refactor which (to my knowledge) landed in v1.15, should we also backport this?

Definitely, yes, this should be backported. It is causing CI flakes and annoying lock contention.

It's also not actually directly related to the FQDN refactor. This is a long-standing bug. However, we have the confidence to fix this now that there are only a very limited set of users of the synchronous APIs.

@squeed
Copy link
Contributor Author

squeed commented Dec 14, 2023

/test
(result: lots of failures to 1.1.1.1; we must have hit a rate limit)

@squeed squeed added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Dec 14, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in v1.15.0-rc.0 Dec 14, 2023
With the newer, asynchronous APIs, the ipcache takes on the
responsibility of updating the policy engine (i.e. the SelectorCache)
whenever it allocates an identity. However, the legacy APIs don't do
that. This commit changes that.

By centralizing this in the ipcache, we can (in a subsequent commit)
stop regenerating all endpoints whenever an identity is allocated. This
is wasteful, since local identities can only ever be allocated by the
ipcache now.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
AllocateIdentity() takes a parameter, `notifyOwner`, that when true,
tells the allocator to propagate those updates to the policy subsystem.

However, the local (i.e. CIDR) identity allocator ignores this field,
and *always* propagates identity updates to the policy subsystem as well
as **always regenerating all endpoints**. This is needless, since the
ipcache always updates the policy system as well.

This change is safe since the ipcache is the only caller that passes
`notifyOwner = false`, and the ipcache updates the policy system itself.
All other callers to `AllocateIdentity()` set `notifyOwner = true`.

Fixes: cilium#29681

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This should be fixed now.

Fixes: cilium#29681

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed force-pushed the localalloc-stop-update-policy branch from 8206a2c to 31339e2 Compare December 19, 2023 09:40
@squeed
Copy link
Contributor Author

squeed commented Dec 19, 2023

/test

@squeed
Copy link
Contributor Author

squeed commented Dec 19, 2023

All green!

@squeed squeed added the affects/v1.15 This issue affects v1.15 branch label Dec 19, 2023
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for ci-structure.

@pchaigno pchaigno removed the request for review from christarazi December 19, 2023 11:47
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is correct, but I raised a couple of points below for additional due diligence. The one conclusion I can come to here is that we can delete quite a lot of complicated logic once we finally migrate the k8sServiceHandler over to the new ipcache model 🎉

I also took a look over the callers who are passing the notifyOwner flag to the identity subsystem and I agree with your assessment that this should be safe / only removing a double-trigger.

I suspect that there's also a subsequent improvement to re-evaluate the role of notifications in the identity subsystem and potentially even remove them. Maybe we could prepare a short CFP to describe the design impacts of that change and float it around for discussion to confirm that's the right path forward.

pkg/ipcache/cidr.go Show resolved Hide resolved
pkg/ipcache/cidr.go Show resolved Hide resolved
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 19, 2023
@nathanjsweet nathanjsweet added this pull request to the merge queue Dec 20, 2023
Merged via the queue into cilium:main with commit 7ebaf63 Dec 20, 2023
62 checks passed
@pippolo84 pippolo84 mentioned this pull request Jan 2, 2024
17 tasks
@pippolo84 pippolo84 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Jan 2, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.15 in v1.15.0-rc.1 Jan 2, 2024
@aanm aanm added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Jan 16, 2024
@aanm aanm added this to Backport pending to v1.15 in 1.15.1 Jan 31, 2024
@aanm aanm removed this from Backport pending to vv1.15.0-rc in v1.15.0-rc.1 Jan 31, 2024
@aanm aanm removed this from Backport pending to v1.15 in 1.15.1 Jan 31, 2024
@aanm aanm added this to Backport done to v1.15 in v1.15.0-rc.1 Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.15 This issue affects v1.15 branch backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
No open projects
v1.15.0-rc.1
Backport done to v1.15
8 participants