identity: Initialize local identity allocator early #19556

jrajahalme · 2022-04-25T13:10:39Z

Move local identity allocator initialization to
NewCachingIdentityAllocator() so that it is initialized when the
allocator is returned to the caller. Also make the events channel and
start the watcher in NewCachingIdentityAllocator(). Close() will no
longer GC the local identity allocator or stop the watcher. Now that the
locally allocated identities are persisted via the bpf ipcache map across
restarts, recycling them at runtime via Close() would be inappropriate.

This is then used in daemon bootstrap to restore locally allocated
identities before new policies can be received via Cilium API or k8s API.

This fixes the issue where CIDR policies were received from k8s before
locally allocated (CIDR) identities were restored, causing the identities
derived from the received policy to be newly allocated with different
numeric identity values, ultimately causing policy drops during Cilium
restart.

Fixes: #19360
Signed-off-by: Jarno Rajahalme jarno@isovalent.com

jrajahalme · 2022-04-25T13:20:00Z

/test

jrajahalme · 2022-04-25T14:50:54Z

Resolved Go lint issue.

jrajahalme · 2022-04-25T14:51:32Z

Verified locally this works as intended, opening for reviews.

jrajahalme · 2022-04-25T14:56:20Z

/test

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 10.128.0.16:80 from testclient-jq8jz

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

jrajahalme · 2022-04-25T17:30:07Z

Investigating runtime test fail that seems relevant.

joestringer

Looks good, just a couple of minor nits.

joestringer · 2022-04-25T19:54:04Z

pkg/identity/cache/cache.go

-				case <-w.stopChan:
-					return


Why should the identityWatcher continue to emit events into the policy subsystem if the agent is shutting down and the CachingIdentityAllocator is Close()d?

Hmm, it doesn't look like that Close() function was called from the daemon yet anyway 🤔 This may have more to do with the tests than the runtime operation of the agent.

Still, it would be nice to include the rationale for this aspect, since it seems like this moves to a pattern of:

foo := NewCachingIdentityAllocator(...) // set up the local identity allocator foo.InitIdentityAllocator(...) // set up the global identity allocator, start handling those events foo.Close() // Close the identity allocator but keep handling events from the global identity source

ehh hmm, not sure I've quite got the logic right here. Core question I'm thinking about is how we ensure that the identityWatcher gets properly stopped. I'm going to put this down for now. Not sure the right answer here, maybe there's some answer around stopping the identityWatcher by closing its events channel from the sender through some other codepath.

Yes, Close() is currently called only from daemon test code, but nonetheless it still deletes the global identity allocator, so events from there should stop coming. So this change essentially makes the local identity allocator never go away, which is in line with making locally allocated identities more persistent.

We could remove the Close() altogether, i suppose, to make this less confusing. The way I see it that when local identity allocator was added the code was patterned like for the global one. For local identity restoration we need the local allocator initialized earlier than the global one can be initialized (it needs node IP info for kvstore prefix). I figured making the initialization of the local allocator more "static" is the simplest approach, as it does not have the same constraints as the global allocator.

I feel like the right answer is properly hooking the Close() into the shutdown sequence, that way we can also start to address the problem where warnings/errors show up during shutdown because some subsystems expect other subsystems to be in a certain state after the agent gets the signal to stop. That said, Jussi is working on improving the structure around agent initialization and that may provide a way to naturally figure out the ordering & dependencies to shut these modules down, so I don't think we necessarily have to take any particular action on this in the short term.

pkg/identity/cache/local.go

joestringer · 2022-04-25T20:15:57Z

Heads up I merged #19501.

joestringer · 2022-04-25T20:17:10Z

Oh, and one more - this is also fixing #19360 , right? As much as I am not fond of the idea of backporting this change because it's shifting around init logic further, we should probably either revert that patch on v1.10 or also backport this patch.

jrajahalme · 2022-04-27T15:02:25Z

net-next failed on pods not being deleted in time etc. unrelated reasons, restarting

jrajahalme · 2022-04-27T15:02:33Z

/test-1.23-net-next

jrajahalme · 2022-04-27T15:03:29Z

gke-stable failed on "kube-dns was not able to get into ready state", restarting

jrajahalme · 2022-04-27T15:03:36Z

/test-gke

InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: cilium#19614 Fixes: cilium#19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: #19614 Fixes: #19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

[ upstream commit 916765b ] InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: cilium#19614 Fixes: cilium#19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

[ upstream commit 916765b ] InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: #19614 Fixes: #19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

jrajahalme added sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. release-note/misc This PR makes changes that have no direct user impact. sig/agent Cilium agent related. labels Apr 25, 2022

jrajahalme requested review from a team as code owners April 25, 2022 13:10

jrajahalme requested review from a team, nebril, jrfastab and joestringer April 25, 2022 13:10

jrajahalme force-pushed the restore-cidr-identities-earlier branch from 97042ec to 99e5623 Compare April 25, 2022 13:11

jrajahalme marked this pull request as draft April 25, 2022 13:12

jrajahalme force-pushed the restore-cidr-identities-earlier branch from 99e5623 to 078017f Compare April 25, 2022 14:50

jrajahalme marked this pull request as ready for review April 25, 2022 14:51

jrajahalme force-pushed the restore-cidr-identities-earlier branch 2 times, most recently from 689b12e to e5c9337 Compare April 25, 2022 19:49

jrajahalme requested a review from a team April 25, 2022 19:49

joestringer requested changes Apr 25, 2022

View reviewed changes

jrajahalme force-pushed the restore-cidr-identities-earlier branch 2 times, most recently from 1d9170c to 66c5970 Compare April 25, 2022 21:40

jrajahalme requested a review from a team as a code owner April 25, 2022 21:40

jrajahalme requested a review from aditighag April 25, 2022 21:40

jrajahalme added the needs-backport/1.10 label Apr 25, 2022

This was referenced Apr 27, 2022

[v1.9] CI: K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master #17617

Closed

CI: K8sVerifier Runs the kernel verifier against Cilium's BPF datapath: terminating containers are not deleted after timeout #18895

Closed

jrajahalme added kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Apr 27, 2022

joestringer merged commit 2e5f35b into cilium:master Apr 27, 2022

joestringer mentioned this pull request Apr 28, 2022

daemon: Initialize k8sCachesSynced channel before calling Initk8sSubsystem() #19626

Merged

joestringer added backport-done/1.10 and removed backport-pending/1.10 labels Apr 29, 2022

maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.11 Apr 29, 2022

maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 29, 2022

joestringer moved this from Backport pending to v1.10 to Backport pending to v1.11 in 1.11.5 Apr 29, 2022

joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels May 4, 2022

aanm mentioned this pull request May 10, 2022

Prepare for release v1.10.11 #19755

Merged

aanm moved this from Backport pending to v1.11 to Backport done to v1.11 in 1.11.5 May 10, 2022

aanm mentioned this pull request May 10, 2022

Prepare for release v1.11.5 #19756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identity: Initialize local identity allocator early #19556

identity: Initialize local identity allocator early #19556

jrajahalme commented Apr 25, 2022 •

edited

jrajahalme commented Apr 25, 2022

jrajahalme commented Apr 25, 2022

jrajahalme commented Apr 25, 2022

jrajahalme commented Apr 25, 2022 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

jrajahalme commented Apr 25, 2022

joestringer left a comment

joestringer Apr 25, 2022

joestringer Apr 25, 2022

joestringer Apr 25, 2022

joestringer Apr 25, 2022

jrajahalme Apr 25, 2022

joestringer Apr 25, 2022

joestringer commented Apr 25, 2022

joestringer commented Apr 25, 2022

jrajahalme commented Apr 27, 2022

jrajahalme commented Apr 27, 2022

jrajahalme commented Apr 27, 2022

jrajahalme commented Apr 27, 2022

identity: Initialize local identity allocator early #19556

identity: Initialize local identity allocator early #19556

Conversation

jrajahalme commented Apr 25, 2022 • edited

jrajahalme commented Apr 25, 2022

jrajahalme commented Apr 25, 2022

jrajahalme commented Apr 25, 2022

jrajahalme commented Apr 25, 2022 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

jrajahalme commented Apr 25, 2022

joestringer left a comment

Choose a reason for hiding this comment

joestringer Apr 25, 2022

Choose a reason for hiding this comment

joestringer Apr 25, 2022

Choose a reason for hiding this comment

joestringer Apr 25, 2022

Choose a reason for hiding this comment

joestringer Apr 25, 2022

Choose a reason for hiding this comment

jrajahalme Apr 25, 2022

Choose a reason for hiding this comment

joestringer Apr 25, 2022

Choose a reason for hiding this comment

joestringer commented Apr 25, 2022

joestringer commented Apr 25, 2022

jrajahalme commented Apr 27, 2022

jrajahalme commented Apr 27, 2022

jrajahalme commented Apr 27, 2022

jrajahalme commented Apr 27, 2022

jrajahalme commented Apr 25, 2022 •

edited

jrajahalme commented Apr 25, 2022 •

edited by maintainer-s-little-helper bot