-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
identity: Initialize local identity allocator early #19556
identity: Initialize local identity allocator early #19556
Conversation
97042ec
to
99e5623
Compare
/test |
99e5623
to
078017f
Compare
Resolved Go lint issue. |
Verified locally this works as intended, opening for reviews. |
/test Job 'Cilium-PR-K8s-GKE' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
Investigating runtime test fail that seems relevant. |
689b12e
to
e5c9337
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple of minor nits.
case <-w.stopChan: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the identityWatcher continue to emit events into the policy subsystem if the agent is shutting down and the CachingIdentityAllocator is Close()
d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it doesn't look like that Close()
function was called from the daemon yet anyway 🤔 This may have more to do with the tests than the runtime operation of the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, it would be nice to include the rationale for this aspect, since it seems like this moves to a pattern of:
foo := NewCachingIdentityAllocator(...) // set up the local identity allocator
foo.InitIdentityAllocator(...) // set up the global identity allocator, start handling those events
foo.Close() // Close the identity allocator but keep handling events from the global identity source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehh hmm, not sure I've quite got the logic right here. Core question I'm thinking about is how we ensure that the identityWatcher gets properly stopped. I'm going to put this down for now. Not sure the right answer here, maybe there's some answer around stopping the identityWatcher by closing its events channel from the sender through some other codepath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Close() is currently called only from daemon test code, but nonetheless it still deletes the global identity allocator, so events from there should stop coming. So this change essentially makes the local identity allocator never go away, which is in line with making locally allocated identities more persistent.
We could remove the Close() altogether, i suppose, to make this less confusing. The way I see it that when local identity allocator was added the code was patterned like for the global one. For local identity restoration we need the local allocator initialized earlier than the global one can be initialized (it needs node IP info for kvstore prefix). I figured making the initialization of the local allocator more "static" is the simplest approach, as it does not have the same constraints as the global allocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the right answer is properly hooking the Close()
into the shutdown sequence, that way we can also start to address the problem where warnings/errors show up during shutdown because some subsystems expect other subsystems to be in a certain state after the agent gets the signal to stop. That said, Jussi is working on improving the structure around agent initialization and that may provide a way to naturally figure out the ordering & dependencies to shut these modules down, so I don't think we necessarily have to take any particular action on this in the short term.
Heads up I merged #19501. |
Oh, and one more - this is also fixing #19360 , right? As much as I am not fond of the idea of backporting this change because it's shifting around init logic further, we should probably either revert that patch on v1.10 or also backport this patch. |
1d9170c
to
66c5970
Compare
net-next failed on pods not being deleted in time etc. unrelated reasons, restarting |
/test-1.23-net-next |
gke-stable failed on "kube-dns was not able to get into ready state", restarting |
/test-gke |
InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: cilium#19614 Fixes: cilium#19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: #19614 Fixes: #19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit 916765b ] InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: cilium#19614 Fixes: cilium#19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit 916765b ] InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: #19614 Fixes: #19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Move local identity allocator initialization to
NewCachingIdentityAllocator() so that it is initialized when the
allocator is returned to the caller. Also make the events channel and
start the watcher in NewCachingIdentityAllocator(). Close() will no
longer GC the local identity allocator or stop the watcher. Now that the
locally allocated identities are persisted via the bpf ipcache map across
restarts, recycling them at runtime via Close() would be inappropriate.
This is then used in daemon bootstrap to restore locally allocated
identities before new policies can be received via Cilium API or k8s API.
This fixes the issue where CIDR policies were received from k8s before
locally allocated (CIDR) identities were restored, causing the identities
derived from the received policy to be newly allocated with different
numeric identity values, ultimately causing policy drops during Cilium
restart.
Fixes: #19360
Signed-off-by: Jarno Rajahalme jarno@isovalent.com