Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identity: Initialize local identity allocator early #19556

Merged

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Apr 25, 2022

Move local identity allocator initialization to
NewCachingIdentityAllocator() so that it is initialized when the
allocator is returned to the caller. Also make the events channel and
start the watcher in NewCachingIdentityAllocator(). Close() will no
longer GC the local identity allocator or stop the watcher. Now that the
locally allocated identities are persisted via the bpf ipcache map across
restarts, recycling them at runtime via Close() would be inappropriate.

This is then used in daemon bootstrap to restore locally allocated
identities before new policies can be received via Cilium API or k8s API.

This fixes the issue where CIDR policies were received from k8s before
locally allocated (CIDR) identities were restored, causing the identities
derived from the received policy to be newly allocated with different
numeric identity values, ultimately causing policy drops during Cilium
restart.

Fixes: #19360
Signed-off-by: Jarno Rajahalme jarno@isovalent.com

@jrajahalme jrajahalme added sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. release-note/misc This PR makes changes that have no direct user impact. sig/agent Cilium agent related. labels Apr 25, 2022
@jrajahalme jrajahalme requested review from a team as code owners April 25, 2022 13:10
@jrajahalme jrajahalme requested review from a team, nebril, jrfastab and joestringer April 25, 2022 13:10
@jrajahalme jrajahalme force-pushed the restore-cidr-identities-earlier branch from 97042ec to 99e5623 Compare April 25, 2022 13:11
@jrajahalme jrajahalme marked this pull request as draft April 25, 2022 13:12
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme force-pushed the restore-cidr-identities-earlier branch from 99e5623 to 078017f Compare April 25, 2022 14:50
@jrajahalme
Copy link
Member Author

Resolved Go lint issue.

@jrajahalme
Copy link
Member Author

Verified locally this works as intended, opening for reviews.

@jrajahalme jrajahalme marked this pull request as ready for review April 25, 2022 14:51
@jrajahalme
Copy link
Member Author

jrajahalme commented Apr 25, 2022

/test

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 10.128.0.16:80 from testclient-jq8jz

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

@jrajahalme
Copy link
Member Author

Investigating runtime test fail that seems relevant.

@jrajahalme jrajahalme force-pushed the restore-cidr-identities-earlier branch 2 times, most recently from 689b12e to e5c9337 Compare April 25, 2022 19:49
@jrajahalme jrajahalme requested a review from a team April 25, 2022 19:49
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of minor nits.

Comment on lines -166 to -167
case <-w.stopChan:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the identityWatcher continue to emit events into the policy subsystem if the agent is shutting down and the CachingIdentityAllocator is Close()d?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it doesn't look like that Close() function was called from the daemon yet anyway 🤔 This may have more to do with the tests than the runtime operation of the agent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, it would be nice to include the rationale for this aspect, since it seems like this moves to a pattern of:

foo := NewCachingIdentityAllocator(...) // set up the local identity allocator
foo.InitIdentityAllocator(...) // set up the global identity allocator, start handling those events
foo.Close() // Close the identity allocator but keep handling events from the global identity source

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehh hmm, not sure I've quite got the logic right here. Core question I'm thinking about is how we ensure that the identityWatcher gets properly stopped. I'm going to put this down for now. Not sure the right answer here, maybe there's some answer around stopping the identityWatcher by closing its events channel from the sender through some other codepath.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Close() is currently called only from daemon test code, but nonetheless it still deletes the global identity allocator, so events from there should stop coming. So this change essentially makes the local identity allocator never go away, which is in line with making locally allocated identities more persistent.

We could remove the Close() altogether, i suppose, to make this less confusing. The way I see it that when local identity allocator was added the code was patterned like for the global one. For local identity restoration we need the local allocator initialized earlier than the global one can be initialized (it needs node IP info for kvstore prefix). I figured making the initialization of the local allocator more "static" is the simplest approach, as it does not have the same constraints as the global allocator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the right answer is properly hooking the Close() into the shutdown sequence, that way we can also start to address the problem where warnings/errors show up during shutdown because some subsystems expect other subsystems to be in a certain state after the agent gets the signal to stop. That said, Jussi is working on improving the structure around agent initialization and that may provide a way to naturally figure out the ordering & dependencies to shut these modules down, so I don't think we necessarily have to take any particular action on this in the short term.

pkg/identity/cache/local.go Outdated Show resolved Hide resolved
@joestringer
Copy link
Member

Heads up I merged #19501.

@joestringer
Copy link
Member

Oh, and one more - this is also fixing #19360 , right? As much as I am not fond of the idea of backporting this change because it's shifting around init logic further, we should probably either revert that patch on v1.10 or also backport this patch.

@jrajahalme jrajahalme force-pushed the restore-cidr-identities-earlier branch 2 times, most recently from 1d9170c to 66c5970 Compare April 25, 2022 21:40
@jrajahalme jrajahalme requested a review from a team as a code owner April 25, 2022 21:40
@jrajahalme
Copy link
Member Author

net-next failed on pods not being deleted in time etc. unrelated reasons, restarting

@jrajahalme
Copy link
Member Author

/test-1.23-net-next

@jrajahalme
Copy link
Member Author

gke-stable failed on "kube-dns was not able to get into ready state", restarting

@jrajahalme
Copy link
Member Author

/test-gke

@jrajahalme jrajahalme added kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Apr 27, 2022
@joestringer joestringer merged commit 2e5f35b into cilium:master Apr 27, 2022
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Apr 29, 2022
InitK8sSubsystem() starts all k8s watchers concurrently, some of which do
call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and
possibly also from elsewhere. Initialize k8sCachesSynced before any
watchers are started to make this access safe. This fixes data race
detected by race detection builds.

Fixes: cilium#19614
Fixes: cilium#19556
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
joestringer pushed a commit that referenced this pull request Apr 29, 2022
InitK8sSubsystem() starts all k8s watchers concurrently, some of which do
call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and
possibly also from elsewhere. Initialize k8sCachesSynced before any
watchers are started to make this access safe. This fixes data race
detected by race detection builds.

Fixes: #19614
Fixes: #19556
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.11 Apr 29, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.11 Apr 29, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 29, 2022
@joestringer joestringer moved this from Backport pending to v1.10 to Backport pending to v1.11 in 1.11.5 Apr 29, 2022
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request May 2, 2022
[ upstream commit 916765b ]

InitK8sSubsystem() starts all k8s watchers concurrently, some of which do
call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and
possibly also from elsewhere. Initialize k8sCachesSynced before any
watchers are started to make this access safe. This fixes data race
detected by race detection builds.

Fixes: cilium#19614
Fixes: cilium#19556
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
joestringer pushed a commit that referenced this pull request May 4, 2022
[ upstream commit 916765b ]

InitK8sSubsystem() starts all k8s watchers concurrently, some of which do
call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and
possibly also from elsewhere. Initialize k8sCachesSynced before any
watchers are started to make this access safe. This fixes data race
detected by race detection builds.

Fixes: #19614
Fixes: #19556
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels May 4, 2022
@aanm aanm moved this from Backport pending to v1.11 to Backport done to v1.11 in 1.11.5 May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/misc This PR makes changes that have no direct user impact. sig/agent Cilium agent related. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
No open projects
1.10.11
Backport done to v1.10
1.11.5
Backport done to v1.11
Development

Successfully merging this pull request may close these issues.

None yet

3 participants