-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Initialize k8sCachesSynced channel before calling Initk8sSubsystem() #19626
daemon: Initialize k8sCachesSynced channel before calling Initk8sSubsystem() #19626
Conversation
While backporting this does not seem necessary at this point, it is possible that later backports will introduce the fixed data race to release branches. Adding backport labels to avoid introducing the fixed data race in any future backports. |
/test |
/test-race |
Thanks! I think that this is most important for v1.11 since it's mostly the ipcache that depends on this, but the watcher -> ipcache logic was only introduced in that version. I'm not sure if we have data race pipelines for the older releases to confirm though 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a regression? Please add the label, and Fixes:
tag pointing to the offending commit.
I think it's a little bit hard to say; this code was always like this (in terms of starting to run the k8s watchers before initializing the "k8s synced" channel). Until v1.11, Cilium didn't have any logic that caused watchers to trigger logic that waits for k8s to be synced. This was introduced as part of commit 2b17d4d ("ipcache, policy: Inject labels from identity metadata"). However, at that time, the ipcache also waited for the local identity allocator to be initialized, and I believe that some other logic was ensuring that the k8s synced channel was initialized before the local identity allocator was initialized. This meant that the race condition did not occur. Only recently in commit 2e5f35b ("identity: Initialize local identity allocator early") this additional constraint was dropped, revealing the underlying initialization race. So I don't think we have any releases with this regression in it, it's only on master, but PR #19556 introduced the race condition and it will be backported. Hence we should also backport this PR to match. |
k8s-1.23-kernel-net-next-race hit a provisioning error, re-running. |
/test-net-next-race Job 'Cilium-PR-K8s-GKE' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
@aditighag Figured out with Joe that #19556 revealed this data race, so added Fixes 19556. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
InitK8sSubsystem() starts all k8s watchers concurrently, some of which do call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and possibly also from elsewhere. Initialize k8sCachesSynced before any watchers are started to make this access safe. This fixes data race detected by race detection builds. Fixes: cilium#19614 Fixes: cilium#19556 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
26ead48
to
e217a26
Compare
Updated commit message title to be shorter to keep bpf checks happy. |
/test-race |
test-1.21-5.4 hit known flake #16928 |
/test-1.21 |
/test-1.23-net-next |
/test-gke |
/test-1.21-5.4 |
Some unrelated flakes, all race tests passed. |
@joestringer Thanks for the merge, and feel free to drop the backport labels for 1.10 and 1.9 at will. Those backports would only be defensive against possible later backports hitting this issue. |
v1.9 will become EOL soon when we release v1.12, so I'll drop there. We can do best-effort for v1.10 backport, I think the risk is low but it could end up helping us avoid digging through this kind of thing in future. If it turns out that the v1.10 backport doesn't cleanly work, then we can drop that one too. |
@jrajahalme @joestringer I tried resolving the conflicts, but there are other changes in this code path that differ from upstream/master so I dropped the backport on v1.10. Resetting the labels accordingly. |
@aditighag OK, fair enough. I'm not too worried about this for v1.10. I'll just drop the backport label. |
InitK8sSubsystem() starts all k8s watchers concurrently, some of which do
call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and
possibly also from elsewhere. Initialize k8sCachesSynced before any
watchers are started to make this access safe. This fixes data race
detected by race detection builds.
Fixes: #19614
Fixes: #19556
Signed-off-by: Jarno Rajahalme jarno@isovalent.com