Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Initialize k8sCachesSynced channel before calling Initk8sSubsystem() #19626

Merged
merged 1 commit into from
Apr 29, 2022

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Apr 28, 2022

InitK8sSubsystem() starts all k8s watchers concurrently, some of which do
call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and
possibly also from elsewhere. Initialize k8sCachesSynced before any
watchers are started to make this access safe. This fixes data race
detected by race detection builds.

Fixes: #19614
Fixes: #19556
Signed-off-by: Jarno Rajahalme jarno@isovalent.com

@jrajahalme jrajahalme added release-note/misc This PR makes changes that have no direct user impact. kind/bug/race-detector labels Apr 28, 2022
@jrajahalme jrajahalme requested review from a team and nathanjsweet April 28, 2022 16:08
@jrajahalme
Copy link
Member Author

While backporting this does not seem necessary at this point, it is possible that later backports will introduce the fixed data race to release branches. Adding backport labels to avoid introducing the fixed data race in any future backports.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.11 Apr 28, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.11.5 Apr 28, 2022
@jrajahalme
Copy link
Member Author

/test

@jrajahalme
Copy link
Member Author

/test-race

@joestringer
Copy link
Member

Thanks! I think that this is most important for v1.11 since it's mostly the ipcache that depends on this, but the watcher -> ipcache logic was only introduced in that version. I'm not sure if we have data race pipelines for the older releases to confirm though 🤔

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a regression? Please add the label, and Fixes: tag pointing to the offending commit.

@joestringer
Copy link
Member

joestringer commented Apr 28, 2022

I think it's a little bit hard to say; this code was always like this (in terms of starting to run the k8s watchers before initializing the "k8s synced" channel). Until v1.11, Cilium didn't have any logic that caused watchers to trigger logic that waits for k8s to be synced. This was introduced as part of commit 2b17d4d ("ipcache, policy: Inject labels from identity metadata"). However, at that time, the ipcache also waited for the local identity allocator to be initialized, and I believe that some other logic was ensuring that the k8s synced channel was initialized before the local identity allocator was initialized. This meant that the race condition did not occur. Only recently in commit 2e5f35b ("identity: Initialize local identity allocator early") this additional constraint was dropped, revealing the underlying initialization race. So I don't think we have any releases with this regression in it, it's only on master, but PR #19556 introduced the race condition and it will be backported. Hence we should also backport this PR to match.

@joestringer
Copy link
Member

k8s-1.23-kernel-net-next-race hit a provisioning error, re-running.

@joestringer
Copy link
Member

joestringer commented Apr 28, 2022

/test-net-next-race

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With native routing and endpoint routes

Failure Output

FAIL: Failed to reach 10.128.0.19:80 from testclient-mrvw2

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

@jrajahalme
Copy link
Member Author

@aditighag Figured out with Joe that #19556 revealed this data race, so added Fixes 19556.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.16 Apr 28, 2022
@joestringer joestringer added the release-blocker/1.11 This issue will prevent the release of the next version of Cilium. label Apr 28, 2022
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

InitK8sSubsystem() starts all k8s watchers concurrently, some of which do
call into K8sCacheIsSynced() via ipcache/metadata.InjectLabels(), and
possibly also from elsewhere. Initialize k8sCachesSynced before any
watchers are started to make this access safe. This fixes data race
detected by race detection builds.

Fixes: cilium#19614
Fixes: cilium#19556
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme
Copy link
Member Author

Updated commit message title to be shorter to keep bpf checks happy.

@jrajahalme
Copy link
Member Author

/test-race

@jrajahalme
Copy link
Member Author

test-1.21-5.4 hit known flake #16928

@jrajahalme
Copy link
Member Author

/test-1.21

@jrajahalme
Copy link
Member Author

/test-1.23-net-next

@jrajahalme
Copy link
Member Author

/test-gke

@jrajahalme
Copy link
Member Author

/test-1.21-5.4

@jrajahalme
Copy link
Member Author

Some unrelated flakes, all race tests passed.

@joestringer joestringer merged commit 916765b into cilium:master Apr 29, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.5 Apr 29, 2022
@jrajahalme
Copy link
Member Author

@joestringer Thanks for the merge, and feel free to drop the backport labels for 1.10 and 1.9 at will. Those backports would only be defensive against possible later backports hitting this issue.

@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.9.16 Apr 29, 2022
@joestringer
Copy link
Member

v1.9 will become EOL soon when we release v1.12, so I'll drop there. We can do best-effort for v1.10 backport, I think the risk is low but it could end up helping us avoid digging through this kind of thing in future. If it turns out that the v1.10 backport doesn't cleanly work, then we can drop that one too.

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.11 May 2, 2022
@aditighag
Copy link
Member

v1.9 will become EOL soon when we release v1.12, so I'll drop there. We can do best-effort for v1.10 backport, I think the risk is low but it could end up helping us avoid digging through this kind of thing in future. If it turns out that the v1.10 backport doesn't cleanly work, then we can drop that one too.

@jrajahalme @joestringer I tried resolving the conflicts, but there are other changes in this code path that differ from upstream/master so I dropped the backport on v1.10. Resetting the labels accordingly.

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Needs backport from master in 1.10.11 May 2, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Needs backport from master in 1.10.11 May 2, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.10.11 May 2, 2022
@joestringer
Copy link
Member

@aditighag OK, fair enough. I'm not too worried about this for v1.10. I'll just drop the backport label.

@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels May 4, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.5 May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. kind/bug/race-detector release-blocker/1.11 This issue will prevent the release of the next version of Cilium. release-note/misc This PR makes changes that have no direct user impact.
Projects
No open projects
1.11.5
Backport done to v1.11
Development

Successfully merging this pull request may close these issues.

Race between K8s initialization and K8sCacheIsSynced()
4 participants