New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s: Fix data race between ServiceCache and K8sWatcher #25087
k8s: Fix data race between ServiceCache and K8sWatcher #25087
Conversation
wait, are these objects from the informer cache? what is writing to them? |
While the release note is correct in describing the impact of the change, it doesn't describe what behavior / impact it is improving for users. As of now, this release note sounds quite scary for any user, so either we identify what specific symptoms this bug introduces (if we know), or we downgrade the release note to |
//var buf bytes.Buffer | ||
//json.Indent(&buf, jsonBytes, "", "\t") | ||
//fmt.Printf("JSON spec:\n%s\n", buf.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just remove the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 lines are in many many places. @jrajahalme Can we clean these up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess these are meant for debugging, so let's just remove them and file an issue to clean up the rest as a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up just removing them from everywhere.
Given that #19521 introduced this bug, we'll need to backport to v1.12 and v1.13. |
They're derived from them, but the cache objects are not mutated. What is mutating is the |
Expanded the comment to mention the impact. Are we ok with this? Could also consider downgrading as the impact is minor. |
The release note sounds better now. I don't think we've had any bug reports about backend preferred selection, but then again it's probably hard to pinpoint from a user's perspective that something went wrong there. I think we ship it given we've done our due diligence for as far as we know. |
Rebasing to run CI |
2712e52
to
b784ff8
Compare
/test |
pkg/k8s/service_cache.go
Outdated
@@ -531,7 +531,7 @@ func (s *ServiceCache) correlateEndpoints(id ServiceID) (*Endpoints, bool) { | |||
|
|||
// Report the service as ready if a local endpoints object exists or if | |||
// external endpoints have been identified | |||
return endpoints, hasLocalEndpoints || hasExternalEndpoints | |||
return endpoints.DeepCopy(), hasLocalEndpoints || hasExternalEndpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @squeed that it'd be cleaner to DeepCopy the backend that's mutated rather than the whole thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up flipping this. Now the backends are mutated in-place (with lock held), but all public methods in ServiceCache
will always return a copy of the Backends.
b784ff8
to
1905319
Compare
/test Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/1934/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/2028/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
//var buf bytes.Buffer | ||
//json.Indent(&buf, jsonBytes, "", "\t") | ||
//fmt.Printf("JSON spec:\n%s\n", buf.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess these are meant for debugging, so let's just remove them and file an issue to clean up the rest as a followup.
Remove leftover debugging code in envoy-related tests. Signed-off-by: Jussi Maki <jussi@isovalent.com>
correlateEndpoints was modifying a Backend object that was read by K8sWatcher: WARNING: DATA RACE Read at 0x00c000942278 by goroutine 71: github.com/cilium/cilium/pkg/k8s/watchers.genCartesianProduct() /home/chris/code/cilium/cilium/pkg/k8s/watchers/watcher.go:787 +0xb56 github.com/cilium/cilium/pkg/k8s/watchers.datapathSVCs() /home/chris/code/cilium/cilium/pkg/k8s/watchers/watcher.go:852 +0x77e github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).addK8sSVCs() ... Previous write at 0x00c000942278 by goroutine 70: github.com/cilium/cilium/pkg/k8s.(*ServiceCache).correlateEndpoints() /home/chris/code/cilium/cilium/pkg/k8s/service_cache.go:501 +0x409 github.com/cilium/cilium/pkg/k8s.(*ServiceCache).UpdateService() /home/chris/code/cilium/cilium/pkg/k8s/service_cache.go:214 +0x3e4 github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcherSuite).TestChangeSVCPort() /home/chris/code/cilium/cilium/pkg/k8s/watchers/watcher_test.go:685 +0x1d09 ... Since the backends are owned by the ServiceCache and allowing to mutate them when holding a lock is something one might expect, fix the issues by making all public ServiceCache methods return a DeepCopy'd *Endpoint and *Backends. Fixes: cilium#25071 Fixes: aa3c85f ("k8s: Add preferred attribute in Endpoints") Signed-off-by: Jussi Maki <jussi@isovalent.com>
6f1e356
to
4cf827b
Compare
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/2090/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
/test-1.26-net-next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried about memory usage, but this all looks good.
Looks like Chart CI Push hit #25274, merging. |
correlateEndpoints was modifying a Backend object that was read by
K8sWatcher:
Since the backends are owned by the ServiceCache and allowing to mutate them when holding a lock is something one might expect, fix the issues by making all public ServiceCache methods return a DeepCopy'd *Endpoint and *Backends.