Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endpoint: Fix goroutine leak when EP is deleted #13683

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented Oct 21, 2020

The metadata resolver controller can be put in a scenario where it can
never succeed if the Cilium K8s pod cache is out-of-sync with the
current view of pods in the cluster. The cache can be out-of-sync if
there's heavy load on the cluster and especially on the apiserver. This
leaves the Cilium pod cache starved. When the pod cache is starved,
Cilium may never have an up-to-date view of the pods running in the
cluster, and therefore may never be able to resolve labels for pods
because the fetch will return "pod not found".

Eventually, kubelet (or maybe even the user) will give up on the pod and
remove it, thereby Cilium begins removing the endpoint. The controller
is stopped, but the goroutine waiting on the done channel will never
receive, hence the leak.

This commit fixes the goroutine leak when the controller never succeeds
and therefore, never closes the done channel. The fix is to add a
select statement to also watch the endpoint's aliveCtx which is
cancelled (and the context's Done channel is closed) when the endpoint
is deleted.

This commit was validated by forcing the metadata resolver to never find
a pod (manually hardcode the wrong pod name), and ensuring that the
gops stack output does not contain an entry like:

goroutine 1244259 [chan receive, 1434 minutes]:
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).RunMetadataResolver.func1(0xc0055d2a20, 0xc0049ff600, 0xc0055d2ae0, 0x5d)
	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:1531 +0x34
created by github.com/cilium/cilium/pkg/endpoint.(*Endpoint).RunMetadataResolver
	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:1530 +0x11e

Fixes: #13680

Fix bug where Cilium leaks a goroutine when an endpoint is deleted. This leak, if left running in a high pod churn environment, can cause Cilium to exceed its memory usage and get OOM killed.

@christarazi christarazi requested a review from a team as a code owner October 21, 2020 21:33
@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Oct 21, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Oct 21, 2020
@christarazi christarazi force-pushed the pr/christarazi/fix-goroutine-leak-metadata-resolver branch from 6f0e342 to f118e0d Compare October 21, 2020 21:33
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.13 Oct 21, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.0-rc3 Oct 21, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.11 Oct 21, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 21, 2020
@christarazi
Copy link
Member Author

test-me-please

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. :shipit:

pkg/endpoint/endpoint.go Show resolved Hide resolved
@christarazi
Copy link
Member Author

test-gke

The metadata resolver controller can be put in a scenario where it can
never succeed if the Cilium K8s pod cache is out-of-sync with the
current view of pods in the cluster. The cache can be out-of-sync if
there's heavy load on the cluster and especially on the apiserver. This
leaves the Cilium pod cache starved. When the pod cache is starved,
Cilium may never have an up-to-date view of the pods running in the
cluster, and therefore may never be able to resolve labels for pods
because the fetch will return "pod not found".

Eventually, kubelet (or maybe even the user) will give up on the pod and
remove it, thereby Cilium begins removing the endpoint. The controller
is stopped, but the goroutine waiting on the `done` channel will never
receive, hence the leak.

This commit fixes the goroutine leak when the controller never succeeds
and therefore, never closes the `done` channel. The fix is to add a
`select` statement to also watch the endpoint's `aliveCtx` which is
cancelled (and the context's `Done` channel is closed) when the endpoint
is deleted.

This commit was validated by forcing the metadata resolver to never find
a pod (manually hardcode the wrong pod name), and ensuring that the
`gops stack` output does not contain an entry like:

```
goroutine 1244259 [chan receive, 1434 minutes]:
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).RunMetadataResolver.func1(0xc0055d2a20, 0xc0049ff600, 0xc0055d2ae0, 0x5d)
	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:1531 +0x34
created by github.com/cilium/cilium/pkg/endpoint.(*Endpoint).RunMetadataResolver
	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:1530 +0x11e
```

Fixes: cilium#13680

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This controller's run interval is actually unnecessary as this
controller is intended to only run once as long as it succeeds. This is
evidenced by the fact that we have a `done` channel that's closed when
the controller succeeds and a goroutine waiting `done` to be closed to
remove the controller, thereby stopping it.

In addition, in the very unlikely case that the controller is scheduled
again to run _and_ the goroutine waiting for the `done` channel to be
closed has not run yet, the controller could panic because it would
close the channel twice. This commit prevents that as the interval
defaults to 10m if not provided. If the goroutine waiting on the channel
doesn't run within 10m, then we very likely have much worse problems at
hand.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-goroutine-leak-metadata-resolver branch from f118e0d to e96f581 Compare October 22, 2020 00:29
@christarazi
Copy link
Member Author

test-me-please

@joestringer joestringer merged commit cb9aa3e into cilium:master Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 22, 2020
@christarazi christarazi deleted the pr/christarazi/fix-goroutine-leak-metadata-resolver branch October 22, 2020 05:28
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.11 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.11 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.11 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.0-rc3 Oct 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.0-rc3 Oct 23, 2020
@joestringer joestringer moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.0-rc3 Oct 31, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 31, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.13 Nov 3, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.6 in 1.6.13 Nov 3, 2020
@tklauser
Copy link
Member

tklauser commented Nov 3, 2020

Not backporting this to 1.6 after out-of-band discussion with @christarazi. The build of backport PR #13870 currently fails. and 1.6 will become unsupported with the release of 1.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.7.11
Backport done to v1.7
1.8.5
Backport done to v1.8
1.9.0-rc3
Backport done to v1.9
Development

Successfully merging this pull request may close these issues.

Goroutine leak when waiting on endpoint's metadata resolver to terminate
6 participants