Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Init endpoint queue during validation #13608

Merged

Conversation

christarazi
Copy link
Member

See commit msgs.

Fixes: #13398

Fix bug where events cannot be enqueued during endpoint restoration

@christarazi christarazi requested a review from a team as a code owner October 16, 2020 19:15
@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. kind/bug This is a bug in the Cilium logic. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. area/endpoint labels Oct 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Oct 16, 2020
@christarazi
Copy link
Member Author

test-me-please

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

pkg/endpoint/manager.go Outdated Show resolved Hide resolved
@pchaigno
Copy link
Member

These error messages also happen in v1.8 CI.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2020
This is useful during endpoint validation when endpoints are being
restored. When they are being restored, their event queue is not yet
initialized because they haven't been exposed to the endpoint manager.
It is important to initialize an endpoint's event queue so that events
are not missed during their restoration.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This commit fixes the following errors:

```
evel=error msg="Unable to enqueue endpoint policy visibility event"
containerID=9f680a5847 datapathPolicyRevision=0 desiredPolicyRevision=0
endpointID=3479 error="unable to Enqueue event" identity=8771
ipv4=10.116.2.10 ipv6=
k8sPodName=cilium-monitoring/grafana-6d49bd9ff7-s8zsd subsys=endpoint
```

These errors occurred because during endpoint validation (when the
endpoint is being restored), its event queue has not been initialized
yet. Once the endpoint is eventually exposed endpoint manager (after
restoration), it will begin processing the events off the queue.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi
Copy link
Member Author

christarazi commented Oct 18, 2020

CI passed and code owner approval. Pushing to resolve typo referenced above:

commit 6435f18a44a6354e1fec57dccf44a5333def349e
Author: Chris Tarazi <chris@isovalent.com>
Date:   Sun Oct 18 14:32:14 2020 -0700

    fixup! endpoint: Add function to initialize event queue
    
    Signed-off-by: Chris Tarazi <chris@isovalent.com>

diff --git a/pkg/endpoint/manager.go b/pkg/endpoint/manager.go
index c241d4300..8472f0459 100644
--- a/pkg/endpoint/manager.go
+++ b/pkg/endpoint/manager.go
@@ -172,9 +172,9 @@ func (e *Endpoint) Unexpose(mgr endpointManager) <-chan struct{} {
 
 // InitEventQueue initializes the endpoint's event queue. Note that this
 // function does not begin processing events off the queue, as that's left up
-// to the caller when should call Expose in order to allow other subsystems to
-// access the endpoint. This function assumes that the endpoint ID has already
-// been allocated!
+// to the caller to call Expose in order to allow other subsystems to access
+// the endpoint. This function assumes that the endpoint ID has already been
+// allocated!
 //
 // Having this be a separate function allows us to prepare
 // the event queue while the endpoint is being validated (during restoration)

@christarazi christarazi force-pushed the pr/christarazi/fix-unable-to-enqueue branch from 4afdfa4 to 5cbdc91 Compare October 18, 2020 21:35
@tklauser tklauser merged commit 290d9e9 into cilium:master Oct 19, 2020
@christarazi christarazi deleted the pr/christarazi/fix-unable-to-enqueue branch October 19, 2020 15:07
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 19, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 19, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 20, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 20, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport done to v1.8 to Backport pending to v1.8 in 1.8.5 Oct 20, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport done to v1.8 to Backport pending to v1.8 in 1.8.5 Oct 20, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Needs backport from master in 1.8.5 Oct 20, 2020
@rolinh
Copy link
Member

rolinh commented Oct 20, 2020

@christarazi Backporting this PR to v1.8 leads to a deadlock situation when some endpoints are being deleted on restore (thanks @aanm for digging and finding out!):

goroutine 682 [chan receive, 3 minutes]:
github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).WaitToBeDrained(...)
	/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:312
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Delete(0xc000aa2c00, 0x27fab80, 0xc0007be000, 0x27fb700, 0xc000acc6c0, 0x2876de0, 0xc0000aea80, 0x430101, 0x0, 0x0, ...)
	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:2194 +0x10f9
github.com/cilium/cilium/daemon/cmd.(*Daemon).deleteEndpointQuiet(...)
	/go/src/github.com/cilium/cilium/daemon/cmd/endpoint.go:674
github.com/cilium/cilium/daemon/cmd.(*Daemon).regenerateRestoredEndpoints.func2(0xc0007be000, 0xc001027f94, 0xc000aa2c00)
	/go/src/github.com/cilium/cilium/daemon/cmd/state.go:302 +0x7c
created by github.com/cilium/cilium/daemon/cmd.(*Daemon).regenerateRestoredEndpoints
	/go/src/github.com/cilium/cilium/daemon/cmd/state.go:296 +0x8a0

A 1.8 specific version of this PR seems to be required.

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Needs backport from master in 1.8.5 Oct 22, 2020
@christarazi
Copy link
Member Author

Removing needs-backport/1.8 for now until the investigation of the potential regression is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. kind/bug This is a bug in the Cilium logic. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to enqueue endpoint policy {policy|bandwidth} event
6 participants