New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.6 backports 2020-06-30 #12341
Merged
Merged
v1.6 backports 2020-06-30 #12341
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ upstream commit 2ed8851 ] Include Envoy-supplied error detail in NACK warning logs to facilitate debugging. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit 477c487 ] The backport PRs shouldn't add the label requires-janitor-review since the review is automatically requested by the CODEOWNERS file. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit df114a6 ] `ginkgoext.Writer` is used as the output buffer behind `ginkgoext.GinkgoPrint`. `GinkgoPrint` can be called from multiple go routines concurrently. For example, `kubectl.validateCilium` spawns `cilium{Status,Controllers,Health}PreFlightCheck` in three separate go routines, all of which call `ginkgoext.By`, which calls `ginkgoext.GinkgoPrint` which calls `ginkgoext.Writer.Write`. This hopefully fixes a spurious panic which observed in https://jenkins.cilium.io/job/Cilium-PR-K8s-newest-kernel-4.9/719, as the "slice bounds out of range" panic seems to be a common failure when `bytes.Buffer.Write` is called unsychronized from different go routines. ``` 09:55:43 STEP: Performing Cilium controllers preflight check panic: runtime error: slice bounds out of range [532609:532604] goroutine 871 [running]: bytes.(*Buffer).Write(0x2f09ca0, 0xc0006f4000, 0x2e, 0x4000, 0x2e, 0x0, 0x0) /usr/local/go/src/bytes/buffer.go:174 +0x101 github.com/cilium/cilium/test/ginkgo-ext.(*Writer).Write(0xc000373b00, 0xc0006f4000, 0x2e, 0x4000, 0x2eca680, 0xc000210c58, 0x40be5b) /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/writer.go:46 +0xb2 fmt.Fprintln(0x1f3aaa0, 0xc000373b00, 0xc000210ca8, 0x1, 0x1, 0xc000221b90, 0x2d, 0x2d) /usr/local/go/src/fmt/print.go:265 +0x8b github.com/cilium/cilium/test/ginkgo-ext.GinkgoPrint(0xc000221b90, 0x2d, 0x0, 0x0, 0x0) /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:178 +0xa7 github.com/cilium/cilium/test/ginkgo-ext.By(0x1c9e224, 0x1e, 0x0, 0x0, 0x0) /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:170 +0x12f github.com/cilium/cilium/test/helpers.(*Kubectl).ciliumHealthPreFlightCheck(0xc00099fa40, 0x1f392a0, 0xc000652120) /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/helpers/kubectl.go:3496 +0x7a github.com/cilium/cilium/test/helpers.(*Kubectl).validateCilium.func3(0xc0000645a0, 0xc000206470) /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/helpers/kubectl.go:3386 +0x2e golang.org/x/sync/errgroup.(*Group).Go.func1(0xc00074cd80, 0xc000796bc0) /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x59 created by golang.org/x/sync/errgroup.(*Group).Go /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66 Ginkgo ran 1 suite in 3m21.079336902s Test Suite Failed ``` Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
…tore [ upstream commit 0ee0458 ] Running Cilium with K8sEventHandover and without a KVStore configured causes it on not being able to delete any CNP after being created. Fixes: e4e83e8 ("daemon: Allow kvstore to be unconfigured") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit 462e08d ] In an IPv6 only cluster, the operator host was set to `[::1]` in the Helm chart. This is a problem as it is "bracketed" when concatenated with the port to form an invalid address: `[[::1]]:9234`. In turn, this prevented the liveness probe from working: Warning Unhealthy 10m (x11 over 14m) kubelet, kind-worker2 Liveness probe failed: Get http://[[::1]]:9234/healthz: dial tcp: address [[::1]]:9234: missing port in address This commit fixes this issue by removing the extraneous brackets in the Helm chart of the Cilium operator. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit 8524fca ] A user reported an issue where Cilium would become unable to provision new endpoints, with the following messages recurring in the agent logs: [T=0s] ... level=info msg="New endpoint" [T=0s] ... level=info msg="Resolving identity labels (blocking) [T=43s]... level=error msg="Forcefully unlocking local kvstore lock" [T=89s]... level=info msg="Removed endpoint" Upon further investigation of "gops stack" output from a sysdump in the environment, the following stack trace was observed on Cilium 1.6.8: goroutine 589 [select, 14 minutes]: github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256 github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46 ... github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4 /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3 github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319 /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708) /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98 github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308 github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0) /go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1 created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController /go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0 In particular, it's useful to note: * The goroutine has been blocking for 14 minutes attempting to make progress on re-establishing the etcd session. * This path is hit by a controller responsible for handling etcd session renewal in the event of etcd connectivity issues. It would not be triggered by initial etcd connectivity failures, only when etcd connectivity was successfully established then subsequently lost. * NewSession() attempts to pick one of the configured etcd peers to connect to, and will block until connectivity to that peer is restored. It will not by itself back off and re-attempt connectivity to another node in the etcd cluster. * This path is in the critical section for writelock of `etcdClient.RWMutex` so will block out all etcd client reads. The above is consistent with the agent logs: Effectively, a new endpoint is scheduled and Cilium attempts to allocate an identity for it. This process requires the kvstore and will block on the lock. Given that it makes no progress, first we see the lock liveness check kick to forcefully unlock the path lock (`Forcefully unlocking local kvstore lock`), then later we see cilium-cni / kubelet give up on creating the endpoint and remove it again. This patch fixes the issue by introducing a timeout when attempting to renew the session. Each time `NewSession()` is called, the etcd client library will pick an etcd peer to connect to. The intent behind the timeout is to provide a way for Cilium to detect when it is attempting to connect to an unhealthy peer, and try to reconnect to one of the other peers in the etcd cluster in the hopes that that other peer will be healthy. In the event that there is an etcd connectivity outage where one of three etcd peers is unhealthy, we expect that the remaining two can retain chorum and continue to operate despite the outage of the third peer. In this case if Cilium was attempting to connect to the third (unhealthy) peer, Cilium would previously block indefinitely. With this patch Cilium will time out after statusCheckTimeout (10s) and re-establish a session to the remaining etcd peers, thereby unblocking subsequent endpoint provisioning. Signed-off-by: Joe Stringer <joe@cilium.io> Co-authored-by: André Martins <andre@cilium.io> Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit b99c7b8 ] When renewSession() attempts to re-establish a session with an etcd peer, it may block for some period of time, currently 10s. At the moment, this timeout is added to the LockPath() timeout because the LockPath() context timeout is created *after* acquiring the lock. Move the context creation before the lock acquisition to allow these timers to overlap to provide more accurate bounding on the timeouts here. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
maintainer-s-little-helper
bot
added
backport/1.6
kind/backports
This PR provides functionality previously merged into master.
labels
Jun 30, 2020
test-backport-1.6 |
joestringer
approved these changes
Jun 30, 2020
maintainer-s-little-helper
bot
added
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
labels
Jun 30, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
kind/backports
This PR provides functionality previously merged into master.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Skipped prs due to conflicts:
Once this PR is merged, you can update the PR labels via: