Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.6 backports 2020-06-30 #12341

Merged
merged 7 commits into from Jun 30, 2020
Merged

v1.6 backports 2020-06-30 #12341

merged 7 commits into from Jun 30, 2020

Conversation

nebril
Copy link
Member

@nebril nebril commented Jun 30, 2020

Skipped prs due to conflicts:

Once this PR is merged, you can update the PR labels via:

$ for pr in 12016 11986 12025 12146 12223 12292 ; do contrib/backporting/set-labels.py $pr done 1.6; done

jrajahalme and others added 7 commits June 30, 2020 18:44
[ upstream commit 2ed8851 ]

Include Envoy-supplied error detail in NACK warning logs to facilitate debugging.

Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit 477c487 ]

The backport PRs shouldn't add the label requires-janitor-review since
the review is automatically requested by the CODEOWNERS file.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit df114a6 ]

`ginkgoext.Writer` is used as the output buffer behind
`ginkgoext.GinkgoPrint`. `GinkgoPrint` can be called from multiple
go routines concurrently. For example, `kubectl.validateCilium`
spawns `cilium{Status,Controllers,Health}PreFlightCheck` in three
separate go routines, all of which call `ginkgoext.By`, which calls
`ginkgoext.GinkgoPrint` which calls `ginkgoext.Writer.Write`.

This hopefully fixes a spurious panic which observed in
https://jenkins.cilium.io/job/Cilium-PR-K8s-newest-kernel-4.9/719,
as the "slice bounds out of range" panic seems to be a common
failure when `bytes.Buffer.Write` is called unsychronized from
different go routines.

```
09:55:43 STEP: Performing Cilium controllers preflight check
panic: runtime error: slice bounds out of range [532609:532604]

goroutine 871 [running]:
bytes.(*Buffer).Write(0x2f09ca0, 0xc0006f4000, 0x2e, 0x4000, 0x2e, 0x0, 0x0)
	/usr/local/go/src/bytes/buffer.go:174 +0x101
github.com/cilium/cilium/test/ginkgo-ext.(*Writer).Write(0xc000373b00, 0xc0006f4000, 0x2e, 0x4000, 0x2eca680, 0xc000210c58, 0x40be5b)
	/home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/writer.go:46 +0xb2
fmt.Fprintln(0x1f3aaa0, 0xc000373b00, 0xc000210ca8, 0x1, 0x1, 0xc000221b90, 0x2d, 0x2d)
	/usr/local/go/src/fmt/print.go:265 +0x8b
github.com/cilium/cilium/test/ginkgo-ext.GinkgoPrint(0xc000221b90, 0x2d, 0x0, 0x0, 0x0)
	/home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:178 +0xa7
github.com/cilium/cilium/test/ginkgo-ext.By(0x1c9e224, 0x1e, 0x0, 0x0, 0x0)
	/home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:170 +0x12f
github.com/cilium/cilium/test/helpers.(*Kubectl).ciliumHealthPreFlightCheck(0xc00099fa40, 0x1f392a0, 0xc000652120)
	/home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/helpers/kubectl.go:3496 +0x7a
github.com/cilium/cilium/test/helpers.(*Kubectl).validateCilium.func3(0xc0000645a0, 0xc000206470)
	/home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/helpers/kubectl.go:3386 +0x2e
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc00074cd80, 0xc000796bc0)
	/home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
	/home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66

Ginkgo ran 1 suite in 3m21.079336902s
Test Suite Failed
```

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
…tore

[ upstream commit 0ee0458 ]

Running Cilium with K8sEventHandover and without a KVStore configured
causes it on not being able to delete any CNP after being created.

Fixes: e4e83e8 ("daemon: Allow kvstore to be unconfigured")
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit 462e08d ]

In an IPv6 only cluster, the operator host was set to `[::1]` in the
Helm chart. This is a problem as it is "bracketed" when
concatenated with the port to form an invalid address: `[[::1]]:9234`.
In turn, this prevented the liveness probe from working:

    Warning  Unhealthy         10m (x11 over 14m)   kubelet, kind-worker2  Liveness probe failed: Get http://[[::1]]:9234/healthz: dial tcp: address [[::1]]:9234: missing port in address

This commit fixes this issue by removing the extraneous brackets in the
Helm chart of the Cilium operator.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit 8524fca ]

A user reported an issue where Cilium would become unable to provision
new endpoints, with the following messages recurring in the agent logs:

[T=0s] ... level=info msg="New endpoint"
[T=0s] ... level=info msg="Resolving identity labels (blocking)
[T=43s]... level=error msg="Forcefully unlocking local kvstore lock"
[T=89s]... level=info msg="Removed endpoint"

Upon further investigation of "gops stack" output from a sysdump in the
environment, the following stack trace was observed on Cilium 1.6.8:

goroutine 589 [select, 14 minutes]:
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46
...
github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708)
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0)
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308
github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def)
        /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde
github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2)
        /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a
github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0)
        /go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1
created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController
        /go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0

In particular, it's useful to note:
* The goroutine has been blocking for 14 minutes attempting to make progress
  on re-establishing the etcd session.
* This path is hit by a controller responsible for handling etcd session
  renewal in the event of etcd connectivity issues. It would not be triggered
  by initial etcd connectivity failures, only when etcd connectivity was
  successfully established then subsequently lost.
* NewSession() attempts to pick one of the configured etcd peers to connect to,
  and will block until connectivity to that peer is restored. It will not by
  itself back off and re-attempt connectivity to another node in the etcd
  cluster.
* This path is in the critical section for writelock of `etcdClient.RWMutex`
  so will block out all etcd client reads.

The above is consistent with the agent logs: Effectively, a new endpoint is
scheduled and Cilium attempts to allocate an identity for it. This process
requires the kvstore and will block on the lock. Given that it makes no
progress, first we see the lock liveness check kick to forcefully unlock
the path lock (`Forcefully unlocking local kvstore lock`), then later we see
cilium-cni / kubelet give up on creating the endpoint and remove it again.

This patch fixes the issue by introducing a timeout when attempting to renew
the session. Each time `NewSession()` is called, the etcd client library will
pick an etcd peer to connect to. The intent behind the timeout is to provide
a way for Cilium to detect when it is attempting to connect to an unhealthy
peer, and try to reconnect to one of the other peers in the etcd cluster in
the hopes that that other peer will be healthy.

In the event that there is an etcd connectivity outage where one of
three etcd peers is unhealthy, we expect that the remaining two can
retain chorum and continue to operate despite the outage of the third
peer. In this case if Cilium was attempting to connect to the third
(unhealthy) peer, Cilium would previously block indefinitely. With this
patch Cilium will time out after statusCheckTimeout (10s) and
re-establish a session to the remaining etcd peers, thereby unblocking
subsequent endpoint provisioning.

Signed-off-by: Joe Stringer <joe@cilium.io>
Co-authored-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
[ upstream commit b99c7b8 ]

When renewSession() attempts to re-establish a session with an etcd
peer, it may block for some period of time, currently 10s. At the
moment, this timeout is added to the LockPath() timeout because the
LockPath() context timeout is created *after* acquiring the lock.
Move the context creation before the lock acquisition to allow these
timers to overlap to provide more accurate bounding on the timeouts
here.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
@nebril nebril requested a review from a team as a code owner June 30, 2020 16:58
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.6 kind/backports This PR provides functionality previously merged into master. labels Jun 30, 2020
@nebril
Copy link
Member Author

nebril commented Jun 30, 2020

test-backport-1.6

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jun 30, 2020
@joestringer joestringer merged commit abeeb88 into v1.6 Jun 30, 2020
@joestringer joestringer deleted the pr/1.6-backports-30-06.2020 branch June 30, 2020 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants