Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail #12292

joestringer · 2020-06-25T17:45:27Z

A user reported an issue where after an etcd cluster outage, Cilium would become unable to provision new endpoints, with the following messages recurring in the agent logs:

[T=0s] ... level=info msg="New endpoint"
[T=0s] ... level=info msg="Resolving identity labels (blocking)
[T=43s]... level=error msg="Forcefully unlocking local kvstore lock"
[T=89s]... level=info msg="Removed endpoint"

Upon further investigation of "gops stack" output from a sysdump in the
environment, the following stack trace was observed on Cilium 1.6.8:

goroutine 589 [select, 14 minutes]:
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46
...
github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708)
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0)
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308
github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def)
        /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde
github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2)
        /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a
github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0)
        /go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1
created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController
        /go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0

In particular, it's useful to note:

The goroutine has been blocking for 14 minutes attempting to make progress
on re-establishing the etcd session.
This path is hit by a controller responsible for handling etcd session
renewal in the event of etcd connectivity issues. It would not be triggered
by initial etcd connectivity failures, only when etcd connectivity was
successfully established then subsequently lost.
NewSession() attempts to pick one of the configured etcd peers to connect to,
and will block until connectivity to that peer is restored. It will not by
itself back off and re-attempt connectivity to another node in the etcd
cluster.
This path is in the critical section for writelock of etcdClient.RWMutex
so will block out all etcd client reads.

The above is consistent with the agent logs: Effectively, a new endpoint is
scheduled and Cilium attempts to allocate an identity for it. This process
requires the kvstore and will block on the lock. Given that it makes no
progress, first we see the lock liveness check kick to forcefully unlock
the path lock (Forcefully unlocking local kvstore lock), then later we see
cilium-cni / kubelet give up on creating the endpoint and remove it again.
The gops output also contained multiple goroutines blocked on this
lock with a call frame originating in an endpoint create AP call.

This patch fixes the issue by introducing a timeout when attempting to renew
the session. Each time NewSession() is called, the etcd client library will
pick an etcd peer to connect to. The intent behind the timeout is to provide
a way for Cilium to detect when it is attempting to connect to an unhealthy
peer, and try to reconnect to one of the other peers in the etcd cluster in
the hopes that that other peer will be healthy.

In the event that there is an etcd connectivity outage where one of
three etcd peers is unhealthy, we expect that the remaining two can
retain chorum and continue to operate despite the outage of the third
peer. In this case if Cilium was attempting to connect to the third
(unhealthy) peer, Cilium would previously block indefinitely. With this
patch Cilium will time out after statusCheckTimeout (10s) and
re-establish a session to the remaining etcd peers, thereby unblocking
subsequent endpoint provisioning.

This PR also aligns etcd path locking timeouts with potential session renew timeouts from the above logic.

A user reported an issue where Cilium would become unable to provision new endpoints, with the following messages recurring in the agent logs: [T=0s] ... level=info msg="New endpoint" [T=0s] ... level=info msg="Resolving identity labels (blocking) [T=43s]... level=error msg="Forcefully unlocking local kvstore lock" [T=89s]... level=info msg="Removed endpoint" Upon further investigation of "gops stack" output from a sysdump in the environment, the following stack trace was observed on Cilium 1.6.8: goroutine 589 [select, 14 minutes]: github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256 github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46 ... github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4 /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3 github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319 /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708) /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98 github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308 github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0) /go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1 created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController /go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0 In particular, it's useful to note: * The goroutine has been blocking for 14 minutes attempting to make progress on re-establishing the etcd session. * This path is hit by a controller responsible for handling etcd session renewal in the event of etcd connectivity issues. It would not be triggered by initial etcd connectivity failures, only when etcd connectivity was successfully established then subsequently lost. * NewSession() attempts to pick one of the configured etcd peers to connect to, and will block until connectivity to that peer is restored. It will not by itself back off and re-attempt connectivity to another node in the etcd cluster. * This path is in the critical section for writelock of `etcdClient.RWMutex` so will block out all etcd client reads. The above is consistent with the agent logs: Effectively, a new endpoint is scheduled and Cilium attempts to allocate an identity for it. This process requires the kvstore and will block on the lock. Given that it makes no progress, first we see the lock liveness check kick to forcefully unlock the path lock (`Forcefully unlocking local kvstore lock`), then later we see cilium-cni / kubelet give up on creating the endpoint and remove it again. This patch fixes the issue by introducing a timeout when attempting to renew the session. Each time `NewSession()` is called, the etcd client library will pick an etcd peer to connect to. The intent behind the timeout is to provide a way for Cilium to detect when it is attempting to connect to an unhealthy peer, and try to reconnect to one of the other peers in the etcd cluster in the hopes that that other peer will be healthy. In the event that there is an etcd connectivity outage where one of three etcd peers is unhealthy, we expect that the remaining two can retain chorum and continue to operate despite the outage of the third peer. In this case if Cilium was attempting to connect to the third (unhealthy) peer, Cilium would previously block indefinitely. With this patch Cilium will time out after statusCheckTimeout (10s) and re-establish a session to the remaining etcd peers, thereby unblocking subsequent endpoint provisioning. Signed-off-by: Joe Stringer <joe@cilium.io> Co-authored-by: André Martins <andre@cilium.io> Signed-off-by: André Martins <andre@cilium.io>

When renewSession() attempts to re-establish a session with an etcd peer, it may block for some period of time, currently 10s. At the moment, this timeout is added to the LockPath() timeout because the LockPath() context timeout is created *after* acquiring the lock. Move the context creation before the lock acquisition to allow these timers to overlap to provide more accurate bounding on the timeouts here. Signed-off-by: Joe Stringer <joe@cilium.io>

pkg/kvstore/etcd.go

ianvernon

Thanks for providing all of the context in this PR :)

joestringer · 2020-06-25T18:01:05Z

test-me-please

christarazi

Nice write-up! Code LGTM

coveralls · 2020-06-25T18:34:47Z

Coverage decreased (-0.02%) to 37.166% when pulling 54c46a9 on joestringer:submit/etcd-session-retry into 8789513 on cilium:master.

joestringer · 2020-06-25T20:33:54Z

pkg/kvstore/etcd.go

+				// integrate better with any code on the etcd
+				// client side.
+				parentCtx := ec.client.Ctx()
+				timedCtx, cancel := context.WithTimeout(parentCtx, statusCheckTimeout)


We could potentially consider using something like ClusterSizeDependantInterval() to scale this based on the cluster size.

joestringer and others added 2 commits June 25, 2020 10:40

joestringer added priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.6 labels Jun 25, 2020

joestringer requested a review from aanm June 25, 2020 17:45

joestringer requested a review from a team as a code owner June 25, 2020 17:45

maintainer-s-little-helper bot added this to Needs backport from master in 1.8.1 Jun 25, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.7.6 Jun 25, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.6.10 Jun 25, 2020

aanm approved these changes Jun 25, 2020

View reviewed changes

ianvernon reviewed Jun 25, 2020

View reviewed changes

pkg/kvstore/etcd.go Show resolved Hide resolved

ianvernon approved these changes Jun 25, 2020

View reviewed changes

christarazi approved these changes Jun 25, 2020

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 25, 2020

joestringer commented Jun 25, 2020

View reviewed changes

aanm merged commit b99c7b8 into cilium:master Jun 26, 2020

joestringer deleted the submit/etcd-session-retry branch June 29, 2020 02:25

christarazi mentioned this pull request Jun 30, 2020

v1.8 backports 2020-06-29 #12332

Merged

christarazi added backport-pending/1.8 and removed needs-backport/1.8 labels Jun 30, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.1 Jun 30, 2020

christarazi mentioned this pull request Jun 30, 2020

v1.7 backports 2020-06-29 #12333

Merged

christarazi added backport-pending/1.7 and removed needs-backport/1.7 labels Jun 30, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.6 Jun 30, 2020

nebril added backport-pending/1.6 and removed needs-backport/1.6 labels Jun 30, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.10 Jun 30, 2020

nebril mentioned this pull request Jun 30, 2020

v1.6 backports 2020-06-30 #12341

Merged

joestringer added backport-done/1.8 and removed backport-pending/1.8 labels Jun 30, 2020

joestringer moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.1 Jun 30, 2020

joestringer added backport-done/1.6 and removed backport-pending/1.6 labels Jun 30, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.10 Jun 30, 2020

joestringer added backport-done/1.7 and removed backport-pending/1.7 labels Jun 30, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.6 Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail #12292

Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail #12292

joestringer commented Jun 25, 2020 •

edited

ianvernon left a comment

joestringer commented Jun 25, 2020

christarazi left a comment

coveralls commented Jun 25, 2020

joestringer Jun 25, 2020

Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail #12292

Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail #12292

Conversation

joestringer commented Jun 25, 2020 • edited

ianvernon left a comment

Choose a reason for hiding this comment

joestringer commented Jun 25, 2020

christarazi left a comment

Choose a reason for hiding this comment

coveralls commented Jun 25, 2020

joestringer Jun 25, 2020

Choose a reason for hiding this comment

joestringer commented Jun 25, 2020 •

edited