New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail #12292
Conversation
A user reported an issue where Cilium would become unable to provision new endpoints, with the following messages recurring in the agent logs: [T=0s] ... level=info msg="New endpoint" [T=0s] ... level=info msg="Resolving identity labels (blocking) [T=43s]... level=error msg="Forcefully unlocking local kvstore lock" [T=89s]... level=info msg="Removed endpoint" Upon further investigation of "gops stack" output from a sysdump in the environment, the following stack trace was observed on Cilium 1.6.8: goroutine 589 [select, 14 minutes]: github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256 github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46 ... github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4 /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3 github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319 /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708) /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98 github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308 github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0) /go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1 created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController /go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0 In particular, it's useful to note: * The goroutine has been blocking for 14 minutes attempting to make progress on re-establishing the etcd session. * This path is hit by a controller responsible for handling etcd session renewal in the event of etcd connectivity issues. It would not be triggered by initial etcd connectivity failures, only when etcd connectivity was successfully established then subsequently lost. * NewSession() attempts to pick one of the configured etcd peers to connect to, and will block until connectivity to that peer is restored. It will not by itself back off and re-attempt connectivity to another node in the etcd cluster. * This path is in the critical section for writelock of `etcdClient.RWMutex` so will block out all etcd client reads. The above is consistent with the agent logs: Effectively, a new endpoint is scheduled and Cilium attempts to allocate an identity for it. This process requires the kvstore and will block on the lock. Given that it makes no progress, first we see the lock liveness check kick to forcefully unlock the path lock (`Forcefully unlocking local kvstore lock`), then later we see cilium-cni / kubelet give up on creating the endpoint and remove it again. This patch fixes the issue by introducing a timeout when attempting to renew the session. Each time `NewSession()` is called, the etcd client library will pick an etcd peer to connect to. The intent behind the timeout is to provide a way for Cilium to detect when it is attempting to connect to an unhealthy peer, and try to reconnect to one of the other peers in the etcd cluster in the hopes that that other peer will be healthy. In the event that there is an etcd connectivity outage where one of three etcd peers is unhealthy, we expect that the remaining two can retain chorum and continue to operate despite the outage of the third peer. In this case if Cilium was attempting to connect to the third (unhealthy) peer, Cilium would previously block indefinitely. With this patch Cilium will time out after statusCheckTimeout (10s) and re-establish a session to the remaining etcd peers, thereby unblocking subsequent endpoint provisioning. Signed-off-by: Joe Stringer <joe@cilium.io> Co-authored-by: André Martins <andre@cilium.io> Signed-off-by: André Martins <andre@cilium.io>
When renewSession() attempts to re-establish a session with an etcd peer, it may block for some period of time, currently 10s. At the moment, this timeout is added to the LockPath() timeout because the LockPath() context timeout is created *after* acquiring the lock. Move the context creation before the lock acquisition to allow these timers to overlap to provide more accurate bounding on the timeouts here. Signed-off-by: Joe Stringer <joe@cilium.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing all of the context in this PR :)
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice write-up! Code LGTM
// integrate better with any code on the etcd | ||
// client side. | ||
parentCtx := ec.client.Ctx() | ||
timedCtx, cancel := context.WithTimeout(parentCtx, statusCheckTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially consider using something like ClusterSizeDependantInterval()
to scale this based on the cluster size.
A user reported an issue where after an etcd cluster outage, Cilium would become unable to provision new endpoints, with the following messages recurring in the agent logs:
Upon further investigation of "gops stack" output from a sysdump in the
environment, the following stack trace was observed on Cilium 1.6.8:
In particular, it's useful to note:
on re-establishing the etcd session.
renewal in the event of etcd connectivity issues. It would not be triggered
by initial etcd connectivity failures, only when etcd connectivity was
successfully established then subsequently lost.
and will block until connectivity to that peer is restored. It will not by
itself back off and re-attempt connectivity to another node in the etcd
cluster.
etcdClient.RWMutex
so will block out all etcd client reads.
The above is consistent with the agent logs: Effectively, a new endpoint is
scheduled and Cilium attempts to allocate an identity for it. This process
requires the kvstore and will block on the lock. Given that it makes no
progress, first we see the lock liveness check kick to forcefully unlock
the path lock (
Forcefully unlocking local kvstore lock
), then later we seecilium-cni / kubelet give up on creating the endpoint and remove it again.
The gops output also contained multiple goroutines blocked on this
lock with a call frame originating in an endpoint create AP call.
This patch fixes the issue by introducing a timeout when attempting to renew
the session. Each time
NewSession()
is called, the etcd client library willpick an etcd peer to connect to. The intent behind the timeout is to provide
a way for Cilium to detect when it is attempting to connect to an unhealthy
peer, and try to reconnect to one of the other peers in the etcd cluster in
the hopes that that other peer will be healthy.
In the event that there is an etcd connectivity outage where one of
three etcd peers is unhealthy, we expect that the remaining two can
retain chorum and continue to operate despite the outage of the third
peer. In this case if Cilium was attempting to connect to the third
(unhealthy) peer, Cilium would previously block indefinitely. With this
patch Cilium will time out after statusCheckTimeout (10s) and
re-establish a session to the remaining etcd peers, thereby unblocking
subsequent endpoint provisioning.
This PR also aligns etcd path locking timeouts with potential session renew timeouts from the above logic.