Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail #12292

Merged
merged 2 commits into from Jun 26, 2020

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Jun 25, 2020

A user reported an issue where after an etcd cluster outage, Cilium would become unable to provision new endpoints, with the following messages recurring in the agent logs:

[T=0s] ... level=info msg="New endpoint"
[T=0s] ... level=info msg="Resolving identity labels (blocking)
[T=43s]... level=error msg="Forcefully unlocking local kvstore lock"
[T=89s]... level=info msg="Removed endpoint"

Upon further investigation of "gops stack" output from a sysdump in the
environment, the following stack trace was observed on Cilium 1.6.8:

goroutine 589 [select, 14 minutes]:
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46
...
github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708)
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0)
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308
github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def)
        /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde
github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2)
        /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a
github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0)
        /go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1
created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController
        /go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0

In particular, it's useful to note:

  • The goroutine has been blocking for 14 minutes attempting to make progress
    on re-establishing the etcd session.
  • This path is hit by a controller responsible for handling etcd session
    renewal in the event of etcd connectivity issues. It would not be triggered
    by initial etcd connectivity failures, only when etcd connectivity was
    successfully established then subsequently lost.
  • NewSession() attempts to pick one of the configured etcd peers to connect to,
    and will block until connectivity to that peer is restored. It will not by
    itself back off and re-attempt connectivity to another node in the etcd
    cluster.
  • This path is in the critical section for writelock of etcdClient.RWMutex
    so will block out all etcd client reads.

The above is consistent with the agent logs: Effectively, a new endpoint is
scheduled and Cilium attempts to allocate an identity for it. This process
requires the kvstore and will block on the lock. Given that it makes no
progress, first we see the lock liveness check kick to forcefully unlock
the path lock (Forcefully unlocking local kvstore lock), then later we see
cilium-cni / kubelet give up on creating the endpoint and remove it again.
The gops output also contained multiple goroutines blocked on this
lock with a call frame originating in an endpoint create AP call.

This patch fixes the issue by introducing a timeout when attempting to renew
the session. Each time NewSession() is called, the etcd client library will
pick an etcd peer to connect to. The intent behind the timeout is to provide
a way for Cilium to detect when it is attempting to connect to an unhealthy
peer, and try to reconnect to one of the other peers in the etcd cluster in
the hopes that that other peer will be healthy.

In the event that there is an etcd connectivity outage where one of
three etcd peers is unhealthy, we expect that the remaining two can
retain chorum and continue to operate despite the outage of the third
peer. In this case if Cilium was attempting to connect to the third
(unhealthy) peer, Cilium would previously block indefinitely. With this
patch Cilium will time out after statusCheckTimeout (10s) and
re-establish a session to the remaining etcd peers, thereby unblocking
subsequent endpoint provisioning.

This PR also aligns etcd path locking timeouts with potential session renew timeouts from the above logic.

joestringer and others added 2 commits June 25, 2020 10:40
A user reported an issue where Cilium would become unable to provision
new endpoints, with the following messages recurring in the agent logs:

[T=0s] ... level=info msg="New endpoint"
[T=0s] ... level=info msg="Resolving identity labels (blocking)
[T=43s]... level=error msg="Forcefully unlocking local kvstore lock"
[T=89s]... level=info msg="Removed endpoint"

Upon further investigation of "gops stack" output from a sysdump in the
environment, the following stack trace was observed on Cilium 1.6.8:

goroutine 589 [select, 14 minutes]:
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0)
        /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46
...
github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708)
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0)
        /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308
github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def)
        /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde
github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2)
        /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a
github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0)
        /go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1
created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController
        /go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0

In particular, it's useful to note:
* The goroutine has been blocking for 14 minutes attempting to make progress
  on re-establishing the etcd session.
* This path is hit by a controller responsible for handling etcd session
  renewal in the event of etcd connectivity issues. It would not be triggered
  by initial etcd connectivity failures, only when etcd connectivity was
  successfully established then subsequently lost.
* NewSession() attempts to pick one of the configured etcd peers to connect to,
  and will block until connectivity to that peer is restored. It will not by
  itself back off and re-attempt connectivity to another node in the etcd
  cluster.
* This path is in the critical section for writelock of `etcdClient.RWMutex`
  so will block out all etcd client reads.

The above is consistent with the agent logs: Effectively, a new endpoint is
scheduled and Cilium attempts to allocate an identity for it. This process
requires the kvstore and will block on the lock. Given that it makes no
progress, first we see the lock liveness check kick to forcefully unlock
the path lock (`Forcefully unlocking local kvstore lock`), then later we see
cilium-cni / kubelet give up on creating the endpoint and remove it again.

This patch fixes the issue by introducing a timeout when attempting to renew
the session. Each time `NewSession()` is called, the etcd client library will
pick an etcd peer to connect to. The intent behind the timeout is to provide
a way for Cilium to detect when it is attempting to connect to an unhealthy
peer, and try to reconnect to one of the other peers in the etcd cluster in
the hopes that that other peer will be healthy.

In the event that there is an etcd connectivity outage where one of
three etcd peers is unhealthy, we expect that the remaining two can
retain chorum and continue to operate despite the outage of the third
peer. In this case if Cilium was attempting to connect to the third
(unhealthy) peer, Cilium would previously block indefinitely. With this
patch Cilium will time out after statusCheckTimeout (10s) and
re-establish a session to the remaining etcd peers, thereby unblocking
subsequent endpoint provisioning.

Signed-off-by: Joe Stringer <joe@cilium.io>
Co-authored-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
When renewSession() attempts to re-establish a session with an etcd
peer, it may block for some period of time, currently 10s. At the
moment, this timeout is added to the LockPath() timeout because the
LockPath() context timeout is created *after* acquiring the lock.
Move the context creation before the lock acquisition to allow these
timers to overlap to provide more accurate bounding on the timeouts
here.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer added priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.6 labels Jun 25, 2020
@joestringer joestringer requested a review from aanm June 25, 2020 17:45
@joestringer joestringer requested a review from a team as a code owner June 25, 2020 17:45
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.1 Jun 25, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.6 Jun 25, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.10 Jun 25, 2020
Copy link
Member

@ianvernon ianvernon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing all of the context in this PR :)

@joestringer
Copy link
Member Author

test-me-please

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice write-up! Code LGTM

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 37.166% when pulling 54c46a9 on joestringer:submit/etcd-session-retry into 8789513 on cilium:master.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 25, 2020
// integrate better with any code on the etcd
// client side.
parentCtx := ec.client.Ctx()
timedCtx, cancel := context.WithTimeout(parentCtx, statusCheckTimeout)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially consider using something like ClusterSizeDependantInterval() to scale this based on the cluster size.

@aanm aanm merged commit b99c7b8 into cilium:master Jun 26, 2020
@joestringer joestringer deleted the submit/etcd-session-retry branch June 29, 2020 02:25
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.1 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.6 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.10 Jun 30, 2020
@joestringer joestringer moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.1 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.10 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.10 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.6 Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/high This is considered vital to an upcoming release. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.6.10
Backport done to v1.6
1.7.6
Backport done to v1.7
1.8.1
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

6 participants