Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch does not fail on complete loss of connectivity #8980

Closed
devnev opened this issue Dec 6, 2017 · 14 comments
Closed

Watch does not fail on complete loss of connectivity #8980

devnev opened this issue Dec 6, 2017 · 14 comments

Comments

@devnev
Copy link

devnev commented Dec 6, 2017

Steps to reproduce:

  • build etcd and etcdctl from master or take latest release
  • run etcd
  • run ETCDCTL_API=3 etcdctl watch --prefix ""
  • SIGSTOP etcd, e.g. using ctrl+z in terminal

Expected outcome:
After some timeout, the watch will fail. This would allow the caller to propagate an error further up the stack, telling clients to retry elsewhere where there might be a healthy connection.

Observed outcome:
The watch blocks forever.

@gyuho
Copy link
Contributor

gyuho commented Dec 6, 2017

telling clients to retry elsewhere where there might be a healthy connection.

This is already working (since v3.2.10+, master branch--pre v3.3.0).

You are running a single node, and the watch client is expected to be stuck with the single node, retrying--even with keepalive settings (e.g. ETCDCTL_API=3 etcdctl --debug --keepalive-time=5s --keepalive-timeout=5s watch --prefix "").

As of today (pre v3.3.0, gRPC v1.7.4), the retrial logic is all manual: just try again on transient failure. This was partially because gRPC keepalive does not expose connection state (seems resolved, related grpc/grpc-go#1444). In the future, we may improve this on a single node case.

@gyuho gyuho closed this as completed Dec 6, 2017
@devnev
Copy link
Author

devnev commented Dec 6, 2017

Sorry, but its not clear to me what you're saying. Is it intended behaviour that the broken watch just sits there doing nothing?

I downloaded v3.2.11 which exposes the same problematic behaviour with my command and does not support the keepalive flags in your command.

I do not understand what you mean by the retrial logic is all manual: just try again on transient failure. Manual in what sense? What is a transient failure?

To be clear, the behaviour I expect to see at some point is: given the correct connection timeout options, a watch that gets disconnected will fail and return to the caller.

@xiang90
Copy link
Contributor

xiang90 commented Dec 6, 2017

A temp connection issue should not fail watcher unless you explicitly specify a timeout (but the timeout will also fail long running watchers, so probably not anyone would like to do that) or a keepalive mechanism to kill it.

@xiang90
Copy link
Contributor

xiang90 commented Dec 6, 2017

After some timeout, the watch will fail. This would allow the caller to propagate an error further up the stack, telling clients to retry elsewhere where there might be a healthy connection.

You should provide etcd client all available endpoints, and the client expects to re-connect by itself. If you want to control reconnection yourself, you basically need to build your own balancer, which is a non-trivial of effort to get right.

@devnev
Copy link
Author

devnev commented Dec 6, 2017

or a keepalive mechanism to kill it.

This is exactly what I'm looking for, and can't find. Trying your etcdctl watch command with all the timeouts, the command does not terminate when i sigstop etcd.

edit: this is with etcd/etcdctl build from master

@devnev
Copy link
Author

devnev commented Dec 6, 2017

You should provide etcd client all available endpoints, and the client expects to re-connect by itself. If you want to control reconnection yourself, you basically need to build your own balancer, which is a non-trivial of effort to get right.

I do not want to control reconnection. I want my own service's requests to fail when the service is completely unable to reach etcd, as the service is unable to operate correctly.

@devnev
Copy link
Author

devnev commented Dec 7, 2017

Hi - any updates? As far as I can tell this is still unresolved, despite the rather premature closing of the issue.

@gyuho
Copy link
Contributor

gyuho commented Dec 7, 2017

@devnev Sorry! I was unclear. And we need better docs around this.

Is it intended behaviour that the broken watch just sits there doing nothing?

It's retrying the single endpoint, since your command only runs a single node cluster. Again, if you want balancer failover, provide other endpoints in --endpoints flag.

command does not terminate when i sigstop etcd.

This is expected. Watch command won't exit on connection loss or server shutdowns. Keepalive in watch was added to detect connection failures and trigger endpoint switches, not to error out. We will document this clearly in v3.3 release.

the service is completely unable to reach etcd

In client-side, there's no way to tell if the failure is transient or permanent. That's why client retries among available endpoints, in case the server comes back.

When etcd server goes down, gRPC client receives rpc error: code = Unavailable desc = transport is closing. Its error code is Unavailable, indicating transient failure from gRPC's point of view. When the client receives obvious server errors (e.g. etcdserver: mvcc: required revision has been compacted), the watch client terminates. We might expose connection state to client (tracked in #8776), so that client can decide whether to retry or not, depending on the connection states.

Again, watch is not really meant for detecting disconnects. If you want to detect connection loss, I would try concurrency.Session with TTL. Example usage in lock command:

https://github.com/coreos/etcd/blob/7e0fc6136eb4c5dc3c8c39ff48eb29e87b7158ab/etcdctl/ctlv3/command/lock_command.go#L56

@devnev
Copy link
Author

devnev commented Dec 7, 2017

Thank you for the clarification and the workaround.

As a final note, even with multiple etcds, there can be total connection loss to the etcd cluster (in our case this was caused by some as of yet undiagnosed routing issue affecting only specific kubernetes pods). One of the benefits of watches I was expecting was to receive updates quickly, making the system more reactive. If a watch gets completely disconnected, the system loses this property. I am glad to hear there's a workaround using sessions, but I am now going to have to wrap every single one of my watches in a session to ensure the reactivity of the system.

@gyuho
Copy link
Contributor

gyuho commented Dec 7, 2017

As a final note, even with multiple etcds, there can be total connection loss to the etcd cluster (in our case this was caused by some as of yet undiagnosed routing issue affecting only specific kubernetes pods).

Agree. We will address this in the next release cycle (hopefully v3.4).
Also document the current watch limitations in 3.3 release.

@xiang90
Copy link
Contributor

xiang90 commented Dec 7, 2017

@devnev @gyuho

An retry option should be provided to the users for watch. If the watcher cannot be reconnected for say 5 seconds, the watcher can return an error.

@gyuho gyuho added this to the v3.4.0 milestone Dec 7, 2017
julianvmodesto added a commit to gazette/core that referenced this issue Jan 29, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in silently missed events.

To upper bound such silent errors, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.
julianvmodesto added a commit to gazette/core that referenced this issue Jan 29, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in silently missed events.

To upper bound such silent errors, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.
julianvmodesto added a commit to gazette/core that referenced this issue Feb 5, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in silently missed events.

To upper bound such silent errors, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also update dep lock.
julianvmodesto added a commit to gazette/core that referenced this issue Feb 5, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in silently missed events.

To upper bound such silent errors, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also update dep lock.
julianvmodesto added a commit to gazette/core that referenced this issue Feb 7, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in missed events.

To upper bound the effects of such missed events, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also:
- Make sure to return all etcd errors.
- Update dep lock.
julianvmodesto added a commit to gazette/core that referenced this issue Feb 7, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in missed events.

To upper bound the effects of such missed events, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also:
- Make sure to return all etcd errors.
- Update dep lock.
julianvmodesto added a commit to gazette/core that referenced this issue Feb 7, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in missed events.

To upper bound the effects of such missed events, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also:
- Make sure to return all etcd errors.
- Update dep lock.
julianvmodesto added a commit to gazette/core that referenced this issue Feb 7, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in missed events.

To upper bound the effects of such missed events, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also:
- Make sure to return all etcd errors.
- Update dep lock.
julianvmodesto added a commit to gazette/core that referenced this issue Feb 7, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in missed events.

To upper bound the effects of such missed events, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also make sure to return all etcd errors.
julianvmodesto added a commit to gazette/core that referenced this issue Feb 7, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in missed events.

To upper bound the effects of such missed events, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also make sure to return all etcd errors.
jgraettinger pushed a commit to gazette/core that referenced this issue Feb 8, 2018
For an etcd Watch with context, as long as the context hasn't been
canceled or timed out, the Watch will retry on recoverable errors
forever until connected. This includes Watches during significant
etcd connection loss, which can result in missed events.

To upper bound the effects of such missed events, use a periodic full refresh
of the full tree to interrupt every Watch, preventing any long-lived Watch.

Currently the periodic full tree refresh is 10 minutes.

Since this affects all Watches, we'll have to keep an eye on etcd
performance and configure the period accordingly.

See etcd-io/etcd#8980.

Also make sure to return all etcd errors.
@gyuho gyuho modified the milestones: etcd-v3.4, etcd-v3.5 Aug 5, 2019
@embano1
Copy link

embano1 commented Mar 31, 2020

Thank you for this useful discussion here. I am seeing the same behavior when simulating partitions (e.g. cut [virtual] network connection or use iptables). Even though my client had all cluster endpoints and tight timeout options configured, Watch() (even those with WithRequireLeader) would hang forever. The underlying gRPC connection remains in TRANSIENT_FAILURE and the corresponding watchChannel is not closed even though lost leader == true.

My observation is only correct when the client itself is partitioned from the majority, i.e. it cannot switch (balance) to another endpoint. This could happen when the client runs on the same etcd instance/node or in the same failure domain/zone. This is in-line with what it discussed here and an open issue according to the Watch() docs:

// TODO(v3.4): configure watch retry policy, limit maximum retry number
// (see https://github.com/coreos/etcd/issues/8980)

I see the same behavior with the etcdctl CLI. For reference, this is my client setup with the corresponding timeouts:

cli, err := clientv3.New(clientv3.Config{
		Endpoints:            []string{"etcd1:2379", "etcd2:2379", "etcd3:2379"},
		DialTimeout:          2 * time.Second,
		DialKeepAliveTime:    5 * time.Second,
		DialKeepAliveTimeout: 2 * time.Second,
		DialOptions:          []grpc.DialOption{grpc.WithBlock()},
	})

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

leaderCtx := clientv3.WithRequireLeader(ctx)
rch := cli.Watch(leaderCtx, "", clientv3.WithPrefix())
[...]

It would be great to update the docs for WithRequireLeader() as it could be misread. Having a retry policy for Watch() as discussed here and mentioned in the TODO comment would be great.

For more details please also see kubernetes/kubernetes#89488

k8s-publishing-bot pushed a commit to kubernetes/apiserver that referenced this issue Apr 8, 2020
Watches against etcd in the API server can hang forever if the etcd
cluster loses quorum, e.g. the majority of nodes crashes. This fix
improves responsiveness (detection and reaction time) of API server
watches against etcd in some rare (but still possible) edge cases so
that watches are terminated with `"etcdserver: no leader"
(ErrNoLeader)`.

Implementation behavior described by jingyih:

```
The etcd server waits until it cannot find a leader for 3 election
timeouts to cancel existing streams. 3 is currently a hard coded
constant. The election timeout defaults to 1000ms.

If the cluster is healthy, when the leader is stopped, the leadership
transfer should be smooth. (leader transfers its leadership before
stopping). If leader is hard killed, other servers will take an election
timeout to realize leader lost and start campaign.
```

For further details, discussion and validation see
kubernetes/kubernetes#89488 (comment)
and etcd-io/etcd#8980.

Closes: kubernetes/kubernetes#89488

Signed-off-by: Michael Gasch <mgasch@vmware.com>

Kubernetes-commit: 70c9f770d7aa2194bfd3f58fe01756a7d200b866
@stale
Copy link

stale bot commented Jun 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@chaochn47
Copy link
Member

chaochn47 commented Jul 1, 2023

Somehow I was excavating the old issue, the symptom described by @embano1 was not reproducible at least on main branch. I have tested both manually e2e by injecting iptable rules across a 3 node etcd hosts with a client running remotely and locally (adhere to a server node the watch connects to).

Not to mention etcd has TestWatchWithRequireLeader and TestBalancerUnderNetworkPartitionWatchFollower to ensure watch will be cancelled with etcdserver: no leader cancel reason to educate the client to reconnect to a endpoint with quorum after around 3 seconds.

Implementation:

if noLeaderCnt >= maxNoLeaderCnt {
smap.mu.Lock()
for ss := range smap.streams {
if ssWithCtx, ok := ss.(serverStreamWithCtx); ok {
ssWithCtx.ctx.Cancel(rpctypes.ErrGRPCNoLeader)
<-ss.Context().Done()
}
}
smap.streams = make(map[grpc.ServerStream]struct{})
smap.mu.Unlock()
}

However, the original issue is not resolved. If a single node is abruptly stopped, the client watch will hang forever.

An retry option should be provided to the users for watch. If the watcher cannot be reconnected for say 5 seconds, the watcher can return an error.

It looks promising and optional for user to enable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants