Watch does not fail on complete loss of connectivity #8980

devnev · 2017-12-06T13:53:50Z

Steps to reproduce:

build etcd and etcdctl from master or take latest release
run etcd
run ETCDCTL_API=3 etcdctl watch --prefix ""
SIGSTOP etcd, e.g. using ctrl+z in terminal

Expected outcome:
After some timeout, the watch will fail. This would allow the caller to propagate an error further up the stack, telling clients to retry elsewhere where there might be a healthy connection.

Observed outcome:
The watch blocks forever.

The text was updated successfully, but these errors were encountered:

gyuho · 2017-12-06T17:12:36Z

telling clients to retry elsewhere where there might be a healthy connection.

This is already working (since v3.2.10+, master branch--pre v3.3.0).

You are running a single node, and the watch client is expected to be stuck with the single node, retrying--even with keepalive settings (e.g. ETCDCTL_API=3 etcdctl --debug --keepalive-time=5s --keepalive-timeout=5s watch --prefix "").

As of today (pre v3.3.0, gRPC v1.7.4), the retrial logic is all manual: just try again on transient failure. This was partially because gRPC keepalive does not expose connection state (seems resolved, related grpc/grpc-go#1444). In the future, we may improve this on a single node case.

devnev · 2017-12-06T17:42:10Z

Sorry, but its not clear to me what you're saying. Is it intended behaviour that the broken watch just sits there doing nothing?

I downloaded v3.2.11 which exposes the same problematic behaviour with my command and does not support the keepalive flags in your command.

I do not understand what you mean by the retrial logic is all manual: just try again on transient failure. Manual in what sense? What is a transient failure?

To be clear, the behaviour I expect to see at some point is: given the correct connection timeout options, a watch that gets disconnected will fail and return to the caller.

xiang90 · 2017-12-06T17:46:20Z

A temp connection issue should not fail watcher unless you explicitly specify a timeout (but the timeout will also fail long running watchers, so probably not anyone would like to do that) or a keepalive mechanism to kill it.

xiang90 · 2017-12-06T17:48:53Z

After some timeout, the watch will fail. This would allow the caller to propagate an error further up the stack, telling clients to retry elsewhere where there might be a healthy connection.

You should provide etcd client all available endpoints, and the client expects to re-connect by itself. If you want to control reconnection yourself, you basically need to build your own balancer, which is a non-trivial of effort to get right.

devnev · 2017-12-06T17:50:43Z

or a keepalive mechanism to kill it.

This is exactly what I'm looking for, and can't find. Trying your etcdctl watch command with all the timeouts, the command does not terminate when i sigstop etcd.

edit: this is with etcd/etcdctl build from master

devnev · 2017-12-06T18:12:20Z

You should provide etcd client all available endpoints, and the client expects to re-connect by itself. If you want to control reconnection yourself, you basically need to build your own balancer, which is a non-trivial of effort to get right.

I do not want to control reconnection. I want my own service's requests to fail when the service is completely unable to reach etcd, as the service is unable to operate correctly.

devnev · 2017-12-07T19:10:00Z

Hi - any updates? As far as I can tell this is still unresolved, despite the rather premature closing of the issue.

gyuho · 2017-12-07T19:11:18Z

@devnev Sorry! I was unclear. And we need better docs around this.

Is it intended behaviour that the broken watch just sits there doing nothing?

It's retrying the single endpoint, since your command only runs a single node cluster. Again, if you want balancer failover, provide other endpoints in --endpoints flag.

command does not terminate when i sigstop etcd.

This is expected. Watch command won't exit on connection loss or server shutdowns. Keepalive in watch was added to detect connection failures and trigger endpoint switches, not to error out. We will document this clearly in v3.3 release.

the service is completely unable to reach etcd

In client-side, there's no way to tell if the failure is transient or permanent. That's why client retries among available endpoints, in case the server comes back.

When etcd server goes down, gRPC client receives rpc error: code = Unavailable desc = transport is closing. Its error code is Unavailable, indicating transient failure from gRPC's point of view. When the client receives obvious server errors (e.g. etcdserver: mvcc: required revision has been compacted), the watch client terminates. We might expose connection state to client (tracked in #8776), so that client can decide whether to retry or not, depending on the connection states.

Again, watch is not really meant for detecting disconnects. If you want to detect connection loss, I would try concurrency.Session with TTL. Example usage in lock command:

https://github.com/coreos/etcd/blob/7e0fc6136eb4c5dc3c8c39ff48eb29e87b7158ab/etcdctl/ctlv3/command/lock_command.go#L56

devnev · 2017-12-07T19:27:10Z

Thank you for the clarification and the workaround.

As a final note, even with multiple etcds, there can be total connection loss to the etcd cluster (in our case this was caused by some as of yet undiagnosed routing issue affecting only specific kubernetes pods). One of the benefits of watches I was expecting was to receive updates quickly, making the system more reactive. If a watch gets completely disconnected, the system loses this property. I am glad to hear there's a workaround using sessions, but I am now going to have to wrap every single one of my watches in a session to ensure the reactivity of the system.

gyuho · 2017-12-07T19:31:04Z

As a final note, even with multiple etcds, there can be total connection loss to the etcd cluster (in our case this was caused by some as of yet undiagnosed routing issue affecting only specific kubernetes pods).

Agree. We will address this in the next release cycle (hopefully v3.4).
Also document the current watch limitations in 3.3 release.

xiang90 · 2017-12-07T19:31:09Z

@devnev @gyuho

An retry option should be provided to the users for watch. If the watcher cannot be reconnected for say 5 seconds, the watcher can return an error.

For an etcd Watch with context, as long as the context hasn't been canceled or timed out, the Watch will retry on recoverable errors forever until connected. This includes Watches during significant etcd connection loss, which can result in silently missed events. To upper bound such silent errors, use a periodic full refresh of the full tree to interrupt every Watch, preventing any long-lived Watch. Currently the periodic full tree refresh is 10 minutes. Since this affects all Watches, we'll have to keep an eye on etcd performance and configure the period accordingly. See etcd-io/etcd#8980.

For an etcd Watch with context, as long as the context hasn't been canceled or timed out, the Watch will retry on recoverable errors forever until connected. This includes Watches during significant etcd connection loss, which can result in silently missed events. To upper bound such silent errors, use a periodic full refresh of the full tree to interrupt every Watch, preventing any long-lived Watch. Currently the periodic full tree refresh is 10 minutes. Since this affects all Watches, we'll have to keep an eye on etcd performance and configure the period accordingly. See etcd-io/etcd#8980. Also update dep lock.

For an etcd Watch with context, as long as the context hasn't been canceled or timed out, the Watch will retry on recoverable errors forever until connected. This includes Watches during significant etcd connection loss, which can result in missed events. To upper bound the effects of such missed events, use a periodic full refresh of the full tree to interrupt every Watch, preventing any long-lived Watch. Currently the periodic full tree refresh is 10 minutes. Since this affects all Watches, we'll have to keep an eye on etcd performance and configure the period accordingly. See etcd-io/etcd#8980. Also: - Make sure to return all etcd errors. - Update dep lock.

For an etcd Watch with context, as long as the context hasn't been canceled or timed out, the Watch will retry on recoverable errors forever until connected. This includes Watches during significant etcd connection loss, which can result in missed events. To upper bound the effects of such missed events, use a periodic full refresh of the full tree to interrupt every Watch, preventing any long-lived Watch. Currently the periodic full tree refresh is 10 minutes. Since this affects all Watches, we'll have to keep an eye on etcd performance and configure the period accordingly. See etcd-io/etcd#8980. Also make sure to return all etcd errors.

embano1 · 2020-03-31T09:07:05Z

Thank you for this useful discussion here. I am seeing the same behavior when simulating partitions (e.g. cut [virtual] network connection or use iptables). Even though my client had all cluster endpoints and tight timeout options configured, Watch() (even those with WithRequireLeader) would hang forever. The underlying gRPC connection remains in TRANSIENT_FAILURE and the corresponding watchChannel is not closed even though lost leader == true.

My observation is only correct when the client itself is partitioned from the majority, i.e. it cannot switch (balance) to another endpoint. This could happen when the client runs on the same etcd instance/node or in the same failure domain/zone. This is in-line with what it discussed here and an open issue according to the Watch() docs:

// TODO(v3.4): configure watch retry policy, limit maximum retry number
// (see https://github.com/coreos/etcd/issues/8980)

I see the same behavior with the etcdctl CLI. For reference, this is my client setup with the corresponding timeouts:

cli, err := clientv3.New(clientv3.Config{
		Endpoints:            []string{"etcd1:2379", "etcd2:2379", "etcd3:2379"},
		DialTimeout:          2 * time.Second,
		DialKeepAliveTime:    5 * time.Second,
		DialKeepAliveTimeout: 2 * time.Second,
		DialOptions:          []grpc.DialOption{grpc.WithBlock()},
	})

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

leaderCtx := clientv3.WithRequireLeader(ctx)
rch := cli.Watch(leaderCtx, "", clientv3.WithPrefix())
[...]

It would be great to update the docs for WithRequireLeader() as it could be misread. Having a retry policy for Watch() as discussed here and mentioned in the TODO comment would be great.

For more details please also see kubernetes/kubernetes#89488

Watches against etcd in the API server can hang forever if the etcd cluster loses quorum, e.g. the majority of nodes crashes. This fix improves responsiveness (detection and reaction time) of API server watches against etcd in some rare (but still possible) edge cases so that watches are terminated with `"etcdserver: no leader" (ErrNoLeader)`. Implementation behavior described by jingyih: ``` The etcd server waits until it cannot find a leader for 3 election timeouts to cancel existing streams. 3 is currently a hard coded constant. The election timeout defaults to 1000ms. If the cluster is healthy, when the leader is stopped, the leadership transfer should be smooth. (leader transfers its leadership before stopping). If leader is hard killed, other servers will take an election timeout to realize leader lost and start campaign. ``` For further details, discussion and validation see kubernetes/kubernetes#89488 (comment) and etcd-io/etcd#8980. Closes: kubernetes/kubernetes#89488 Signed-off-by: Michael Gasch <mgasch@vmware.com> Kubernetes-commit: 70c9f770d7aa2194bfd3f58fe01756a7d200b866

stale · 2020-06-30T12:19:00Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

chaochn47 · 2023-07-01T05:43:23Z

Somehow I was excavating the old issue, the symptom described by @embano1 was not reproducible at least on main branch. I have tested both manually e2e by injecting iptable rules across a 3 node etcd hosts with a client running remotely and locally (adhere to a server node the watch connects to).

Not to mention etcd has TestWatchWithRequireLeader and TestBalancerUnderNetworkPartitionWatchFollower to ensure watch will be cancelled with etcdserver: no leader cancel reason to educate the client to reconnect to a endpoint with quorum after around 3 seconds.

Implementation:

etcd/server/etcdserver/api/v3rpc/interceptor.go

Lines 336 to 346 in 9e1abba

    
           if noLeaderCnt >= maxNoLeaderCnt { 
        
           	smap.mu.Lock() 
        
           	for ss := range smap.streams { 
        
           		if ssWithCtx, ok := ss.(serverStreamWithCtx); ok { 
        
           			ssWithCtx.ctx.Cancel(rpctypes.ErrGRPCNoLeader) 
        
           			<-ss.Context().Done() 
        
           		} 
        
           	} 
        
           	smap.streams = make(map[grpc.ServerStream]struct{}) 
        
           	smap.mu.Unlock() 
        
           }

However, the original issue is not resolved. If a single node is abruptly stopped, the client watch will hang forever.

An retry option should be provided to the users for watch. If the watcher cannot be reconnected for say 5 seconds, the watcher can return an error.

It looks promising and optional for user to enable.

gyuho closed this as completed Dec 6, 2017

gyuho reopened this Dec 7, 2017

gyuho added area/documentation type/question labels Dec 7, 2017

gyuho added this to the v3.4.0 milestone Dec 7, 2017

gyuho added area/clientv3 area/performance labels Dec 7, 2017

gyuho mentioned this issue Dec 8, 2017

clientv3: document context.Context parameter to streaming APIs #8993

Closed

julianvmodesto mentioned this issue Jan 29, 2018

consensus: defensive watch reloads gazette/core#24

Closed

gyuho removed the area/performance label Feb 25, 2018

ccressent mentioned this issue Jun 13, 2019

Cancelled contexts in watchers cause high CPU usage sensu/sensu-go#3012

Closed

3 tasks

gyuho modified the milestones: etcd-v3.4, etcd-v3.5 Aug 5, 2019

embano1 mentioned this issue Mar 30, 2020

Clarification on watch when server (or store) is partitioned kubernetes/kubernetes#89488

Closed

embano1 mentioned this issue Apr 6, 2020

Add etcd WithRequireLeader option to API watches kubernetes/kubernetes#89881

Merged

stale bot added the stale label Jun 30, 2020

jcferretti72 mentioned this issue Jul 13, 2020

Slow grpc re-connect etcd-io/jetcd#778

Closed

stale bot closed this as completed Jul 21, 2020

Ehco1996 mentioned this issue Feb 10, 2022

worker(dm): add retry for watch when network is weak pingcap/tiflow#4553

Closed

This was referenced Oct 28, 2022

cdc maybe stuck when send SIGSTOP to pd leader pingcap/tiflow#7470

Closed

campagin maybe stuck, when send SIGSTOP to etcd server leader #14641

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch does not fail on complete loss of connectivity #8980

Watch does not fail on complete loss of connectivity #8980

devnev commented Dec 6, 2017

gyuho commented Dec 6, 2017 •

edited

Loading

devnev commented Dec 6, 2017

xiang90 commented Dec 6, 2017

xiang90 commented Dec 6, 2017

devnev commented Dec 6, 2017 •

edited

Loading

devnev commented Dec 6, 2017

devnev commented Dec 7, 2017

gyuho commented Dec 7, 2017

devnev commented Dec 7, 2017

gyuho commented Dec 7, 2017

xiang90 commented Dec 7, 2017

embano1 commented Mar 31, 2020

stale bot commented Jun 30, 2020

chaochn47 commented Jul 1, 2023 •

edited

Loading

Watch does not fail on complete loss of connectivity #8980

Watch does not fail on complete loss of connectivity #8980

Comments

devnev commented Dec 6, 2017

gyuho commented Dec 6, 2017 • edited Loading

devnev commented Dec 6, 2017

xiang90 commented Dec 6, 2017

xiang90 commented Dec 6, 2017

devnev commented Dec 6, 2017 • edited Loading

devnev commented Dec 6, 2017

devnev commented Dec 7, 2017

gyuho commented Dec 7, 2017

devnev commented Dec 7, 2017

gyuho commented Dec 7, 2017

xiang90 commented Dec 7, 2017

embano1 commented Mar 31, 2020

stale bot commented Jun 30, 2020

chaochn47 commented Jul 1, 2023 • edited Loading

gyuho commented Dec 6, 2017 •

edited

Loading

devnev commented Dec 6, 2017 •

edited

Loading

chaochn47 commented Jul 1, 2023 •

edited

Loading