-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network glitch reducing etcd cluster availability seriously #7321
Comments
It seems that the issue #4606 has relation with this |
If a member is partitioned from the rest of the network, a linearized Get shouldn't succeed; it'll keep retrying. I suspect the auto sync is causing the balancer to select a connection to a new member which is letting the request eventually go through.
The client doesn't do this yet. It will only detect a disconnect between the client and the server; it won't blacklist partitioned members.
See 1 & 2.
Partitioned endpoint blacklisting won't solve this entirely. The client should follow the same path as losing a connection between the client and server if the lease expires. |
I make sure that It has nothing to do with auto sync. At first I did not set 'AutoSyncInterval', and it also went through after about 16minute.
Perhaps, I wasn't too clear about what I had meant. Let me try and clear this up a bit. First, I call GrantLease and get the lease with a ttl as 15s。Then I had a goroutine to call KeepAliveOnce(with a timeout context) every 5s to prevent the lease expired. If KeepAliveOnce failed, just retry until the lease is expired. So if the client can blacklist partitioned members in time, KeepAliveOnce can succeed and the lease is alive. I think it's the key point to solve the problem. |
OK, then it's getting disconnected somehow from some lower layer and choosing a new connection.
No, that is still not robust. The application must handle lease expiration that's out of its control, otherwise it won't gracefully failover when there is client network loss or arbitrary client delay. Blacklisting does not completely solve it. |
@heyitsanthony Could you give me some simple suggest to handle the scene of broken network? How to deal with the client effectively? |
@shuting-yst the easiest workaround now through the clientv3 client is to have a |
Sorry, I don't think the method does work:
|
That's not how it works.
|
@heyitsanthony sorry, I did an experiment to prove that does not work. Watch operation hangs just like other operation like Get. Simple code shows below:
More, I did another experiment. When an etcd node is net broken, and the Get operation failed due to 'context deadline exceeded'. I get the endpoints by calling client.Endpoints(), and use client.SetEndpoints() to set back, the client didn't cycle through the endpoints, and the Get operation never succeed by forever retry. client.Endpoints() always get the value that I set at the new client stage, never update. At last, I tried every single client(a client just connect to one etcd), and had a sad result, the step :
Any other help?I'm looking forward to hearing from you. |
This does not "cycle through the endpoints until it finds an active leader": eps := client.Endpoints()
logger.Infof("The current endpoints is %v", eps)
client.SetEndpoints(eps...) It would need to set endpoints one-by-one, not setting them all at once-- that's treated like a no-op since the endpoint list doesn't change. |
@heyitsanthony I just do setting endpoints one-by-one, it does not work as before, code is
result is
I thought really serious a bug it is. |
@shuting-yst this is still with the 3.1.0 client? I see a bug in the balancer on master that breaks endpoint switching if the connection can be established (#7392). I'll write some test cases tomorrow. |
Yes, it's with the version v3.1.0. Thanks! Waiting for your response. |
@shuting-yst Thanks for your patience and for creating tests on this issue. If you feel confident and comfortable, you can actually contribute to etcd project and help us fix the issue next time. |
@xiang90 I will focus on the project relative to etcd for a long time and I'm very glad to do fix that under your help. |
Sorry for some mistake. I fix the ConnectNet method, and redo all the experiments, results:
|
Huh? WithRequireLeader will detect if the server is partitioned from the rest of the cluster; that is precisely what it is designed to do. The server knows if it loses its leader since it will eventually stop receiving heartbeats (this is the same mechanism that triggers leader election). If the client can connect to the server it will be notified that the leader is lost when using WithRequireLeader. For an example of how to properly use WithRequireLeader to detect leader loss please see https://github.com/coreos/etcd/blob/master/proxy/grpcproxy/leader.go |
@heyitsanthony Is there any update on this issue. I am hitting a similar problem. When etcd leader node goes down, clients on other nodes fail to access etcd even after electing new leader. |
@shilpamayanna that's not what this issue is about. If the current leader goes down and the clients are configured for multiple endpoints, then they will reconnect to different members. This issue is about clients connecting to partitioned members. The fix for this issue is slated for the 3.3 release. |
@heyitsanthony I guess it is similar issue. Leader node is down either because it was partitioned from the network or the node was rebooted. At this point client on other nodes that is configured with multiple endpoints cannot access etcd. It requires to be reconnected from the application. Etcd version 3.0.17. |
@shilpamayanna please file a separate issue explaining how to reproduce the problem. Thanks! |
Superseded by #8660. |
Hi,
I met a problem at using etcd API when the network is not stable: I just use a etcd client(has 3 endpoints which are relative to the 3 etcd nodes)to get some value from etcd nodes. But if one of etcd node is in broken network, the request will hang at least 16 minutes, and then request succeed.
It's easy to reproduce:
code like this:
start etcd container like this
I change the context.background() to context.WithTimeout(5*time.Second), the result like this:
Questions:
ETCD version : 3.1.0
In my scenary, I grant a lease , and call keepAliveOnce(with a timeout 5s context) for the lease, but the etcd client can't change to connect other etcd nodes when I retry keepAliveOnce, so the lease expired, It's horrible!How should I deal with it? Help, pls
The text was updated successfully, but these errors were encountered: