Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd client v3 can switch addr in endpoints like v2? #7941

Closed
xiaoyulei opened this issue May 17, 2017 · 31 comments
Closed

etcd client v3 can switch addr in endpoints like v2? #7941

xiaoyulei opened this issue May 17, 2017 · 31 comments
Assignees
Labels

Comments

@xiaoyulei
Copy link
Contributor

etcd v2 will switch address in endpoints when request fail. But I use etcd v3, and it looks like not switch when fail.

Is there any parameter need set to open it?

@gyuho
Copy link
Contributor

gyuho commented May 17, 2017

@gyuho gyuho closed this as completed May 17, 2017
@heyitsanthony
Copy link
Contributor

What sort of request failures?

@xiaoyulei
Copy link
Contributor Author

xiaoyulei commented May 18, 2017

@gyuho I had passed three endpoints when I new a client. Do I need call SetEndpoints again?

@xiaoyulei
Copy link
Contributor Author

xiaoyulei commented May 18, 2017

@heyitsanthony I passed three endpoints when I new a client. Shutdown one of etcd server network, I found etcd client v3 always request fail.

@heyitsanthony
Copy link
Contributor

heyitsanthony commented May 18, 2017

@YuleiXiao what version of etcd server / client? Steps to reproduce? What errors? clientv3 has test cases for this; it should reconnect to another node if one is down...

@heyitsanthony heyitsanthony reopened this May 18, 2017
@xiaoyulei
Copy link
Contributor Author

xiaoyulei commented May 18, 2017

@heyitsanthony

  1. client/server version is 3.1.5
  2. three etcd server nodes, one client. They are in different machine. The client writes or reads it cyclically.
    1. ifconfig down eth1 to close etcd leader's machine network card
    2. client will report context deadline exceeded for about 15 minutes, and then success again. In 15 minutes, why client not connect to others etcd?

I test when I new client with three endpoints and disable auto-sync, the probability of occurrence is greater.
But if I new client with only one endpoint and enable auto-sync, not happen until now. Is there relation about auto-sync?

@fanminshi
Copy link
Member

@YuleiXiao could you try https://github.com/coreos/etcd/releases/tag/v3.2.0-rc.1 to see if that solves your issue. There are some bug fixes related to SetEndpoints since 3.1.5.

@xiaoyulei
Copy link
Contributor Author

@fanminshi I had try 3.2, it is the same problem.

@adityadani
Copy link

I am facing similar issues where a client reports context deadline exceeded if one of the etcd server node goes down.
With the etcd 3.2 release

  1. Does clientv3 automatically switch endpoints?
  2. Or is it expected for the client to call SetEndpoint() with the updated set of endpoints.

@fanminshi
Copy link
Member

@adityadani
Does clientv3 automatically switch endpoints?
Yes, clientv3 automatically switch to an endpoint if the previous endpoint is down.

@fanminshi
Copy link
Member

fanminshi commented Jun 15, 2017

@adityadani could you easily reproduce the issue you were describing?

@adityadani
Copy link

I am using clientv3 at v3.1.8.
I will try out v3.2.0 and will get back to you.
Thanks! @fanminshi

@mwf
Copy link

mwf commented Sep 23, 2017

Hi, @fanminshi!
Is it OK that automatic endpoints switch occurs only ~15 minutes after the failure? Why does it take so long? Could it be configured in some way?

If it matters, I'm talking about Watch() automatic failover.

@fanminshi
Copy link
Member

@mwf 15 mins seems too long. I expect it to automatically swtich as soon as connection issue is detected. What version of etcd are you using? are you able to reproduce this issue easily? also could you test your code against the latest master branch. I guess #8545 should fix the endpoint issue you have.

@mwf
Copy link

mwf commented Sep 23, 2017

Ohh, finally!

I hope #8545 will fix all the issues we have.
It's interesting if it fixes freezing Watch calls. I will test it next week :)

We are using 3.1.5 server (it should not matter here) and 3.2.5 client.

15 minutes is some "standard" time for all issues people have with Watch freeze and network partitioning with current stable releases. Hope it changes in master. It's easily reproducable.

Thanks! If 15 minutes failover still exists in master I will let you know.

@gyuho
Copy link
Contributor

gyuho commented Sep 28, 2017

Closing this in favor of #8022.

  1. We've added health check to client balancer clientv3: health balancer #8545.
  2. clientv3 now supports HTTP/2 keepalive ping to detect connection issues (See *: add watch with client keepalive test #8626 for example usage).

This will be shipped in our upcoming v3.3 release.

@gyuho gyuho closed this as completed Sep 28, 2017
@xiang90
Copy link
Contributor

xiang90 commented Sep 28, 2017

@YuleiXiao

Please give etcd/client master a try. let us know if the problem is fixed. thank you.

@xiaoyulei
Copy link
Contributor Author

@xiang90 OK, I will test it and give test result later. Thanks very much.

@xiaoyulei
Copy link
Contributor Author

xiaoyulei commented Sep 30, 2017

@xiang90 It looks like same problem.
Three etcd 192.168.128.36:13000,192.168.128.37:12998,192.168.128.38:12996,

watch on a key. Client connect to 192.168.128.37 first.

./etcdctl --endpoints=[192.168.128.36:13000,192.168.128.37:12998,192.168.128.38:12996] watch /test/watchkey --prev-kv=true -w fields --debug=true
ETCDCTL_CACERT=
ETCDCTL_CERT=
ETCDCTL_COMMAND_TIMEOUT=5s
ETCDCTL_DEBUG=true
ETCDCTL_DIAL_TIMEOUT=2s
ETCDCTL_DISCOVERY_SRV=
ETCDCTL_ENDPOINTS=[[192.168.128.36:13000,192.168.128.37:12998,192.168.128.38:12996]]
ETCDCTL_HEX=false
ETCDCTL_INSECURE_DISCOVERY=true
ETCDCTL_INSECURE_SKIP_TLS_VERIFY=false
ETCDCTL_INSECURE_TRANSPORT=true
ETCDCTL_KEY=
ETCDCTL_USER=
ETCDCTL_WRITE_OUT=fields
WARNING: 2017/09/30 10:51:19 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp: address 192.168.128.38:12996]: unexpected ']' in address"; Reconnecting to {192.168.128.38:12996] <nil>}
WARNING: 2017/09/30 10:51:19 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp: address 192.168.128.38:12996]: unexpected ']' in address"; Reconnecting to {192.168.128.38:12996] <nil>}
WARNING: 2017/09/30 10:51:19 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp: address [192.168.128.36:13000: missing ']' in address"; Reconnecting to {[192.168.128.36:13000 <nil>}
WARNING: 2017/09/30 10:51:19 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp: address [192.168.128.36:13000: missing ']' in address"; Reconnecting to {[192.168.128.36:13000 <nil>}
INFO: 2017/09/30 10:51:19 clientv3: balancer pins endpoint to 192.168.128.37:12998
WARNING: 2017/09/30 10:51:19 Failed to dial [192.168.128.36:13000: context canceled; please retry.
WARNING: 2017/09/30 10:51:19 Failed to dial [192.168.128.36:13000: context canceled; please retry.
WARNING: 2017/09/30 10:51:19 Failed to dial 192.168.128.38:12996]: context canceled; please retry.
WARNING: 2017/09/30 10:51:19 Failed to dial 192.168.128.38:12996]: context canceled; please retry.
"ClusterID" : 12582185902153712179
"MemberID" : 4283685097908224549
"Revision" : 1248
"RaftTerm" : 3
"Type" : PUT
"PrevKey" : "/test/watchkey"
"PrevCreateRevision" : 1241
"PrevModRevision" : 1247
"PrevVersion" : 7
"PrevValue" : "cc"
"PrevLease" : 0
"Key" : "/test/watchkey"
"CreateRevision" : 1241
"ModRevision" : 1248
"Version" : 8
"Value" : "dd"
"Lease" : 0

shutdown the eth on 192.168.128.37

sudo ifconfig eth5 down

put a value, put fail first. Client watch still no any change to other server.

./etcdctl --endpoints=[192.168.128.36:13000,192.168.128.37:12998,192.168.128.38:12996] put /test/watchkey ee
Error: dial tcp: address 192.168.128.38:12996]: unexpected ']' in address

@xiang90
Copy link
Contributor

xiang90 commented Sep 30, 2017

Error: dial tcp: address 192.168.128.38:12996]: unexpected ']' in address

it seems that the endpoint is malformed.

@xiaoyulei
Copy link
Contributor Author

@xiang90 I think format is OK because I put it success before eth5 down

@gyuho
Copy link
Contributor

gyuho commented Sep 30, 2017

@YuleiXiao Try without brackets in --endpoints flag?

./etcdctl --endpoints=192.168.128.36:13000,192.168.128.37:12998,192.168.128.38:12996

format is OK because I put it success before eth5 down

WARNINGs show that it didn't work.

@xiaoyulei
Copy link
Contributor Author

xiaoyulei commented Oct 7, 2017

@gyuho I remove brackets, it did not report malformed. But client watch still not switch.

  1. watch /test/watchkey
./etcdctl --endpoints=192.168.128.33:13000,192.168.128.37:12998,192.168.128.38:12996 watch /test/watchkey --prev-kv=true -w fields --debug=true
ETCDCTL_CACERT=
ETCDCTL_CERT=
ETCDCTL_COMMAND_TIMEOUT=5s
ETCDCTL_DEBUG=true
ETCDCTL_DIAL_TIMEOUT=2s
ETCDCTL_DISCOVERY_SRV=
ETCDCTL_ENDPOINTS=[192.168.128.33:13000,192.168.128.37:12998,192.168.128.38:12996]
ETCDCTL_HEX=false
ETCDCTL_INSECURE_DISCOVERY=true
ETCDCTL_INSECURE_SKIP_TLS_VERIFY=false
ETCDCTL_INSECURE_TRANSPORT=true
ETCDCTL_KEY=
ETCDCTL_USER=
ETCDCTL_WRITE_OUT=fields
INFO: 2017/10/07 10:35:07 clientv3: balancer pins endpoint to 192.168.128.37:12998
  1. login to 192.168.128.37, shutdown eth5
sudo ifconfig eth5 down
  1. put value, but nothing happen on watch
./etcdctl --endpoints=192.168.128.33:13000,192.168.128.37:12998,192.168.128.38:12996 put /test/watchkey bbb --debug=true
ETCDCTL_CACERT=
ETCDCTL_CERT=
ETCDCTL_COMMAND_TIMEOUT=5s
ETCDCTL_DEBUG=true
ETCDCTL_DIAL_TIMEOUT=2s
ETCDCTL_DISCOVERY_SRV=
ETCDCTL_ENDPOINTS=[192.168.128.33:13000,192.168.128.37:12998,192.168.128.38:12996]
ETCDCTL_HEX=false
ETCDCTL_INSECURE_DISCOVERY=true
ETCDCTL_INSECURE_SKIP_TLS_VERIFY=false
ETCDCTL_INSECURE_TRANSPORT=true
ETCDCTL_KEY=
ETCDCTL_USER=
ETCDCTL_WRITE_OUT=simple
INFO: 2017/10/07 10:36:54 clientv3: balancer pins endpoint to 192.168.128.38:12996
OK

@xiang90 xiang90 reopened this Oct 7, 2017
@xiang90
Copy link
Contributor

xiang90 commented Oct 7, 2017

@gyuho can you take a look?

@gyuho gyuho self-assigned this Oct 7, 2017
@xiaoyulei
Copy link
Contributor Author

@xiang90 @gyuho It looks like not not set DialKeepAliveTime and DialKeepAliveTimeout in etcdctl, I will add this and try it.

@xiang90
Copy link
Contributor

xiang90 commented Oct 8, 2017

@YuleiXiao Oh. Probably that is the cause, can you also send a PR to fix that if you can confirm it is the root cause?

@xiaoyulei
Copy link
Contributor Author

@xiang90 sure

@xiaoyulei
Copy link
Contributor Author

@gyuho @xiang90 It works after add DialKeepAliveTime and DialKeepAliveTimeout, I will send a PR later.

@xiang90
Copy link
Contributor

xiang90 commented Oct 8, 2017

@YuleiXiao awesome. thank you.

@zyf0330
Copy link

zyf0330 commented Nov 13, 2017

Does etcdv3 go client have this option?

@xiaoyulei
Copy link
Contributor Author

xiaoyulei commented Dec 4, 2017

@gyuho Is this shipped in v3.3 or v3.4?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

8 participants