etcdctl cluster-health are not consistent #2650

holmes86 · 2015-04-10T03:53:59Z

I run two members of etcd cluseter on centos7. one of cluster run "etcdctl cluster-health" getting cluseter status is health, but another "etcdctl cluster-health" getting cluster status is unhealth. the cluster health is ok when etcd started.

test-0-1:
# etcdctl cluster-health
cluster is unhealthy

test-0-2:
#etcdctl cluster-health
cluster is healthy
member 6fafd2d770d2396 is healthy
member f2882b486b73afb9 is healthy

the etcd version is 2.0.3, the config file is as follow:

test-0-1:/etc/etcd/etcd.conf:

[member]
ETCD_NAME=test-0-1
ETCD_SNAPSHOT_COUNTER="10000"
ETCD_HEARTBEAT_INTERVAL="100"
ETCD_ELECTION_TIMEOUT="1000"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380,http://0.0.0.0:7001"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:4001,http://0.0.0.0:2379"
ETCD_MAX_WALS="0"


[cluster]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.128.0.1:2380"
# if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."
ETCD_INITIAL_CLUSTER="test-0-1=http://10.128.0.1:2380,test-0-2=http://10.128.0.2:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="test-etcd-cluster"
ETCD_ADVERTISE_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"

test-0-2:/etc/etcd/etcd.conf:

[member]
ETCD_NAME=test-0-2
ETCD_SNAPSHOT_COUNTER="10000"
ETCD_HEARTBEAT_INTERVAL="100"
ETCD_ELECTION_TIMEOUT="1000"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380,http://0.0.0.0:7001"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:4001,http://0.0.0.0:2379"
ETCD_MAX_WALS="0"

[cluster]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.128.0.2:2380"
# if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."
ETCD_INITIAL_CLUSTER="test-0-1=http://10.128.0.1:2380,test-0-2=http://10.128.0.2:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="test-etcd-cluster"
ETCD_ADVERTISE_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"

thanks

The text was updated successfully, but these errors were encountered:

xiang90 · 2015-04-10T03:58:58Z

etcdctl needs to reach the leader to verify the cluster information. My best guess is the etcdctl on machine 1 cannot reach etcd running on machine 2 for some reason.

we might make the local etcd be able to check the cluster info for the client.

holmes86 · 2015-04-10T04:58:09Z

@xiang90 how to check the cluster info for the client? I found the cluster leader info not exists on local etcd. why etcd cluster data is not consistent? thanks

test-0-1

# curl -L http://127.0.0.1:4001/v2/stats/leader
{"message":"not current leader"}

test-0-2

curl -L http://127.0.0.1:4001/v2/stats/leader
{"leader":"f2882b486b73afb9","followers":{"6fafd2d770d2396":{"latency":{"current":0.033491,"average":0.03256292351481814,"standardDeviation":0.11327880677627736,"minimum":0.004243,"maximum":217.301155},"counts":{"fail":37315,"success":10272082}}}}

SpencerBrown · 2015-05-13T14:44:47Z

I am seeing a very similar problem with 2.0.10. (CoreOS 675) and the machines can access each other fine (they are running locally on VirtualBox on OS X).

Situation:
node 1 started with

coreos:
  etcd2:
    name: unittest-1
    initial-advertise_peer_urls: http://10.42.8.5:2380
    listen_peer_urls: http://10.42.8.5:2380
    initial_cluster_token: unittest
    initial_cluster: unittest-1=http://10.42.8.5:2380

node 2 started with:

coreos:
  etcd2:
    name: unittest-2
    initial-advertise_peer_urls: http://10.42.8.6:2380
    listen_peer_urls: http://10.42.8.6:2380
    initial_cluster_token: unittest
    initial_cluster: unittest-2=http://10.42.8.6:2380,unittest-1=http://10.42.8.5:2380
    initial_cluster_state: existing

then on node 1 I run etcdctl member add unittest-2 http://10.42.8.6:2380

at this point, node 1 reports "cluster healthy" and lists both members. node 2 reports "cluster unhealthy" but etcdctl member list lists both members. Also I can set data on one node and see it on the other.

So it seems just that etcdctl cluster-health on node 2 is reporting things incorrectly.

yichengq · 2015-05-13T21:05:35Z

@holmes86 @SpencerBrown You need to set correct client URLs to make it work well. Here are more details(#2567 (comment)):

The problem comes from that the -advertise-client-URLs are not set correctly on each etcd member. Internally, cluster-health first finds the leader ID, then gets the leader member's clientURLs, then use that urls to make requests. So it needs the correct client URLs registered in the cluster.

In your case, it is actually saying that 'i fetched leader stats from leader, but i failed to do that, so the cluster must be unhealthy'. It may need a better log to let users know what is actually happening.

SpencerBrown · 2015-05-13T21:36:05Z

@yichengq You are absolutely correct, and the doc is incomplete in its example. I will submit a PR for the doc. (btw was good to see you at CoreOS Fest)

I added advertise-client-urls and listen-client-urls parameters to the nodes and everything works perfectly now in my scenario.

Node 1:

coreos:
  etcd2:
    name: unittest-1
    initial-advertise_peer_urls: http://10.42.8.5:2380
    listen_peer_urls: http://10.42.8.5:2380
    listen_client_urls: http://10.42.8.5:2379,http://127.0.0.1:2379
    advertise_client_urls: http://10.42.8.5:2379
    initial_cluster_token: unittest
    initial_cluster: unittest-1=http://10.42.8.5:2380
    initial_cluster_state: new

Node 2:

coreos:
  etcd2:
    name: unittest-2
    initial-advertise_peer_urls: http://10.42.8.6:2380
    listen_peer_urls: http://10.42.8.6:2380
    listen_client_urls: http://10.42.8.6:2379,http://127.0.0.1:2379
    advertise_client_urls: http://10.42.8.6:2379
    initial_cluster_token: unittest
    initial_cluster: unittest-2=http://10.42.8.6:2380,unittest-1=http://10.42.8.5:2380
    initial_cluster_state: existing

yichengq · 2015-05-14T00:03:45Z

@SpencerBrown I think your PR is a good one. Could you send it to upstream?

soichih · 2015-05-16T21:59:25Z

Thank you for this info. I've added -advertise-client-urls and this problem went away.

xiang90 added this to the v2.2.0 milestone Apr 10, 2015

xiang90 added the etcdctl label Apr 10, 2015

yichengq mentioned this issue May 13, 2015

etcdctl cluster-health problem #2567

Closed

yichengq mentioned this issue May 14, 2015

etcdctl/cluster_health: improve output if failed to get leader stats #2828

Merged

barakmich assigned yichengq May 14, 2015

yichengq closed this as completed May 22, 2015

philips unassigned yichengq Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdctl cluster-health are not consistent #2650

etcdctl cluster-health are not consistent #2650

holmes86 commented Apr 10, 2015

xiang90 commented Apr 10, 2015

holmes86 commented Apr 10, 2015

SpencerBrown commented May 13, 2015

yichengq commented May 13, 2015

SpencerBrown commented May 13, 2015

yichengq commented May 14, 2015

soichih commented May 16, 2015

etcdctl cluster-health are not consistent #2650

etcdctl cluster-health are not consistent #2650

Comments

holmes86 commented Apr 10, 2015

xiang90 commented Apr 10, 2015

holmes86 commented Apr 10, 2015

SpencerBrown commented May 13, 2015

yichengq commented May 13, 2015

SpencerBrown commented May 13, 2015

yichengq commented May 14, 2015

soichih commented May 16, 2015