Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdctl cluster-health are not consistent #2650

Closed
holmes86 opened this issue Apr 10, 2015 · 7 comments
Closed

etcdctl cluster-health are not consistent #2650

holmes86 opened this issue Apr 10, 2015 · 7 comments
Milestone

Comments

@holmes86
Copy link

I run two members of etcd cluseter on centos7. one of cluster run "etcdctl cluster-health" getting cluseter status is health, but another "etcdctl cluster-health" getting cluster status is unhealth. the cluster health is ok when etcd started.

test-0-1:
# etcdctl cluster-health
cluster is unhealthy

test-0-2:
#etcdctl cluster-health
cluster is healthy
member 6fafd2d770d2396 is healthy
member f2882b486b73afb9 is healthy

the etcd version is 2.0.3, the config file is as follow:

test-0-1:/etc/etcd/etcd.conf:

[member]
ETCD_NAME=test-0-1
ETCD_SNAPSHOT_COUNTER="10000"
ETCD_HEARTBEAT_INTERVAL="100"
ETCD_ELECTION_TIMEOUT="1000"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380,http://0.0.0.0:7001"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:4001,http://0.0.0.0:2379"
ETCD_MAX_WALS="0"


[cluster]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.128.0.1:2380"
# if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."
ETCD_INITIAL_CLUSTER="test-0-1=http://10.128.0.1:2380,test-0-2=http://10.128.0.2:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="test-etcd-cluster"
ETCD_ADVERTISE_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"

test-0-2:/etc/etcd/etcd.conf:

[member]
ETCD_NAME=test-0-2
ETCD_SNAPSHOT_COUNTER="10000"
ETCD_HEARTBEAT_INTERVAL="100"
ETCD_ELECTION_TIMEOUT="1000"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380,http://0.0.0.0:7001"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:4001,http://0.0.0.0:2379"
ETCD_MAX_WALS="0"

[cluster]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.128.0.2:2380"
# if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."
ETCD_INITIAL_CLUSTER="test-0-1=http://10.128.0.1:2380,test-0-2=http://10.128.0.2:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="test-etcd-cluster"
ETCD_ADVERTISE_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"

thanks

@xiang90 xiang90 added this to the v2.2.0 milestone Apr 10, 2015
@xiang90
Copy link
Contributor

xiang90 commented Apr 10, 2015

etcdctl needs to reach the leader to verify the cluster information. My best guess is the etcdctl on machine 1 cannot reach etcd running on machine 2 for some reason.

we might make the local etcd be able to check the cluster info for the client.

@holmes86
Copy link
Author

@xiang90 how to check the cluster info for the client? I found the cluster leader info not exists on local etcd. why etcd cluster data is not consistent? thanks

test-0-1

# curl -L http://127.0.0.1:4001/v2/stats/leader
{"message":"not current leader"}

test-0-2

curl -L http://127.0.0.1:4001/v2/stats/leader
{"leader":"f2882b486b73afb9","followers":{"6fafd2d770d2396":{"latency":{"current":0.033491,"average":0.03256292351481814,"standardDeviation":0.11327880677627736,"minimum":0.004243,"maximum":217.301155},"counts":{"fail":37315,"success":10272082}}}}

@SpencerBrown
Copy link
Contributor

I am seeing a very similar problem with 2.0.10. (CoreOS 675) and the machines can access each other fine (they are running locally on VirtualBox on OS X).

Situation:
node 1 started with

coreos:
  etcd2:
    name: unittest-1
    initial-advertise_peer_urls: http://10.42.8.5:2380
    listen_peer_urls: http://10.42.8.5:2380
    initial_cluster_token: unittest
    initial_cluster: unittest-1=http://10.42.8.5:2380

node 2 started with:

coreos:
  etcd2:
    name: unittest-2
    initial-advertise_peer_urls: http://10.42.8.6:2380
    listen_peer_urls: http://10.42.8.6:2380
    initial_cluster_token: unittest
    initial_cluster: unittest-2=http://10.42.8.6:2380,unittest-1=http://10.42.8.5:2380
    initial_cluster_state: existing

then on node 1 I run etcdctl member add unittest-2 http://10.42.8.6:2380

at this point, node 1 reports "cluster healthy" and lists both members. node 2 reports "cluster unhealthy" but etcdctl member list lists both members. Also I can set data on one node and see it on the other.

So it seems just that etcdctl cluster-health on node 2 is reporting things incorrectly.

@yichengq
Copy link
Contributor

@holmes86 @SpencerBrown You need to set correct client URLs to make it work well. Here are more details(#2567 (comment)):

The problem comes from that the -advertise-client-URLs are not set correctly on each etcd member. Internally, cluster-health first finds the leader ID, then gets the leader member's clientURLs, then use that urls to make requests. So it needs the correct client URLs registered in the cluster.

In your case, it is actually saying that 'i fetched leader stats from leader, but i failed to do that, so the cluster must be unhealthy'. It may need a better log to let users know what is actually happening.

@SpencerBrown
Copy link
Contributor

@yichengq You are absolutely correct, and the doc is incomplete in its example. I will submit a PR for the doc. (btw was good to see you at CoreOS Fest)

I added advertise-client-urls and listen-client-urls parameters to the nodes and everything works perfectly now in my scenario.

Node 1:

coreos:
  etcd2:
    name: unittest-1
    initial-advertise_peer_urls: http://10.42.8.5:2380
    listen_peer_urls: http://10.42.8.5:2380
    listen_client_urls: http://10.42.8.5:2379,http://127.0.0.1:2379
    advertise_client_urls: http://10.42.8.5:2379
    initial_cluster_token: unittest
    initial_cluster: unittest-1=http://10.42.8.5:2380
    initial_cluster_state: new

Node 2:

coreos:
  etcd2:
    name: unittest-2
    initial-advertise_peer_urls: http://10.42.8.6:2380
    listen_peer_urls: http://10.42.8.6:2380
    listen_client_urls: http://10.42.8.6:2379,http://127.0.0.1:2379
    advertise_client_urls: http://10.42.8.6:2379
    initial_cluster_token: unittest
    initial_cluster: unittest-2=http://10.42.8.6:2380,unittest-1=http://10.42.8.5:2380
    initial_cluster_state: existing

@yichengq
Copy link
Contributor

@SpencerBrown I think your PR is a good one. Could you send it to upstream?

@soichih
Copy link

soichih commented May 16, 2015

Thank you for this info. I've added -advertise-client-urls and this problem went away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants