New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to recover etcd cluster after loss of nodes #863

Closed
sqs opened this Issue Jun 20, 2014 · 41 comments

Comments

@sqs

sqs commented Jun 20, 2014

I am running the CoreOS alpha channel (349.0) on EC2 (non-VPC) with no modifications to the CloudFormation cloud-config section for etcd. I updated my CloudFormation stack, which requires the EC2 instances to be terminated and recreated, but which is done using a rolling update with MinInstancesInService=3 and should maintain the etcd cluster.

However, after the CF stack update, the etcd cluster was left in a bad state. 1 node (call it A) at least appears to have the snapshot:

Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.079 INFO      | 3195460fb2fe42bd92439e261c84b859 attempted to join via 10.220.1.118:7001 failed: fail checking join version: Client Internal Error (Get http://10.220.1.118:7001/version: dial tcp 10.220.1.118:7001: i/o timeout)
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.452 INFO      | Send Join Request to http://10.41.9.209:7001/join
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.454 INFO      | 3195460fb2fe42bd92439e261c84b859 attempted to join via 10.41.9.209:7001 failed: fail on join request: Raft Internal Error ()
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.470 WARNING   | 3195460fb2fe42bd92439e261c84b859 cannot connect to previous cluster [10.238.41.24:7001 10.41.3.254:7001 10.220.27.188:7001 10.41.7.202:7001 10.41.3.254:7001 10.41.3.28:7001 10.41.11.168:7001 10.220.27.188:7001 10.220.145.240:7001 10.41.3.250:7001 10.41.3.28:7001 10.217.200.55:7001 10.238.39.228:7001 10.238.12.241:7001 10.238.41.24:7001 10.231.156.16:7001 10.220.146.186:7001 10.220.1.118:7001 10.41.9.209:7001]: fail joining the cluster via given peers after 1 retries
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO      | etcd server [name 3195460fb2fe42bd92439e261c84b859, listen on :4001, advertised url http://10.234.59.141:4001]
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO      | peer server [name 3195460fb2fe42bd92439e261c84b859, listen on :7001, advertised url http://10.234.59.141:7001]
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO      | 3195460fb2fe42bd92439e261c84b859 starting in peer mode
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO      | 3195460fb2fe42bd92439e261c84b859: state changed from 'initialized' to 'follower'.
Jun 20 01:00:18 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:18.029 INFO      | 3195460fb2fe42bd92439e261c84b859: state changed from 'follower' to 'candidate'.
Jun 20 01:00:19 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:19.472 INFO      | 3195460fb2fe42bd92439e261c84b859: snapshot of 179979 events at index 179979 completed

But the other 2 (call them B and C) continually fail with Raft Internal Error () when attempting to join the cluster composed of A:

Jun 20 01:27:36 ip-10-41-9-209 etcd[1912]: [etcd] Jun 20 01:27:36.097 INFO      | 39e1d196a13d42dd8cccfb8d3cc3040f attempted to join via 10.234.59.141:7001 failed: fail on join request: Raft Internal Error ()

@philips walked me through a recovery process that was unsuccessful and asked me to post this issue here. The recovery process consisted of:

  1. Finding previous machine IDs by running curl 127.0.0.1:4001/v2/keys/_etcd/machines on A
  2. On B and C, wiping the /var/lib/etcd dirs and changing their etcd -name= to machine IDs that existed in the machines JSON obtained in (1), by editing the file /var/run/systemd/system/etcd.service.d/20-cloudinit.conf
  3. Restarting B and C

After performing these steps and verifying the other 2 nodes' logs have the changed names, 1 of them fails with the same 'Raft Internal Error ()' and the other fails with is not allowed to join the cluster with existing URL http://10.41.9.209:7001 [its IP]'. Node A did not emit any more log lines during this whole process.

The data dir on node A contains various development credentials. I can revoke most of them but some of them I'm not sure about. Folks from CoreOS or other maintainers can email me at sqs@sourcegraph.com if you want me to send it to you (but I don't want to post it on the web).

@stevenschlansker

This comment has been minimized.

Contributor

stevenschlansker commented Jul 8, 2014

+1, I am having similar issues.

I run a cluster of 3 etcd servers. I want to update the machine image backing them (we are in AWS, so a new AMI) so I intend to do a rolling restart of the cluster. But eventually I always end up in a situation where some number of nodes are serving client requests, but refusing to let new cluster members join:

[etcd] Jul  8 21:20:22.739 INFO      | Discovery via https://discovery.etcd.io using prefix /b0d0e7798e882d327af1953f0eb977bf.
[etcd] Jul  8 21:20:23.000 INFO      | Discovery found peers [http://ip-10-70-6-77.us-west-2.compute.internal:31908 http://ip-10-70-6-249.us-west-2.compute.internal:31801 http://ip-10-70-6-77.us-west-2.compute.internal:31156 http://ip-10-70-6-235.us-west-2.compute.internal:31023 http://ip-10-70-6-233.us-west-2.compute.internal:31878 http://ip-10-70-6-235.us-west-2.compute.internal:31004]
[etcd] Jul  8 21:20:23.000 INFO      | Discovery fetched back peer list: [ip-10-70-6-77.us-west-2.compute.internal:31908 ip-10-70-6-249.us-west-2.compute.internal:31801 ip-10-70-6-77.us-west-2.compute.internal:31156 ip-10-70-6-235.us-west-2.compute.internal:31023 ip-10-70-6-233.us-west-2.compute.internal:31878 ip-10-70-6-235.us-west-2.compute.internal:31004]
[etcd] Jul  8 21:20:23.008 INFO      | Send Join Request to http://ip-10-70-6-77.us-west-2.compute.internal:31908/join
[etcd] Jul  8 21:20:23.012 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-77.us-west-2.compute.internal:31908 failed: fail on join request: Raft Internal Error ()
[etcd] Jul  8 21:20:23.015 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-249.us-west-2.compute.internal:31801 failed: fail checking join version: Client Internal Error (Get http://ip-10-70-6-249.us-west-2.compute.internal:31801/version: dial tcp 10.70.6.249:31801: connection refused)
[etcd] Jul  8 21:20:23.018 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-77.us-west-2.compute.internal:31156 failed: fail checking join version: Client Internal Error (Get http://ip-10-70-6-77.us-west-2.compute.internal:31156/version: dial tcp 10.70.6.77:31156: connection refused)
[etcd] Jul  8 21:20:23.022 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-235.us-west-2.compute.internal:31023 failed: fail checking join version: Client Internal Error (Get http://ip-10-70-6-235.us-west-2.compute.internal:31023/version: dial tcp 10.70.6.235:31023: connection refused)
[etcd] Jul  8 21:20:23.030 INFO      | Send Join Request to http://ip-10-70-6-233.us-west-2.compute.internal:31878/join
[etcd] Jul  8 21:20:23.035 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-233.us-west-2.compute.internal:31878 failed: fail on join request: Raft Internal Error ()
[etcd] Jul  8 21:20:23.043 INFO      | Send Join Request to http://ip-10-70-6-235.us-west-2.compute.internal:31004/join
[etcd] Jul  8 21:20:23.047 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-235.us-west-2.compute.internal:31004 failed: fail on join request: Raft Internal Error ()
[etcd] Jul  8 21:20:23.047 INFO      | 8f5d66c2eb0d is unable to join the cluster using any of the peers [ip-10-70-6-77.us-west-2.compute.internal:31908 ip-10-70-6-249.us-west-2.compute.internal:31801 ip-10-70-6-77.us-west-2.compute.internal:31156 ip-10-70-6-235.us-west-2.compute.internal:31023 ip-10-70-6-233.us-west-2.compute.internal:31878 ip-10-70-6-235.us-west-2.compute.internal:31004] at 0th time. Retrying in 3.7 seconds

Soon thereafter the existing nodes started to have serious troubles too:

[etcd] Jul  8 21:26:49.923 INFO      | f86a9a869d1c: state changed from 'follower' to 'candidate'.
[etcd] Jul  8 21:26:50.496 INFO      | f86a9a869d1c: state changed from 'candidate' to 'follower'.
[etcd] Jul  8 21:26:50.496 INFO      | f86a9a869d1c: term #8234 started.
[etcd] Jul  8 21:26:50.711 INFO      | f86a9a869d1c: term #8235 started.
[etcd] Jul  8 21:26:50.927 INFO      | f86a9a869d1c: term #8236 started.
[etcd] Jul  8 21:26:51.214 INFO      | f86a9a869d1c: term #8237 started.
[etcd] Jul  8 21:26:51.418 INFO      | f86a9a869d1c: term #8238 started.
[etcd] Jul  8 21:26:51.684 INFO      | f86a9a869d1c: term #8239 started.
[etcd] Jul  8 21:26:51.943 INFO      | f86a9a869d1c: term #8240 started.
[etcd] Jul  8 21:26:52.186 INFO      | f86a9a869d1c: term #8241 started.
[etcd] Jul  8 21:26:52.518 INFO      | f86a9a869d1c: term #8242 started.
[etcd] Jul  8 21:26:52.787 INFO      | f86a9a869d1c: state changed from 'follower' to 'candidate'.
[etcd] Jul  8 21:26:53.003 INFO      | f86a9a869d1c: state changed from 'candidate' to 'follower'.
@jbrandstetter

This comment has been minimized.

jbrandstetter commented Aug 13, 2014

+1 same issue here

@SergeyZh

This comment has been minimized.

SergeyZh commented Aug 14, 2014

+1 same issue.
Fortunately it's not production stack however I don't know what to do if I get this on production.

@charlesmarshall

This comment has been minimized.

charlesmarshall commented Aug 14, 2014

+1, currently failing to view other hosts with a clean discovery token via vagrant and aws

@philips

This comment has been minimized.

Contributor

philips commented Aug 14, 2014

@jbrandstetter @SergeyZh @charlesmarshall What versions are you all running?

@SergeyZh

This comment has been minimized.

SergeyZh commented Aug 14, 2014

I use etcd which bundled with last CoreOS:
/usr/bin/etcd --version
etcd version 0.4.4

@SergeyZh

This comment has been minimized.

SergeyZh commented Aug 14, 2014

I have the server where the problem is right now. I can give you access for 1 day and you can look it.

@charlesmarshall

This comment has been minimized.

charlesmarshall commented Aug 14, 2014

ive added a thread to the google group - https://groups.google.com/forum/#!topic/coreos-user/B0yoBo9-agA

added output of the journalctl of fleet & etcd there. thats from the current alpha version of coreos

@charlesmarshall

This comment has been minimized.

charlesmarshall commented Aug 18, 2014

any ideas of why?

@jbrandstetter

This comment has been minimized.

jbrandstetter commented Aug 18, 2014

I'm on etcd version 0.4.4, latest stable channel
For now I have reconfigured the machines to use a new discovery url when that happens. Very ugly, but works for me.

@charlesmarshall

This comment has been minimized.

charlesmarshall commented Aug 18, 2014

at the moment, for me, it fails on a totally clean install with a brand new discovery token with the automated etcd.service.

@SergeyZh

This comment has been minimized.

SergeyZh commented Aug 18, 2014

I played a bit with etcd cluster and I found the problem with removing of old nodes. It produces unavailable cluster with even 3 online nodes: #941

@charlesmarshall

This comment has been minimized.

charlesmarshall commented Aug 18, 2014

sorry, mine was a typo in a regex causing it to sub discovery for discover .. doh!

@yichengq yichengq added the bug label Aug 28, 2014

@Telvary

This comment has been minimized.

Telvary commented Aug 28, 2014

+1 here. I'll try to run more tests but any "big" changes in cluster's members result in such situation.

@MasashiTeruya

This comment has been minimized.

MasashiTeruya commented Sep 3, 2014

+1 same issue
$ etcd --version
etcd version 0.4.6

$ cat /etc/os-release | grep PRETTY
PRETTY_NAME="CoreOS 410.0.0"

@stevenschlansker

This comment has been minimized.

Contributor

stevenschlansker commented Sep 3, 2014

Hi, what would it take to get some movement forward on this bug? At this point this has crashed our etcd cluster a few tens of times, and we are going to have to start evaluating other options if we don't get a fix here. Is there something I can do to help move this forward? (Unfortunately I am not much of a Go guy, so trying to fix it myself isn't much of an option)

@yichengq

This comment has been minimized.

Contributor

yichengq commented Sep 3, 2014

@stevenschlansker We are working hard to rewrite it and the new version is able to take care of this elegantly.
Thread: #874
It will go to the public very soon.

@tracker1

This comment has been minimized.

tracker1 commented Sep 5, 2014

+1 on the issue here... :-( ...this is just in our test/dev environment... won't need it for production for another month, but it's a pretty serious issue...

@jhiemer

This comment has been minimized.

jhiemer commented Sep 14, 2014

seems that I am having the same issue here in my CloudFoundry instance. cloudfoundry-attic/cf-release#500

Only to solve this is by removing/readding etcd as a new deployment?

@sbward

This comment has been minimized.

sbward commented Sep 22, 2014

+1, this needs to be resolved before we can rely on etcd. serious issue. we saw this problem during our tests at Iron.io

@tracker1

This comment has been minimized.

tracker1 commented Sep 23, 2014

Any updates? This has happened several times now, and I can't rely on etcd in production as it stands. Any alternatives to suggest?

@threetee

This comment has been minimized.

threetee commented Sep 30, 2014

+1, I ran into this when resizing storage on a coreos cluster node (vanilla three-node cluster, created from the official CoreOS CloudFormation template).

@asiragusa

This comment has been minimized.

asiragusa commented Oct 3, 2014

+1

There are any workarounds to avoid this bug?

@redlines

This comment has been minimized.

redlines commented Oct 5, 2014

Interesting that this bug is posted as limited to node removal. We are running into this exact issue trying to set up a new cluster( with -stable). Very first node up works fine, but all subsequent nodes fail in the documented manner.

@gus

This comment has been minimized.

gus commented Oct 6, 2014

Same behavior right now, and we're using 410.2.0

@gregory90

This comment has been minimized.

gregory90 commented Oct 6, 2014

Same here on 459.0.0. 2 host setup. New cluster.
Logs:

Everything seems to work on 444.3.0 so far.

@gabrtv

This comment has been minimized.

gabrtv commented Oct 7, 2014

We are also seeing this while performing HA testing.

$ curl http://10.21.2.171:4001/v2/stats/leader
{"errorCode":300,"message":"Raft Internal Error","index":32580}

$ cat /etc/lsb-release 
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=459.0.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 459.0.0"

$ etcd --version
etcd version 0.4.6

Environment is EC2/VPC with 3 hosts specified via Cloud Formation.

I should also note that we can reproduce this fairly reliably by killing off the current etcd leader.

@madsurgeon

This comment has been minimized.

madsurgeon commented Oct 7, 2014

Same behaviour here on 444.3.0 on a single bare metal system after reboot.

journalctl -u etcd starts like this:
Oct 03 19:17:40 localhost systemd[1]: Starting etcd...
Oct 03 19:17:40 localhost systemd[1]: Started etcd.
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.113 INFO | The path /var/lib/etcd/log is in btrfs
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.124 INFO | Set NOCOW to path /var/lib/etcd/log succeeded
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.124 INFO | f92851674e784fbe983ff49263610c65 is starting a new cluster
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.128 INFO | etcd server [name f92851674e784fbe983ff49263610c65, listen on :4001, advertised url http://127.0.0.1:4001]
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.128 INFO | peer server [name f92851674e784fbe983ff49263610c65, listen on :7001, advertised url http://127.0.0.1:7001]
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.128 INFO | f92851674e784fbe983ff49263610c65 starting in peer mode
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.128 INFO | f92851674e784fbe983ff49263610c65: state changed from 'initialized' to 'follower'.
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.128 INFO | f92851674e784fbe983ff49263610c65: state changed from 'follower' to 'leader'.
Oct 03 19:17:41 localhost etcd[905]: [etcd] Oct 3 19:17:41.128 INFO | f92851674e784fbe983ff49263610c65: leader changed from '' to 'f92851674e784fbe983ff49263610c65'.
Oct 03 20:40:59 localhost etcd[905]: [etcd] Oct 3 20:40:59.741 INFO | f92851674e784fbe983ff49263610c65: snapshot of 10001 events at index 10001 completed
[...]
Oct 05 19:55:13 localhost etcd[905]: [etcd] Oct 5 19:55:13.815 INFO | f92851674e784fbe983ff49263610c65: snapshot of 10003 events at index 350108 completed
-- Reboot --
Oct 06 19:30:58 coreos0 systemd[1]: Starting etcd...
Oct 06 19:30:58 coreos0 systemd[1]: Started etcd.
Oct 06 19:30:59 coreos0 etcd[555]: [etcd] Oct 6 19:30:59.306 INFO | coreos0: peer added: 'f92851674e784fbe983ff49263610c65'
Oct 06 19:30:59 coreos0 etcd[555]: [etcd] Oct 6 19:30:59.776 INFO | The path /var/lib/etcd/log is in btrfs
Oct 06 19:30:59 coreos0 etcd[555]: [etcd] Oct 6 19:30:59.776 WARNING | Failed setting NOCOW: skip nonempty file
Oct 06 19:30:59 coreos0 etcd[555]: [etcd] Oct 6 19:30:59.776 CRITICAL | coreos0 is not allowed to join the cluster with existing URL http://127.0.0.1:7001
Oct 06 19:30:59 coreos0 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Oct 06 19:30:59 coreos0 systemd[1]: Unit etcd.service entered failed state.
Oct 06 19:31:10 coreos0 systemd[1]: etcd.service holdoff time over, scheduling restart.
Oct 06 19:31:10 coreos0 systemd[1]: Stopping etcd...

Seems to have something to do with the state.
Using a new token doesn't help.

@lukescott

This comment has been minimized.

lukescott commented Oct 8, 2014

I have the same issue on stable release. Removed 2 of the 3 nodes in my cluster to update my launch config (adding IAM role) and the new node is in an endless cycle of trying to join the cluster:

[etcd] Oct 8 22:00:06.213 INFO | e26cce6e300a47b79f360acede836fda attempted to join via 172.31.7.23:7001 failed: fail on join request: Raft Internal Error ()

The other node the last lines show:

Oct 08 21:45:39 ip-172-31-7-23.us-west-2.compute.internal etcd[582]: [etcd] Oct 8 21:45:39.110 INFO | d2b168399c8143d282ce1be692a0777b: leader changed from 'e8cc95a2a5fc4b7cad4a31369fcb88c6' to ''.
Oct 08 21:45:40 ip-172-31-7-23.us-west-2.compute.internal etcd[582]: [etcd] Oct 8 21:45:40.424 INFO | d2b168399c8143d282ce1be692a0777b: state changed from 'follower' to 'candidate'.

I think at this point I'm going to have to recreate the cluster. But I also have auto scaling, so this is sure to crop up on its own.

@mfischer-zd

This comment has been minimized.

Contributor

mfischer-zd commented Oct 14, 2014

I'm in this situation too today. What's the best way to resync a failed peer until the official fix is ready?

@sbward

This comment has been minimized.

sbward commented Oct 18, 2014

Can we have a status update on this? Thanks

@asiragusa

This comment has been minimized.

asiragusa commented Oct 26, 2014

I managed to patch this by changing the ports etcd listens to for each install. This forces etcd to believe that the reinstalled node is a brand new machine and it has to do a full sync with the existing nodes.

Btw this should be the default etcd behavior and it shouldn't assume that just the ip:port pair should identify a machine state.

To make it work I had to tamper with the cloud-config, /etc/environment and the .bashrc. Hopefully I have scripted all this and it's transparent for me...

@xiang90

This comment has been minimized.

Contributor

xiang90 commented Oct 26, 2014

Sorry for not replying to this issue in time. Most of the problems I found from the logs are caused by misconfiguration. The configuration in 0.4.x is somehow confusing. We are aware of this problem. And we will have much better configuration management and documentation for the coming etcd 0.5.

Here are a few things we need to take care of when restarting etcd.

After rebooting, you MUST restart etcd using the same data dir. Or it will be considered as a NEW node. New node need to get approved by the majority of existing cluster. If the existing cluster is unavailable, the new node will not be able to join.

If you are seeing fail on join request: Raft Internal Error () when the restarting the node and that node exits after several attempts, then you probably need to check the configuration.

If you want to change the advertised peer url of a node when restarting it, you need to make sure the cluster is still available after you take down that etcd node. Changing advertised peer url needs to get approved by the majority of the existing cluster.

Moreover, etcd verifies that each nodes MUST have unique advertised-peer-url. If this verification fails it will not start/restart.

To sum up:

  1. Make sure the cluster itself is available (majority is alive) when you want to do configuration changes (add/remove node, change peer url).
  2. Make sure the the node you want to restart points to the old data dir it used before.

There is known bug in go-raft, which can cause restarting problem. That happens if the snapshot is very large.

@lukescott

This comment has been minimized.

lukescott commented Oct 27, 2014

@xiangli-cmu So what happens when the majority (at least 3?) is not alive? I understand that the cluster cannot function without the majority, but is there any way for it to recover after it loses the majority and gets new members? Is this something that will be addressed in 0.5?

@mfischer-zd

This comment has been minimized.

Contributor

mfischer-zd commented Oct 27, 2014

@xiangli-cmu understood, but there is currently no way to change the node name in the event of an initial misconfiguration without losing all data. This is Bad.

@xiang90

This comment has been minimized.

Contributor

xiang90 commented Oct 27, 2014

@lukescott @mfischer-zd

  1. Are etcd trying to prevent the misconfiguration problem in 0.5?

Yes. We are. etcd supports static bootstrap for a multiple node cluster. etcd requires all the configuration to be done explicitly by API. etcd will do clusterID checking.

  1. Can etcd prevent all the misconfiguration problems?

No. We are not able to prevent ALL the user errors.

  1. Is there a plan for misconfiguration/node loss recovery?

Yes. Please keep an eye on #1242

@mfischer-zd

This comment has been minimized.

Contributor

mfischer-zd commented Oct 27, 2014

Excellent, looking forward to it.

@kelseyhightower

This comment has been minimized.

Contributor

kelseyhightower commented Oct 27, 2014

I've done some basic testing on etcd 0.5.x and I was able to recover after a loss of a node and even a mis-configured one. While there maybe still issues on 0.5.x, it's where the fix for these types of issues will land.

We've also tried to note where possible that etcd is not production ready, this was especially true with etcd 0.4.x, and many of the issues have been resolved on the master branch. I'm closing this ticket since these types of issues are being or have been addressed on master.

Please kick the tires on 0.5.0 and file new issues to help us produce an awesome release.

@gegere

This comment has been minimized.

gegere commented Nov 10, 2014

I wanted to add more information to this thread. Do not use name: in the #cloud-config file! http://cl.ly/image/2m2R1a1W0M3a

If you do the cluster will go round and round trying to join the cluster as a "leader" node will not be established.

Nov 10 08:56:47 coreos-20 etcd[23696]: [etcd] Nov 10 08:56:47.464 INFO | coreos20 attempted to join via 10.132.188.206:7001 failed: fail checking join version: Client Internal Error (Get http://10.132.188.206:7001/version: dial tcp 10.132.188.206:7001: no route to host)

Nov 10 08:56:48 coreos-20 etcd[23696]: [etcd] Nov 10 08:56:48.815 INFO | coreos20 attempted to join via 10.132.189.117:7001 failed: fail checking join version: Client Internal Error (Get http://10.132.189.117:7001/version: dial tcp 10.132.189.117:7001: i/o timeout)

Nov 10 08:56:48 coreos-20 etcd[23696]: [etcd] Nov 10 08:56:48.820 INFO | Send Join Request to http://10.132.189.116:7001/join

Nov 10 08:56:48 coreos-20 etcd[23696]: [etcd] Nov 10 08:56:48.821 INFO | coreos20 attempted to join via 10.132.189.116:7001 failed: fail on join request: Raft Internal Error ()

@gegere

This comment has been minimized.

gegere commented Nov 18, 2014

@AlessandroEmm there is a bug in etcd when using names in the configuration it can't correctly pass the leader role because the hash doesn't match.

If you tail the journal "journalctl -f" do you see the cluster trying to build but there are continuous failures?

@dougbtv

This comment has been minimized.

dougbtv commented Aug 22, 2015

I'm still running into this with latest stable qemu image. Which is running etcd 0.4.6 (unless you specify etcd2 in your cloud config, I haven't gotten etcd2 to work properly for me yet).

But, I found a solution that at least works for me, and have documented it further here. I'd notice that when I first spun up a cluster it would work, but, if I tried to have the cluster rediscover it from scratch -- e.g. all coreos boxes down... It'd start failing with these, and I noticed that etcd appeared to cache who it's peers were. Which, I didn't feel was particularly appropriate, so I hacked a way to clear it's "cache" if-you-will by clearing items in the /var/lib/etcd dir, and specifically I would, on each CoreOS machine issue:

systemctl stop etcd
rm /var/lib/etcd/log
rm /var/lib/etcd/conf
shutdown now

Then I'd feed a new discovery URL into the cloud configs and start it back up. And maybe it's witchcraft, but, I feel like I was having better luck if I booted one box first, and gave it 30 seconds before I booted up another bunch (which I'd just boot all quickly in succession)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment