New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition in cluster join #716

Closed
aphyr opened this Issue Apr 13, 2014 · 11 comments

Comments

5 participants
@aphyr

aphyr commented Apr 13, 2014

When joining fresh nodes to the cluster, I've found a reproducible but somewhat random crash where the node claims that either the version or join request timed out, or its connection was refused. Running the joins simultaneously reliably crashes all but one joiner; running them with a 5-second delay usually works but still crashes sometimes. I'm not sure whether this is an artifact of aggressive IO timeouts or what; it looks like the nodes gave up after only a few hundred milliseconds. Re-running the join process from the crashed node has a chance to reconnect, so this whole thing smells of a timing bug or race condition.

n1

ubuntu@n1:/opt/etcd$ sudo rm -rf /var/lib/etcd/
ubuntu@n1:/opt/etcd$ sudo bin/etcd -peer-addr n1:7001 -addr n1:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n1
[etcd] Apr 13 21:16:46.841 INFO      | etcd server [name n1, listen on 0.0.0.0:4001, advertised url http://n1:4001]
[etcd] Apr 13 21:16:46.843 INFO      | URLs: /_etcd/machines:  / n1 ()
[etcd] Apr 13 21:16:46.843 INFO      | This peer is starting a brand new cluster now.
[etcd] Apr 13 21:16:46.844 INFO      | n1: state changed from 'initialized' to 'follower'.
[etcd] Apr 13 21:16:46.844 INFO      | n1: state changed from 'follower' to 'leader'.
[etcd] Apr 13 21:16:46.844 INFO      | n1: leader changed from '' to 'n1'.
[etcd] Apr 13 21:16:46.909 INFO      | peer server [name n1, listen on 0.0.0.0:7001, advertised url http://n1:7001]

[etcd] Apr 13 21:16:50.164 INFO      | n1: peer added: 'n2'
[etcd] Apr 13 21:16:50.620 INFO      | n1: peer added: 'n5'
[etcd] Apr 13 21:16:50.737 INFO      | n1: peer added: 'n3'
[etcd] Apr 13 21:16:50.796 INFO      | n1: peer added: 'n4'
[etcd] Apr 13 21:16:50.898 INFO      | n1: warning: heartbeat timed out: 'n5'
[etcd] Apr 13 21:16:50.898 INFO      | n1: warning: heartbeat timed out: 'n3'
[etcd] Apr 13 21:16:50.898 INFO      | n1: warning: heartbeat timed out: 'n4'
[etcd] Apr 13 21:16:50.899 INFO      | n1: warning: heartbeat timed out: 'n3'
[etcd] Apr 13 21:16:50.900 INFO      | n1: warning: heartbeat timed out: 'n5'
[etcd] Apr 13 21:16:50.900 INFO      | n1: warning: heartbeat timed out: 'n4'
[etcd] Apr 13 21:16:50.921 INFO      | n1: warning: heartbeat timed out: 'n5'
[etcd] Apr 13 21:16:50.938 INFO      | n1: warning: heartbeat timed out: 'n3'
[etcd] Apr 13 21:16:50.947 INFO      | n1: warning: heartbeat timed out: 'n4'
[etcd] Apr 13 21:16:50.971 INFO      | n1: warning: heartbeat timed out: 'n2'
[etcd] Apr 13 21:16:50.972 INFO      | n1: warning: heartbeat timed out: 'n5'
[etcd] Apr 13 21:16:50.988 INFO      | n1: warning: heartbeat timed out: 'n3'
[etcd] Apr 13 21:16:50.997 INFO      | n1: warning: heartbeat timed out: 'n4'
[etcd] Apr 13 21:16:51.021 INFO      | n1: warning: heartbeat timed out: 'n5'
[etcd] Apr 13 21:16:51.038 INFO      | n1: warning: heartbeat timed out: 'n3'
[etcd] Apr 13 21:16:51.047 INFO      | n1: warning: heartbeat timed out: 'n4'
[etcd] Apr 13 21:16:51.071 INFO      | n1: warning: heartbeat timed out: 'n5'
[etcd] Apr 13 21:16:51.088 INFO      | n1: warning: heartbeat timed out: 'n3'
[etcd] Apr 13 21:16:51.097 INFO      | n1: warning: heartbeat timed out: 'n4'
[etcd] Apr 13 21:16:51.121 INFO      | n1: warning: heartbeat timed out: 'n5'
[etcd] Apr 13 21:16:51.139 INFO      | n1: warning: heartbeat timed out: 'n3'
[etcd] Apr 13 21:16:51.147 INFO      | n1: warning: heartbeat timed out: 'n4'
[etcd] Apr 13 21:16:51.171 INFO      | n1: warning: heartbeat timed out: 'n5'

n2

ubuntu@n2:/opt/etcd$ sudo rm -rf /var/lib/etcd/
ubuntu@n2:/opt/etcd$ sudo bin/etcd -peer-addr n2:7001 -addr n2:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n2 -peers n1:7001,n2:7001,n3:7001,n4:7001,n5:7001
[etcd] Apr 13 21:15:47.665 INFO      | etcd server [name n2, listen on 0.0.0.0:4001, advertised url http://n2:4001]
[etcd] Apr 13 21:15:47.672 INFO      | URLs: /_etcd/machines:  / n2 ()
ubuntu@n2:/opt/etcd$ sudo rm -rf /var/lib/etcd/
ubuntu@n2:/opt/etcd$ sudo bin/etcd -peer-addr n2:7001 -addr n2:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n2 -peers n1:7001,n2:7001,n3:7001,n4:7001,n5:7001
[etcd] Apr 13 21:16:50.160 INFO      | etcd server [name n2, listen on 0.0.0.0:4001, advertised url http://n2:4001]
[etcd] Apr 13 21:16:50.161 INFO      | URLs: /_etcd/machines:  / n2 ()
[etcd] Apr 13 21:16:50.163 INFO      | Send Join Request to http://n1:7001/v2/admin/machines/n2
[etcd] Apr 13 21:16:50.260 INFO      | »»»» 200
[etcd] Apr 13 21:16:50.260 INFO      | n2: state changed from 'initialized' to 'follower'.
[etcd] Apr 13 21:16:50.260 INFO      | peer server [name n2, listen on 0.0.0.0:7001, advertised url http://n2:7001]
[etcd] Apr 13 21:16:50.344 INFO      | n2: peer added: 'n1'
[etcd] Apr 13 21:16:50.897 INFO      | n2: warning: heartbeat near election timeout: 277.209878ms
[etcd] Apr 13 21:16:50.898 INFO      | n2: peer added: 'n5'
[etcd] Apr 13 21:16:50.981 INFO      | n2: peer added: 'n3'
[etcd] Apr 13 21:16:51.065 INFO      | n2: peer added: 'n4'
[etcd] Apr 13 21:16:57.571 INFO      | n2: state changed from 'follower' to 'candidate'.
[etcd] Apr 13 21:16:57.572 INFO      | n2: leader changed from 'n1' to ''.

n3

ubuntu@n3:/opt/etcd$ sudo rm -rf /var/lib/etcd/
ubuntu@n3:/opt/etcd$ sudo bin/etcd -peer-addr n3:7001 -addr n3:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n3 -peers n1:7001,n2:7001,n3:7001,n4:7001,n5:7001
[etcd] Apr 13 21:16:50.161 INFO      | etcd server [name n3, listen on 0.0.0.0:4001, advertised url http://n3:4001]
[etcd] Apr 13 21:16:50.164 INFO      | URLs: /_etcd/machines:  / n3 ()
[etcd] Apr 13 21:16:50.166 INFO      | Send Join Request to http://n1:7001/v2/admin/machines/n3
[etcd] Apr 13 21:16:50.516 WARNING   | Attempt to join via n1:7001 failed: Unable to join: Put http://n1:7001/v2/admin/machines/n3: net/http: timeout awaiting response headers
[etcd] Apr 13 21:16:50.517 INFO      | Send Join Request to http://n2:7001/v2/admin/machines/n3
[etcd] Apr 13 21:16:50.518 INFO      | »»»» 307
[etcd] Apr 13 21:16:50.519 INFO      | »»»» 404
[etcd] Apr 13 21:16:50.519 WARNING   | Attempt to join via n2:7001 failed: Unable to join
[etcd] Apr 13 21:16:50.520 WARNING   | Attempt to join via n4:7001 failed: Error during join version check: Get http://n4:7001/version: dial tcp 192.168.122.14:7001: connection refused
[etcd] Apr 13 21:16:50.520 WARNING   | Attempt to join via n5:7001 failed: Error during join version check: Get http://n5:7001/version: dial tcp 192.168.122.15:7001: connection refused
[etcd] Apr 13 21:16:50.520 WARNING   | No living peers are found!
[etcd] Apr 13 21:16:50.521 CRITICAL  | No available peers in backup list, and no log data
ubuntu@n3:/opt/etcd

n4

ubuntu@n4:/opt/etcd$ sudo rm -rf /var/lib/etcd/
ubuntu@n4:/opt/etcd$ sudo bin/etcd -peer-addr n4:7001 -addr n4:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n4 -peers n1:7001,n2:7001,n3:7001,n4:7001,n5:7001
[etcd] Apr 13 21:16:50.162 INFO      | etcd server [name n4, listen on 0.0.0.0:4001, advertised url http://n4:4001]
[etcd] Apr 13 21:16:50.172 INFO      | URLs: /_etcd/machines:  / n4 ()
[etcd] Apr 13 21:16:50.174 INFO      | Send Join Request to http://n1:7001/v2/admin/machines/n4
[etcd] Apr 13 21:16:50.524 WARNING   | Attempt to join via n1:7001 failed: Unable to join: Put http://n1:7001/v2/admin/machines/n4: net/http: timeout awaiting response headers
[etcd] Apr 13 21:16:50.525 INFO      | Send Join Request to http://n2:7001/v2/admin/machines/n4
[etcd] Apr 13 21:16:50.526 INFO      | »»»» 307
[etcd] Apr 13 21:16:50.529 INFO      | »»»» 404
[etcd] Apr 13 21:16:50.529 WARNING   | Attempt to join via n2:7001 failed: Unable to join
[etcd] Apr 13 21:16:50.530 WARNING   | Attempt to join via n3:7001 failed: Error during join version check: Get http://n3:7001/version: dial tcp 192.168.122.13:7001: connection refused
[etcd] Apr 13 21:16:50.530 WARNING   | Attempt to join via n5:7001 failed: Error during join version check: Get http://n5:7001/version: dial tcp 192.168.122.15:7001: connection refused
[etcd] Apr 13 21:16:50.530 WARNING   | No living peers are found!
[etcd] Apr 13 21:16:50.530 CRITICAL  | No available peers in backup list, and no log data
ubuntu@n4:/opt/etcd$

n5

ubuntu@n5:/opt/etcd$ sudo rm -rf /var/lib/etcd/
ubuntu@n5:/opt/etcd$ sudo bin/etcd -peer-addr n5:7001 -addr n5:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n5 -peers n1:7001,n2:7001,n3:7001,n4:7001,n5:7001
[etcd] Apr 13 21:16:50.161 INFO      | etcd server [name n5, listen on 0.0.0.0:4001, advertised url http://n5:4001]
[etcd] Apr 13 21:16:50.163 INFO      | URLs: /_etcd/machines:  / n5 ()
[etcd] Apr 13 21:16:50.164 INFO      | Send Join Request to http://n1:7001/v2/admin/machines/n5
[etcd] Apr 13 21:16:50.514 WARNING   | Attempt to join via n1:7001 failed: Unable to join: Put http://n1:7001/v2/admin/machines/n5: read tcp 192.168.122.11:7001: i/o timeout
[etcd] Apr 13 21:16:50.515 INFO      | Send Join Request to http://n2:7001/v2/admin/machines/n5
[etcd] Apr 13 21:16:50.516 INFO      | »»»» 307
[etcd] Apr 13 21:16:50.517 INFO      | »»»» 404
[etcd] Apr 13 21:16:50.517 WARNING   | Attempt to join via n2:7001 failed: Unable to join
[etcd] Apr 13 21:16:50.518 WARNING   | Attempt to join via n3:7001 failed: Error during join version check: Get http://n3:7001/version: dial tcp 192.168.122.13:7001: connection refused
[etcd] Apr 13 21:16:50.518 WARNING   | Attempt to join via n4:7001 failed: Error during join version check: Get http://n4:7001/version: dial tcp 192.168.122.14:7001: connection refused
[etcd] Apr 13 21:16:50.518 WARNING   | No living peers are found!
[etcd] Apr 13 21:16:50.518 CRITICAL  | No available peers in backup list, and no log data
ubuntu@n5:/opt/etcd$
@xiang90

This comment has been minimized.

Show comment
Hide comment
@xiang90

xiang90 Apr 13, 2014

Contributor

@aphyr There is a known timing issue. Join is also a command, which need to be broadcasted. When you join a bunch of nodes at the same time, there will be a problem for now. We might need to loose the timeout for the join command.

Contributor

xiang90 commented Apr 13, 2014

@aphyr There is a known timing issue. Join is also a command, which need to be broadcasted. When you join a bunch of nodes at the same time, there will be a problem for now. We might need to loose the timeout for the join command.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 13, 2014

A possibly related issue: if the join process crashes on multiple machines, leaving the leader spinning complaining about missing heartbeats from down nodes, the leader will never respond to a join request again. Every join attempt, concurrent or not, results in

ubuntu@n2:/opt/etcd$ sudo bin/etcd -peer-addr n2:7001 -addr n2:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n2 -peers n1:7001,n2:7001,n3:7001,n4:7001,n5:7001
[etcd] Apr 13 23:00:11.748 INFO      | etcd server [name n2, listen on 0.0.0.0:4001, advertised url http://n2:4001]
[etcd] Apr 13 23:00:11.751 INFO      | URLs: /_etcd/machines:  / n2 ()
[etcd] Apr 13 23:00:11.753 INFO      | Send Join Request to http://n1:7001/v2/admin/machines/n2
[etcd] Apr 13 23:00:12.103 WARNING   | Attempt to join via n1:7001 failed: Unable to join: Put http://n1:7001/v2/admin/machines/n2: read tcp 192.168.122.11:7001: i/o timeout
[etcd] Apr 13 23:00:12.104 WARNING   | Attempt to join via n3:7001 failed: Error during join version check: Get http://n3:7001/version: dial tcp 192.168.122.13:7001: connection refused
[etcd] Apr 13 23:00:12.104 WARNING   | Attempt to join via n4:7001 failed: Error during join version check: Get http://n4:7001/version: dial tcp 192.168.122.14:7001: connection refused
[etcd] Apr 13 23:00:12.105 INFO      | Send Join Request to http://n5:7001/v2/admin/machines/n2
[etcd] Apr 13 23:00:12.106 INFO      | »»»» 307
[etcd] Apr 13 23:00:12.107 INFO      | »»»» 404
[etcd] Apr 13 23:00:12.107 WARNING   | Attempt to join via n5:7001 failed: Unable to join
[etcd] Apr 13 23:00:12.107 WARNING   | No living peers are found!
[etcd] Apr 13 23:00:12.107 CRITICAL  | No available peers in backup list, and no log data

These are in LXC containers; ping time between nodes is on the order of a tenth of a millisecond.

aphyr commented Apr 13, 2014

A possibly related issue: if the join process crashes on multiple machines, leaving the leader spinning complaining about missing heartbeats from down nodes, the leader will never respond to a join request again. Every join attempt, concurrent or not, results in

ubuntu@n2:/opt/etcd$ sudo bin/etcd -peer-addr n2:7001 -addr n2:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n2 -peers n1:7001,n2:7001,n3:7001,n4:7001,n5:7001
[etcd] Apr 13 23:00:11.748 INFO      | etcd server [name n2, listen on 0.0.0.0:4001, advertised url http://n2:4001]
[etcd] Apr 13 23:00:11.751 INFO      | URLs: /_etcd/machines:  / n2 ()
[etcd] Apr 13 23:00:11.753 INFO      | Send Join Request to http://n1:7001/v2/admin/machines/n2
[etcd] Apr 13 23:00:12.103 WARNING   | Attempt to join via n1:7001 failed: Unable to join: Put http://n1:7001/v2/admin/machines/n2: read tcp 192.168.122.11:7001: i/o timeout
[etcd] Apr 13 23:00:12.104 WARNING   | Attempt to join via n3:7001 failed: Error during join version check: Get http://n3:7001/version: dial tcp 192.168.122.13:7001: connection refused
[etcd] Apr 13 23:00:12.104 WARNING   | Attempt to join via n4:7001 failed: Error during join version check: Get http://n4:7001/version: dial tcp 192.168.122.14:7001: connection refused
[etcd] Apr 13 23:00:12.105 INFO      | Send Join Request to http://n5:7001/v2/admin/machines/n2
[etcd] Apr 13 23:00:12.106 INFO      | »»»» 307
[etcd] Apr 13 23:00:12.107 INFO      | »»»» 404
[etcd] Apr 13 23:00:12.107 WARNING   | Attempt to join via n5:7001 failed: Unable to join
[etcd] Apr 13 23:00:12.107 WARNING   | No living peers are found!
[etcd] Apr 13 23:00:12.107 CRITICAL  | No available peers in backup list, and no log data

These are in LXC containers; ping time between nodes is on the order of a tenth of a millisecond.

@bmizerany

This comment has been minimized.

Show comment
Hide comment
@bmizerany

bmizerany Apr 14, 2014

Contributor

@aphyr Thank you for all of the help. We're looking into the issue. If you find anything else, please let us know.

Contributor

bmizerany commented Apr 14, 2014

@aphyr Thank you for all of the help. We're looking into the issue. If you find anything else, please let us know.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 14, 2014

Will do! Happy to provide logs, try test branches, or anything else that might be of use. :)

aphyr commented Apr 14, 2014

Will do! Happy to provide logs, try test branches, or anything else that might be of use. :)

@xiang90

This comment has been minimized.

Show comment
Hide comment
@xiang90

xiang90 Apr 14, 2014

Contributor

@aphyr The 2nd issue should be fix after #717 get merged.

Contributor

xiang90 commented Apr 14, 2014

@aphyr The 2nd issue should be fix after #717 get merged.

@philips

This comment has been minimized.

Show comment
Hide comment
@philips

philips Apr 14, 2014

Contributor

#717 is merged.

Contributor

philips commented Apr 14, 2014

#717 is merged.

@yichengq

This comment has been minimized.

Show comment
Hide comment
@yichengq

yichengq Apr 14, 2014

Contributor

@xiangli-cmu @philips
To fix this bug, we could merge #707 for a short-term solution. It would try 3 times for new node to join the cluster, and n3, n4, n5 could be added to the cluster then.
And our Request Timeout is incorrect now, please review this one #624, which should set requestTimeout more reasonable.
For tests, when creating cluster, currently it starts the first instance before starting the others in parallel to avoid time waste, but this makes the tests less powerful. I would try to remove this part.
For long-term solution, the nodes should wait longer if the cluster is under leader election.

Contributor

yichengq commented Apr 14, 2014

@xiangli-cmu @philips
To fix this bug, we could merge #707 for a short-term solution. It would try 3 times for new node to join the cluster, and n3, n4, n5 could be added to the cluster then.
And our Request Timeout is incorrect now, please review this one #624, which should set requestTimeout more reasonable.
For tests, when creating cluster, currently it starts the first instance before starting the others in parallel to avoid time waste, but this makes the tests less powerful. I would try to remove this part.
For long-term solution, the nodes should wait longer if the cluster is under leader election.

@yichengq

This comment has been minimized.

Show comment
Hide comment
@yichengq

yichengq Apr 14, 2014

Contributor

And I would increase the timeout for join commands, because this is an important command for the cluster, and deserves a longer timeout.

Contributor

yichengq commented Apr 14, 2014

And I would increase the timeout for join commands, because this is an important command for the cluster, and deserves a longer timeout.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 15, 2014

I can confirm that master dramatically reduces the probability of getting the cluster into a broken state during startup; I can reliably stand up a cluster on the first try now (staggered start, 2 seconds between each process launch), instead of taking 5-6 tries.

aphyr commented Apr 15, 2014

I can confirm that master dramatically reduces the probability of getting the cluster into a broken state during startup; I can reliably stand up a cluster on the first try now (staggered start, 2 seconds between each process launch), instead of taking 5-6 tries.

@xiang90

This comment has been minimized.

Show comment
Hide comment
@xiang90

xiang90 Apr 15, 2014

Contributor

@aphyr Thanks for your feedback.
We still have several pending pull requests for completely solving this issue. Hope we can merge these soon.

Contributor

xiang90 commented Apr 15, 2014

@aphyr Thanks for your feedback.
We still have several pending pull requests for completely solving this issue. Hope we can merge these soon.

@yichengq

This comment has been minimized.

Show comment
Hide comment
@yichengq

yichengq Apr 18, 2014

Contributor

@aphyr I think this should work totally fine. Close this now, and let me know if this is still a problem.

Contributor

yichengq commented Apr 18, 2014

@aphyr I think this should work totally fine. Close this now, and let me know if this is still a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment