-
Notifications
You must be signed in to change notification settings - Fork 9.7k
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
standby nodes in 0.5.0 #1416
Comments
We have deprecated the standby feature in 0.5 by introducing new proxy mode. The proxy will not promote itself automatically. We found out that the auto demotion/promotion confuse people, since their behavior is not that controllable and deterministic. And configuration change is an important event to an etcd cluster. It should not happen frequently and we do expect human involvement. We may introduce deterministic auto recovery strategy to replace dead node. |
Auto recovery would be a great feature! |
@vany-egorov etcd likes other datastore, it consumes CPU/network resources. It is even more important than the normal db service, since it provides the truth for your whole cluster. We do not think people should ever put the actual etcd server on a arbitrary machine in the cluster. Also the recovery process comes with a big cost. We have a very draft plan to share with you at this moment:
|
@vany-egorov Moreover, we have two assumptions:
So the case you described should not happen in most the case. Or you have a relatively huge probability to lose all your etcd machines (if all the etcd machines are in the 15 machines that went down). If that is really the case, then you need to consider to put etcd on a more stable and dedicated cluster. |
@xiangli-cmu Maybe you are right. We shall try to implement recovery by add/remove node to/from etcd cluster. Automaticly or throught web ui. Thanks! |
@xiangli-cmu What about the automatic updates? During the updates the machine is considered failed and it can't be controlled, as they happen in background and ALL the nodes are involved in the operation. |
@asiragusa
The automatic updates falls into case 1. Thus it has nothing to do with dynamic configuration change. Moreover, if you are using CoreOS it can be controlled. |
Yep, I see, but you stated that
Which is the impact of this in case of a whole cluster update? And what about a node reinstall (#863)? I am getting scared of etcd/fleet as it seems to me quite unstable ATM... |
If you still have the data on all the members, restarting etcd comes with nearly zero cost. |
Can you estimate this cost please? Is it based on the snapshot size + the ops since the last snapshot? |
And does it freeze the whole cluster during this operation? Sorry about that, I know that it's a lot of questions :) |
Can you please tell me what exact problem you want to solve first? |
I need a reliable platform to work with and I had a lot of troubles with etcd / fleet. I have to start on my cluster an good amount of short-lived units, let's say 1/s and each one lives for 15 minutes. I know that this involves fleet too, but during my tests etcd become unavailable too often and this caused troubles to fleet too. Moreover, while rebooting my machines (once at a time), the cluster stopped working and therefore I had to create a new one with a new discovery url. Due to the high load of etcd I disabled the snapshots, as stated in the doc, but this probably made the things worse. Now I am considering using Mesos / Marathon to do that job and CoreOS / etcd just to deploy the Mesos slaves on each server, due to the easy and fast setup. However I still don't know if it will be able to handle that load. For sure etcd / fleet are still not ready and I have little hope that they will be soon :/ |
@asiragusa I think the first thing you need to do is to isolate the problems you met in etcd. |
Sorry I'd rather put this comment on that issue |
In etcd v0.4.6 standby node was able to act as peer node if one of peer nodes is dead. Standbys are not part of the Raft cluster themselves.
Standby node was able to replace dead peer node automaticly.
It was very usefull in case of big cluster size (>20 machines).
It was able to set cluster-active-size, cluster-remove-delay, cluster-sync-interval.
The text was updated successfully, but these errors were encountered: