-
Notifications
You must be signed in to change notification settings - Fork 24.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support transfer leader feature in elasticsearch #60032
Conversation
Pinging @elastic/es-distributed (:Distributed/Cluster Coordination) |
@boicehuang Thanks for the PR. For changes like these, it's best to discuss your change in a Github issue before spending much time on its implementation. We sometimes have to reject contributions that duplicate other efforts, take the wrong approach to solving a problem, or solve a problem which does not need solving. An up-front discussion often saves a good deal of wasted time in these cases (see also https://github.com/elastic/elasticsearch/blob/master/CONTRIBUTING.md#reviewing-and-accepting-your-contribution). A first step here would be to better understand why elections are taking 3 seconds on a clean node shut down. A second step here would be to better understand why this 3-second window is affecting your cluster that much (Is it that searches / indexing is blocked? If so, there's the Regarding 3: I would like to question the usefulness of this. Dedicated master nodes are to provide cluster stability as data and master capability are not shared by the same node. A cluster that has a single dedicated master node and 2 mixed master/data nodes will not have that resiliency if just a single node goes down. If the dedicated master node fails, another node that will already have the data capability will take on the role of a master. There is a risk that that node will not be able to handle both its load as a master and a data node. If it were to handle the load fine, you might as well then just have started out with 3 mixed master/data nodes, instead of having one dedicated one. The feature to use here is voting-only master-eligible nodes. |
@ywelsch Thanks for your reply. Let me try to explain more about it.
|
re 1: In case of a clean socket break, reelection should be triggered instantly (see LeaderChecker.handleDisconnectedNode). Perhaps there's something not so clean about your shutdowns? re 2: It sounds scary that a second-level unavailability can have a great impact on your payment business (especially with the frequency at which we can rolling restarts to happen). Perhaps that's something to be addressed at a larger architectural level. A small possible improvement on the ES side here is that we don't necessarily need to block indexing when a master is unavailable, we just need the master to reconfigure (e.g. for shard allocation or failures) or for metadata-level operations (mapping changes). |
As Yannick says, a clean shutdown doesn't need any timeouts, it causes an immediate reaction in Elasticsearch. In the case of an unclean shutdown you should be relying on the OS to detect that the connection has dropped since the OS can react to this much more quickly than Elasticsearch. If you are not detecting unclean shutdowns fast enough then I suspect you have not configured The default for
With such short timeouts your cluster is very sensitive to transient disruptions and likely does more work reacting to those disruptions than it needs to. Also, reiterating Yannick's comment above, it is very risky to set such a short indexing timeout. I don't think we should be considering adding complexity to Elasticsearch to address such unreasonably tight latency goals. This is certainly not the only reason why indexing might take longer than a second sometimes. |
Apologies, I misread the setting name, Can you set |
No further details were forthcoming so I am closing this PR. |
Issue
Leadership transfer extension is an important feature in raft protocol. It is similar to leader abdication but it aims at switching to a pre-set node as fast as possible. It can improve the availability of elasticsearch in the below situations:
In the case of a rolling restart, we shut down one node at a time. When we shut down the leader node, it takes at least 3 seconds or more downtime for the cluster to detect leader failure and elect a new leader. We can transfer the leader to a node that has been restarted before shut down the origin leader within a hundred milliseconds, to avoid restarting the leader node.
When the leader node is under heavy load but still connected with other nodes, this affects the health of the cluster. We need to switch the leader node to the new one as soon as possible instead of shutting it down and waiting for slow leader fault detection and re-election.
2 master/data nodes plus one dedicated master node is a cheap and reasonable setup for many users. The master/data nodes are only elected as leader temporarily when the dedicated master node fails down. In general situations, the low-spec dedicated master node is always elected as leader by leadership transfer extension.
Solution
The process of Leadership transfer extension is as follows.
preferred_master_name
to designate a follower as the future leader._cluster/reelect
to the cluster.Testing
Using Leadership transfer extension to avoid shutting down the active leader. Here is the test result.