Support transfer leader feature in elasticsearch #60032

boicehuang · 2020-07-22T05:48:00Z

Issue

Leadership transfer extension is an important feature in raft protocol. It is similar to leader abdication but it aims at switching to a pre-set node as fast as possible. It can improve the availability of elasticsearch in the below situations:

In the case of a rolling restart, we shut down one node at a time. When we shut down the leader node, it takes at least 3 seconds or more downtime for the cluster to detect leader failure and elect a new leader. We can transfer the leader to a node that has been restarted before shut down the origin leader within a hundred milliseconds, to avoid restarting the leader node.
When the leader node is under heavy load but still connected with other nodes, this affects the health of the cluster. We need to switch the leader node to the new one as soon as possible instead of shutting it down and waiting for slow leader fault detection and re-election.
2 master/data nodes plus one dedicated master node is a cheap and reasonable setup for many users. The master/data nodes are only elected as leader temporarily when the dedicated master node fails down. In general situations, the low-spec dedicated master node is always elected as leader by leadership transfer extension.

Solution

The process of Leadership transfer extension is as follows.

Update cluster setting preferred_master_name to designate a follower as the future leader.
POST _cluster/reelect to the cluster.
The assigned node has the priority to become leader in the next round election. It begins to send votes to others to elect itself.
if the assigned node is elected as leader, the process ends. if the assigned node fails, it loses its priority, an equal election will be launched.

Testing

Using Leadership transfer extension to avoid shutting down the active leader. Here is the test result.

	before optimization	after optimization	after optimization but the first election failed
downtime	3.0-3.1s	0.5-0.6s	1.0-1.1s

elasticmachine · 2020-07-22T06:15:27Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

ywelsch · 2020-07-22T06:37:31Z

@boicehuang Thanks for the PR. For changes like these, it's best to discuss your change in a Github issue before spending much time on its implementation. We sometimes have to reject contributions that duplicate other efforts, take the wrong approach to solving a problem, or solve a problem which does not need solving. An up-front discussion often saves a good deal of wasted time in these cases (see also https://github.com/elastic/elasticsearch/blob/master/CONTRIBUTING.md#reviewing-and-accepting-your-contribution).

A first step here would be to better understand why elections are taking 3 seconds on a clean node shut down.

A second step here would be to better understand why this 3-second window is affecting your cluster that much (Is it that searches / indexing is blocked? If so, there's the cluster.no_master_block setting and potential improvements to be made in that area).

Regarding 3: I would like to question the usefulness of this. Dedicated master nodes are to provide cluster stability as data and master capability are not shared by the same node. A cluster that has a single dedicated master node and 2 mixed master/data nodes will not have that resiliency if just a single node goes down. If the dedicated master node fails, another node that will already have the data capability will take on the role of a master. There is a risk that that node will not be able to handle both its load as a master and a data node. If it were to handle the load fine, you might as well then just have started out with 3 mixed master/data nodes, instead of having one dedicated one. The feature to use here is voting-only master-eligible nodes.

boicehuang · 2020-07-22T07:21:33Z

@ywelsch Thanks for your reply. Let me try to explain more about it.

cluster.fault_detection.leader_check.interval defaults to 1s and cluster.fault_detection.leader_check.retry_count defaults to 3. So 3 failed checks must happen before a leader is considered to have failed.
In our cases, Elasticsearch is used as a storage layer of the billing data. Since the timeout of writing our payment data to the storage is 1 second, reelection will have a great impact on our payment business.

ywelsch · 2020-07-22T07:36:17Z

re 1: In case of a clean socket break, reelection should be triggered instantly (see LeaderChecker.handleDisconnectedNode). Perhaps there's something not so clean about your shutdowns?

re 2: It sounds scary that a second-level unavailability can have a great impact on your payment business (especially with the frequency at which we can rolling restarts to happen). Perhaps that's something to be addressed at a larger architectural level. A small possible improvement on the ES side here is that we don't necessarily need to block indexing when a master is unavailable, we just need the master to reconfigure (e.g. for shard allocation or failures) or for metadata-level operations (mapping changes).

DaveCTurner · 2020-07-22T08:04:04Z

As Yannick says, a clean shutdown doesn't need any timeouts, it causes an immediate reaction in Elasticsearch. In the case of an unclean shutdown you should be relying on the OS to detect that the connection has dropped since the OS can react to this much more quickly than Elasticsearch. If you are not detecting unclean shutdowns fast enough then I suspect you have not configured net.ipv4.tcp_retries2 appropriately; it should be 3 if you want to detect a dropped connection in under a second. Relates #59222.

The default for cluster.fault_detection.leader_check.interval is 10s not 1s and you should not adjust this setting:

If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

With such short timeouts your cluster is very sensitive to transient disruptions and likely does more work reacting to those disruptions than it needs to.

Also, reiterating Yannick's comment above, it is very risky to set such a short indexing timeout. I don't think we should be considering adding complexity to Elasticsearch to address such unreasonably tight latency goals. This is certainly not the only reason why indexing might take longer than a second sometimes.

DaveCTurner · 2020-07-22T08:11:08Z

Apologies, I misread the setting name, cluster.fault_detection.leader_check.timeout defaults to 10s but the interval is indeed 1s by default. I'm confused, however, an unclean shutdown would require the leader checks to timeout (taking over 30 seconds by default) and a clean shutdown would trigger a new election immediately, so I do not understand what is taking 3 seconds.

Can you set logger.org.elasticsearch.discovery: TRACE and logger.org.elasticsearch.cluster.coordination: TRACE and share the resulting logs from one of these 3-second-long failovers?

DaveCTurner · 2021-09-21T09:18:19Z

No further details were forthcoming so I am closing this PR.

Support prefer master reelection

0c2874e

ywelsch added the :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Jul 22, 2020

elasticmachine added the Team:Distributed Meta label for distributed team label Jul 22, 2020

DaveCTurner closed this Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support transfer leader feature in elasticsearch #60032

Support transfer leader feature in elasticsearch #60032

boicehuang commented Jul 22, 2020 •

edited

elasticmachine commented Jul 22, 2020

ywelsch commented Jul 22, 2020

boicehuang commented Jul 22, 2020 •

edited

ywelsch commented Jul 22, 2020

DaveCTurner commented Jul 22, 2020

DaveCTurner commented Jul 22, 2020

DaveCTurner commented Sep 21, 2021

Support transfer leader feature in elasticsearch #60032

Support transfer leader feature in elasticsearch #60032

Conversation

boicehuang commented Jul 22, 2020 • edited

Issue

Solution

Testing

elasticmachine commented Jul 22, 2020

ywelsch commented Jul 22, 2020

boicehuang commented Jul 22, 2020 • edited

ywelsch commented Jul 22, 2020

DaveCTurner commented Jul 22, 2020

DaveCTurner commented Jul 22, 2020

DaveCTurner commented Sep 21, 2021

boicehuang commented Jul 22, 2020 •

edited

boicehuang commented Jul 22, 2020 •

edited