Accumulated improvements to ZenDiscovery #7493

bleskes · 2014-08-28T07:06:53Z

This PR contains the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes:

Testing infra

Networking:
- all symmetric partitioning
- dropping packets
- hard disconnects
- Jepsen Tests
Single node service disruptions:
- Long GC / Halt
- Slow cluster state updates
Discovery settings
- Easy to setup unicast with partial host list

Zen Discovery

Pinging after master loss (no local elects)
Fixes the split brain issue: minimum_master_nodes does not prevent split-brain if splits are intersecting #2488
Batching join requests
More resilient joining process (wait on a publish from master)

don't perform full recovery when minimum master nodes are not met, keep the state around and use it once elected as master

defaults to false since there is still work left to properly make it work

…t shows reads do execute (partially) when m_m_n isn't met

…iscoveryWithNetworkFailuresTests only.

…n from false to true in zen discovery. Added AwaitFix for the FullRollingRestartTests.

…revious known nodes.

…odes that are not in the latestDiscoNodes list. Only the previous master node has been removed, so only shards allocated to that node will get failed. This would have happened anyhow on later on when AllocationService#reroute is invoked (for example when a cluster setting changes or another cluster event), but by cleaning the routing table pro-actively, the stale routing table is fixed sooner and therefor the shards that are not accessible anyhow (because the node these shards were on has left the cluster) will get re-assigned sooner.

Introduced with: 11a3201

… longer master When a node steps down from being a master (because, for example, min_master_node is breached), it may still have cluster state update tasks queued up. Most (but not all) are tasks that should no longer be executed as the node no longer has authority to do so. Other cluster states updates, like electing the current node as master, should be executed even if the current node is no longer master. This commit make sure that, by default, `ClusterStateUpdateTask` is not executed if the node is no longer master. Tasks that should run on non masters are changed to implement a new interface called `ClusterStateNonMasterUpdateTask` Closes #6230

…e fact the master left

…ulated network split.

…is simulated

…ted node(s) rejoin the cluster after network segmentation and when the elected master node ended up on the lesser side of the network segmentation.

…ble at runtime.

… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes #6466

…g in data consistency test

… initializing shards before comparing cluster states

…disconnect default to false The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure. Also added tests to verify the settings' behaviour

…use local node is no longer master

If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again. Closes #6526

This commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various network partions. There is also one implementation for causing a node to be slow in processing cluster state updates. This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests. A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions. Closes #6505

…dIndexing test: * waiting time should be long enough depending on the type of the disruption scheme * MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic

Accumulate expected shard failures to log later

Also improved timeouts & logs

The test is currently unstable and needs some more work

…is updatable at runtime.

They share a lot of settings and some logic. Closes #7512

…ion on a non master Previous implementation used a marker interface and had no explicit failure call back for the case update task was run on a non master (i.e., the master stepped down after it was submitted). That lead to a couple of instance of checks. This approach moves ClusterStateUpdateTask from an interface to an abstract class, which allows adding a flag to indicate whether it should only run on master nodes (defaults to true). It also adds an explicit onNoLongerMaster call back to allow different error handling for that case. This also removed the need for the NoLongerMaster. Closes #7511

Merging the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes: __Testing infra__ - Networking: - all symmetric partitioning - dropping packets - hard disconnects - Jepsen Tests - Single node service disruptions: - Long GC / Halt - Slow cluster state updates - Discovery settings - Easy to setup unicast with partial host list __Zen Discovery__ - Pinging after master loss (no local elects) - Fixes the split brain issue: elastic#2488 - Batching join requests - More resilient joining process (wait on a publish from master) Closes elastic#7493

Merging the accumulated work from the feature/improve_zen branch. Here are the highlights of the changes: __Testing infra__ - Networking: - all symmetric partitioning - dropping packets - hard disconnects - Jepsen Tests - Single node service disruptions: - Long GC / Halt - Slow cluster state updates - Discovery settings - Easy to setup unicast with partial host list __Zen Discovery__ - Pinging after master loss (no local elects) - Fixes the split brain issue: elastic#2488 - Batching join requests - More resilient joining process (wait on a publish from master) Closes elastic#7493

…e cluster for the first time With the change in #7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes #7558

…e cluster for the first time With the change in elastic#7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes elastic#7558

…he new ping on master gone introduced in elastic#7493 The change in elastic#7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0

…e cluster for the first time With the change in elastic#7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes elastic#7558

…he new ping on master gone introduced in elastic#7493 The change in elastic#7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 Closes elastic#7694

…e cluster for the first time With the change in #7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes #7558

…ping on master gone introduced in #7493 The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 Closes #7694

…e cluster for the first time With the change in elastic#7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes elastic#7558

…he new ping on master gone introduced in elastic#7493 The change in elastic#7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 Closes elastic#7694

kimchy and others added 30 commits August 27, 2014 15:45

[Discovery] lightweight minimum master node recovery

63d0406

don't perform full recovery when minimum master nodes are not met, keep the state around and use it once elected as master

[Internal] make no master lock an instance var so it can be configured

4824f05

[Discovery] add rejoin on master gone flag, defaults to false

6ede83a

defaults to false since there is still work left to properly make it work

[Discovery] Make noMasterBlock configurable and added simple test tha…

97bdc8f

…t shows reads do execute (partially) when m_m_n isn't met

[Discovery] Enable discovery.zen.rejoin_on_master_gone setting in D…

3cdbb1a

…iscoveryWithNetworkFailuresTests only.

[Discovery] Changed the default for the 'rejoin_on_master_gone' optio…

549076e

…n from false to true in zen discovery. Added AwaitFix for the FullRollingRestartTests.

[Discovery] If available newly elected master node should take over p…

89a50f6

…revious known nodes.

Updated to use ClusterBlocks new constructor signature

a9aa10a

Introduced with: 11a3201

[TEST] It may take a little bit before the unlucky node deals with th…

2c9ef63

…e fact the master left

[TEST] Added test that verifies data integrity during and after a sim…

fc8ae4d

…ulated network split.

[TEST] Make sure there no initializing shards when network partition …

e7d24ec

…is simulated

[TEST] Added test that exposes a shard consistency problem when isola…

4828e78

…ted node(s) rejoin the cluster after network segmentation and when the elected master node ended up on the lesser side of the network segmentation.

[Discovery] Removed METADATA block

424a2f6

[Discovery] Made 'discovery.zen.rejoin_on_master_gone' setting updata…

1849d09

…ble at runtime.

[TEST] Remove 'index.routing.allocation.total_shards_per_node' settin…

f3d90cd

…g in data consistency test

[Test] testIsolateMasterAndVerifyClusterStateConsensus didn't wait on…

e39ac7e

… initializing shards before comparing cluster states

[Discovery] Improved logging when a join request is not executed beca…

8b85d97

…use local node is no longer master

[Discovery] when master is gone, flush all pending cluster states

5d13571

If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again. Closes #6526

[TEST] Check if worker if null to prevent NPE on double stopping

8aed9ee

[TEST] Renamed afterDistribution timeout to expectedTimeToHeal

f7b962a

Accumulate expected shard failures to log later

[Test] ensureStableCluster failed to pass viaNode parameter correctly

a7a61a0

Also improved timeouts & logs

[Tests] Disabling testAckedIndexing

1af82fd

The test is currently unstable and needs some more work

Fixed compilation issue caused by the lack of a thread pool name

c3e84eb

[TEST] Added test to verify if 'discovery.zen.rejoin_on_master_gone' …

98084c0

…is updatable at runtime.

bleskes added 2 commits September 1, 2014 15:51

[Internal] Extract a common base class for (Master|Nodes)FaultDetection

596a4a0

They share a lot of settings and some logic. Closes #7512

bleskes merged commit 34f4ca7 into master Sep 1, 2014

bleskes mentioned this pull request Sep 1, 2014

minimum_master_nodes does not prevent split-brain if splits are intersecting #2488

Closed

bleskes deleted the feature/improve_zen branch September 2, 2014 17:59

This was referenced Sep 3, 2014

Master election should demotes nodes which try to join the cluster for the first time #7558

Closed

[Indexing] A network partition can cause in flight documents to be lost #7572

Closed

clintongormley changed the title ~~[Discovery] accumulated improvements to ZenDiscovery~~ Resiliency: Accumulated improvements to ZenDiscovery Sep 8, 2014

clintongormley added the release highlight label Sep 8, 2014

clintongormley added the >enhancement label Sep 11, 2014

bleskes mentioned this pull request Sep 29, 2014

When starting up, recovery of shards takes up to 50 minutes #6372

Closed

jpountz removed the review label Oct 21, 2014

bleskes mentioned this pull request Nov 17, 2014

ElasticSearch 1.3.4 recovery slow on larger clusters (50+ total nodes) #8487

Closed

clintongormley changed the title ~~Resiliency: Accumulated improvements to ZenDiscovery~~ Accumulated improvements to ZenDiscovery Jun 7, 2015

clintongormley added the :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure label Jun 7, 2015

LiangShang mentioned this pull request Jul 6, 2016

An Issue Named “minimum_master_nodes does not prevent split-brain if splits are intersecting” LiangShang/liangshang.github.com#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accumulated improvements to ZenDiscovery #7493

Accumulated improvements to ZenDiscovery #7493

bleskes commented Aug 28, 2014

Accumulated improvements to ZenDiscovery #7493

Accumulated improvements to ZenDiscovery #7493

Conversation

bleskes commented Aug 28, 2014