New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accumulated improvements to ZenDiscovery #7493

Merged
merged 74 commits into from Sep 1, 2014

Conversation

Projects
None yet
7 participants
@bleskes
Member

bleskes commented Aug 28, 2014

This PR contains the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes:

Testing infra

  • Networking:
    • all symmetric partitioning
    • dropping packets
    • hard disconnects
    • Jepsen Tests
  • Single node service disruptions:
    • Long GC / Halt
    • Slow cluster state updates
  • Discovery settings
    • Easy to setup unicast with partial host list

Zen Discovery

  • Pinging after master loss (no local elects)
  • Fixes the split brain issue: #2488
  • Batching join requests
  • More resilient joining process (wait on a publish from master)

kimchy and others added some commits Apr 10, 2014

[Discovery] lightweight minimum master node recovery
don't perform full recovery when minimum master nodes are not met, keep the state around and use it once elected as master
[Discovery] add rejoin on master gone flag, defaults to false
defaults to false since there is still work left to properly make it work
[Discovery] Make noMasterBlock configurable and added simple test tha…
…t shows reads do execute (partially) when m_m_n isn't met
[Discovery] Changed the default for the 'rejoin_on_master_gone' optio…
…n from false to true in zen discovery.

Added AwaitFix for the FullRollingRestartTests.
[Discovery] Eagerly clean the routing table of shards that exist on n…
…odes that are not in the latestDiscoNodes list.

Only the previous master node has been removed, so only shards allocated to that node will get failed.
This would have happened anyhow on later on when AllocationService#reroute is invoked (for example when a cluster setting changes or another cluster event),
but by cleaning the routing table pro-actively, the stale routing table is fixed sooner and therefor the shards
that are not accessible anyhow (because the node these shards were on has left the cluster) will get re-assigned sooner.
[Internal] Do not execute cluster state changes if current node is no…
… longer master

When a node steps down from being a master (because, for example, min_master_node is breached), it may still have
cluster state update tasks queued up. Most (but not all) are tasks that should no longer be executed as the node
no longer has authority to do so. Other cluster states updates, like electing the current node as master, should be
executed even if the current node is no longer master.

This commit make sure that, by default, `ClusterStateUpdateTask` is not executed if the node is no longer master. Tasks
that should run on non masters are changed to implement a new interface called `ClusterStateNonMasterUpdateTask`

Closes #6230
[TEST] Added test that exposes a shard consistency problem when isola…
…ted node(s) rejoin the cluster after network segmentation and when the elected master node ended up on the lesser side of the network segmentation.
[Discovery] do not use versions to optimize cluster state copying for…
… a first update from a new master

We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different.

Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements)

Closes #6466
[Test] testIsolateMasterAndVerifyClusterStateConsensus didn't wait on…
… initializing shards before comparing cluster states
[Discovery] Change (Master|Nodes)FaultDetection's connect_on_network_…
…disconnect default to false

The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure.

Also added tests to verify the settings' behaviour
[Discovery] when master is gone, flush all pending cluster states
If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again.

Closes #6526
[Tests] Added ServiceDisruptionScheme(s) and testAckedIndexing
This commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This
abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various
network partions. There is also one implementation for causing a node to be slow in processing cluster state updates.

This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests.

A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions.

Closes #6505
[TEST] Reduced failures in DiscoveryWithNetworkFailuresTests#testAcke…
…dIndexing test:

* waiting time should be long enough depending on the type of the disruption scheme
* MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic
[TEST] Renamed afterDistribution timeout to expectedTimeToHeal
Accumulate expected shard failures to log later
[Tests] Disabling testAckedIndexing
The test is currently unstable and needs some more work
[Cluster] Refactored ClusterStateUpdateTask protection against execut…
…ion on a non master

Previous implementation used a marker interface and had no explicit failure call back for the case update task was run on a non master (i.e., the master stepped down after it was submitted). That lead to a couple of instance of checks.

This approach moves ClusterStateUpdateTask from an interface to an abstract class, which allows adding a flag to indicate whether it should only run on master nodes (defaults to true). It also adds an explicit onNoLongerMaster call back to allow different error handling for that case. This also removed the need for the  NoLongerMaster.

Closes #7511

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 1, 2014

[Discovery] accumulated improvements to ZenDiscovery
Merging the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes:

__Testing infra__
- Networking:
    - all symmetric partitioning
    - dropping packets
    - hard disconnects
    - Jepsen Tests
- Single node service disruptions:
    - Long GC / Halt
    - Slow cluster state updates
- Discovery settings
    - Easy to setup unicast with partial host list

__Zen Discovery__
- Pinging after master loss (no local elects)
- Fixes the split brain issue: #2488
- Batching join requests
- More resilient joining process (wait on a publish from master)

Closes #7493

@bleskes bleskes merged commit 34f4ca7 into master Sep 1, 2014

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 1, 2014

[Discovery] accumulated improvements to ZenDiscovery
Merging the accumulated work from the feature/improve_zen branch. Here are the highlights of the changes:

__Testing infra__
- Networking:
    - all symmetric partitioning
    - dropping packets
    - hard disconnects
    - Jepsen Tests
- Single node service disruptions:
    - Long GC / Halt
    - Slow cluster state updates
- Discovery settings
    - Easy to setup unicast with partial host list

__Zen Discovery__
- Pinging after master loss (no local elects)
- Fixes the split brain issue: #2488
- Batching join requests
- More resilient joining process (wait on a publish from master)

Closes #7493
@javanna

This comment has been minimized.

Show comment
Hide comment
@javanna

javanna Sep 1, 2014

Member

typo s/removeDistruptionSchemeFromNode/removeDisruptionSchemeFromNode

typo s/removeDistruptionSchemeFromNode/removeDisruptionSchemeFromNode

@bleskes bleskes deleted the feature/improve_zen branch Sep 2, 2014

@clintongormley clintongormley changed the title from [Discovery] accumulated improvements to ZenDiscovery to Resiliency: Accumulated improvements to ZenDiscovery Sep 8, 2014

bleskes added a commit that referenced this pull request Sep 11, 2014

Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 16, 2014

Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 16, 2014

Discovery: back port #7558 to 1.x and add bwc protections of the new …
…ping on master gone introduced in #7493

The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used,  this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state.

Further two bwc protections are added:
1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0
2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 16, 2014

Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 16, 2014

Discovery: back port #7558 to 1.x and add bwc protections of the new …
…ping on master gone introduced in #7493

The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used,  this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state.

Further two bwc protections are added:
1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0
2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0

Closes #7694

bleskes added a commit that referenced this pull request Sep 16, 2014

Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558

bleskes added a commit that referenced this pull request Sep 16, 2014

Discovery: back port #7558 to 1.x and add bwc protections of the new …
…ping on master gone introduced in #7493

The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used,  this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state.

Further two bwc protections are added:
1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0
2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0

Closes #7694

@jpountz jpountz removed the review label Oct 21, 2014

@clintongormley clintongormley changed the title from Resiliency: Accumulated improvements to ZenDiscovery to Accumulated improvements to ZenDiscovery Jun 7, 2015

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

Discovery: back port #7558 to 1.x and add bwc protections of the new …
…ping on master gone introduced in #7493

The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used,  this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state.

Further two bwc protections are added:
1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0
2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0

Closes #7694
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment