Accumulated improvements to ZenDiscovery #7493

Merged
merged 74 commits into from Sep 1, 2014

Projects

None yet

7 participants

@bleskes
Member
bleskes commented Aug 28, 2014

This PR contains the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes:

Testing infra

  • Networking:
    • all symmetric partitioning
    • dropping packets
    • hard disconnects
    • Jepsen Tests
  • Single node service disruptions:
    • Long GC / Halt
    • Slow cluster state updates
  • Discovery settings
    • Easy to setup unicast with partial host list

Zen Discovery

  • Pinging after master loss (no local elects)
  • Fixes the split brain issue: #2488
  • Batching join requests
  • More resilient joining process (wait on a publish from master)
kimchy and others added some commits Apr 10, 2014
@kimchy @bleskes kimchy [Discovery] lightweight minimum master node recovery
don't perform full recovery when minimum master nodes are not met, keep the state around and use it once elected as master
63d0406
@kimchy @bleskes kimchy [Internal] make no master lock an instance var so it can be configured 4824f05
@kimchy @bleskes kimchy [Discovery] add rejoin on master gone flag, defaults to false
defaults to false since there is still work left to properly make it work
6ede83a
@martijnvg @bleskes martijnvg [Discovery] Make noMasterBlock configurable and added simple test tha…
…t shows reads do execute (partially) when m_m_n isn't met
97bdc8f
@martijnvg @bleskes martijnvg [Discovery] Enable `discovery.zen.rejoin_on_master_gone` setting in D…
…iscoveryWithNetworkFailuresTests only.
3cdbb1a
@martijnvg @bleskes martijnvg [Discovery] Changed the default for the 'rejoin_on_master_gone' optio…
…n from false to true in zen discovery.

Added AwaitFix for the FullRollingRestartTests.
549076e
@martijnvg @bleskes martijnvg [Discovery] If available newly elected master node should take over p…
…revious known nodes.
89a50f6
@martijnvg @bleskes martijnvg [Discovery] Eagerly clean the routing table of shards that exist on n…
…odes that are not in the latestDiscoNodes list.

Only the previous master node has been removed, so only shards allocated to that node will get failed.
This would have happened anyhow on later on when AllocationService#reroute is invoked (for example when a cluster setting changes or another cluster event),
but by cleaning the routing table pro-actively, the stale routing table is fixed sooner and therefor the shards
that are not accessible anyhow (because the node these shards were on has left the cluster) will get re-assigned sooner.
2220c66
@bleskes bleskes Updated to use ClusterBlocks new constructor signature
Introduced with: 11a3201
a9aa10a
@bleskes bleskes [Internal] Do not execute cluster state changes if current node is no…
… longer master

When a node steps down from being a master (because, for example, min_master_node is breached), it may still have
cluster state update tasks queued up. Most (but not all) are tasks that should no longer be executed as the node
no longer has authority to do so. Other cluster states updates, like electing the current node as master, should be
executed even if the current node is no longer master.

This commit make sure that, by default, `ClusterStateUpdateTask` is not executed if the node is no longer master. Tasks
that should run on non masters are changed to implement a new interface called `ClusterStateNonMasterUpdateTask`

Closes #6230
d44bed5
@martijnvg @bleskes martijnvg [TEST] It may take a little bit before the unlucky node deals with th…
…e fact the master left
2c9ef63
@martijnvg @bleskes martijnvg [TEST] Added test that verifies data integrity during and after a sim…
…ulated network split.
fc8ae4d
@martijnvg @bleskes martijnvg [TEST] Make sure there no initializing shards when network partition …
…is simulated
e7d24ec
@martijnvg @bleskes martijnvg [TEST] Added test that exposes a shard consistency problem when isola…
…ted node(s) rejoin the cluster after network segmentation and when the elected master node ended up on the lesser side of the network segmentation.
4828e78
@martijnvg @bleskes martijnvg [Discovery] Removed METADATA block 424a2f6
@martijnvg @bleskes martijnvg [Discovery] Made 'discovery.zen.rejoin_on_master_gone' setting updata…
…ble at runtime.
1849d09
@bleskes bleskes [Discovery] do not use versions to optimize cluster state copying for…
… a first update from a new master

We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different.

Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements)

Closes #6466
58f8774
@martijnvg @bleskes martijnvg [TEST] Remove 'index.routing.allocation.total_shards_per_node' settin…
…g in data consistency test
f3d90cd
@bleskes bleskes [Test] testIsolateMasterAndVerifyClusterStateConsensus didn't wait on…
… initializing shards before comparing cluster states
e39ac7e
@bleskes bleskes [Discovery] Change (Master|Nodes)FaultDetection's connect_on_network_…
…disconnect default to false

The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure.

Also added tests to verify the settings' behaviour
7db9e98
@bleskes bleskes [Discovery] Improved logging when a join request is not executed beca…
…use local node is no longer master
8b85d97
@bleskes bleskes [Discovery] when master is gone, flush all pending cluster states
If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again.

Closes #6526
5d13571
@bleskes bleskes [Tests] Added ServiceDisruptionScheme(s) and testAckedIndexing
This commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This
abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various
network partions. There is also one implementation for causing a node to be slow in processing cluster state updates.

This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests.

A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions.

Closes #6505
28489ce
@martijnvg @bleskes martijnvg [TEST] Check if worker if null to prevent NPE on double stopping 8aed9ee
@martijnvg @bleskes martijnvg [TEST] Reduced failures in DiscoveryWithNetworkFailuresTests#testAcke…
…dIndexing test:

* waiting time should be long enough depending on the type of the disruption scheme
* MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic
785d0e5
@martijnvg @bleskes martijnvg [TEST] Renamed afterDistribution timeout to expectedTimeToHeal
Accumulate expected shard failures to log later
f7b962a
@bleskes bleskes [Test] ensureStableCluster failed to pass viaNode parameter correctly
Also improved timeouts & logs
a7a61a0
@bleskes bleskes [Tests] Disabling testAckedIndexing
The test is currently unstable and needs some more work
1af82fd
@bleskes bleskes Fixed compilation issue caused by the lack of a thread pool name c3e84eb
@martijnvg @bleskes martijnvg [TEST] Added test to verify if 'discovery.zen.rejoin_on_master_gone' …
…is updatable at runtime.
98084c0
@martijnvg @bleskes martijnvg [TEST] Verify no master block during partition for read and write apis 52f69c6
@martijnvg @bleskes martijnvg [TEST] Make sure get request is always local 77dae63
@bleskes bleskes Added java docs to all tests in DiscoveryWithNetworkFailuresTests
Moved testVerifyApiBlocksDuringPartition to test blocks rather then rely on specific API rejections.
Did some cleaning while at it.
5e5f8a9
@bleskes bleskes [Tests] improved automatic disruption healing after tests e897dcc
@martijnvg @bleskes martijnvg [TEST] Properly clear the disruption schemes after test completed. d99ca80
@bleskes bleskes [Test] testVerifyApiBlocksDuringPartition - wait for stable cluster a…
…fter partition
48c7da1
@bleskes bleskes [Discovery] immediately start Master|Node fault detection pinging
After a node joins the clusters, it starts pinging the master to verify it's health. Before, the cluster join request was processed async and we had to give some time to complete. With  #6480 we changed this to wait for the join process to complete on the master. We can therefore start pinging immediately for fast detection of failures. Similar change can be made to the Node fault detection from the master side.

Closes #6706
5302a53
@bleskes bleskes [Discovery] Start master fault detection after pingInterval
This is to allow the master election to complete on the chosen master.

 Relates to #6706
3586e38
@bleskes bleskes [Tests] Use local gateway
This is important to for proper primary allocation decisions
522d4af
@bleskes bleskes [Tests] Don't log about restoring a partition if the partition is not…
… active.
7b6e194
@bleskes bleskes [Tests] Increase timeout when waiting for partitions to heal
the current 30s addition is tricky because we use 30s as timeout in many places...
c12d090
@bleskes bleskes [Internal] Migrate new initial state cluster update task to a Cluster…
…StateNonMasterUpdateTask
e0543b3
@bleskes bleskes [Tests] Introduced ClusterDiscoveryConfiguration
Closes #6890
ea27837
@bleskes bleskes [logging] don't log an error if scheduled reroute is rejected because…
… local node is no longer master

Since it runs in a background thread after a node is added, or submits a cluster state update when a node leaves, it may be that by the time it is executed the local node is no longer master.
7fa3d70
@bleskes bleskes Remove unneeded reference to DiscoveryService which potentially cause…
…s circular references
ccabb4a
@bleskes bleskes [Tests] stability improvements
added explicit cleaning of temp unicast ping results
reduce gateway local.list_timeout to 10s.
testVerifyApiBlocksDuringPartition: verify master node has stepped down before restoring partition
bebaf97
@martijnvg @bleskes martijnvg Discovery: If not enough possible masters are found, but there are ma…
…sters to ping (ping responses did include master node) then these nodes should be resolved.

After the findMaster() call we try to connect to the node and if it isn't the master we start looking for a new master via pinging again.

Closes #6904
67685cb
@bleskes bleskes [Store] migrate non-allocated shard deletion to use ClusterStateNonMa…
…sterUpdateTask
f029a24
@martijnvg @bleskes martijnvg Discovery: Only add local node to possibleMasterNodes if it is a mast…
…er node.
5e38e9e
@martijnvg @bleskes martijnvg Discovery: Don't include local node to pingMasters list. We might end…
… up electing ourselves without any form of verification.
c2142c0
@bleskes bleskes [Discovery] join master after first election
Currently, pinging results are only used if the local node is elected master or if they detect another *already* active master. This has the effect that master election requires two pinging rounds - one for the elected master to take is role and another for the other nodes to detect it and join the cluster. We can be smarter and use the election of the first round on other nodes as well. Those nodes can try to join the elected master immediately. There is a catch though - the elected master node may still be processing the election and may reject the join request if not ready yet. To compensate a retry mechanism is introduced to try again (up to 3 times by default) if this happens.

Closes #6943
ffcf107
@bleskes bleskes [Tests] Fixed some issues with SlowClusterStateProcessing
Reduced expected time to heal to 0 (we interrupt and wait on stop disruption). It was also  wrongly indicated in seconds.
We didn't properly wait between slow cluster state tasks
a409848
@bleskes bleskes [Discovery] verify we have a master after a successful join request
After master election, nodes send join requests to the elected master. Master is then responsible for publishing a new cluster state which sets the master on the local node's cluster state. If something goes wrong with the cluster state publishing, this process will not successfully complete. We should check it after the join request returns and if it failed, retry pinging.

Closes #6969
cccd060
@kimchy @bleskes kimchy retry logic to unwrap exception to check for illegal state
it probably comes wrapped in a remote exception, which we should unwrap in order to detect it..., also, simplified a bit the retry logic
0244ddb
@martijnvg @bleskes martijnvg [Discovery] Made the handeling of the join request batch oriented.
In large clusters when a new elected master is chosen, there are many join requests to handle. By batching them up the the cluster state doesn't get published for each individual join request, but many handled at the same time, which results into a single new cluster state which ends up be published.

Closes #6984
130e680
@martijnvg @bleskes martijnvg [TEST] Added test that verifies that no shard relocations happen duri…
…ng / after a master re-election.
364374d
@martijnvg @bleskes martijnvg [Discovery] Master fault detection and nodes fault detection should t…
…ake cluster name into account.

Both master fault detection and nodes fault detection request should also send the cluster name, so that on the receiving side the handling of these requests can be failed with an error. This error can be caught on the sending side and for master fault detection the node can fail the master locally and for nodes fault detection the node can be failed.

Note this validation will most likely never fail in a production cluster, but in during automated tests where cluster / nodes are created and destroyed very frequently.
4b8456e
@bleskes bleskes [TEST] Added LongGCDisruption and a test simulating GC on master nodes
Also rename DiscoveryWithNetworkFailuresTests to DiscoveryWithServiceDisruptions which better suites what we do.
50f852f
@martijnvg @bleskes martijnvg [Discovery] Added cluster version and master node to the nodes fault …
…detecting ping request

The cluster state version allows resolving the case where a old master node become unresponsive and later wakes up and pings all the nodes in the cluster, allowing the newly elected master to decide whether it should step down or ask the old master to rejoin.
403ebc9
@martijnvg @bleskes martijnvg [TEST] Make sure all shards are allocated before killing a random dat…
…a node.
47326ad
@martijnvg @bleskes martijnvg Typo: s/Recieved/Received 966a55d
@bleskes bleskes [Transport] Introduced worker threads to prevent alien threads of ent…
…ering a node.

Requests are handled by the worked thread pool of the target node instead of the generic thread pool of the source node.
Also this change is required in order to make GC disruption work with local transport. Previously the handling of the a request was performed on on a node that that was being GC disrupted, resulting in some actions being performed while GC was being simulated.
26d9088
@martijnvg @bleskes martijnvg [TEST] Remove the forceful `network.mode` setting in DiscoveryWithSer…
…viceDisruptions#testMasterNodeGCs now local transport use worker threads.
702890e
@martijnvg @bleskes martijnvg [TEST] Changed action names. c8919e4
@martijnvg @bleskes martijnvg [TEST] Adapt testNoMasterActions since metadata isn't cleared if ther…
…e is a no master block
5932371
@bleskes bleskes [Discovery] add a debug log if a node responds to a publish request a…
…fter publishing timed out.
ff8b740
@bleskes bleskes [Discovery] UnicastZenPing should also ping last known discoNodes
At the moment, when a node looses connection to the master (due to a partition or the master was stopped), we ping the unicast hosts in order to discover other nodes and elect a new master or get of another master than has been elected in the mean time. This can go wrong if all unicast targets are on the same side of a minority partition and therefore will never rejoin once the partition is healed.

Closes #7336
d5552a9
@s1monw s1monw commented on an outdated diff Aug 28, 2014
@@ -1219,6 +1219,11 @@
<bundledSignature>jdk-unsafe</bundledSignature>
<bundledSignature>jdk-deprecated</bundledSignature>
</bundledSignatures>
+ <excludes>
+ <!-- start exclude for test GC simulation using Thread.suspend -->
+ <exclude>org/elasticsearch/test/disruption/LongGCDisruption.class</exclude>
+ <!-- end exclude for Channels -->
@s1monw
s1monw Aug 28, 2014 Contributor

this end exclude comment is wrong copy/paste?

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...icsearch/cluster/ClusterStateNonMasterUpdateTask.java
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.elasticsearch.cluster;
+
+/**
+ * This is a marker interface to indicate that the task should be executed
+ * even if the current node is not a master.
+ */
+public interface ClusterStateNonMasterUpdateTask extends ClusterStateUpdateTask {
@s1monw
s1monw Aug 28, 2014 Contributor

maybe call this ForcedClusterStateUpdateTask?

@s1monw
s1monw Aug 28, 2014 Contributor

I also think instead of all the instanceof checks we should maybe make this an abstract class and add a method

public boolean force() {
  return true|false;
}

to ClusterStateUpdateTask maybe?

@bleskes
bleskes Aug 28, 2014 Member

I like the fact that you indicate the criteria (master or not) in the class name. I also don't like the instanceof checks. Perhaps keep the current name and add a method call runOnlyIfMasterand the default implementation will return true while the ClusterStateNonMasterUpdateTask will override it to false?

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...va/org/elasticsearch/cluster/block/ClusterBlocks.java
@@ -108,6 +108,19 @@ public boolean hasGlobalBlock(ClusterBlock block) {
return global.contains(block);
}
+ public boolean hasGlobalBlock(int blockId) {
+ for (ClusterBlock clusterBlock : global) {
+ if (clusterBlock.id() == blockId) {
+ return true;
+ }
+ }
+ return false;
+ }
+
+ public boolean hasGlobalBlock(ClusterBlockLevel level) {
+ return !global(level).isEmpty();
@s1monw
s1monw Aug 28, 2014 Contributor

can we maybe do global(level).size() > 0 ?

@bleskes
bleskes Aug 28, 2014 Member

sure. will do.

@s1monw s1monw commented on an outdated diff Aug 28, 2014
...org/elasticsearch/cluster/routing/RoutingService.java
@@ -151,8 +151,10 @@ public ClusterState execute(ClusterState currentState) {
@Override
public void onFailure(String source, Throwable t) {
- ClusterState state = clusterService.state();
- logger.error("unexpected failure during [{}], current state:\n{}", t, source, state.prettyPrint());
+ if (!(t instanceof ClusterService.NoLongerMasterException)) {
@s1monw
s1monw Aug 28, 2014 Contributor

I'd appreciate if we can do == false it's just so much easier to read

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...org/elasticsearch/cluster/routing/RoutingService.java
@@ -151,8 +151,10 @@ public ClusterState execute(ClusterState currentState) {
@Override
public void onFailure(String source, Throwable t) {
- ClusterState state = clusterService.state();
- logger.error("unexpected failure during [{}], current state:\n{}", t, source, state.prettyPrint());
+ if (!(t instanceof ClusterService.NoLongerMasterException)) {
+ ClusterState state = clusterService.state();
@s1monw
s1monw Aug 28, 2014 Contributor

are we sure the clusterService.state() can never be null?

@bleskes
bleskes Aug 28, 2014 Member

It can't. many things rely on that fact...

@s1monw
s1monw Aug 28, 2014 Contributor

ok fine! :)

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...ava/org/elasticsearch/discovery/DiscoveryService.java
super(settings);
+ this.discoverySettings = discoverySettings;
this.discovery = discovery;
this.initialStateTimeout = componentSettings.getAsTime("initial_state_timeout", TimeValue.timeValueSeconds(30));
@s1monw
s1monw Aug 28, 2014 Contributor

can we maybe make this a constant? the setting I mean?

@bleskes
bleskes Aug 28, 2014 Member

unrelated but will do (will also convert it to a full path)

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...va/org/elasticsearch/discovery/DiscoverySettings.java
@@ -59,6 +78,23 @@ public void onRefreshSettings(Settings settings) {
publishTimeout = newPublishTimeout;
}
}
+ String newNoMasterBlockValue = settings.get(NO_MASTER_BLOCK);
+ if (newNoMasterBlockValue != null) {
+ ClusterBlock newNoMasterBlock = parseNoMasterBlock(newNoMasterBlockValue);
+ if (newNoMasterBlock != noMasterBlock) {
+ noMasterBlock = newNoMasterBlock;
+ }
+ }
+ }
+ }
+
+ private ClusterBlock parseNoMasterBlock(String value) {
+ if ("all".equals(value)) {
@s1monw
s1monw Aug 28, 2014 Contributor

can we use a switch / case statement here it's easier to read and 1.7 supports strings

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
// also support direct discovery.zen settings, for cases when it gets extended
this.pingTimeout = settings.getAsTime("discovery.zen.ping.timeout", settings.getAsTime("discovery.zen.ping_timeout", componentSettings.getAsTime("ping_timeout", componentSettings.getAsTime("initial_ping_timeout", timeValueSeconds(3)))));
this.joinTimeout = settings.getAsTime("discovery.zen.join_timeout", TimeValue.timeValueMillis(pingTimeout.millis() * 20));
+ this.joinRetryAttempts = settings.getAsInt("discovery.zen.join_retry_attempts", 3);
@s1monw
s1monw Aug 28, 2014 Contributor

can we make those all constants? and does it make sense to randomize some of them?

@bleskes
bleskes Aug 28, 2014 Member

I don't think we should randomize this one. We do want to rarely randomize the discovery.zen.rejoin_on_master_gone which is already a constant.

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
// also support direct discovery.zen settings, for cases when it gets extended
this.pingTimeout = settings.getAsTime("discovery.zen.ping.timeout", settings.getAsTime("discovery.zen.ping_timeout", componentSettings.getAsTime("ping_timeout", componentSettings.getAsTime("initial_ping_timeout", timeValueSeconds(3)))));
this.joinTimeout = settings.getAsTime("discovery.zen.join_timeout", TimeValue.timeValueMillis(pingTimeout.millis() * 20));
+ this.joinRetryAttempts = settings.getAsInt("discovery.zen.join_retry_attempts", 3);
+ this.joinRetryDelay = settings.getAsTime("discovery.zen.join_retry_delay", TimeValue.timeValueMillis(100));
@s1monw
s1monw Aug 28, 2014 Contributor

should we validate this setting? ie >= 0?

@bleskes
bleskes Aug 28, 2014 Member

can add.

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
@@ -178,7 +201,7 @@ protected void doStart() throws ElasticsearchException {
final String nodeId = DiscoveryService.generateNodeId(settings);
localNode = new DiscoveryNode(settings.get("name"), nodeId, transportService.boundAddress().publishAddress(), nodeAttributes, version);
latestDiscoNodes = new DiscoveryNodes.Builder().put(localNode).localNodeId(localNode.id()).build();
- nodesFD.updateNodes(latestDiscoNodes);
+ nodesFD.updateNodes(latestDiscoNodes, -1);
@s1monw
s1monw Aug 28, 2014 Contributor

can we make the unknown cluster state version a constant?

@bleskes
bleskes Aug 28, 2014 Member

will do.

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
}
@Override
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
- callback.onSuccess();
+ for (Tuple<DiscoveryNode, MembershipAction.JoinCallback> drainedTask : drainedTasks) {
+ drainedTask.v2().onSuccess();
@s1monw
s1monw Aug 28, 2014 Contributor

should we do this in a try / catch fashion? and make sure we call it on all of them?

@s1monw s1monw commented on the diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
@@ -807,35 +921,33 @@ private DiscoveryNode findMaster() {
List<DiscoveryNode> pingMasters = newArrayList();
for (ZenPing.PingResponse pingResponse : pingResponses) {
if (pingResponse.master() != null) {
- pingMasters.add(pingResponse.master());
+ // We can't include the local node in pingMasters list, otherwise we may up electing ourselves without
+ // any check / verifications from other nodes in ZenDiscover#innerJoinCluster()
+ if (!localNode.equals(pingResponse.master())) {
+ pingMasters.add(pingResponse.master());
@s1monw
s1monw Aug 28, 2014 Contributor

it might be good to have an assertion somewhere that make sure it's not there?

@bleskes
bleskes Aug 28, 2014 Member

will see where it makes sense to add this.

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
if (pingMasters.isEmpty()) {
- // lets tie break between discovered nodes
- DiscoveryNode electedMaster = electMaster.electMaster(possibleMasterNodes);
- if (localNode.equals(electedMaster)) {
- return localNode;
+ // if we don't have enough master nodes, we bail, because there are not enough master to elect from
+ if (!electMaster.hasEnoughMasterNodes(possibleMasterNodes)) {
@s1monw
s1monw Aug 28, 2014 Contributor

maybe we can turn this around and do

if (electMaster.hasEnoughMasterNodes(possibleMasterNodes)) {
  // lets tie break between discovered nodes
  return electMaster.electMaster(possibleMasterNodes);
} else {
  logger.trace("not enough master nodes [{}]", possibleMasterNodes);
  return null;
}
@bleskes
bleskes Aug 28, 2014 Member

will do

@s1monw s1monw commented on the diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
@Override
public void onNodeFailure(DiscoveryNode node, String reason) {
handleNodeFailure(node, reason);
}
+
+ @Override
+ public void onPingReceived(final NodesFaultDetection.PingRequest pingRequest) {
+ // if we are master, we don't expect any fault detection from another node. If we get it
+ // means we potentially have two masters in the cluster.
+ if (!master) {
+ pingsWhileMaster.set(0);
+ return;
+ }
+
+ // nodes pre 1.4.0 do not send this information
+ if (pingRequest.masterNode() == null) {
@s1monw
s1monw Aug 28, 2014 Contributor

maybe add a logging statement here?

@bleskes
bleskes Aug 29, 2014 Member

rather not because this ping happens every second. I think it will be rarely useful..

@s1monw s1monw commented on an outdated diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
@@ -968,7 +1137,11 @@ public ClusterState execute(ClusterState currentState) {
@Override
public void onFailure(String source, Throwable t) {
- logger.error("unexpected failure during [{}]", t, source);
+ if (t instanceof ClusterService.NoLongerMasterException) {
@s1monw
s1monw Aug 28, 2014 Contributor

I start to see this a lot, can we have a static helper somewhere?

@s1monw s1monw commented on the diff Aug 28, 2014
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
+
+ @Override
+ public void onPingReceived(final NodesFaultDetection.PingRequest pingRequest) {
+ // if we are master, we don't expect any fault detection from another node. If we get it
+ // means we potentially have two masters in the cluster.
+ if (!master) {
+ pingsWhileMaster.set(0);
+ return;
+ }
+
+ // nodes pre 1.4.0 do not send this information
+ if (pingRequest.masterNode() == null) {
+ return;
+ }
+
+ if (pingsWhileMaster.incrementAndGet() < maxPingsFromAnotherMaster) {
@s1monw
s1monw Aug 28, 2014 Contributor

it doens't matter if that one overflows no? I mean grows larger than maxPingsFromAnotherMaster?

@bleskes
bleskes Aug 29, 2014 Member

Not really because it doesn't hurt to do the check twice. Rather be more resilient to errors exceptions etc.

@s1monw s1monw commented on the diff Aug 28, 2014
...ticsearch/discovery/zen/elect/ElectMasterService.java
@@ -70,6 +69,18 @@ public boolean hasEnoughMasterNodes(Iterable<DiscoveryNode> nodes) {
}
/**
+ * Returns the given nodes sorted by likelyhood of being elected as master, most likely first.
+ * Non-master nodes are not removed but are rather put in the end
+ * @param nodes
+ * @return
+ */
+ public List<DiscoveryNode> sortByMasterLikelihood(Iterable<DiscoveryNode> nodes) {
@s1monw
s1monw Aug 28, 2014 Contributor

can't this be list from the beginning?

@bleskes
bleskes Aug 28, 2014 Member

I just followed the existing pattern. Will double check why it is so.

@bleskes
bleskes Aug 29, 2014 Member

I think it's OK? We want to prefer safety and copy the list anyway because we changing it's content. If that's the case might as well accept any Iterable. This code is rarely called so performance is secondary imho.

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...sticsearch/discovery/zen/fd/MasterFaultDetection.java
- this.connectOnNetworkDisconnect = componentSettings.getAsBoolean("connect_on_network_disconnect", true);
+ this.connectOnNetworkDisconnect = componentSettings.getAsBoolean("connect_on_network_disconnect", false);
@s1monw
s1monw Aug 28, 2014 Contributor

can we make this a constant? and Ideally not using componentSettings it's so confusing

@s1monw s1monw commented on the diff Aug 28, 2014
...sticsearch/discovery/zen/fd/MasterFaultDetection.java
@@ -200,7 +207,8 @@ private void handleTransportDisconnect(DiscoveryNode node) {
masterPinger.stop();
}
this.masterPinger = new MasterPinger();
- threadPool.schedule(pingInterval, ThreadPool.Names.SAME, masterPinger);
+ // we use schedule with a 0 time value to run the pinger on the pool as it will run on later
@s1monw
s1monw Aug 28, 2014 Contributor

I am not sure I understand this change here?

@bleskes
bleskes Aug 29, 2014 Member

This done to detect master problems faster. Before we waited 1 second (default ping interval) but we know here that the master is already active - we can start pinging immediately.

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...asticsearch/discovery/zen/fd/NodesFaultDetection.java
@@ -296,6 +321,13 @@ public void messageReceived(PingRequest request, TransportChannel channel) throw
if (!latestNodes.localNodeId().equals(request.nodeId)) {
throw new ElasticsearchIllegalStateException("Got pinged as node [" + request.nodeId + "], but I am node [" + latestNodes.localNodeId() + "]");
}
+ if (request.clusterName != null && !request.clusterName.equals(clusterName)) {
@s1monw
s1monw Aug 28, 2014 Contributor

maybe it makes sense to put this logic somehwere else to make sure we know it can be null?

@bleskes
bleskes Aug 29, 2014 Member

I can put it on the request deserialization but I think in this case it's better not to hide the fact that the request didn't have a clusterName with it? I'll add a comment as to how it can be null..

@s1monw s1monw commented on the diff Aug 28, 2014
...sticsearch/discovery/zen/fd/MasterFaultDetection.java
@@ -400,13 +424,15 @@ public String executor() {
private String nodeId;
private String masterNodeId;
+ private ClusterName clusterName;
@s1monw
s1monw Aug 28, 2014 Contributor

I can't see who uses this?

@s1monw s1monw and 1 other commented on an outdated diff Aug 28, 2014
...search/discovery/zen/ping/unicast/UnicastZenPing.java
}
+ // sort the nodes by likelihood of being an active master
+ List<DiscoveryNode> sortedNodesToPing = electMasterService.sortByMasterLikelihood(nodesToPingSet);
+
+ // new add the the unicast targets first
+ ArrayList<DiscoveryNode> nodesToPing = Lists.newArrayList(nodes);
@s1monw
s1monw Aug 28, 2014 Contributor

it took me a while to figure out what all these different node lists / sets are maybe you can find a better name for this.nodes?

@bleskes
bleskes Aug 29, 2014 Member

agreed. Will change.

@s1monw s1monw commented on the diff Aug 28, 2014
...org/elasticsearch/transport/local/LocalTransport.java
@@ -58,13 +62,20 @@
private static final AtomicLong transportAddressIdGenerator = new AtomicLong();
private final ConcurrentMap<DiscoveryNode, LocalTransport> connectedNodes = newConcurrentMap();
- public static final String TRANSPORT_LOCAL_ADDRESS = "transport.local_address";
+ public static final String TRANSPORT_LOCAL_ADDRESS = "transport.local.address";
@s1monw
s1monw Aug 28, 2014 Contributor

oh is this change BW compatible?

@bleskes
bleskes Aug 29, 2014 Member

The LocalTransport is only used for testing and only works within a single JVMs - so I think bwc is not an issue and we should prefer cleaner code?

@s1monw
Contributor
s1monw commented Aug 28, 2014

I did a first review round and left some mostly cosmetic comments. I don't know that code well enough to really review it deeply but given the fact that the development on the branch went through several review rounds is a good indicator. I think we are close here/

@bleskes
Member
bleskes commented Aug 29, 2014

@s1monw thx for the review. I pushed the minor things as commits to this branch and put the bigger changes into separate PRs (#7511 & #7512 ) to make the review easier...

@bleskes bleskes [Internal] introduce ClusterState.UNKNOWN_VERSION constant
Used as null value for cluster state versions.
d8a5ff0
@s1monw
Contributor
s1monw commented Sep 1, 2014

the changes here LGTM thanks!

bleskes added some commits Aug 29, 2014
@bleskes bleskes [Internal] Extract a common base class for (Master|Nodes)FaultDetection
They share a lot of settings and some logic.

Closes #7512
596a4a0
@bleskes bleskes [Cluster] Refactored ClusterStateUpdateTask protection against execut…
…ion on a non master

Previous implementation used a marker interface and had no explicit failure call back for the case update task was run on a non master (i.e., the master stepped down after it was submitted). That lead to a couple of instance of checks.

This approach moves ClusterStateUpdateTask from an interface to an abstract class, which allows adding a flag to indicate whether it should only run on master nodes (defaults to true). It also adds an explicit onNoLongerMaster call back to allow different error handling for that case. This also removed the need for the  NoLongerMaster.

Closes #7511
34f4ca7
@bleskes bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 1, 2014
@bleskes bleskes [Discovery] accumulated improvements to ZenDiscovery
Merging the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes:

__Testing infra__
- Networking:
    - all symmetric partitioning
    - dropping packets
    - hard disconnects
    - Jepsen Tests
- Single node service disruptions:
    - Long GC / Halt
    - Slow cluster state updates
- Discovery settings
    - Easy to setup unicast with partial host list

__Zen Discovery__
- Pinging after master loss (no local elects)
- Fixes the split brain issue: #2488
- Batching join requests
- More resilient joining process (wait on a publish from master)

Closes #7493
598854d
@bleskes bleskes merged commit 34f4ca7 into master Sep 1, 2014
@bleskes bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 1, 2014
@bleskes bleskes [Discovery] accumulated improvements to ZenDiscovery
Merging the accumulated work from the feature/improve_zen branch. Here are the highlights of the changes:

__Testing infra__
- Networking:
    - all symmetric partitioning
    - dropping packets
    - hard disconnects
    - Jepsen Tests
- Single node service disruptions:
    - Long GC / Halt
    - Slow cluster state updates
- Discovery settings
    - Easy to setup unicast with partial host list

__Zen Discovery__
- Pinging after master loss (no local elects)
- Fixes the split brain issue: #2488
- Batching join requests
- More resilient joining process (wait on a publish from master)

Closes #7493
51b89f7
@javanna

typo s/removeDistruptionSchemeFromNode/removeDisruptionSchemeFromNode

@bleskes bleskes deleted the feature/improve_zen branch Sep 2, 2014
@clintongormley clintongormley changed the title from [Discovery] accumulated improvements to ZenDiscovery to Resiliency: Accumulated improvements to ZenDiscovery Sep 8, 2014
@bleskes bleskes added a commit that referenced this pull request Sep 11, 2014
@bleskes bleskes Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558
a50934e
@bleskes bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 16, 2014
@bleskes bleskes Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558
1f39d43
@bleskes bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 16, 2014
@bleskes bleskes Discovery: back port #7558 to 1.x and add bwc protections of the new …
…ping on master gone introduced in #7493

The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used,  this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state.

Further two bwc protections are added:
1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0
2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0
f9667ee
@bleskes bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 16, 2014
@bleskes bleskes Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558
e9d14b3
@bleskes bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 16, 2014
@bleskes bleskes Discovery: back port #7558 to 1.x and add bwc protections of the new …
…ping on master gone introduced in #7493

The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used,  this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state.

Further two bwc protections are added:
1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0
2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0

Closes #7694
19adaf8
@bleskes bleskes added a commit that referenced this pull request Sep 16, 2014
@bleskes bleskes Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558
d9ea628
@bleskes bleskes added a commit that referenced this pull request Sep 16, 2014
@bleskes bleskes Discovery: back port #7558 to 1.x and add bwc protections of the new …
…ping on master gone introduced in #7493

The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used,  this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state.

Further two bwc protections are added:
1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0
2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0

Closes #7694
e5de47d
@jpountz jpountz removed the review label Oct 21, 2014
@clintongormley clintongormley changed the title from Resiliency: Accumulated improvements to ZenDiscovery to Accumulated improvements to ZenDiscovery Jun 7, 2015
@mute mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
@bleskes bleskes Resiliency: Master election should demotes nodes which try to join th…
…e cluster for the first time

With the change in #7493,  we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master).  If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed.

Closes #7558
cd6dfa4
@mute mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
@bleskes bleskes Discovery: back port #7558 to 1.x and add bwc protections of the new …
…ping on master gone introduced in #7493

The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used,  this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state.

Further two bwc protections are added:
1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0
2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0

Closes #7694
54a4b90
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment