Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Better join retry #6970

Closed
wants to merge 64 commits into
from

Conversation

Projects
None yet
3 participants
Owner

kimchy commented Jul 22, 2014

Note, this is against improve_zen branch

kimchy and others added some commits Apr 10, 2014

lightweight minimum master node recovery
don't perform full recovery when minimum master nodes are not met, keep the state around and use it once elected as master
add rejoin on master gone flag, defaults to false
defaults to false since there is still work left to properly make it work
Make noMasterBlock configurable and added simple test that shows read…
…s do execute (partially) when m_m_n isn't met
Changed the default for the 'rejoin_on_master_gone' option from false…
… to true in zen discovery.

Added AwaitFix for the FullRollingRestartTests.
Do not start a recovery process if the primary shard is currently all…
…ocated on a node which is not part of the cluster state

If a source node disconnect during recover, the target node will respond by canceling the recovery. Typically the master will respond by removing the disconnected node from the cluster state, promoting another shard to become primary. This is sent it to all nodes and the target node will start recovering from the new primary. However, if the drop of a node caused the node count to go bellow min_master_node, the master will step down and will not promote shard immediately. When a new master is elected we may publish a new cluster state (who's point is to notify of a new master) which is not yet updated. This caused the node to start a recovery to a non existent node. Before we aborted the recovery without cleaning up the shard, causing subsequent correct cluster states to be ignored. We should not start the recovery process but wait for another cluster state to come in.
RoutingService background reroute task should check if node is still …
…master when its run

It may be that between the moment the task was added to the queue the current node stepped down from being master (due to minimum_master_nodes settings).
Eagerly clean the routing table of shards that exist on nodes that ar…
…e not in the latestDiscoNodes list.

Only the previous master node has been removed, so only shards allocated to that node will get failed.
This would have happened anyhow on later on when AllocationService#reroute is invoked (for example when a cluster setting changes or another cluster event),
but by cleaning the routing table pro-actively, the stale routing table is fixed sooner and therefor the shards
that are not accessible anyhow (because the node these shards were on has left the cluster) will get re-assigned sooner.
Use just AllocationService#reroute to eagerly clean the routing table…
… of nodes that don't exist instead of using AllocationService#applyFailedShards directly.
Do not execute cluster state changes if current node is no longer master
When a node steps down from being a master (because, for example, min_master_node is breached), it may still have
cluster state update tasks queued up. Most (but not all) are tasks that should no longer be executed as the node
no longer has authority to do so. Other cluster states updates, like electing the current node as master, should be
executed even if the current node is no longer master.

This commit make sure that, by default, `ClusterStateUpdateTask` is not executed if the node is no longer master. Tasks
that should run on non masters are changed to implement a new interface called `ClusterStateNonMasterUpdateTask`

Closes #6230
[TEST] Added test that exposes a shard consistency problem when isola…
…ted node(s) rejoin the cluster after network segmentation and when the elected master node ended up on the lesser side of the network segmentation.
[Discovery] do not use versions to optimize cluster state copying for…
… a first update from a new master

We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different.

Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements)

Closes #6466
[Test] testIsolateMasterAndVerifyClusterStateConsensus didn't wait on…
… initializing shards before comparing cluster states
[TEST] Increased logging and improved logging in broadcast operations…
… to know what cluster state version is used to resolve the shards
Change (Master|Nodes)FaultDetection's connect_on_network_disconnect d…
…efault to false

The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure.

Also added tests to verify the settings' behaviour
Added ServiceDisruptionScheme(s) and testAckedIndexing
This commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This
abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various
network partions. There is also one implementation for causing a node to be slow in processing cluster state updates.

This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests.

A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions.

Closes #6505
[Discovery] when master is gone, flush all pending cluster states
If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again.

Closes #6526
[TEST] Reduced failures in DiscoveryWithNetworkFailuresTests#testAcke…
…dIndexing test:

* waiting time should be long enough depending on the type of the disruption scheme
* MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic
[TEST] Renamed afterDistribution timeout to expectedTimeToHeal
Accumulate expected shard failures to log later
Added java docs to all tests in DiscoveryWithNetworkFailuresTests
Moved testVerifyApiBlocksDuringPartition to test blocks rather then rely on specific API rejections.
Did some cleaning while at it.
[Discovery] immediately start Master|Node fault detection pinging
After a node joins the clusters, it starts pinging the master to verify it's health. Before, the cluster join request was processed async and we had to give some time to complete. With  #6480 we changed this to wait for the join process to complete on the master. We can therefore start pinging immediately for fast detection of failures. Similar change can be made to the Node fault detection from the master side.

Closes #6706
[Discovery] Start master fault detection after pingInterval
This is to allow the master election to complete on the chosen master.

 Relates to #6706
Use local gateway
This is important to for proper primary allocation decisions
Increase timeout when waiting for partitions to heal
the current 30s addition is tricky because we use 30s as timeout in many places...
Test stability improvements
added explicit cleaning of temp unicast ping results
reduce gateway local.list_timeout to 10s.
testVerifyApiBlocksDuringPartition: verify master node has stepped down before restoring partition
[logging] don't log an error if scheduled reroute is rejected because…
… local node is no longer master

Since it runs in a background thread after a node is added, or submits a cluster state update when a node leaves, it may be that by the time it is executed the local node is no longer master.
Discovery: If not enough possible masters are found, but there are ma…
…sters to ping (ping responses did include master node) then these nodes should be resolved.

After the findMaster() call we try to connect to the node and if it isn't the master we start looking for a new master via pinging again.

Closes #6904
Discovery: Don't include local node to pingMasters list. We might end…
… up electing ourselves without any form of verification.
[Discovery] join master after first election
Currently, pinging results are only used if the local node is elected master or if they detect another *already* active master. This has the effect that master election requires two pinging rounds - one for the elected master to take is role and another for the other nodes to detect it and join the cluster. We can be smarter and use the election of the first round on other nodes as well. Those nodes can try to join the elected master immediately. There is a catch though - the elected master node may still be processing the election and may reject the join request if not ready yet. To compensate a retry mechanism is introduced to try again (up to 3 times by default) if this happens.

Closes #6943
[Tests] Fixed some issues with SlowClusterStateProcessing
Reduced expected time to heal to 0 (we interrupt and wait on stop disruption). It was also  wrongly indicated in seconds.
We didn't properly wait between slow cluster state tasks
[Discovery] verify we have a master after a successful join request
After master election, nodes send join requests to the elected master. Master is then responsible for publishing a new cluster state which sets the master on the local node's cluster state. If something goes wrong with the cluster state publishing, this process will not successfully complete. We should check it after the join request returns and if it failed, retry pinging.

Closes #6969
retry logic to unwrap exception to check for illegal state
it probably comes wrapped in a remote exception, which we should unwrap in order to detect it..., also, simplified a bit the retry logic
Owner

kimchy commented Jul 22, 2014

failed pull request :)

@kimchy kimchy closed this Jul 22, 2014

@kimchy kimchy deleted the kimchy:better_join_retry branch Jul 22, 2014

@kimchy kimchy restored the kimchy:better_join_retry branch Jul 22, 2014

@kimchy kimchy deleted the kimchy:better_join_retry branch Aug 18, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment