
Loading…
Accumulated improvements to ZenDiscovery #7493
|
|
kimchy |
[Discovery] lightweight minimum master node recovery
…
don't perform full recovery when minimum master nodes are not met, keep the state around and use it once elected as master |
63d0406
|
|
|
|
kimchy |
[Internal] make no master lock an instance var so it can be configured
|
4824f05
|
|
|
|
kimchy |
[Discovery] add rejoin on master gone flag, defaults to false
…
defaults to false since there is still work left to properly make it work |
6ede83a
|
|
|
|
martijnvg |
[Discovery] Make noMasterBlock configurable and added simple test tha…
…
…t shows reads do execute (partially) when m_m_n isn't met |
97bdc8f
|
|
|
|
martijnvg |
[Discovery] Enable `discovery.zen.rejoin_on_master_gone` setting in D…
…
…iscoveryWithNetworkFailuresTests only. |
3cdbb1a
|
|
|
|
martijnvg |
[Discovery] Changed the default for the 'rejoin_on_master_gone' optio…
…
…n from false to true in zen discovery. Added AwaitFix for the FullRollingRestartTests. |
549076e
|
|
|
|
martijnvg |
[Discovery] If available newly elected master node should take over p…
…
…revious known nodes. |
89a50f6
|
|
|
|
martijnvg |
[Discovery] Eagerly clean the routing table of shards that exist on n…
…
…odes that are not in the latestDiscoNodes list. Only the previous master node has been removed, so only shards allocated to that node will get failed. This would have happened anyhow on later on when AllocationService#reroute is invoked (for example when a cluster setting changes or another cluster event), but by cleaning the routing table pro-actively, the stale routing table is fixed sooner and therefor the shards that are not accessible anyhow (because the node these shards were on has left the cluster) will get re-assigned sooner. |
2220c66
|
|
|
|
bleskes |
Updated to use ClusterBlocks new constructor signature
…
Introduced with: 11a3201 |
a9aa10a
|
|
|
|
bleskes |
[Internal] Do not execute cluster state changes if current node is no…
…
… longer master When a node steps down from being a master (because, for example, min_master_node is breached), it may still have cluster state update tasks queued up. Most (but not all) are tasks that should no longer be executed as the node no longer has authority to do so. Other cluster states updates, like electing the current node as master, should be executed even if the current node is no longer master. This commit make sure that, by default, `ClusterStateUpdateTask` is not executed if the node is no longer master. Tasks that should run on non masters are changed to implement a new interface called `ClusterStateNonMasterUpdateTask` Closes #6230 |
d44bed5
|
|
|
|
martijnvg |
[TEST] It may take a little bit before the unlucky node deals with th…
…
…e fact the master left |
2c9ef63
|
|
|
|
martijnvg |
[TEST] Added test that verifies data integrity during and after a sim…
…
…ulated network split. |
fc8ae4d
|
|
|
|
martijnvg |
[TEST] Make sure there no initializing shards when network partition …
…
…is simulated |
e7d24ec
|
|
|
|
martijnvg |
[TEST] Added test that exposes a shard consistency problem when isola…
…
…ted node(s) rejoin the cluster after network segmentation and when the elected master node ended up on the lesser side of the network segmentation. |
4828e78
|
|
|
|
martijnvg |
[Discovery] Removed METADATA block
|
424a2f6
|
|
|
|
martijnvg |
[Discovery] Made 'discovery.zen.rejoin_on_master_gone' setting updata…
…
…ble at runtime. |
1849d09
|
|
|
|
bleskes |
[Discovery] do not use versions to optimize cluster state copying for…
…
… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes #6466 |
58f8774
|
|
|
|
martijnvg |
[TEST] Remove 'index.routing.allocation.total_shards_per_node' settin…
…
…g in data consistency test |
f3d90cd
|
|
|
|
bleskes |
[Test] testIsolateMasterAndVerifyClusterStateConsensus didn't wait on…
…
… initializing shards before comparing cluster states |
e39ac7e
|
|
|
|
bleskes |
[Discovery] Change (Master|Nodes)FaultDetection's connect_on_network_…
…
…disconnect default to false The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure. Also added tests to verify the settings' behaviour |
7db9e98
|
|
|
|
bleskes |
[Discovery] Improved logging when a join request is not executed beca…
…
…use local node is no longer master |
8b85d97
|
|
|
|
bleskes |
[Discovery] when master is gone, flush all pending cluster states
…
If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again. Closes #6526 |
5d13571
|
|
|
|
bleskes |
[Tests] Added ServiceDisruptionScheme(s) and testAckedIndexing
…
This commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various network partions. There is also one implementation for causing a node to be slow in processing cluster state updates. This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests. A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions. Closes #6505 |
28489ce
|
|
|
|
martijnvg |
[TEST] Check if worker if null to prevent NPE on double stopping
|
8aed9ee
|
|
|
|
martijnvg |
[TEST] Reduced failures in DiscoveryWithNetworkFailuresTests#testAcke…
…
…dIndexing test: * waiting time should be long enough depending on the type of the disruption scheme * MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic |
785d0e5
|
|
|
|
martijnvg |
[TEST] Renamed afterDistribution timeout to expectedTimeToHeal
…
Accumulate expected shard failures to log later |
f7b962a
|
|
|
|
bleskes |
[Test] ensureStableCluster failed to pass viaNode parameter correctly
…
Also improved timeouts & logs |
a7a61a0
|
|
|
|
bleskes |
[Tests] Disabling testAckedIndexing
…
The test is currently unstable and needs some more work |
1af82fd
|
|
|
|
bleskes |
Fixed compilation issue caused by the lack of a thread pool name
|
c3e84eb
|
|
|
|
martijnvg |
[TEST] Added test to verify if 'discovery.zen.rejoin_on_master_gone' …
…
…is updatable at runtime. |
98084c0
|
|
|
|
martijnvg |
[TEST] Verify no master block during partition for read and write apis
|
52f69c6
|
|
|
|
martijnvg |
[TEST] Make sure get request is always local
|
77dae63
|
|
|
|
bleskes |
Added java docs to all tests in DiscoveryWithNetworkFailuresTests
…
Moved testVerifyApiBlocksDuringPartition to test blocks rather then rely on specific API rejections. Did some cleaning while at it. |
5e5f8a9
|
|
|
|
bleskes |
[Tests] improved automatic disruption healing after tests
|
e897dcc
|
|
|
|
martijnvg |
[TEST] Properly clear the disruption schemes after test completed.
|
d99ca80
|
|
|
|
bleskes |
[Test] testVerifyApiBlocksDuringPartition - wait for stable cluster a…
…
…fter partition |
48c7da1
|
|
|
|
bleskes |
[Discovery] immediately start Master|Node fault detection pinging
…
After a node joins the clusters, it starts pinging the master to verify it's health. Before, the cluster join request was processed async and we had to give some time to complete. With #6480 we changed this to wait for the join process to complete on the master. We can therefore start pinging immediately for fast detection of failures. Similar change can be made to the Node fault detection from the master side. Closes #6706 |
5302a53
|
|
|
|
bleskes |
[Discovery] Start master fault detection after pingInterval
…
This is to allow the master election to complete on the chosen master. Relates to #6706 |
3586e38
|
|
|
|
bleskes |
[Tests] Use local gateway
…
This is important to for proper primary allocation decisions |
522d4af
|
|
|
|
bleskes |
[Tests] Don't log about restoring a partition if the partition is not…
…
… active. |
7b6e194
|
|
|
|
bleskes |
[Tests] Increase timeout when waiting for partitions to heal
…
the current 30s addition is tricky because we use 30s as timeout in many places... |
c12d090
|
|
|
|
bleskes |
[Internal] Migrate new initial state cluster update task to a Cluster…
…
…StateNonMasterUpdateTask |
e0543b3
|
|
|
|
bleskes |
[Tests] Introduced ClusterDiscoveryConfiguration
…
Closes #6890 |
ea27837
|
|
|
|
bleskes |
[logging] don't log an error if scheduled reroute is rejected because…
…
… local node is no longer master Since it runs in a background thread after a node is added, or submits a cluster state update when a node leaves, it may be that by the time it is executed the local node is no longer master. |
7fa3d70
|
|
|
|
bleskes |
Remove unneeded reference to DiscoveryService which potentially cause…
…
…s circular references |
ccabb4a
|
|
|
|
bleskes |
[Tests] stability improvements
…
added explicit cleaning of temp unicast ping results reduce gateway local.list_timeout to 10s. testVerifyApiBlocksDuringPartition: verify master node has stepped down before restoring partition |
bebaf97
|
|
|
|
martijnvg |
Discovery: If not enough possible masters are found, but there are ma…
…
…sters to ping (ping responses did include master node) then these nodes should be resolved. After the findMaster() call we try to connect to the node and if it isn't the master we start looking for a new master via pinging again. Closes #6904 |
67685cb
|
|
|
|
bleskes |
[Store] migrate non-allocated shard deletion to use ClusterStateNonMa…
…
…sterUpdateTask |
f029a24
|
|
|
|
martijnvg |
Discovery: Only add local node to possibleMasterNodes if it is a mast…
…
…er node. |
5e38e9e
|
|
|
|
martijnvg |
Discovery: Don't include local node to pingMasters list. We might end…
…
… up electing ourselves without any form of verification. |
c2142c0
|
|
|
|
bleskes |
[Discovery] join master after first election
…
Currently, pinging results are only used if the local node is elected master or if they detect another *already* active master. This has the effect that master election requires two pinging rounds - one for the elected master to take is role and another for the other nodes to detect it and join the cluster. We can be smarter and use the election of the first round on other nodes as well. Those nodes can try to join the elected master immediately. There is a catch though - the elected master node may still be processing the election and may reject the join request if not ready yet. To compensate a retry mechanism is introduced to try again (up to 3 times by default) if this happens. Closes #6943 |
ffcf107
|
|
|
|
bleskes |
[Tests] Fixed some issues with SlowClusterStateProcessing
…
Reduced expected time to heal to 0 (we interrupt and wait on stop disruption). It was also wrongly indicated in seconds. We didn't properly wait between slow cluster state tasks |
a409848
|
|
|
|
bleskes |
[Discovery] verify we have a master after a successful join request
…
After master election, nodes send join requests to the elected master. Master is then responsible for publishing a new cluster state which sets the master on the local node's cluster state. If something goes wrong with the cluster state publishing, this process will not successfully complete. We should check it after the join request returns and if it failed, retry pinging. Closes #6969 |
cccd060
|
|
|
|
kimchy |
retry logic to unwrap exception to check for illegal state
…
it probably comes wrapped in a remote exception, which we should unwrap in order to detect it..., also, simplified a bit the retry logic |
0244ddb
|
|
|
|
martijnvg |
[Discovery] Made the handeling of the join request batch oriented.
…
In large clusters when a new elected master is chosen, there are many join requests to handle. By batching them up the the cluster state doesn't get published for each individual join request, but many handled at the same time, which results into a single new cluster state which ends up be published. Closes #6984 |
130e680
|
|
|
|
martijnvg |
[TEST] Added test that verifies that no shard relocations happen duri…
…
…ng / after a master re-election. |
364374d
|
|
|
|
martijnvg |
[Discovery] Master fault detection and nodes fault detection should t…
…
…ake cluster name into account. Both master fault detection and nodes fault detection request should also send the cluster name, so that on the receiving side the handling of these requests can be failed with an error. This error can be caught on the sending side and for master fault detection the node can fail the master locally and for nodes fault detection the node can be failed. Note this validation will most likely never fail in a production cluster, but in during automated tests where cluster / nodes are created and destroyed very frequently. |
4b8456e
|
|
|
|
bleskes |
[TEST] Added LongGCDisruption and a test simulating GC on master nodes
…
Also rename DiscoveryWithNetworkFailuresTests to DiscoveryWithServiceDisruptions which better suites what we do. |
50f852f
|
|
|
|
martijnvg |
[Discovery] Added cluster version and master node to the nodes fault …
…
…detecting ping request The cluster state version allows resolving the case where a old master node become unresponsive and later wakes up and pings all the nodes in the cluster, allowing the newly elected master to decide whether it should step down or ask the old master to rejoin. |
403ebc9
|
|
|
|
martijnvg |
[TEST] Make sure all shards are allocated before killing a random dat…
…
…a node. |
47326ad
|
|
|
|
martijnvg |
Typo: s/Recieved/Received
|
966a55d
|
|
|
|
bleskes |
[Transport] Introduced worker threads to prevent alien threads of ent…
…
…ering a node. Requests are handled by the worked thread pool of the target node instead of the generic thread pool of the source node. Also this change is required in order to make GC disruption work with local transport. Previously the handling of the a request was performed on on a node that that was being GC disrupted, resulting in some actions being performed while GC was being simulated. |
26d9088
|
|
|
|
martijnvg |
[TEST] Remove the forceful `network.mode` setting in DiscoveryWithSer…
…
…viceDisruptions#testMasterNodeGCs now local transport use worker threads. |
702890e
|
|
|
|
martijnvg |
[TEST] Changed action names.
|
c8919e4
|
|
|
|
martijnvg |
[TEST] Adapt testNoMasterActions since metadata isn't cleared if ther…
…
…e is a no master block |
5932371
|
|
|
|
bleskes |
[Discovery] add a debug log if a node responds to a publish request a…
…
…fter publishing timed out. |
ff8b740
|
|
|
|
bleskes |
[Discovery] UnicastZenPing should also ping last known discoNodes
…
At the moment, when a node looses connection to the master (due to a partition or the master was stopped), we ping the unicast hosts in order to discover other nodes and elect a new master or get of another master than has been elected in the mean time. This can go wrong if all unicast targets are on the same side of a minority partition and therefore will never rejoin once the partition is healed. Closes #7336 |
d5552a9
|
|
this
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
maybe call this I also think instead of all the instanceof checks we should maybe make this an abstract class and add a method
to ClusterStateUpdateTask maybe? I like the fact that you indicate the criteria (master or not) in the class name. I also don't like the instanceof checks. Perhaps keep the current name and add a method call
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
can we maybe do sure. will do.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
I'd appreciate if we can do
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
are we sure the It can't. many things rely on that fact... ok fine! :)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
can we maybe make this a constant? the setting I mean? unrelated but will do (will also convert it to a full path)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
can we use a switch / case statement here it's easier to read and 1.7 supports strings ++
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
can we make those all constants? and does it make sense to randomize some of them? I don't think we should randomize this one. We do want to rarely randomize the
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
should we validate this setting? ie can add.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
can we make the will do.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
should we do this in a try / catch fashion? and make sure we call it on all of them? ++
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
it might be good to have an assertion somewhere that make sure it's not there? will see where it makes sense to add this.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
maybe we can turn this around and do if (electMaster.hasEnoughMasterNodes(possibleMasterNodes)) {
// lets tie break between discovered nodes
return electMaster.electMaster(possibleMasterNodes);
} else {
logger.trace("not enough master nodes [{}]", possibleMasterNodes);
return null;
}will do
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
maybe add a logging statement here? rather not because this ping happens every second. I think it will be rarely useful..
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
I start to see this a lot, can we have a static helper somewhere?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
it doens't matter if that one overflows no? I mean grows larger than Not really because it doesn't hurt to do the check twice. Rather be more resilient to errors exceptions etc.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
can't this be list from the beginning? I just followed the existing pattern. Will double check why it is so. I think it's OK? We want to prefer safety and copy the list anyway because we changing it's content. If that's the case might as well accept any Iterable. This code is rarely called so performance is secondary imho.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
can we make this a constant? and Ideally not using componentSettings it's so confusing +1
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
I am not sure I understand this change here? This done to detect master problems faster. Before we waited 1 second (default ping interval) but we know here that the master is already active - we can start pinging immediately.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
maybe it makes sense to put this logic somehwere else to make sure we know it can be null? I can put it on the request deserialization but I think in this case it's better not to hide the fact that the request didn't have a clusterName with it? I'll add a comment as to how it can be null..
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
I can't see who uses this? The github diff is hard to read but it's used here: https://github.com/elasticsearch/elasticsearch/pull/7493/files#diff-b6157c9d93b838bb6020bbab99bfe7b6R399
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
it took me a while to figure out what all these different node lists / sets are maybe you can find a better name for agreed. Will change.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
|
oh is this change BW compatible? The LocalTransport is only used for testing and only works within a single JVMs - so I think bwc is not an issue and we should prefer cleaner code?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
|
I did a first review round and left some mostly cosmetic comments. I don't know that code well enough to really review it deeply but given the fact that the development on the branch went through several review rounds is a good indicator. I think we are close here/
|
|
bleskes |
[Internal] introduce ClusterState.UNKNOWN_VERSION constant
…
Used as null value for cluster state versions. |
d8a5ff0
|
the changes here LGTM thanks!
|
|
bleskes |
[Internal] Extract a common base class for (Master|Nodes)FaultDetection
…
They share a lot of settings and some logic. Closes #7512 |
596a4a0
|
|
|
|
bleskes |
[Cluster] Refactored ClusterStateUpdateTask protection against execut…
…
…ion on a non master Previous implementation used a marker interface and had no explicit failure call back for the case update task was run on a non master (i.e., the master stepped down after it was submitted). That lead to a couple of instance of checks. This approach moves ClusterStateUpdateTask from an interface to an abstract class, which allows adding a flag to indicate whether it should only run on master nodes (defaults to true). It also adds an explicit onNoLongerMaster call back to allow different error handling for that case. This also removed the need for the NoLongerMaster. Closes #7511 |
34f4ca7
|
|
|
bleskes |
[Discovery] accumulated improvements to ZenDiscovery
…
Merging the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes:
__Testing infra__
- Networking:
- all symmetric partitioning
- dropping packets
- hard disconnects
- Jepsen Tests
- Single node service disruptions:
- Long GC / Halt
- Slow cluster state updates
- Discovery settings
- Easy to setup unicast with partial host list
__Zen Discovery__
- Pinging after master loss (no local elects)
- Fixes the split brain issue: #2488
- Batching join requests
- More resilient joining process (wait on a publish from master)
Closes #7493 |
598854d
|
|
|
bleskes |
[Discovery] accumulated improvements to ZenDiscovery
…
Merging the accumulated work from the feature/improve_zen branch. Here are the highlights of the changes:
__Testing infra__
- Networking:
- all symmetric partitioning
- dropping packets
- hard disconnects
- Jepsen Tests
- Single node service disruptions:
- Long GC / Halt
- Slow cluster state updates
- Discovery settings
- Easy to setup unicast with partial host list
__Zen Discovery__
- Pinging after master loss (no local elects)
- Fixes the split brain issue: #2488
- Batching join requests
- More resilient joining process (wait on a publish from master)
Closes #7493 |
51b89f7
|
|
|
bleskes |
Resiliency: Master election should demotes nodes which try to join th…
…
…e cluster for the first time With the change in #7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes #7558 |
a50934e
|
|
|
bleskes |
Resiliency: Master election should demotes nodes which try to join th…
…
…e cluster for the first time With the change in #7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes #7558 |
d35e887
|
|
|
bleskes |
Discovery: back port #7558 to 1.x and add bwc protections of the new …
…
…ping on master gone introduced in #7493 The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 |
4754ca6
|
|
|
bleskes |
Discovery: back port #7558 to 1.x and add bwc protections of the new …
…
…ping on master gone introduced in #7493 The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 |
9a2809b
|
|
|
bleskes |
Resiliency: Master election should demotes nodes which try to join th…
…
…e cluster for the first time With the change in #7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes #7558 |
1f39d43
|
|
|
bleskes |
Discovery: back port #7558 to 1.x and add bwc protections of the new …
…
…ping on master gone introduced in #7493 The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 |
f9667ee
|
|
|
bleskes |
Resiliency: Master election should demotes nodes which try to join th…
…
…e cluster for the first time With the change in #7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes #7558 |
e9d14b3
|
|
|
bleskes |
Discovery: back port #7558 to 1.x and add bwc protections of the new …
…
…ping on master gone introduced in #7493 The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 Closes #7694 |
19adaf8
|
|
|
bleskes |
Resiliency: Master election should demotes nodes which try to join th…
…
…e cluster for the first time With the change in #7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes #7558 |
d9ea628
|
|
|
bleskes |
Discovery: back port #7558 to 1.x and add bwc protections of the new …
…
…ping on master gone introduced in #7493 The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 Closes #7694 |
e5de47d
|
|
|
bleskes |
Resiliency: Master election should demotes nodes which try to join th…
…
…e cluster for the first time With the change in #7493, we introduced a pinging round when a master nodes goes down. That pinging round helps validating the current state of the cluster and takes, by default, 3 seconds. It may be that during that window, a new node tries to join the cluster and starts pinging (this is typical when you quickly restart the current master). If this node gets elected as the new master it will force recovery from the gateway (it has no in memory cluster state), which in turn will cause a full cluster shard synchronisation. While this is not a problem on it's own, it's a shame. This commit demotes "new" nodes during master election so the will only be elected if really needed. Closes #7558 |
cd6dfa4
|
|
|
bleskes |
Discovery: back port #7558 to 1.x and add bwc protections of the new …
…
…ping on master gone introduced in #7493 The change in #7558 adds a flag to PingResponse. However, when unicast discovery is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 Closes #7694 |
54a4b90
|
-
[Discovery] lightweight minimum master node recovery …
don't perform full recovery when minimum master nodes are not met, keep the state around and use it once elected as master
-
[Discovery] add rejoin on master gone flag, defaults to false …
defaults to false since there is still work left to properly make it work
-
[Discovery] Make noMasterBlock configurable and added simple test tha… …
…t shows reads do execute (partially) when m_m_n isn't met
-
[Discovery] Enable `discovery.zen.rejoin_on_master_gone` setting in D… …
…iscoveryWithNetworkFailuresTests only.
-
[Discovery] Changed the default for the 'rejoin_on_master_gone' optio… …
…n from false to true in zen discovery. Added AwaitFix for the FullRollingRestartTests.
-
[Discovery] If available newly elected master node should take over p… …
…revious known nodes.
-
[Discovery] Eagerly clean the routing table of shards that exist on n… …
…odes that are not in the latestDiscoNodes list. Only the previous master node has been removed, so only shards allocated to that node will get failed. This would have happened anyhow on later on when AllocationService#reroute is invoked (for example when a cluster setting changes or another cluster event), but by cleaning the routing table pro-actively, the stale routing table is fixed sooner and therefor the shards that are not accessible anyhow (because the node these shards were on has left the cluster) will get re-assigned sooner.
-
[Internal] Do not execute cluster state changes if current node is no… …
bleskes authored… longer master When a node steps down from being a master (because, for example, min_master_node is breached), it may still have cluster state update tasks queued up. Most (but not all) are tasks that should no longer be executed as the node no longer has authority to do so. Other cluster states updates, like electing the current node as master, should be executed even if the current node is no longer master. This commit make sure that, by default, `ClusterStateUpdateTask` is not executed if the node is no longer master. Tasks that should run on non masters are changed to implement a new interface called `ClusterStateNonMasterUpdateTask` Closes #6230
-
[TEST] It may take a little bit before the unlucky node deals with th… …
…e fact the master left
-
[TEST] Added test that verifies data integrity during and after a sim… …
…ulated network split.
-
[TEST] Added test that exposes a shard consistency problem when isola… …
…ted node(s) rejoin the cluster after network segmentation and when the elected master node ended up on the lesser side of the network segmentation.
-
-
[Discovery] Made 'discovery.zen.rejoin_on_master_gone' setting updata… …
…ble at runtime.
-
[Discovery] do not use versions to optimize cluster state copying for… …
bleskes authored… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes #6466
-
[TEST] Remove 'index.routing.allocation.total_shards_per_node' settin… …
…g in data consistency test
-
[Test] testIsolateMasterAndVerifyClusterStateConsensus didn't wait on… …
bleskes authored… initializing shards before comparing cluster states
-
[Discovery] Change (Master|Nodes)FaultDetection's connect_on_network_… …
bleskes authored…disconnect default to false The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure. Also added tests to verify the settings' behaviour
-
[Discovery] Improved logging when a join request is not executed beca… …
bleskes authored…use local node is no longer master
-
[Discovery] when master is gone, flush all pending cluster states …
bleskes authoredIf the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again. Closes #6526
-
[Tests] Added ServiceDisruptionScheme(s) and testAckedIndexing …
bleskes authoredThis commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various network partions. There is also one implementation for causing a node to be slow in processing cluster state updates. This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests. A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions. Closes #6505
-
[TEST] Reduced failures in DiscoveryWithNetworkFailuresTests#testAcke… …
…dIndexing test: * waiting time should be long enough depending on the type of the disruption scheme * MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic
-
[TEST] Renamed afterDistribution timeout to expectedTimeToHeal …
Accumulate expected shard failures to log later
-
[Test] ensureStableCluster failed to pass viaNode parameter correctly …
bleskes authoredAlso improved timeouts & logs
-
[Tests] Disabling testAckedIndexing …
bleskes authoredThe test is currently unstable and needs some more work
-
[TEST] Added test to verify if 'discovery.zen.rejoin_on_master_gone' … …
…is updatable at runtime.
-
-
Added java docs to all tests in DiscoveryWithNetworkFailuresTests …
bleskes authoredMoved testVerifyApiBlocksDuringPartition to test blocks rather then rely on specific API rejections. Did some cleaning while at it.
-
[Test] testVerifyApiBlocksDuringPartition - wait for stable cluster a… …
bleskes authored…fter partition
-
[Discovery] immediately start Master|Node fault detection pinging …
bleskes authoredAfter a node joins the clusters, it starts pinging the master to verify it's health. Before, the cluster join request was processed async and we had to give some time to complete. With #6480 we changed this to wait for the join process to complete on the master. We can therefore start pinging immediately for fast detection of failures. Similar change can be made to the Node fault detection from the master side. Closes #6706
-
[Discovery] Start master fault detection after pingInterval …
bleskes authoredThis is to allow the master election to complete on the chosen master. Relates to #6706
-
bleskes authored
This is important to for proper primary allocation decisions
-
[Tests] Increase timeout when waiting for partitions to heal …
bleskes authoredthe current 30s addition is tricky because we use 30s as timeout in many places...
-
[Internal] Migrate new initial state cluster update task to a Cluster… …
bleskes authored…StateNonMasterUpdateTask
-
[logging] don't log an error if scheduled reroute is rejected because… …
bleskes authored… local node is no longer master Since it runs in a background thread after a node is added, or submits a cluster state update when a node leaves, it may be that by the time it is executed the local node is no longer master.
-
Remove unneeded reference to DiscoveryService which potentially cause… …
bleskes authored…s circular references
-
[Tests] stability improvements …
bleskes authoredadded explicit cleaning of temp unicast ping results reduce gateway local.list_timeout to 10s. testVerifyApiBlocksDuringPartition: verify master node has stepped down before restoring partition
-
Discovery: If not enough possible masters are found, but there are ma… …
…sters to ping (ping responses did include master node) then these nodes should be resolved. After the findMaster() call we try to connect to the node and if it isn't the master we start looking for a new master via pinging again. Closes #6904
-
[Store] migrate non-allocated shard deletion to use ClusterStateNonMa… …
bleskes authored…sterUpdateTask
-
Discovery: Don't include local node to pingMasters list. We might end… …
… up electing ourselves without any form of verification.
-
[Discovery] join master after first election …
bleskes authoredCurrently, pinging results are only used if the local node is elected master or if they detect another *already* active master. This has the effect that master election requires two pinging rounds - one for the elected master to take is role and another for the other nodes to detect it and join the cluster. We can be smarter and use the election of the first round on other nodes as well. Those nodes can try to join the elected master immediately. There is a catch though - the elected master node may still be processing the election and may reject the join request if not ready yet. To compensate a retry mechanism is introduced to try again (up to 3 times by default) if this happens. Closes #6943
-
[Tests] Fixed some issues with SlowClusterStateProcessing …
bleskes authoredReduced expected time to heal to 0 (we interrupt and wait on stop disruption). It was also wrongly indicated in seconds. We didn't properly wait between slow cluster state tasks
-
[Discovery] verify we have a master after a successful join request …
bleskes authoredAfter master election, nodes send join requests to the elected master. Master is then responsible for publishing a new cluster state which sets the master on the local node's cluster state. If something goes wrong with the cluster state publishing, this process will not successfully complete. We should check it after the join request returns and if it failed, retry pinging. Closes #6969
-
retry logic to unwrap exception to check for illegal state …
it probably comes wrapped in a remote exception, which we should unwrap in order to detect it..., also, simplified a bit the retry logic
-
[Discovery] Made the handeling of the join request batch oriented. …
In large clusters when a new elected master is chosen, there are many join requests to handle. By batching them up the the cluster state doesn't get published for each individual join request, but many handled at the same time, which results into a single new cluster state which ends up be published. Closes #6984
-
[TEST] Added test that verifies that no shard relocations happen duri… …
…ng / after a master re-election.
-
[Discovery] Master fault detection and nodes fault detection should t… …
…ake cluster name into account. Both master fault detection and nodes fault detection request should also send the cluster name, so that on the receiving side the handling of these requests can be failed with an error. This error can be caught on the sending side and for master fault detection the node can fail the master locally and for nodes fault detection the node can be failed. Note this validation will most likely never fail in a production cluster, but in during automated tests where cluster / nodes are created and destroyed very frequently.
-
[TEST] Added LongGCDisruption and a test simulating GC on master nodes …
bleskes authoredAlso rename DiscoveryWithNetworkFailuresTests to DiscoveryWithServiceDisruptions which better suites what we do.
-
[Discovery] Added cluster version and master node to the nodes fault … …
…detecting ping request The cluster state version allows resolving the case where a old master node become unresponsive and later wakes up and pings all the nodes in the cluster, allowing the newly elected master to decide whether it should step down or ask the old master to rejoin.
-
-
[Transport] Introduced worker threads to prevent alien threads of ent… …
bleskes authored…ering a node. Requests are handled by the worked thread pool of the target node instead of the generic thread pool of the source node. Also this change is required in order to make GC disruption work with local transport. Previously the handling of the a request was performed on on a node that that was being GC disrupted, resulting in some actions being performed while GC was being simulated.
-
[TEST] Remove the forceful `network.mode` setting in DiscoveryWithSer… …
…viceDisruptions#testMasterNodeGCs now local transport use worker threads.
-
-
[TEST] Adapt testNoMasterActions since metadata isn't cleared if ther… …
…e is a no master block
-
[Discovery] add a debug log if a node responds to a publish request a… …
bleskes authored…fter publishing timed out.
-
[Discovery] UnicastZenPing should also ping last known discoNodes …
bleskes authoredAt the moment, when a node looses connection to the master (due to a partition or the master was stopped), we ping the unicast hosts in order to discover other nodes and elect a new master or get of another master than has been elected in the mean time. This can go wrong if all unicast targets are on the same side of a minority partition and therefore will never rejoin once the partition is healed. Closes #7336
-
bleskes authored
-
Add an assertion to ZenDiscovery checking that local node is never el… …
bleskes authored…ected if pings indicate an active master
-
[Internal] introduce ClusterState.UNKNOWN_VERSION constant …
bleskes authoredUsed as null value for cluster state versions.
-
[Internal] Extract a common base class for (Master|Nodes)FaultDetection …
bleskes authoredThey share a lot of settings and some logic. Closes #7512
-
[Cluster] Refactored ClusterStateUpdateTask protection against execut… …
bleskes authored…ion on a non master Previous implementation used a marker interface and had no explicit failure call back for the case update task was run on a non master (i.e., the master stepped down after it was submitted). That lead to a couple of instance of checks. This approach moves ClusterStateUpdateTask from an interface to an abstract class, which allows adding a flag to indicate whether it should only run on master nodes (defaults to true). It also adds an explicit onNoLongerMaster call back to allow different error handling for that case. This also removed the need for the NoLongerMaster. Closes #7511
This PR contains the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes:
Testing infra
Zen Discovery