Remove `InternalTestCluster.startNode(s)Async` #21846

bleskes · 2016-11-29T07:51:34Z

Since the removal of local discovery of ##20960 we rely on minimum master nodes to be set in our test cluster. The settings is automatically managed by the cluster (by default) but current management doesn't work with concurrent single node async starting. On the other hand, with MockZenPing and the discovery.initial_state_timeout set to 0s node starting and joining is very fast making async starting an unneeded complexity. Test that still need async starting could, in theory, still do so themselves via background threads.

bleskes · 2016-11-29T07:52:46Z

I marked this as going to 5.1 as well, but I will only do so after it has proven stable on the other branches.

ywelsch

Left 2 comments. I'm positively surprised this turned out so simple.

ywelsch · 2016-11-29T09:39:26Z

core/src/test/java/org/elasticsearch/discovery/ZenUnicastDiscoveryIT.java

-import static org.hamcrest.Matchers.equalTo;
-
-@ClusterScope(scope = Scope.TEST, numDataNodes = 0, autoMinMasterNodes = false)
-public class ZenUnicastDiscoveryIT extends ESIntegTestCase {


why remove this test?

I didn't see much added value in it now that we use unicast by default on all tests?

ok, makes sense.

ywelsch · 2016-11-29T09:47:19Z

core/src/test/java/org/elasticsearch/action/support/master/IndexingMasterFailoverIT.java

                .put("discovery.zen.join_timeout", "10s")  // still long to induce failures but to long so test won't time out
                .put(DiscoverySettings.PUBLISH_TIMEOUT_SETTING.getKey(), "1s") // <-- for hitting simulated network failures quickly
                .put(ElectMasterService.DISCOVERY_ZEN_MINIMUM_MASTER_NODES_SETTING.getKey(), 2)
+                .put(DiscoverySettings.INITIAL_STATE_TIMEOUT_SETTING.getKey(), "0")  // <-- we wait for cluster formation at the end


I don't like this setting being mis-used now in multiple places as we lack the async versions of starting up nodes (I don't think that it should be used by autoManageMinMasterNodes either). Instead startNodes could be smarter and start up the nodes in parallel, i.e. just call node.start() in parallel and wait for all nodes to return from that method before continuing.

I dont' see why this is a misuse of the settings? it's there for OOB experience so when you start a node it's ready and has successfully join the cluster, but if you don't need it/want it you can disable it. That's why it's a setting?

It exposes the sequential nature that is used by startNodes to users of this API, which is completely unnecessary (I've even outlined a solution) and the only reason requiring this setting. Calling startNodes(3) should not trip up because of some internals require an extra setting to be passed. Let's have a clean API that does the right thing.

answering out of order:

which is completely unnecessary (I've even outlined a solution)

My goal here is to remove the complexity and asynchronicity of node starting. Adding that back will defeat what I wanted to achieve. I went through the tests and didn't see any test that actually relies on this async nature (except for bypassing this waiting).

Calling startNodes(3) should not trip up because of some internals require an extra setting to be passed.

To be clear - if you use that method when min_master_nodes is auto managed you don't need to do anything. It's only if you "own" that setting than you have to also deal with the sequential nature of starting (unless you use this setting which IMO a valid use case and the reason for it's existence).

It exposes the sequential nature that is used by startNodes to users of this API,

I can document that and make it official? I very much prefer that to doing async work in the test cluster.

My goal here is to remove the complexity and asynchronicity of node starting. Adding that back will defeat what I wanted to achieve.

I'm not arguing that the startNode(s) APIs should expose asynchronicity. I'm talking about how the APIs should internally work. There I think they should behave in the way how you would normally start up nodes in real-life. An API call startNodes(3) should start 3 nodes and, after the method returns, have all 3 nodes in a state where they're started (their Node#start() methods returned). Internally, the API should follow what I would do in a real-world scenario: Start the 3 nodes and wait for all 3 to be ready. I wouldn't go and use the INITIAL_STATE_TIMEOUT_SETTING to achieve that but just boot them up in parallel.

To be clear - if you use that method when min_master_nodes is auto managed you don't need to do anything. It's only if you "own" that setting than you have to also deal with the sequential nature of starting (unless you use this setting which IMO a valid use case and the reason for it's existence).

It does not matter whether min_master_nodes is auto-managed or not. In both those cases, there is no need for INITIAL_STATE_TIMEOUT_SETTING if it's implemented in the way I outlined.

the reason for it's existence

No. It's used as an internal setting to start up tribe nodes. It's not even mentioned in our documentation.

I think that this is the wrong approach and that there is a cleaner way to do this.

No. It's used as an internal setting to start up tribe node

Just an FYI - that setting has been there at least since 0.90.

I think that this is the wrong approach and that there is a cleaner way to do this.

I understand you feel differently about it. It's a judgment call - I see your argument about having nodes start as they would normally do. On the other hand, going through tests, all the async they need is supplied with just waiting for the join to complete. On the flip side of that is that sync starting is easier to debug, make randomness (sometimes) easier etc. One of my goals here was to remove it because of that.

As this is a judgement call and we can go both ways - I will bring it up for discussion in fix it friday and see what people think.

On the flip side of that is that sync starting is easier to debug, make randomness (sometimes) easier etc. One of my goals here was to remove it because of that.

I don't get that argument. I said above that the nodes can still be initialized sequentially. I'm just advocating to call the startNode method in parallel. That method does very little and is mostly about scheduling some stuff on the thread pool and starting the joining process (all of which is done in separate threads anyhow).

That method does very little and is mostly about scheduling some stuff on the thread pool and starting the joining process (all of which is done in separate threads anyhow).

Agreed and I see that side of the argument as well. We discussed this and the majority feels that limiting the async to the start method is the right trade off (in order to not have custom settings). I'll adapt the PR.

…_nodes_async

This reverts commit 2b10e31.

…sync fashion internally

bleskes · 2016-12-04T15:51:39Z

@ywelsch I pushed some more commits. Can you take another look?

ywelsch

The change looks good. I've left two minor comments.

ywelsch · 2016-12-05T09:53:33Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

+                }
+            } catch (InterruptedException e) {
+                Thread.interrupted();
+                return;


I wonder about the cases where this can happen as the close method on InternalTestCluster that calls shutdown on the executor is synchronized (same as this method). Have you observed this exception being thrown? If we don't expect this to occur under normal operations, I would prefer not to swallow the exception here.

The interrupted exception is always a battle as to what to do. I never seen it in practice but I have to deal with it because of checked exceptions. Not though that I didn't swallow it but rather intended to set the thread flag (which I failed miserably to do so and will fix). The alternative is to throw a `RuntimeException but that feels ugly as well (and I didn't want to force everyone to deal with InterruptedException by adding it to the signature). Which option do you prefer?

maybe we can throw an AssertionError to make it clear that getting to this place is not expected (i.e. it will serve as documentation and validate our assumptions).

sure thing - just need to come up with a proper error message :)

ywelsch · 2016-12-05T09:54:50Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

-     * Starts multiple nodes (based on the number of settings provided) in an async manner, with explicit settings for each node.
-     * The order of the node names returned matches the order of the settings provided.
-     */
-    public synchronized Async<List<String>> startNodesAsync(final Settings... settings) {


Can the Async interface also be removed? I see it's an interface defined inside InternTestCluster.

bleskes · 2016-12-06T11:06:31Z

Thx @ywelsch . I pushed this to master. will wait a day before backporting

Since the removal of local discovery of ##20960 we rely on minimum master nodes to be set in our test cluster. The settings is automatically managed by the cluster (by default) but current management doesn't work with concurrent single node async starting. On the other hand, with `MockZenPing` and the `discovery.initial_state_timeout` set to `0s` node starting and joining is very fast making async starting an unneeded complexity. Test that still need async starting could, in theory, still do so themselves via background threads. Note that this change also removes the usage of `INITIAL_STATE_TIMEOUT_SETTINGS` as the starting of nodes is done concurrently (but building them is sequential)

bleskes · 2016-12-09T09:55:22Z

this is now backported to 5.1 as well

Since the removal of local discovery of #elastic#20960 we rely on minimum master nodes to be set in our test cluster. The settings is automatically managed by the cluster (by default) but current management doesn't work with concurrent single node async starting. On the other hand, with `MockZenPing` and the `discovery.initial_state_timeout` set to `0s` node starting and joining is very fast making async starting an unneeded complexity. Test that still need async starting could, in theory, still do so themselves via background threads. Note that this change also removes the usage of `INITIAL_STATE_TIMEOUT_SETTINGS` as the starting of nodes is done concurrently (but building them is sequential)

In order to start clusters with min master nodes set without setting `discovery.initial_state_timeout`, #21846 has changed the way we start nodes. Instead to the previous serial start up, we now always start the nodes in an async fashion (internally). This means that starting a cluster is unsafe without `min_master_nodes` being set. We should therefore make it mandatory.

bleskes added 4 commits November 28, 2016 20:55

remove startNodesAsync

d2323f9

fix AzureDiscoveryClusterFormationTests

9dd93c5

more left over usage

8d0b9f7

Fix SizeFieldMapperUpgradeTests

5482ba6

bleskes added >test Issues or PRs that are addressing/adding tests v5.1.1 v5.2.0 v6.0.0-alpha1 labels Nov 29, 2016

bleskes assigned ywelsch Nov 29, 2016

bleskes changed the title ~~Remove InternalTestCluster.startNodesAsync~~ Remove InternalTestCluster.startNode(s)Async Nov 29, 2016

Merge remote-tracking branch 'upstream/master' into remove_nodes_async

74a3126

ywelsch suggested changes Nov 29, 2016

View reviewed changes

remove left over executor

2b10e31

bleskes added discuss and removed discuss labels Dec 1, 2016

bleskes added 6 commits December 2, 2016 14:06

Merge branch 'master' of github.com:elastic/elasticsearch into remove…

681c771

…_nodes_async

Revert "remove left over executor"

ae5d766

This reverts commit 2b10e31.

remove usages of INITIAL_STATE_TIMEOUT_SETTING and start nodes in a a…

a649350

…sync fashion internally

also use concurrent start on full cluster restart

3c6a383

red add Murmur3FieldMapperTests

3e05c7a

remove unneeded formatting changes from Murmur3FieldMapperTests

284f3c1

bleskes force-pushed the remove_nodes_async branch from 546c2db to 284f3c1 Compare December 4, 2016 15:50

ywelsch approved these changes Dec 5, 2016

View reviewed changes

bleskes added 4 commits December 5, 2016 15:03

properly mark thread as interrupted

802c45e

remove async interface

cf17de1

Merge remote-tracking branch 'upstream/master' into remove_nodes_async

fb26f60

throw an AssertionError on interruption

5c72b55

bleskes merged commit a7050b2 into elastic:master Dec 6, 2016

bleskes deleted the remove_nodes_async branch December 6, 2016 11:08

bleskes mentioned this pull request Dec 9, 2016

Enforce min master nodes in test cluster #22065

Merged

Remove InternalTestCluster.startNode(s)Async #21846

Remove InternalTestCluster.startNode(s)Async #21846

Uh oh!

Conversation

bleskes commented Nov 29, 2016

Uh oh!

bleskes commented Nov 29, 2016

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes Dec 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes commented Dec 4, 2016

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes commented Dec 6, 2016

Uh oh!

bleskes commented Dec 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove `InternalTestCluster.startNode(s)Async` #21846

Remove `InternalTestCluster.startNode(s)Async` #21846

bleskes Dec 1, 2016 •

edited

Loading