-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Remove InternalTestCluster.startNode(s)Async
#21846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I marked this as going to 5.1 as well, but I will only do so after it has proven stable on the other branches. |
InternalTestCluster.startNodesAsyncInternalTestCluster.startNode(s)Async
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left 2 comments. I'm positively surprised this turned out so simple.
| import static org.hamcrest.Matchers.equalTo; | ||
|
|
||
| @ClusterScope(scope = Scope.TEST, numDataNodes = 0, autoMinMasterNodes = false) | ||
| public class ZenUnicastDiscoveryIT extends ESIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see much added value in it now that we use unicast by default on all tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, makes sense.
| .put("discovery.zen.join_timeout", "10s") // still long to induce failures but to long so test won't time out | ||
| .put(DiscoverySettings.PUBLISH_TIMEOUT_SETTING.getKey(), "1s") // <-- for hitting simulated network failures quickly | ||
| .put(ElectMasterService.DISCOVERY_ZEN_MINIMUM_MASTER_NODES_SETTING.getKey(), 2) | ||
| .put(DiscoverySettings.INITIAL_STATE_TIMEOUT_SETTING.getKey(), "0") // <-- we wait for cluster formation at the end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this setting being mis-used now in multiple places as we lack the async versions of starting up nodes (I don't think that it should be used by autoManageMinMasterNodes either). Instead startNodes could be smarter and start up the nodes in parallel, i.e. just call node.start() in parallel and wait for all nodes to return from that method before continuing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont' see why this is a misuse of the settings? it's there for OOB experience so when you start a node it's ready and has successfully join the cluster, but if you don't need it/want it you can disable it. That's why it's a setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It exposes the sequential nature that is used by startNodes to users of this API, which is completely unnecessary (I've even outlined a solution) and the only reason requiring this setting. Calling startNodes(3) should not trip up because of some internals require an extra setting to be passed. Let's have a clean API that does the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answering out of order:
which is completely unnecessary (I've even outlined a solution)
My goal here is to remove the complexity and asynchronicity of node starting. Adding that back will defeat what I wanted to achieve. I went through the tests and didn't see any test that actually relies on this async nature (except for bypassing this waiting).
Calling startNodes(3) should not trip up because of some internals require an extra setting to be passed.
To be clear - if you use that method when min_master_nodes is auto managed you don't need to do anything. It's only if you "own" that setting than you have to also deal with the sequential nature of starting (unless you use this setting which IMO a valid use case and the reason for it's existence).
It exposes the sequential nature that is used by startNodes to users of this API,
I can document that and make it official? I very much prefer that to doing async work in the test cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal here is to remove the complexity and asynchronicity of node starting. Adding that back will defeat what I wanted to achieve.
I'm not arguing that the startNode(s) APIs should expose asynchronicity. I'm talking about how the APIs should internally work. There I think they should behave in the way how you would normally start up nodes in real-life. An API call startNodes(3) should start 3 nodes and, after the method returns, have all 3 nodes in a state where they're started (their Node#start() methods returned). Internally, the API should follow what I would do in a real-world scenario: Start the 3 nodes and wait for all 3 to be ready. I wouldn't go and use the INITIAL_STATE_TIMEOUT_SETTING to achieve that but just boot them up in parallel.
To be clear - if you use that method when min_master_nodes is auto managed you don't need to do anything. It's only if you "own" that setting than you have to also deal with the sequential nature of starting (unless you use this setting which IMO a valid use case and the reason for it's existence).
It does not matter whether min_master_nodes is auto-managed or not. In both those cases, there is no need for INITIAL_STATE_TIMEOUT_SETTING if it's implemented in the way I outlined.
the reason for it's existence
No. It's used as an internal setting to start up tribe nodes. It's not even mentioned in our documentation.
I think that this is the wrong approach and that there is a cleaner way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It's used as an internal setting to start up tribe node
Just an FYI - that setting has been there at least since 0.90.
I think that this is the wrong approach and that there is a cleaner way to do this.
I understand you feel differently about it. It's a judgment call - I see your argument about having nodes start as they would normally do. On the other hand, going through tests, all the async they need is supplied with just waiting for the join to complete. On the flip side of that is that sync starting is easier to debug, make randomness (sometimes) easier etc. One of my goals here was to remove it because of that.
As this is a judgement call and we can go both ways - I will bring it up for discussion in fix it friday and see what people think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the flip side of that is that sync starting is easier to debug, make randomness (sometimes) easier etc. One of my goals here was to remove it because of that.
I don't get that argument. I said above that the nodes can still be initialized sequentially. I'm just advocating to call the startNode method in parallel. That method does very little and is mostly about scheduling some stuff on the thread pool and starting the joining process (all of which is done in separate threads anyhow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That method does very little and is mostly about scheduling some stuff on the thread pool and starting the joining process (all of which is done in separate threads anyhow).
Agreed and I see that side of the argument as well. We discussed this and the majority feels that limiting the async to the start method is the right trade off (in order to not have custom settings). I'll adapt the PR.
This reverts commit 2b10e31.
…sync fashion internally
546c2db to
284f3c1
Compare
|
@ywelsch I pushed some more commits. Can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good. I've left two minor comments.
| } | ||
| } catch (InterruptedException e) { | ||
| Thread.interrupted(); | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about the cases where this can happen as the close method on InternalTestCluster that calls shutdown on the executor is synchronized (same as this method). Have you observed this exception being thrown? If we don't expect this to occur under normal operations, I would prefer not to swallow the exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interrupted exception is always a battle as to what to do. I never seen it in practice but I have to deal with it because of checked exceptions. Not though that I didn't swallow it but rather intended to set the thread flag (which I failed miserably to do so and will fix). The alternative is to throw a `RuntimeException but that feels ugly as well (and I didn't want to force everyone to deal with InterruptedException by adding it to the signature). Which option do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can throw an AssertionError to make it clear that getting to this place is not expected (i.e. it will serve as documentation and validate our assumptions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing - just need to come up with a proper error message :)
| * Starts multiple nodes (based on the number of settings provided) in an async manner, with explicit settings for each node. | ||
| * The order of the node names returned matches the order of the settings provided. | ||
| */ | ||
| public synchronized Async<List<String>> startNodesAsync(final Settings... settings) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the Async interface also be removed? I see it's an interface defined inside InternTestCluster.
|
Thx @ywelsch . I pushed this to master. will wait a day before backporting |
Since the removal of local discovery of ##20960 we rely on minimum master nodes to be set in our test cluster. The settings is automatically managed by the cluster (by default) but current management doesn't work with concurrent single node async starting. On the other hand, with `MockZenPing` and the `discovery.initial_state_timeout` set to `0s` node starting and joining is very fast making async starting an unneeded complexity. Test that still need async starting could, in theory, still do so themselves via background threads. Note that this change also removes the usage of `INITIAL_STATE_TIMEOUT_SETTINGS` as the starting of nodes is done concurrently (but building them is sequential)
|
this is now backported to 5.1 as well |
Since the removal of local discovery of #elastic#20960 we rely on minimum master nodes to be set in our test cluster. The settings is automatically managed by the cluster (by default) but current management doesn't work with concurrent single node async starting. On the other hand, with `MockZenPing` and the `discovery.initial_state_timeout` set to `0s` node starting and joining is very fast making async starting an unneeded complexity. Test that still need async starting could, in theory, still do so themselves via background threads. Note that this change also removes the usage of `INITIAL_STATE_TIMEOUT_SETTINGS` as the starting of nodes is done concurrently (but building them is sequential)
In order to start clusters with min master nodes set without setting `discovery.initial_state_timeout`, #21846 has changed the way we start nodes. Instead to the previous serial start up, we now always start the nodes in an async fashion (internally). This means that starting a cluster is unsafe without `min_master_nodes` being set. We should therefore make it mandatory.
In order to start clusters with min master nodes set without setting `discovery.initial_state_timeout`, #21846 has changed the way we start nodes. Instead to the previous serial start up, we now always start the nodes in an async fashion (internally). This means that starting a cluster is unsafe without `min_master_nodes` being set. We should therefore make it mandatory.
In order to start clusters with min master nodes set without setting `discovery.initial_state_timeout`, #21846 has changed the way we start nodes. Instead to the previous serial start up, we now always start the nodes in an async fashion (internally). This means that starting a cluster is unsafe without `min_master_nodes` being set. We should therefore make it mandatory.
Since the removal of local discovery of ##20960 we rely on minimum master nodes to be set in our test cluster. The settings is automatically managed by the cluster (by default) but current management doesn't work with concurrent single node async starting. On the other hand, with
MockZenPingand thediscovery.initial_state_timeoutset to0snode starting and joining is very fast making async starting an unneeded complexity. Test that still need async starting could, in theory, still do so themselves via background threads.