Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer reroute when nodes join #42855

Merged

Conversation

@DaveCTurner
Copy link
Contributor

commented Jun 4, 2019

Today the master eagerly reroutes the cluster as part of processing node joins.
However, it is not necessary to do this reroute straight away, and it is
sometimes preferable to defer it until later. For instance, when the master
wins its election it processes joins and performs a reroute, but it would be
better to defer the reroute until after the master has become properly
established.

This change defers this reroute into a separate task.

DaveCTurner added 3 commits Jun 4, 2019
Today the master eagerly reroutes the cluster as part of processing node joins.
However, it is not necessary to do this reroute straight away, and it is
sometimes preferable to defer it until later. For instance, when the master
wins its election it processes joins and performs a reroute, but it would be
better to defer the reroute until after the master has become properly
established.

This change defers this reroute into a separate task, and batches multiple such
tasks together.
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Jun 4, 2019

@DaveCTurner

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

@ywelsch I managed to synthesise a failure by changing some of the randomBoolean()s and rarely()s in IndicesClusterStateServiceRandomUpdatesTests#testRandomClusterStateUpdates, but the probability of hitting it seemed very low otherwise.

DaveCTurner added 2 commits Jun 5, 2019
@@ -141,15 +142,17 @@
public Coordinator(String nodeName, Settings settings, ClusterSettings clusterSettings, TransportService transportService,

This comment has been minimized.

Copy link
@andrershov

andrershov Jun 5, 2019

Contributor

I think we're at the point when the number of Coordinator constructor parameters are unmanageable. Can we possibly add a JavaDoc to the constructor describing how specific dependency is used by Coordinator?

This comment has been minimized.

Copy link
@DaveCTurner

DaveCTurner Jun 6, 2019

Author Contributor

I know what you mean, but most of them are of very specific types. I added docs for the ones whose types don't make their meaning so clear in d9dbf2d (including the one added here).

equalTo(false));

final AtomicBoolean stopRerouting = new AtomicBoolean();
final Thread rerouteThread = new Thread(() -> {

This comment has been minimized.

Copy link
@andrershov

andrershov Jun 5, 2019

Contributor

I'm not sure I understand what reroute thread is doing here. I mean I understand that it continuously performs reroutes, but why is it needed in the test? Can you add the comment, please?

This comment has been minimized.

Copy link
@DaveCTurner

DaveCTurner Jun 6, 2019

Author Contributor

I found a simpler way to test the same thing now that I understand what's going on a bit better. See 563ea02.

DaveCTurner added 3 commits Jun 6, 2019
@DaveCTurner DaveCTurner requested a review from andrershov Jun 6, 2019
Copy link
Contributor

left a comment

One nit, looking good o.w.

@@ -1243,7 +1243,8 @@ public void start(ClusterState initialState) {
allocationService, masterService, () -> persistedState,
hostsResolver -> testClusterNodes.nodes.values().stream().filter(n -> n.node.isMasterNode())
.map(n -> n.node.getAddress()).collect(Collectors.toList()),
clusterService.getClusterApplierService(), Collections.emptyList(), random());
clusterService.getClusterApplierService(), Collections.emptyList(), random(),
s -> {});

This comment has been minimized.

Copy link
@ywelsch

ywelsch Jun 11, 2019

Contributor

perhaps plug in in the actual RoutingService here. This test cares about shards and realistic mocking, so I'm worried that a lot of effort will be spend in the future here to figure out why shards are not allocated.

@DaveCTurner DaveCTurner merged commit ddedf80 into elastic:master Jun 11, 2019
8 checks passed
8 checks passed
CLA All commits in pull request signed
Details
elasticsearch-ci/1 Build finished.
Details
elasticsearch-ci/2 Build finished.
Details
elasticsearch-ci/bwc Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details
@DaveCTurner DaveCTurner deleted the DaveCTurner:2019-06-04-deferred-reroute-on-join branch Jun 11, 2019
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 11, 2019
Today the master eagerly reroutes the cluster as part of processing node joins.
However, it is not necessary to do this reroute straight away, and it is
sometimes preferable to defer it until later. For instance, when the master
wins its election it processes joins and performs a reroute, but it would be
better to defer the reroute until after the master has become properly
established.

This change defers this reroute into a separate task, and batches multiple such
tasks together.
DaveCTurner added a commit that referenced this pull request Jun 11, 2019
Today the master eagerly reroutes the cluster as part of processing node joins.
However, it is not necessary to do this reroute straight away, and it is
sometimes preferable to defer it until later. For instance, when the master
wins its election it processes joins and performs a reroute, but it would be
better to defer the reroute until after the master has become properly
established.

This change defers this reroute into a separate task, and batches multiple such
tasks together.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.