Reduce cluster update reroutes with async fetch #11421

kimchy · 2015-05-29T15:15:15Z

When using async fetch, we can end up with cluster updates and reroutes based on teh number of shards. While not disastrous we can optimize it, since a single reroute is enough to apply to all the async fetch results that arrived during that time.

dakrone · 2015-05-29T17:24:52Z

src/main/java/org/elasticsearch/gateway/GatewayAllocator.java

+                logger.trace("{} already has pending reroute, ignoring {}", shardId, reason);
+                return;
+            }
+            clusterService.submitStateUpdateTask("async_shard_fetch", Priority.HIGH, new ClusterStateUpdateTask() {


I think it would be valuable to have the original type, shardId, and reason in the message, did you remove it on purpose?

my thought that it becomes less relevant, since a single reroute actually represents a few events now, and it can be misleading seeing in the pending task information about a shard id, where it might be ones for multiple ones

makes sense, thanks!

dakrone · 2015-05-29T17:25:15Z

left one comment about removing info from the message, other than that LGTM

kimchy · 2015-05-29T20:23:55Z

@dakrone I added a comment back, tell me if it makes sense

After asynchronously fetching shard information the gateway allocator issues a reroute via a cluster state update task. elastic#11421 introduced an optimization trying to avoid submitting unneeded reroutes when results for many shards come in together. This is done by having a rerouting flag, indicating a pending reroute is coming and thus any new incoming shard info doesn't need to issue a reroute. This flag wasn't reset upon an error in the reroute update task. Most notably - if a master node had to step during to a min_master_node violation, it could reject an ongoing reroute. Lacking to reset the flag causing it to skip any future reroute, when the node became master again.

After asynchronously fetching shard information the gateway allocator issues a reroute via a cluster state update task. #11421 introduced an optimization trying to avoid submitting unneeded reroutes when results for many shards come in together. This is done by having a rerouting flag, indicating a pending reroute is coming and thus any new incoming shard info doesn't need to issue a reroute. This flag wasn't reset upon an error in the reroute update task. Most notably - if a master node had to step during to a min_master_node violation, it could reject an ongoing reroute. Lacking to reset the flag causing it to skip any future reroute, when the node became master again. Closes #11519

kimchy added v2.0.0-beta1 review v1.6.0 labels May 29, 2015

dakrone reviewed May 29, 2015
View reviewed changes

clintongormley added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels May 29, 2015

kimchy closed this May 29, 2015

kevinkluge removed the review label May 29, 2015

kimchy deleted the minimize_reroute branch May 29, 2015 21:40

bleskes mentioned this pull request Jun 5, 2015

GatewayAllocator: reset rerouting flag after error #11519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce cluster update reroutes with async fetch #11421

Reduce cluster update reroutes with async fetch #11421

kimchy commented May 29, 2015

dakrone May 29, 2015

kimchy May 29, 2015

dakrone May 29, 2015

dakrone commented May 29, 2015

kimchy commented May 29, 2015

Reduce cluster update reroutes with async fetch #11421

Reduce cluster update reroutes with async fetch #11421

Conversation

kimchy commented May 29, 2015

dakrone May 29, 2015

Choose a reason for hiding this comment

kimchy May 29, 2015

Choose a reason for hiding this comment

dakrone May 29, 2015

Choose a reason for hiding this comment

dakrone commented May 29, 2015

kimchy commented May 29, 2015