Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GatewayAllocator: reset rerouting flag after error #11519

Closed

Conversation

Projects
None yet
5 participants
@bleskes
Copy link
Member

commented Jun 5, 2015

After asynchronously fetching shard information the gateway allocator issues a reroute via a cluster state update task. #11421 introduced an optimization trying to avoid submitting unneeded reroutes when results for many shards come in together. This is done by having a rerouting flag, indicating a pending reroute is coming and thus any new incoming shard info doesn't need to issue a reroute. This flag wasn't reset upon an error in the reroute update task. Most notably - if a master node had to step during to a min_master_node violation, it could reject an ongoing reroute. Lacking to reset the flag causing it to skip any future reroute, when the node became master again.

Example failure: http://build-us-00.elastic.co/job/es_core_1x_metal/9122/testReport/junit/org.elasticsearch.cluster/MinimumMasterNodesTests/multipleNodesShutdownNonMasterNodes/

GatewayAllocator: reset rerouting flag after error
After asynchronously fetching shard information the gateway allocator issues a reroute via  a cluster state update task. #11421 introduced an optimization trying to avoid submitting unneeded reroutes when results for many shards come in together. This is done by having a rerouting flag, indicating a pending reroute is coming and thus any new incoming shard info doesn't need to issue a reroute. This flag wasn't reset upon an error in the reroute update task. Most notably - if a master node had to step during to a min_master_node violation, it could reject an ongoing reroute. Lacking to reset the flag causing it to skip any future reroute, when the node became master again.
@@ -167,7 +167,6 @@ public void run() {

@Test @LuceneTestCase.Slow
@TestLogging("cluster.routing.allocation.allocator:TRACE")

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 5, 2015

Contributor

trace can go away too?

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 5, 2015

Author Member

I didn't add it but yeah...

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2015

LGTM

@kimchy

This comment has been minimized.

Copy link
Member

commented Jun 5, 2015

nice catch @bleskes!, LGTM

bleskes added a commit that referenced this pull request Jun 5, 2015

GatewayAllocator: reset rerouting flag after error
After asynchronously fetching shard information the gateway allocator issues a reroute via  a cluster state update task. #11421 introduced an optimization trying to avoid submitting unneeded reroutes when results for many shards come in together. This is done by having a rerouting flag, indicating a pending reroute is coming and thus any new incoming shard info doesn't need to issue a reroute. This flag wasn't reset upon an error in the reroute update task. Most notably - if a master node had to step during to a min_master_node violation, it could reject an ongoing reroute. Lacking to reset the flag causing it to skip any future reroute, when the node became master again.

Closes #11519

bleskes added a commit that referenced this pull request Jun 5, 2015

GatewayAllocator: reset rerouting flag after error
After asynchronously fetching shard information the gateway allocator issues a reroute via a cluster state update task. #11421 introduced an optimization trying to avoid submitting unneeded reroutes when results for many shards come in together. This is done by having a rerouting flag, indicating a pending reroute is coming and thus any new incoming shard info doesn't need to issue a reroute. This flag wasn't reset upon an error in the reroute update task. Most notably - if a master node had to step during to a min_master_node violation, it could reject an ongoing reroute. Lacking to reset the flag causing it to skip any future reroute, when the node became master again.

Closes #11519

@bleskes bleskes closed this in 6aa27a1 Jun 5, 2015

@kevinkluge kevinkluge removed the review label Jun 5, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.