Cluster state delay can cause endless index request loop #12573

brwe · 2015-07-31T11:41:34Z

When a primary is relocating from node_1 to node_2, there can be a short time where the old primary is removed from the node already (closed, not deleted) but the new primary is still in POST_RECOVERY. In this state indexing requests might be sent back and forth between node_1 and node_2 endlessly.

Course of events:

primary ([index][0]) relocates from node_1 to node_2

node_2 is done recovering, moves its shard to IndexShardState.POST_RECOVERY and sends a message to master that the shard is ShardRoutingState.STARTED

Cluster state 1: 
node_1: [index][0] RELOCATING (ShardRoutingState), (STARTED from IndexShardState perspective on node_1) 
node_2: [index][0] INITIALIZING (ShardRoutingState), (at this point already POST_RECOVERY from IndexShardState perspective on node_2)

master receives shard started and updates cluster state to:

Cluster state 2: 
node_1: [index][0] no shard 
node_2: [index][0] STARTED (ShardRoutingState), (at this point still in POST_RECOVERY from IndexShardState perspective on node_2)

master sends this to node_1 and node_2

node_1 receives the new cluster state and removes its shard because it is not allocated on node_1 anymore
index a document

At this point node_1 is already on cluster state 2 and does not have the shard anymore so it forwards the request to node_2. But node_2 is behind with cluster state processing, is still on cluster state 1 and therefore has the shard in IndexShardState.POST_RECOVERY and thinks node_1 has the primary. So it will send the request back to node_1. This goes on until either node_2 finally catches up and processes cluster state 2 or both nodes OOM.

I will make a pull request with a test shortly

The text was updated successfully, but these errors were encountered:

brwe · 2015-07-31T11:44:17Z

here is a test that reproduces this: #12574

clintongormley · 2016-01-26T17:30:31Z

I think this will be closed by #15900

ywelsch · 2016-01-27T18:26:35Z

I've opened #16274 to address this issue.

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

…cal routing table Closes elastic#16274 Closes elastic#12573 Closes elastic#12574

martijnvg mentioned this issue Aug 4, 2015

Cleanup TransportReplicationAction #12395

Closed

clintongormley added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >bug labels Jan 26, 2016

ywelsch mentioned this issue Jan 27, 2016

Prevent TransportReplicationAction to route request based on stale local routing table #16274

Merged

ywelsch closed this as completed in af1f637 Feb 2, 2016

bleskes added a commit that referenced this issue Apr 7, 2016

Update resliency page

557a3d1

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

bleskes mentioned this issue Apr 7, 2016

Update resliency page #17586

Merged

bleskes added a commit that referenced this issue Apr 7, 2016

Update resiliency page (#17586)

8eee28e

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

ywelsch mentioned this issue Jun 30, 2016

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

Closed

ywelsch pushed a commit to ywelsch/elasticsearch that referenced this issue Jul 7, 2016

Prevent TransportReplicationAction to route request based on stale lo…

7f14f4b

…cal routing table Closes elastic#16274 Closes elastic#12573 Closes elastic#12574

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster state delay can cause endless index request loop #12573

Cluster state delay can cause endless index request loop #12573

brwe commented Jul 31, 2015

brwe commented Jul 31, 2015

clintongormley commented Jan 26, 2016

ywelsch commented Jan 27, 2016

Cluster state delay can cause endless index request loop #12573

Cluster state delay can cause endless index request loop #12573

Comments

brwe commented Jul 31, 2015

brwe commented Jul 31, 2015

clintongormley commented Jan 26, 2016

ywelsch commented Jan 27, 2016