New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delayed allocation can miss a reroute #14445
Comments
Let me see if I understand this:
However, assuming shard A is actually assigned there will be a new cluster change event (a shard started event), so that's why this doesn't look like it occurs very frequently. I think perhaps we can remove the if (event.source().startsWith(CLUSTER_UPDATE_TASK_SOURCE)) {
// that's us, ignore this event
return;
} From the cluster changed event, what do you think? |
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards. Closes elastic#14494 Closes elastic#14010 Closes elastic#14445
The issue I focus on is described in #14010, #14011. Reproduced it using an integration test and I think I found a program flow leading to this issue.
Assume cluster where setting for delayed allocation is 1 minute. Now a node goes down. Shard becomes unassigned. In
RoutingService.clusterChanged()
(master node),nextDelaySetting
is 1 minute (= smallest delayed allocation setting). The variableregisteredNextDelaySetting
is initially set toLong.Max_VALUE
, hence we schedule a new reroute in a minute. The variableregisteredNextDelaySetting
is set to a minute. After half a minute, a second node goes down. A shard of the second node becomes unassigned. In this case,registeredNextDelaySetting
is still 1 minute, andnextDelaySetting
is also 1 minute. So nothing happens inRoutingService.clusterChanged()
. The previous scheduled reroute gets executed at some point. This goes as follows: First, the variableregisteredNextDelaySetting
is again set toLong.MAX_VALUE
and then we submit a cluster update task to do the reroute. Lets assume that routingResult.changed() yields true (but only because it is set to true in ReplicaShardAllocator due to the second shard still being delayed). Now, after the reroute is successfully applied,InternalClusterService
calls back toRoutingService.clusterChanged()
. Here we check if we were the reason for the cluster change event. If yes (and that's the case), we do not schedule a delayed allocation until the next cluster change event. This means that if no new cluster change event happens, the check never gets reevaluated for the remaining delayed shards.The text was updated successfully, but these errors were encountered: