-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify delayed shard allocation #14808
Conversation
@@ -56,9 +56,8 @@ | |||
private final AllocationService allocationService; | |||
|
|||
private AtomicBoolean rerouting = new AtomicBoolean(); | |||
private volatile long registeredNextDelaySetting = Long.MAX_VALUE; | |||
private volatile long registeredNextDelaySetting = Long.MAX_VALUE; // in milliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename the setting registeredNextDelayMillis
to make the unit explicit
I like the decoupling in this! One thing that is concerning though, is the APIs in |
@@ -273,13 +273,13 @@ private ClusterHealthResponse clusterHealth(ClusterHealthRequest request, Cluste | |||
} catch (IndexNotFoundException e) { | |||
// one of the specified indices is not there - treat it as RED. | |||
ClusterHealthResponse response = new ClusterHealthResponse(clusterName.value(), Strings.EMPTY_ARRAY, clusterState, | |||
numberOfPendingTasks, numberOfInFlightFetch, UnassignedInfo.getNumberOfDelayedUnassigned(System.currentTimeMillis(), settings, clusterState), | |||
numberOfPendingTasks, numberOfInFlightFetch, UnassignedInfo.getNumberOfDelayedUnassigned(clusterState), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nice side effect :)
Updated PR based on review comments. Naming is hard ;-) |
this.message = in.readOptionalString(); | ||
this.failure = in.readThrowable(); | ||
} | ||
|
||
public void writeTo(StreamOutput out) throws IOException { | ||
out.writeByte((byte) reason.ordinal()); | ||
out.writeLong(timestamp); | ||
out.writeLong(timestampMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment about not serializing the timestamp in nanos
Thanks @ywelsch . left some more comments. |
Pushed another set of changes. |
minDelaySettingAtLastScheduling = minDelaySetting; | ||
TimeValue nextDelay = TimeValue.timeValueNanos(UnassignedInfo.findNextDelayedAllocationIn(event.state())); | ||
assert nextDelay.nanos() > 0 : "next delay must be non 0 as minDelaySetting is [" + minDelaySetting + "]"; | ||
int unassignedDelayedShards = UnassignedInfo.getNumberOfDelayedUnassigned(event.state()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the unassignedDelayedShards anymore, right? now that findSmallestDelayedAllocationSetting only takes delayed shards into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking if (unassignedDelayedShards > 0) {
is superfluous, I agree, but I would still leave this method so we can have nice logging information about how many shards are delayed.
unassignedIterator.removeAndIgnore(); | ||
} | ||
// if we didn't manage to find *any* data (regardless of matching sizes), check if the allocation of the replica shard needs to be delayed | ||
changed |= ignoreUnassignedIfDelayed(System.nanoTime(), allocation, unassignedIterator, shard); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we capture System.nanoTime() at the beginning of this method so all shards use the same? it's not broken now, but will make it easier to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to capture System.nanoTime()
at the beginning of the method
Left some minor comments. O.w. LGTM. @dakrone can you take a look as well? |
newCalculatedDelayNanos = 0l; | ||
} else { | ||
assert nanoTimeNow >= unassignedTimeNanos; | ||
newCalculatedDelayNanos = Math.max(0l, (delayTimeoutMillis * 1_000_000l) - (nanoTimeNow - unassignedTimeNanos)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the 1,000,001 instead of 1,000,000 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh hahahah, I can't read, that's an L
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that something like TimeUnit.NANOSECONDS.convert(delayTimeoutMillis, TimeUnit.MILLISECONDS)
might be clearer anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be clearer
Left a minor comment and echoed one of Boaz's minor comments, other than that, LGTM also. Thanks for changing the naming, it's easier to read now. |
thanks @bleskes, @dakrone and @jasontedor for the comments. I pushed once more and will merge this in tomorrow. |
Thx @ywelsch . Can we give it a couple of days under CI and push to 2.2 as well? |
- moves calculation of the delay to a single place (ReplicaShardAllocator) - reduces coupling between GatewayAllocator and RoutingService - in master failover situations, elapsed delay time is forgotten Closes elastic#14808
d35d987
to
2084df8
Compare
Simplify delayed shard allocation
Removing the 2.2 label until this PR is merged into 2.x |
…ry reroute elastic#14808 changed the way we calculate the remaining delay of unassigned shards to make sure that all components use the same basic details for making decision and don't rely on System.currentTimeStamp. The calculation was made whenever the ReplicaShardAllocator couldn't assign a shard. However we did it too late so, for example, if some shard had some in flight store fetch the delay information wasn't updated causing some tests to fail and making reasoning about time left tricky (some shards were updated, some not), causing issues with our reporting. Instead we should update the delay indication with every iteration. For example: if a node left the cluster and an async store fetch was triggered. In that time no shard is marked as delayed (and strictly speaking it's not yet delayed). This caused test for shard delays post node left to fail. see : http://build-us-00.elastic.co/job/es_core_master_windows-2012-r2/2074/testReport/ To fix this, the delay update is now done by the Allocation Service, based of a fixed time stamp that is determined at the beginning of the reroute. Also, this commit fixes a bug where unassigned info instances were reused across shard routings, causing calculated delays to be leaked.
- moves calculation of the delay to a single place (ReplicaShardAllocator) - reduces coupling between GatewayAllocator and RoutingService - in master failover situations, elapsed delay time is forgotten Closes #14808
backported to 2.x |
This PR simplifies delayed shard allocation by moving the calculation of the delay to a single place (ReplicaShardAllocator) and removing the bridge between RoutingService and GatewayAllocator. A consequence of the simplification is that the delay can be slightly less accurate.