Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify delayed shard allocation #14808

Merged
merged 1 commit into from
Nov 19, 2015

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Nov 17, 2015

This PR simplifies delayed shard allocation by moving the calculation of the delay to a single place (ReplicaShardAllocator) and removing the bridge between RoutingService and GatewayAllocator. A consequence of the simplification is that the delay can be slightly less accurate.

@@ -56,9 +56,8 @@
private final AllocationService allocationService;

private AtomicBoolean rerouting = new AtomicBoolean();
private volatile long registeredNextDelaySetting = Long.MAX_VALUE;
private volatile long registeredNextDelaySetting = Long.MAX_VALUE; // in milliseconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename the setting registeredNextDelayMillis to make the unit explicit

@dakrone
Copy link
Member

dakrone commented Nov 17, 2015

I like the decoupling in this!

One thing that is concerning though, is the APIs in UnassignedInfo, there are two getTimestamp methods, I think we should try to be as explicit as possible by putting both unit and purpose in the name (ie, getDelayedTimestampNanos, getUnassignedTimestampMillis), what do you think @ywelsch?

@@ -273,13 +273,13 @@ private ClusterHealthResponse clusterHealth(ClusterHealthRequest request, Cluste
} catch (IndexNotFoundException e) {
// one of the specified indices is not there - treat it as RED.
ClusterHealthResponse response = new ClusterHealthResponse(clusterName.value(), Strings.EMPTY_ARRAY, clusterState,
numberOfPendingTasks, numberOfInFlightFetch, UnassignedInfo.getNumberOfDelayedUnassigned(System.currentTimeMillis(), settings, clusterState),
numberOfPendingTasks, numberOfInFlightFetch, UnassignedInfo.getNumberOfDelayedUnassigned(clusterState),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nice side effect :)

@ywelsch
Copy link
Contributor Author

ywelsch commented Nov 18, 2015

Updated PR based on review comments. Naming is hard ;-)

this.message = in.readOptionalString();
this.failure = in.readThrowable();
}

public void writeTo(StreamOutput out) throws IOException {
out.writeByte((byte) reason.ordinal());
out.writeLong(timestamp);
out.writeLong(timestampMillis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment about not serializing the timestamp in nanos

@bleskes
Copy link
Contributor

bleskes commented Nov 18, 2015

Thanks @ywelsch . left some more comments.

@ywelsch
Copy link
Contributor Author

ywelsch commented Nov 18, 2015

Pushed another set of changes.

minDelaySettingAtLastScheduling = minDelaySetting;
TimeValue nextDelay = TimeValue.timeValueNanos(UnassignedInfo.findNextDelayedAllocationIn(event.state()));
assert nextDelay.nanos() > 0 : "next delay must be non 0 as minDelaySetting is [" + minDelaySetting + "]";
int unassignedDelayedShards = UnassignedInfo.getNumberOfDelayedUnassigned(event.state());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the unassignedDelayedShards anymore, right? now that findSmallestDelayedAllocationSetting only takes delayed shards into account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if (unassignedDelayedShards > 0) { is superfluous, I agree, but I would still leave this method so we can have nice logging information about how many shards are delayed.

unassignedIterator.removeAndIgnore();
}
// if we didn't manage to find *any* data (regardless of matching sizes), check if the allocation of the replica shard needs to be delayed
changed |= ignoreUnassignedIfDelayed(System.nanoTime(), allocation, unassignedIterator, shard);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we capture System.nanoTime() at the beginning of this method so all shards use the same? it's not broken now, but will make it easier to reason about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to capture System.nanoTime() at the beginning of the method

@bleskes
Copy link
Contributor

bleskes commented Nov 18, 2015

Left some minor comments. O.w. LGTM. @dakrone can you take a look as well?

newCalculatedDelayNanos = 0l;
} else {
assert nanoTimeNow >= unassignedTimeNanos;
newCalculatedDelayNanos = Math.max(0l, (delayTimeoutMillis * 1_000_000l) - (nanoTimeNow - unassignedTimeNanos));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the 1,000,001 instead of 1,000,000 here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh hahahah, I can't read, that's an L

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that something like TimeUnit.NANOSECONDS.convert(delayTimeoutMillis, TimeUnit.MILLISECONDS) might be clearer anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be clearer

@dakrone
Copy link
Member

dakrone commented Nov 18, 2015

Left a minor comment and echoed one of Boaz's minor comments, other than that, LGTM also. Thanks for changing the naming, it's easier to read now.

@ywelsch
Copy link
Contributor Author

ywelsch commented Nov 18, 2015

thanks @bleskes, @dakrone and @jasontedor for the comments. I pushed once more and will merge this in tomorrow.

@bleskes
Copy link
Contributor

bleskes commented Nov 19, 2015

Thx @ywelsch . Can we give it a couple of days under CI and push to 2.2 as well?

@ywelsch ywelsch added the v2.2.0 label Nov 19, 2015
- moves calculation of the delay to a single place (ReplicaShardAllocator)
- reduces coupling between GatewayAllocator and RoutingService
- in master failover situations, elapsed delay time is forgotten

Closes elastic#14808
ywelsch pushed a commit that referenced this pull request Nov 19, 2015
@ywelsch ywelsch merged commit 6a2fa73 into elastic:master Nov 19, 2015
@clintongormley
Copy link

Removing the 2.2 label until this PR is merged into 2.x

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Nov 20, 2015
…ry reroute

elastic#14808 changed the way we calculate the remaining delay of unassigned shards to make sure that all components use the same basic details for making decision and don't rely on System.currentTimeStamp. The calculation was made whenever the ReplicaShardAllocator couldn't assign a shard. However we did it too late so, for example, if some shard had some in flight store fetch the delay information wasn't updated causing some tests to fail and making reasoning about time left tricky (some shards were updated, some not), causing issues with our reporting. Instead we should update the delay indication with every iteration.

For example: if a node left the cluster and an async store fetch was triggered. In that time no shard is marked as delayed (and strictly speaking it's not yet delayed). This caused test for shard delays post node left to fail. see : http://build-us-00.elastic.co/job/es_core_master_windows-2012-r2/2074/testReport/

 To fix this, the delay update is now done by the Allocation Service, based of a fixed time stamp that is determined at the beginning of the reroute.

 Also, this commit fixes a bug where unassigned info instances were reused across shard routings, causing calculated delays to be leaked.
ywelsch pushed a commit that referenced this pull request Nov 27, 2015
- moves calculation of the delay to a single place (ReplicaShardAllocator)
- reduces coupling between GatewayAllocator and RoutingService
- in master failover situations, elapsed delay time is forgotten

Closes #14808
@ywelsch ywelsch added the v2.2.0 label Nov 27, 2015
@ywelsch
Copy link
Contributor Author

ywelsch commented Nov 27, 2015

backported to 2.x

@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v2.2.0 v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants