Simplify delayed shard allocation #14808

ywelsch · 2015-11-17T19:22:04Z

This PR simplifies delayed shard allocation by moving the calculation of the delay to a single place (ReplicaShardAllocator) and removing the bridge between RoutingService and GatewayAllocator. A consequence of the simplification is that the delay can be slightly less accurate.

dakrone · 2015-11-17T21:13:52Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingService.java

@@ -56,9 +56,8 @@
    private final AllocationService allocationService;

    private AtomicBoolean rerouting = new AtomicBoolean();
-    private volatile long registeredNextDelaySetting = Long.MAX_VALUE;
+    private volatile long registeredNextDelaySetting = Long.MAX_VALUE; // in milliseconds


Let's rename the setting registeredNextDelayMillis to make the unit explicit

dakrone · 2015-11-17T21:37:04Z

I like the decoupling in this!

One thing that is concerning though, is the APIs in UnassignedInfo, there are two getTimestamp methods, I think we should try to be as explicit as possible by putting both unit and purpose in the name (ie, getDelayedTimestampNanos, getUnassignedTimestampMillis), what do you think @ywelsch?

bleskes · 2015-11-18T10:33:27Z

...rc/main/java/org/elasticsearch/action/admin/cluster/health/TransportClusterHealthAction.java

@@ -273,13 +273,13 @@ private ClusterHealthResponse clusterHealth(ClusterHealthRequest request, Cluste
        } catch (IndexNotFoundException e) {
            // one of the specified indices is not there - treat it as RED.
            ClusterHealthResponse response = new ClusterHealthResponse(clusterName.value(), Strings.EMPTY_ARRAY, clusterState,
-                    numberOfPendingTasks, numberOfInFlightFetch, UnassignedInfo.getNumberOfDelayedUnassigned(System.currentTimeMillis(), settings, clusterState),
+                    numberOfPendingTasks, numberOfInFlightFetch, UnassignedInfo.getNumberOfDelayedUnassigned(clusterState),


a nice side effect :)

ywelsch · 2015-11-18T11:25:53Z

Updated PR based on review comments. Naming is hard ;-)

bleskes · 2015-11-18T12:55:36Z

core/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java

        this.message = in.readOptionalString();
        this.failure = in.readThrowable();
    }

    public void writeTo(StreamOutput out) throws IOException {
        out.writeByte((byte) reason.ordinal());
-        out.writeLong(timestamp);
+        out.writeLong(timestampMillis);


can we add a comment about not serializing the timestamp in nanos

bleskes · 2015-11-18T13:11:33Z

Thanks @ywelsch . left some more comments.

ywelsch · 2015-11-18T14:58:08Z

Pushed another set of changes.

bleskes · 2015-11-18T16:06:17Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingService.java

+                minDelaySettingAtLastScheduling = minDelaySetting;
+                TimeValue nextDelay = TimeValue.timeValueNanos(UnassignedInfo.findNextDelayedAllocationIn(event.state()));
+                assert nextDelay.nanos() > 0 : "next delay must be non 0 as minDelaySetting is [" + minDelaySetting + "]";
+                int unassignedDelayedShards = UnassignedInfo.getNumberOfDelayedUnassigned(event.state());


I don't think we need the unassignedDelayedShards anymore, right? now that findSmallestDelayedAllocationSetting only takes delayed shards into account.

Checking if (unassignedDelayedShards > 0) { is superfluous, I agree, but I would still leave this method so we can have nice logging information about how many shards are delayed.

bleskes · 2015-11-18T16:10:34Z

core/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

-                    unassignedIterator.removeAndIgnore();
-                }
+                // if we didn't manage to find *any* data (regardless of matching sizes), check if the allocation of the replica shard needs to be delayed
+                changed |= ignoreUnassignedIfDelayed(System.nanoTime(), allocation, unassignedIterator, shard);


can we capture System.nanoTime() at the beginning of this method so all shards use the same? it's not broken now, but will make it easier to reason about.

+1 to capture System.nanoTime() at the beginning of the method

bleskes · 2015-11-18T16:15:49Z

Left some minor comments. O.w. LGTM. @dakrone can you take a look as well?

dakrone · 2015-11-18T16:29:12Z

core/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java

+            newCalculatedDelayNanos = 0l;
+        } else {
+            assert nanoTimeNow >= unassignedTimeNanos;
+            newCalculatedDelayNanos = Math.max(0l, (delayTimeoutMillis * 1_000_000l) - (nanoTimeNow - unassignedTimeNanos));


Why the 1,000,001 instead of 1,000,000 here?

oh hahahah, I can't read, that's an L

I think that something like TimeUnit.NANOSECONDS.convert(delayTimeoutMillis, TimeUnit.MILLISECONDS) might be clearer anyway?

Yes, that would be clearer

dakrone · 2015-11-18T16:37:31Z

Left a minor comment and echoed one of Boaz's minor comments, other than that, LGTM also. Thanks for changing the naming, it's easier to read now.

ywelsch · 2015-11-18T17:41:31Z

thanks @bleskes, @dakrone and @jasontedor for the comments. I pushed once more and will merge this in tomorrow.

bleskes · 2015-11-19T07:32:36Z

Thx @ywelsch . Can we give it a couple of days under CI and push to 2.2 as well?

- moves calculation of the delay to a single place (ReplicaShardAllocator) - reduces coupling between GatewayAllocator and RoutingService - in master failover situations, elapsed delay time is forgotten Closes elastic#14808

Simplify delayed shard allocation

clintongormley · 2015-11-19T12:31:33Z

Removing the 2.2 label until this PR is merged into 2.x

…ry reroute elastic#14808 changed the way we calculate the remaining delay of unassigned shards to make sure that all components use the same basic details for making decision and don't rely on System.currentTimeStamp. The calculation was made whenever the ReplicaShardAllocator couldn't assign a shard. However we did it too late so, for example, if some shard had some in flight store fetch the delay information wasn't updated causing some tests to fail and making reasoning about time left tricky (some shards were updated, some not), causing issues with our reporting. Instead we should update the delay indication with every iteration. For example: if a node left the cluster and an async store fetch was triggered. In that time no shard is marked as delayed (and strictly speaking it's not yet delayed). This caused test for shard delays post node left to fail. see : http://build-us-00.elastic.co/job/es_core_master_windows-2012-r2/2074/testReport/ To fix this, the delay update is now done by the Allocation Service, based of a fixed time stamp that is determined at the beginning of the reroute. Also, this commit fixes a bug where unassigned info instances were reused across shard routings, causing calculated delays to be leaked.

- moves calculation of the delay to a single place (ReplicaShardAllocator) - reduces coupling between GatewayAllocator and RoutingService - in master failover situations, elapsed delay time is forgotten Closes #14808

ywelsch · 2015-11-27T16:34:22Z

backported to 2.x

ywelsch added >enhancement review v5.0.0-alpha1 labels Nov 17, 2015

ywelsch assigned bleskes Nov 17, 2015

dakrone reviewed Nov 17, 2015
View reviewed changes

bleskes reviewed Nov 18, 2015
View reviewed changes

dakrone reviewed Nov 18, 2015
View reviewed changes

ywelsch added the v2.2.0 label Nov 19, 2015

Simplify delayed shard allocation

2084df8

- moves calculation of the delay to a single place (ReplicaShardAllocator) - reduces coupling between GatewayAllocator and RoutingService - in master failover situations, elapsed delay time is forgotten Closes elastic#14808

ywelsch force-pushed the refactor/delayed-allocation branch from d35d987 to 2084df8 Compare November 19, 2015 08:57

ywelsch pushed a commit that referenced this pull request Nov 19, 2015

Merge pull request #14808 from ywelsch/refactor/delayed-allocation

6a2fa73

Simplify delayed shard allocation

ywelsch merged commit 6a2fa73 into elastic:master Nov 19, 2015

clintongormley removed the v2.2.0 label Nov 19, 2015

bleskes mentioned this pull request Nov 20, 2015

Make sure the remaining delay of unassigned shard is updated with every reroute #14890

Closed

ywelsch added the v2.2.0 label Nov 27, 2015

s1monw mentioned this pull request May 19, 2016

Add a notice note to README alicegoldfuss/shardnado#1

Closed

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify delayed shard allocation #14808

Simplify delayed shard allocation #14808

ywelsch commented Nov 17, 2015

dakrone Nov 17, 2015

dakrone commented Nov 17, 2015

bleskes Nov 18, 2015

ywelsch commented Nov 18, 2015

bleskes Nov 18, 2015

bleskes commented Nov 18, 2015

ywelsch commented Nov 18, 2015

bleskes Nov 18, 2015

ywelsch Nov 18, 2015

bleskes Nov 18, 2015

dakrone Nov 18, 2015

bleskes commented Nov 18, 2015

dakrone Nov 18, 2015

dakrone Nov 18, 2015

jasontedor Nov 18, 2015

dakrone Nov 18, 2015

dakrone commented Nov 18, 2015

ywelsch commented Nov 18, 2015

bleskes commented Nov 19, 2015

clintongormley commented Nov 19, 2015

ywelsch commented Nov 27, 2015

Simplify delayed shard allocation #14808

Simplify delayed shard allocation #14808

Conversation

ywelsch commented Nov 17, 2015

Choose a reason for hiding this comment

dakrone commented Nov 17, 2015

Choose a reason for hiding this comment

ywelsch commented Nov 18, 2015

Choose a reason for hiding this comment

bleskes commented Nov 18, 2015

ywelsch commented Nov 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Nov 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone commented Nov 18, 2015

ywelsch commented Nov 18, 2015

bleskes commented Nov 19, 2015

clintongormley commented Nov 19, 2015

ywelsch commented Nov 27, 2015