Prepares allocator decision objects for use with the allocation explain API #21691

abeyad · 2016-11-21T05:45:13Z

This commit enhances the allocator decision result objects (namely,
AllocateUnassignedDecision, MoveDecision, and RebalanceDecision)
to enable them to be used directly by the cluster allocation explain API. In
particular, this commit does the following:

Adds serialization and toXContent methods to the response objects,
which will form the explain API responses.
Moves the calculation of the final explanation to the response
object itself, removing it from the responsibility of the
allocators.
Adds shard store information to the NodeAllocationResult, so that
store information is available for each node, when explaining a
shard allocation by the PrimaryShardAllocator or the
ReplicaShardAllocator.
Removes RebalanceDecision in favor of using MoveDecision for both
moving and rebalancing shards.
Removes NodeRebalanceResult in favor of using NodeAllocationResult.
Changes the notion of weight ranking to be relative to the current node,
instead of an absolute weight that doesn't convey any added value to the
API user and can be confusing.
Introduces a new enum AllocationDecision to convey the decision type,
which enables conveying unassigned, moving, and rebalancing scenarios
with more detail as opposed to just Decision.Type and AllocationStatus.

abeyad · 2016-11-21T14:09:08Z

retest this please

ywelsch

I left a few comments. Will do another iteration tomorrow.

ywelsch · 2016-11-21T15:08:14Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+                explanation = "shard assigned to node [" + assignedNodeId + "]";
+            }
+        } else if (finalDecision == Type.THROTTLE) {
+            explanation = "allocation throttled as each node with an existing copy of the shard is busy with other recoveries";


it's hard to give a good explanation here that fits all cases: I wonder if this should be kept more abstract

ywelsch · 2016-11-21T15:15:38Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+    @Override
+    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
+        builder.field("final_decision", finalDecision.toString());
+        builder.field("final_explanation", getFinalExplanation());


I don't like the word "final" in "final_decision" and "final_explanation". I wonder if we can have something like node-level decisions and cluster-level decisions or come up with better names for these?

ywelsch · 2016-11-21T15:16:19Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+            builder.field("allocation_status", allocationStatus.value());
+        }
+        if (assignedNodeId != null) {
+            builder.field("assigned_node_id", assignedNodeId);


the node name might be useful here as well

ywelsch · 2016-11-21T15:19:56Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

+        }
+        Decision.writeTo(canAllocateDecision, out);
+        out.writeFloat(weight);
+        innerWriteTo(out);


I think it's ok in this case to not have innerWriteTo and have the subclass just call super.writeTo(out);

ywelsch · 2016-11-21T15:22:25Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+        }
+
+        if (nodeDecisions == null && allocationStatus != null) {
+            // use cached version - there are no detailed decisions or explanations


Is it useful to reuse cached instances after serialization? The use case we wanted to cover with CACHED_DECISIONS does never include serialized instances.

abeyad · 2016-11-23T03:48:51Z

@ywelsch I pushed a few commits to address the code review comments and some other things we discussed. Still outstanding is to add a "minimal_mode" JSON output.

ywelsch

Left another round of comments. This is tricky stuff.

ywelsch · 2016-11-23T19:53:58Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+        if (in.readBoolean()) {
+            allocationStatus = AllocationStatus.readFrom(in);
+        } else {
+            allocationStatus = null;


allocationStatus = in.readOptionalWriteable(AllocationStatus::readFrom);

ywelsch · 2016-11-23T19:57:34Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

-            return new AllocateUnassignedDecision(Type.NO, AllocationStatus.DECIDERS_NO, explanation, null, null, null, shardDecision);
+    public AllocateUnassignedDecision(StreamInput in) throws IOException {
+        if (in.readBoolean()) {
+            decision = Type.readFrom(in);


decision = in.readOptionalWriteable(Type::readFrom); and make Type implement Writeable (converting static writeTo method to instance method)

ywelsch · 2016-11-23T20:00:13Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+    /**
+     * Returns the explanation behind the {@link #getDecision()} that is returned for this decision.
+     */
+    public String getFinalExplanation() {


maybe just "getExplanation"?

ywelsch · 2016-11-23T20:03:41Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+        if (assignedNodeId != null) {
+            builder.startObject("assigned_node");
+            builder.field("id", assignedNodeId);
+            builder.field("name", assignedNodeName);


I wonder if we should store the full DiscoveryNode so that we have more freedom to display whatever we like here.

I was debating that myself. I am fine with storing the full DiscoveryNode

ywelsch · 2016-11-23T20:04:53Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+            builder.field("allocation_id", allocationId);
+        }
+        if (allocationStatus == AllocationStatus.DELAYED_ALLOCATION) {
+            builder.field("remaining_delay_in_millis", remainingDelayInMillis);


the previous explain API also had information about how long the total delay was (i.e. configured delay in index settings).

ywelsch · 2016-11-23T21:25:45Z

core/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

+        Tuple<Decision, Map<String, NodeAllocationResult>> result = canBeAllocatedToAtLeastOneNode(unassignedShard, allocation);
+        Decision allocateDecision = result.v1();
+        if (allocateDecision.type() != Decision.Type.YES) {
+            // only return early if we are not in explain mode, because if we are in explain mode,


I don't understand this explanation. we return early here, even in explain mode?

The issue is with fetching. Suppose canBeAllocatedToAtleastOneNode returns a non-YES decision. If this is the case from the start of the ES instance, then we will have never before fetched shard data (b/c canBeAllocatedToAtleastOneNode) aborted us from doing so. Now suppose we execute an explain request. If we didn't abort early here, then the explain call would've triggered a shard fetch, and the response we would get from the explain API is "pending shard fetch".

I don't believe this is accurate, as the explain API is not reflecting reality (i.e. its telling us its pending shard fetching when that's not the reality of allocation, the reality of allocation is that it never bothered trying to fetch shard data due to a NO or THROTTLE decision from canBeAllocatedToAtLeastOneNode). I don't think the explain API should be the cause of a shard fetch for the PrimaryShardAllocator nor the ReplicaShardAllocator.

That was the reasoning here for aborting early, even in explain mode. One way to avoid the explain API triggering a shard fetch and still return node decisions even if canBeAllocatedToAtleastOneNode returns NO is to add a method to the ReplicaShardsAllocator to check if a shard data fetch has ever been triggered. I tried this strategy in this commit: aa31f4c72bd4f1b0febab835b73542c00364db32. Let me know what you think.

ywelsch · 2016-11-23T21:27:32Z

core/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

        }

        ShardRouting primaryShard = routingNodes.activePrimary(unassignedShard.shardId());
-        assert primaryShard != null : "the replica shard can be allocated on at least one node, so there must be an active primary";
+        if (primaryShard == null) {
+            assert explain : "primary should only be null here if we are in explain mode, so we didn't " +


does that still hold true?

ywelsch · 2016-11-23T21:32:20Z

core/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

-            if (decision.type() == Decision.Type.YES) {
-                return Tuple.tuple(decision, null);
+            if (decision.type() == Decision.Type.YES && madeDecision.type() != Decision.Type.YES) {
+                if (allocation.debugDecision()) {


if (explain)

ywelsch · 2016-11-23T21:40:02Z

core/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

+                } else {
+                    shardStore = new ShardStore(StoreStatus.UNKNOWN, matchingBytes);
+                }
+                nodeDecisions.put(node.nodeId(), new NodeAllocationResult(discoNode, shardStore, decision));
            }

            if (decision.type() == Decision.Type.NO) {
                continue;


why not calculate the nodesToSize when in explain mode?

Not sure I follow? nodesToSize is used for the actual decision, to store the number of matching bytes for nodes that canAllocate returns a YES/THROTTLE decision for. The nodeDecisions are populated regardless of whether they end up in nodesToSize, b/c we want the shard store status for every node, even if canAllocate returns a NO decision.

ywelsch · 2016-11-23T21:40:40Z

core/src/main/java/org/elasticsearch/search/aggregations/bucket/BestDocsDeferringCollector.java

@@ -215,7 +215,7 @@ public int getDocCount() {
        PerSegmentCollects(LeafReaderContext readerContext) throws IOException {
            // The publisher behaviour for Reader/Scorer listeners triggers a
            // call to this constructor with a null scorer so we can't call
-            // scorer.getWeight() and pass the Weight to our base class.
+            // scorer.getWeightRanking() and pass the Weight to our base class.


abeyad · 2016-11-25T21:41:19Z

@ywelsch I pushed some more commits that address the latest round of feedback, and i've left comments/questions regarding outstanding issues. Its ready for another look. I'm working on a minimal_mode JSON output which may be better served in a separate PR after this one.

ywelsch

I've left another round of comments. The three main things to address are the sorting of results, whether we should not have the StoreReadability enum and what the output of NodeRebalanceResult should be.

ywelsch · 2016-11-28T11:51:36Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+                    explanation = "cannot allocate because a previous copy of the shard existed, but could not be found";
+                }
+            } else if (allocationStatus == AllocationStatus.DELAYED_ALLOCATION) {
+                explanation = "cannot allocate because the cluster is waiting " + Long.toString(remainingDelayInMillis / 1000L) +


maybe it's simpler to use TimeValue's toString() to give nice output here.

ywelsch · 2016-11-28T11:53:49Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+        }
+        if (assignedNode != null) {
+            builder.startObject("assigned_node");
+            builder.field("id", assignedNode.getId());


should we put the whole DiscoveryNode here?

Do you mean a single field assigned_node with to toString() of the DiscoveryNode? Or calling DiscoveryNode#toXContent? I feel it may be too verbose and provide unnecessary information, especially when any of that extra information is easily obtained via /_cat/nodes or getting the cluster state. But if you feel strongly that we should put the full object there, I don't mind.

ywelsch · 2016-11-28T11:57:37Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+            builder.startObject("node_decisions");
+            {
+                List<String> nodeIds = new ArrayList<>(nodeDecisions.keySet());
+                Collections.sort(nodeIds);


why sort by node id? Should we sort by other properties? e.g. yes before throttle before no and also take the weights into account as secondary sort criterium.

ywelsch · 2016-11-28T12:01:34Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

-    public float getWeight() {
-        return weight;
+    public boolean isWeightRanked() {
+        return weightRanking != -1;


we could use 0 instead of 1 for this and start the ranking with 1,2,3,...

ywelsch · 2016-11-28T12:07:57Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

+                    builder.field("matching_bytes", new ByteSizeValue(matchingBytes).toString());
+                }
+                if (storeException != null) {
+                    builder.field("store_exception", ExceptionsHelper.detailedMessage(storeException));


ElasticsearchException.toXContent()?

thanks, i didn't know about that!

ywelsch · 2016-11-28T12:33:48Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/RelocationDecision.java

+        }
+        builder.field("final_explanation", getExplanation());
+        if (assignedNode != null) {
+            builder.startObject("assigned_node");


put full node here?

ywelsch · 2016-11-28T12:37:10Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

@@ -371,10 +363,11 @@ private RebalanceDecision decideRebalance(final ShardRouting shard) {
            final String idxName = shard.getIndexName();
            Map<String, NodeRebalanceResult> nodeDecisions = new HashMap<>(modelNodes.length - 1);
            Type rebalanceDecisionType = Type.NO;
-            String assignedNodeId = null;
+            DiscoveryNode assignedNode = null;


maybe just store ModelNode here and later resolve when passing it to RebalanceDecision.

ywelsch · 2016-11-28T12:38:20Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

                    rebalanceConditionsMet ? canAllocate.type() : Type.NO,
                    canAllocate,
+                    ++weightRanking,


ok, let's leave it like this for now.

ywelsch · 2016-11-28T12:48:05Z

core/src/main/java/org/elasticsearch/gateway/PrimaryShardAllocator.java

+        if (storeErr != null) {
+            Throwable unwrapped = ExceptionsHelper.unwrapCause(storeErr);
+            if (unwrapped instanceof CorruptIndexException) {
+                storeReadability = StoreReadability.CORRUPT;


I wonder if we need an enum for this as we already have the exception at hand in ShardStore.

I agree with you on this and thought the same when separating these two enums out. If the storeException is non-null, we already have details on the exception and don't need any further confirmation on whether it was "corrupt" or "shard-lock". The exception tells us that and more. If the storeException is null, we know we don't have any reading errors on the shard. I am happy to remove this enum.

ywelsch · 2016-11-28T12:48:36Z

core/src/main/java/org/elasticsearch/gateway/PrimaryShardAllocator.java

+            // The ids are only empty if dealing with a legacy index
+            storeStatus = StoreStatus.UNKNOWN;
+        } else if (nodeShardState.allocationId() != null && inSyncAllocationIds.contains(nodeShardState.allocationId())) {
+            storeStatus = StoreStatus.CURRENT;


instead of CURRENT we could use IN_SYNC?

abeyad · 2016-11-29T02:44:55Z

@ywelsch I pushed 8343e227a95787cc97894344fe90a79e6d6e9875 to address the latest round.

ywelsch

I've left another round of comments.

ywelsch · 2016-11-30T10:39:26Z

core/src/main/java/org/elasticsearch/cluster/node/DiscoveryNode.java

+     * A toXContent implementation that leaves off some of the non-critical fields, and assumes the outer object
+     * is created outside of this method call.
+     */
+    public XContentBuilder toXContentLight(XContentBuilder builder, Params params) throws IOException {


I think this method should be moved to the allocation explain code somewhere. It is a bit too specific for this class.

ywelsch · 2016-11-30T10:41:10Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+            builder.field("allocation_id", allocationId);
+        }
+        if (allocationStatus == AllocationStatus.DELAYED_ALLOCATION) {
+            builder.field("remaining_delay", TimeValue.timeValueSeconds(remainingDelayInMillis / 1000L).toString());


use timeValueMillis?

ywelsch · 2016-11-30T10:41:20Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+        }
+        if (allocationStatus == AllocationStatus.DELAYED_ALLOCATION) {
+            builder.field("remaining_delay", TimeValue.timeValueSeconds(remainingDelayInMillis / 1000L).toString());
+            builder.field("total_delay", TimeValue.timeValueSeconds(totalDelayInMillis / 1000L).toString());


use timeValueMillis?

ywelsch · 2016-11-30T10:57:06Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+                });
+                List<String> nodeIds = new ArrayList<>(nodeDecisions.keySet());
+                Collections.sort(nodeIds);
+                for (String nodeId : nodeIds) {


shouldn't you iterate over the entries here?

ywelsch · 2016-11-30T10:58:16Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

+     * deciders ran.  For example, suppose there are 3 nodes which the allocation deciders
+     * decided upon: node1, node2, and node3.  If node2 had the best weight for holding the
+     * shard, followed by node3, followed by node1, then node2's weight will be 1, node3's
+     * weight will be 2, and node1's weight will be 1.  A value of -1 means the weight was


update javadoc. 0 is the new -1

ywelsch · 2016-11-30T12:58:35Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

+        // The copy matches sync ids with the primary
+        MATCHING_SYNC_ID((byte) 2),
+        // It's unknown what the copy of the data is
+        UNKNOWN((byte) 3);


should we have an enum value for "no data found"?

if a node had no data, it would never be considered by the PrimaryShardAllocator nor the ReplicaShardAllocator, so that node would not appear in the node decisions map. Do you think we should add all nodes to the map even if the Primary or Replica ShardsAllocator took the decision, and for those nodes that had no copy, just have a NO decision with ShardStatus = NO_DATA_FOUND? I thought that would be overkill, but can add it if you think it will make it more clear that all the other nodes just didn't have any copies of the data, so weren't in consideration.

let's leave it like this for now and revisit when we provide the API endpoint.

ywelsch · 2016-11-30T13:02:27Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeRebalanceResult.java

+    protected void innerToXContent(XContentBuilder builder, ToXContent.Params params) throws IOException {
+        builder.field("delta_above_threshold", deltaAboveThreshold);
+        if (deltaAboveThreshold) {
+            builder.field("better_weight_with_shard_added", betterWeightWithShardAdded);


When is the weight delta above threshold but weight not better? I don't understand this conditional output.

Lets take two nodes, A and B, and we are trying to balance shard S0 belonging to index idx which current resides on node A. Lets say the current weight of node A with respect to idx is 25 and the current weight of node B with respect to idx is 10. The current weight delta between the two nodes is 15. Lets assume the configured threshold for prompting rebalances is 10, so node B will have deltaAboveThreshold=true with respect to shard S0 on node A.

Now, we simulate moving S0 from A to B. Lets say the new weight of A with respect to idx is now 10 and the new weight of B with respect to idx is 26. The new delta between the two nodes with respect to idx is now 16 which is greater than the previous delta of 15, so better_balance would be false, even though deltaAboveThreshold is true.

ywelsch · 2016-11-30T13:04:27Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeRebalanceResult.java

+    protected void innerToXContent(XContentBuilder builder, ToXContent.Params params) throws IOException {
+        builder.field("delta_above_threshold", deltaAboveThreshold);
+        if (deltaAboveThreshold) {
+            builder.field("better_weight_with_shard_added", betterWeightWithShardAdded);


better_weight_with_shard_added reads so weird. Wouldn't better_weight be more accurate. Or maybe just better_balance?

better_balance sounds good to me. I liked better_weight_with_shard_added because it essentially told us exactly the computation that occurred (simulating moving the shard and checking the weights), but that's probably meaningless to the user and they just want to know if the balance is better by moving the shard.

ywelsch · 2016-11-30T13:10:58Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/RebalanceDecision.java

-    public RebalanceDecision(Decision canRebalanceDecision, Type finalDecision, String finalExplanation) {
-        this(canRebalanceDecision, finalDecision, finalExplanation, null, null, Float.POSITIVE_INFINITY);
-    }
+    private final float currentWeight;


do we still need the current weight if there is no other weight to compare it to?

that's true, its not needed anymore as the node results no longer maintain their weights as a comparison point (just using weight ranking now) so i'll remove this

ywelsch · 2016-11-30T13:18:44Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/Decision.java

+
+        @Override
+        public void writeTo(StreamOutput out) throws IOException {
+            out.writeBoolean(true); // flag indicating it is a multi decision


maybe we can use the NamedWriteable infrastructure?

We could, but I didn't want to break the wire protocol for Decision (its the same reason I didn't reorder the ids in the Type enum which would've made the comparing easier). The Decision class (and therefore, Type as well) can be serialized over the wire protocol in case of running a reroute command with explanations turned on. RoutingExplanations relies on serializing the Decision class. I don't think the benefit outweighs breaking the wire protocol here, even though it may be a rare case. WDYT?

abeyad · 2016-12-01T04:09:27Z

@ywelsch I pushed e0cb904d986614432acd6465d3c61bb370130ddc and 44ac196883b6a4bbff8e547bdd65e1aac46339c3 to address the latest feedback.

ywelsch · 2016-12-01T14:20:05Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+     * A toXContent implementation that leaves off some of the non-critical fields, and assumes the outer object
+     * is created outside of this method call.
+     */
+    public static XContentBuilder discoveryNodeToXContent(XContentBuilder builder, Params params, DiscoveryNode node) throws IOException {


super nitty, but DiscoveryNode should be the first parameter of this method, so it reads more nicely.

ywelsch · 2016-12-01T14:26:54Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

-        }
-        this.nodeDecisions = (nodeDecisions != null) ? Collections.unmodifiableMap(nodeDecisions) : null;
-
+        nodeDecisions = in.readBoolean() ? sortNodeDecisions(in.readMap(StreamInput::readString, NodeAllocationResult::new)) : null;


is sorting needed when we read the map? Can we assume it's sorted already and use a LinkedHashMap? This could be done by generalizing readMap so that it also takes a Map supplier.

I pushed 42aad8bf40c41cef68a689907bc5a6a07033fe7a which adds a readMap method to StreamInput that takes a map supplier. I used the same code instead of having the older readMap method call into readMap with a hash map supplier because of the ability of the old readMap method to use the size param to properly size the map from the start. In practice, this is likely not going to cost us much, so I'm also fine with having just one method that takes a map supplier and is not able to size the map from the beginning. WDYT?

ywelsch · 2016-12-01T14:38:14Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/DecisionUtils.java

+/**
+ * A class that holds utility methods for serializing and working with allocation decisions.
+ */
+public class DecisionUtils {


I wonder if we need a class like this (the stuff it provides feels a bit arbitrary). The comparator, sortNodeDecisions and nodeDecisionsToXContent could go into the NodeAllocationResult class. This leaves discoveryNodeToXContent which could go into an AbstractAllocationDecision class that's a superclass for all decision classes?
We could also have NodeAllocationResult implement Comparable.

I contemplated this. Given our discussion on the node rebalancing output, I realized that NodeRebalanceResult is not needed - all the info is in NodeAllocationResult. Likewise, after iterations on this PR, I realized that an AbstractAllocationDecision can encompass a few common elements between all the decisions, so I've done quite a bit of refactoring there. I also made NodeAllocationResult implement Comparable. I implemented all this in 553eb77677fe696fb7d6b314bfed2fe4392a727b. WDYT?

ywelsch · 2016-12-01T14:40:50Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/Decision.java

        YES(1),
-        THROTTLE(2);
+        THROTTLE(2),


now you rely on the natural order? You should then also have a test for it so that it does not get reordered by anyone else.

Done in 1f304d706afd001ebed72ceac57684a4be884d4a

ywelsch · 2016-12-01T14:49:32Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

+        // The copy matches sync ids with the primary
+        MATCHING_SYNC_ID((byte) 2),
+        // It's unknown what the copy of the data is
+        UNKNOWN((byte) 3);


let's leave it like this for now and revisit when we provide the API endpoint.

ywelsch · 2016-12-01T14:56:01Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeRebalanceResult.java

-        return weightWithShardAdded;
+    @Override
+    protected void innerToXContent(XContentBuilder builder, ToXContent.Params params) throws IOException {
+        builder.field("delta_above_threshold", deltaAboveThreshold);


I'm still a bit on the fence of exposing this. I'll reach out to discuss.

abeyad · 2016-12-02T04:14:51Z

I pushed 16896e726291e9ad6c95eb6012e805440e3b061d to use relative weight ranking in the rebalance decision.

abeyad · 2016-12-02T04:16:13Z

@ywelsch I pushed commits that did some refactoring based on your review and our discussion.

ywelsch · 2016-12-02T16:57:08Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AbstractAllocationDecision.java

+     * Generates X-Content for a {@link DiscoveryNode} that leaves off some of the non-critical fields,
+     * and assumes the outer object is created outside of this method call.
+     */
+    public static XContentBuilder discoveryNodeToXContent(XContentBuilder builder, ToXContent.Params params, DiscoveryNode node)


DiscoveryNode should be the first parameter of this method, so it reads more nicely.

ywelsch · 2016-12-02T17:00:55Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/MoveDecision.java

        super.toXContent(builder, params);
        builder.startObject("can_remain_decision");
        {
            builder.field("decision", canRemainDecision.type().toString());
            canRemainDecision.toXContent(builder, params);
        }
        builder.endObject();
-        nodeDecisionsToXContent(builder, params, nodeDecisions);


this reorders the output. I think that the StreamOutput/StreamInput can call super() but that the toXContent should be completely done by the subclass so that we can render output in the optimal way. I would for example prefer to see the can_remain_decision before the node decisions where to allocate.

innerToXContent is called before nodeDecisionsToXContent in AbstractAllocationDecision, so the can_remain_decision comes before the node decisions. I was basically able to do it such that the order was kept the same and sensible throughout by having the innerToXContent so I abstracted it out in this way. Do you mean you'd rather see can_remain_decision before assigned_node?

yeah, that was what I meant

ywelsch · 2016-12-02T17:05:38Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

@@ -83,6 +83,8 @@ public NodeAllocationResult(StreamInput in) throws IOException {
        weightRanking = in.readVInt();
    }

+    @Override public String toString() { return weightRanking+""; }


Integer.toString(int i). Also, give this a few newlines.

this was leftover from testing, i forgot to remove it, i removed it now

ywelsch · 2016-12-02T17:07:36Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/RebalanceDecision.java

@@ -111,7 +123,7 @@ public String getExplanation() {

    @Override
    public void innerToXContent(XContentBuilder builder, Params params) throws IOException {
-        super.toXContent(builder, params);
+        builder.field("current_node_ranking", currentNodeRanking);


I wonder if we should just call this weight_ranking and put it under the toXContent for the currently assigned node.

we don't currently output the current node to which the shard is assigned. would you like to add this? this would only apply to RebalanceDecision and MoveDecision, and only the RebalanceDecision would have a "weight_ranking" on its current node. This would be a more compelling reason to have each individual decision implement its own toXContent.

sounds good

abeyad · 2016-12-02T20:09:34Z

@ywelsch I pushed commits to improve the ranking algorithm code and address the code review comments

abeyad · 2016-12-02T21:52:23Z

retest this please

ywelsch · 2016-12-02T21:38:31Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

-                // the current node is the worst ranked, so its ranking never got assigned above
-                currentNodeWeightRanking = weightRanking + 1;
+            int weightRanking = 0;
+            Map<String, NodeAllocationResult> nodeDecisions = new HashMap<>(modelNodes.length - 1);


do we need a map here? As we sort anyway in the RebalanceDecision constructor, is it good enough to just pass a list of NodeRebalanceResult to it?

ywelsch · 2016-12-02T21:42:29Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+        builder.field("explanation", getExplanation());
+        if (getAssignedNode() != null) {
+            builder.startObject("assigned_node");
+            discoveryNodeToXContent(getAssignedNode(), builder, params);


I prefer to make the fields protected and access them in that way.

ywelsch · 2016-12-02T21:51:33Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/RebalanceDecision.java

+        builder.field("decision", getDecisionType().toString());
+        builder.field("explanation", getExplanation());
+        if (getAssignedNode() != null) {
+            builder.startObject("assigned_node");


I wonder if we should name this target_node?

ywelsch · 2016-12-02T21:52:42Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/RebalanceDecision.java

@@ -111,7 +123,7 @@ public String getExplanation() {

    @Override
    public void innerToXContent(XContentBuilder builder, Params params) throws IOException {
-        super.toXContent(builder, params);
+        builder.field("current_node_ranking", currentNodeRanking);


sounds good

abeyad · 2016-12-02T22:58:49Z

@ywelsch I've addressed the latest round

them to be used directly by the cluster allocation explain API. In particular, this commit does the following: 1. Adds serialization and toXContent methods to the response objects, which will form the explain API responses. 2. Moves the calculation of the final explanation to the response object itself, removing it from the responsibility of the allocators. 3. Adds shard store information to the NodeAllocationResult, so that store information is available for each node, when explaining a shard allocation by the PrimaryShardAllocator or the ReplicaShardAllocator. 4. Moves NodeRebalanceResult to its own top-level class, as that was forgotten to be done in elastic#21662. 5. Adds delta information to the NodeRebalanceResult.

to the class itself.

abeyad · 2016-12-03T03:08:39Z

retest this please

class and (2) throws IllegalStateException if fields called on decision object where decision was not taken

abeyad · 2016-12-03T17:36:04Z

@ywelsch As I was reworking the rest layer, I realized we will want to know the canRemain decision even in the RebalanceDecision step. When adding it, I also realized there are lots of commonalities between MoveDecision and RebalanceDecision and I don't even feel we need a separate class for expressing "rebalancing". Its simply a move decision where either the move was forced (depending on the canRemain decision) or not. So I experimented with this and pushed 8f390c2.

This will make it much easier to express the canRemain decision alongside a canRebalance decision, if indeed we aren't required to force move the shard. I also added checks to ensure isDecisionTaken returns true before getting the various decision property objects, otherwise an IllegalStateException is thrown. Let me know what you think.

abeyad · 2016-12-05T06:27:23Z

@ywelsch I pushed 4a0ed53 and a579b19 based on our analysis of the JSON output

ywelsch · 2016-12-05T12:21:49Z

core/src/main/java/org/elasticsearch/action/admin/cluster/allocation/NodeExplanation.java

@@ -108,7 +108,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
            builder.field("final_decision", finalDecision.toString());


where is this "final_decision" coming from? I thought we got rid of those...

This is the old explain API code. Once we plug the new explain API into the rest layer, this whole class will disappear.

ywelsch · 2016-12-05T12:23:00Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AbstractAllocationDecision.java

    }

    protected AbstractAllocationDecision(StreamInput in) throws IOException {
        decision = in.readOptionalWriteable(Type::readFrom);
        targetNode = in.readOptionalWriteable(DiscoveryNode::new);
-        nodeDecisions = in.readBoolean() ? Collections.unmodifiableMap(
-            in.readMap(StreamInput::readString, NodeAllocationResult::new, LinkedHashMap::new)) : null;


the readMap method can go away again now?

ywelsch · 2016-12-05T12:26:37Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+        out.writeVLong(configuredDelayInMillis);
+    }
+
+    private boolean atleastOneNodeWithYesDecision() {


ywelsch · 2016-12-05T17:43:20Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AbstractAllocationDecision.java

+public abstract class AbstractAllocationDecision implements ToXContent, Writeable {
+
+    @Nullable
+    protected final Type decision;


I see this variable being called "decision", "finalDecision", "decisionType" in different places. Let's make this a bit more uniform. I also wonder if this is the right class type to expose as overall "result". It can just have the values "YES/NO/THROTTLE" which feel limited w.r.t AllocationStatus.

ywelsch · 2016-12-05T17:47:45Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/MoveDecision.java

-     * @param explain true if in explain mode
-     * @param currentNodeId the current node id where the shard is assigned
-     * @param assignedNodeId the node id for where the shard can move to
+     * @param assignedNode the node for where the shard can move to


the node where the shard should move to

ywelsch · 2016-12-05T17:51:49Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/MoveDecision.java

-     * Gets the individual node-level decisions that went into making the final decision as represented by
-     * {@link #getFinalDecisionType()}.  The map that is returned has the node id as the key and a {@link NodeAllocationResult}.
+     * Returns the decision for being allowed to rebalance the shard.  Invoking this method will return a
+     * {@code null} if {@link #cannotRemain()} returns {@code true}, which means the node is not allowed to


return a null?

ywelsch · 2016-12-05T17:55:29Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/MoveDecision.java

+                        explanation = "can rebalance shard";
+                    }
+                } else {
+                    explanation = "cannot rebalance as no target node node exists that can both allocate this shard " +


ywelsch · 2016-12-05T17:56:07Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/MoveDecision.java

+        } else {
+            // it was a decision to force move the shard
+            if (cannotRemain() == false) {
+                explanation = "can remain on its current node";


shard can remain

ywelsch · 2016-12-05T17:58:35Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

+                // if the node decision is NO, despite the canAllocate decision returning YES, it might not seem
+                // intuitive why the node has a NO decision, so we provide an extra explanation in this case
+                // to denote the reason for the NO decision was that the balance was not improved
+                builder.field("explanation", "not rebalancing to this node because the weight ranking is the same or worse " +


instead of "the same or worse" you can use "not better"

ywelsch · 2016-12-05T18:07:28Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/MoveDecision.java

    private final Decision canRemainDecision;
    @Nullable
-    private final Map<String, NodeAllocationResult> nodeDecisions;
+    private final Decision canRebalanceDecision;
+    private final boolean fetchPending;


here we use a boolean fetchPending whereas AllocateUnassignedDecision uses the allocationStatus to determine if fetch is pending. As said in an earlier comment, I think we should have a uniform result object for this.

message

ywelsch · 2016-12-05T19:02:08Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationResult.java

@@ -152,7 +152,7 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
                // if the node decision is NO, despite the canAllocate decision returning YES, it might not seem
                // intuitive why the node has a NO decision, so we provide an extra explanation in this case
                // to denote the reason for the NO decision was that the balance was not improved
-                builder.field("explanation", "not rebalancing to this node because the weight ranking is the same or worse " +
+                builder.field("explanation", "not rebalancing to this node because the weight ranking is not better" +


space missing at the end

consistency

ywelsch · 2016-12-06T21:01:54Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocateUnassignedDecision.java

+            explanation = "can allocate the shard";
+        } else if (allocationStatus == AllocationStatus.DECIDERS_THROTTLED) {
+            explanation = "allocation temporarily throttled";
+        } else {
            if (allocationStatus == AllocationStatus.FETCHING_SHARD_DATA) {


} else if { ?

ywelsch

LGTM. Thanks @abeyad

abeyad · 2016-12-07T18:05:15Z

Thanks for the patience with the review @ywelsch and helping to come up with a much leaner and easy to follow output!

…in API (#21691) This commit enhances the allocator decision result objects (namely, AllocateUnassignedDecision, MoveDecision, and RebalanceDecision) to enable them to be used directly by the cluster allocation explain API. In particular, this commit does the following: - Adds serialization and toXContent methods to the response objects, which will form the explain API responses. - Moves the calculation of the final explanation to the response object itself, removing it from the responsibility of the allocators. - Adds shard store information to the NodeAllocationResult, so that store information is available for each node, when explaining a shard allocation by the PrimaryShardAllocator or the ReplicaShardAllocator. - Removes RebalanceDecision in favor of using MoveDecision for both moving and rebalancing shards. - Removes NodeRebalanceResult in favor of using NodeAllocationResult. - Changes the notion of weight ranking to be relative to the current node, instead of an absolute weight that doesn't convey any added value to the API user and can be confusing. - Introduces a new enum AllocationDecision to convey the decision type, which enables conveying unassigned, moving, and rebalancing scenarios with more detail as opposed to just Decision.Type and AllocationStatus.

abeyad · 2016-12-08T02:15:03Z

5.x commit: b7bdc1c

abeyad added :Allocation v5.1.1 v6.0.0-alpha1 labels Nov 21, 2016

clintongormley added the >enhancement label Nov 21, 2016

ywelsch suggested changes Nov 21, 2016

View reviewed changes

abeyad force-pushed the explain_api_refactor_responses branch from 451d0b9 to d52373d Compare November 23, 2016 16:27

ywelsch suggested changes Nov 23, 2016

View reviewed changes

abeyad added v5.2.0 and removed v5.1.1 labels Nov 24, 2016

ywelsch suggested changes Nov 28, 2016

View reviewed changes

ywelsch suggested changes Nov 30, 2016

View reviewed changes

ywelsch reviewed Dec 1, 2016

View reviewed changes

ywelsch reviewed Dec 2, 2016

View reviewed changes

Ali Beyad added 5 commits December 2, 2016 21:28

Improve explanations and move RebalanceDecision explanation calculation

f3cc960

to the class itself.

Serialization shouldn't care about cached decisions

d0fab3b

remove innerWriteTo in favor of subclass calling super.writeTo(out)

937e9a0

Adds assigned node name in addition to assigned node id

e2c97cc

Ali Beyad added 3 commits December 2, 2016 21:31

assigned_node -> target_node

c7fc15a

Decision constructor takes list of node decisions instead of map

7294058

fix rebasing issue

aa02030

(1) Merges RebalanceDecision into MoveDecision - now we have just one

8f390c2

class and (2) throws IllegalStateException if fields called on decision object where decision was not taken

abeyad force-pushed the explain_api_refactor_responses branch from dbc082c to 8f390c2 Compare December 3, 2016 17:35

Ali Beyad added 2 commits December 5, 2016 00:56

improvements to the JSON output

4a0ed53

improves decider explanations

a579b19

ywelsch reviewed Dec 5, 2016

View reviewed changes

Ali Beyad added 2 commits December 5, 2016 09:23

remove unneeded readMap

021fd81

atleast -> atLeast

e306175

ywelsch reviewed Dec 5, 2016

View reviewed changes

Ali Beyad added 2 commits December 5, 2016 13:08

remove currentNode from MoveDecision and improve rebalance not allowed

ee0ae6e

message

addresses code review comments

37cf32c

ywelsch reviewed Dec 5, 2016

View reviewed changes

Ali Beyad added 2 commits December 5, 2016 14:04

fix spacing

e86080a

Use an AllocationDecision enum to convey decisions to provide

1efae15

consistency

ywelsch reviewed Dec 6, 2016

View reviewed changes

Ali Beyad added 2 commits December 6, 2016 23:02

else if in AllocateUnassignedDecision

81bbceb

can_remain, can_move -> "yes|no"

bd7cd33

ywelsch approved these changes Dec 7, 2016

View reviewed changes

abeyad merged commit e6e7bab into elastic:master Dec 7, 2016

abeyad mentioned this pull request Dec 14, 2016

Cluster Explain API uses the allocation process to explain shard allocation decisions #22182

Merged

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

		@@ -108,7 +108,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
		builder.field("final_decision", finalDecision.toString());

Prepares allocator decision objects for use with the allocation explain API #21691

Prepares allocator decision objects for use with the allocation explain API #21691

Conversation

abeyad commented Nov 21, 2016 • edited

abeyad commented Nov 21, 2016

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Nov 23, 2016

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Nov 25, 2016

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Nov 29, 2016

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Dec 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad Dec 1, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Dec 2, 2016

abeyad commented Dec 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Nov 21, 2016 •

edited

abeyad Dec 1, 2016 •

edited