Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepares allocator decision objects for use with the allocation explain API #21691

Merged
merged 53 commits into from Dec 7, 2016

Conversation

abeyad
Copy link

@abeyad abeyad commented Nov 21, 2016

This commit enhances the allocator decision result objects (namely,
AllocateUnassignedDecision, MoveDecision, and RebalanceDecision)
to enable them to be used directly by the cluster allocation explain API. In
particular, this commit does the following:

  1. Adds serialization and toXContent methods to the response objects,
    which will form the explain API responses.
  2. Moves the calculation of the final explanation to the response
    object itself, removing it from the responsibility of the
    allocators.
  3. Adds shard store information to the NodeAllocationResult, so that
    store information is available for each node, when explaining a
    shard allocation by the PrimaryShardAllocator or the
    ReplicaShardAllocator.
  4. Removes RebalanceDecision in favor of using MoveDecision for both
    moving and rebalancing shards.
  5. Removes NodeRebalanceResult in favor of using NodeAllocationResult.
  6. Changes the notion of weight ranking to be relative to the current node,
    instead of an absolute weight that doesn't convey any added value to the
    API user and can be confusing.
  7. Introduces a new enum AllocationDecision to convey the decision type,
    which enables conveying unassigned, moving, and rebalancing scenarios
    with more detail as opposed to just Decision.Type and AllocationStatus.

@abeyad
Copy link
Author

abeyad commented Nov 21, 2016

retest this please

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. Will do another iteration tomorrow.

explanation = "shard assigned to node [" + assignedNodeId + "]";
}
} else if (finalDecision == Type.THROTTLE) {
explanation = "allocation throttled as each node with an existing copy of the shard is busy with other recoveries";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to give a good explanation here that fits all cases: I wonder if this should be kept more abstract

@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.field("final_decision", finalDecision.toString());
builder.field("final_explanation", getFinalExplanation());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the word "final" in "final_decision" and "final_explanation". I wonder if we can have something like node-level decisions and cluster-level decisions or come up with better names for these?

builder.field("allocation_status", allocationStatus.value());
}
if (assignedNodeId != null) {
builder.field("assigned_node_id", assignedNodeId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node name might be useful here as well

}
Decision.writeTo(canAllocateDecision, out);
out.writeFloat(weight);
innerWriteTo(out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok in this case to not have innerWriteTo and have the subclass just call super.writeTo(out);

}

if (nodeDecisions == null && allocationStatus != null) {
// use cached version - there are no detailed decisions or explanations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to reuse cached instances after serialization? The use case we wanted to cover with CACHED_DECISIONS does never include serialized instances.

@abeyad
Copy link
Author

abeyad commented Nov 23, 2016

@ywelsch I pushed a few commits to address the code review comments and some other things we discussed. Still outstanding is to add a "minimal_mode" JSON output.

@abeyad abeyad force-pushed the explain_api_refactor_responses branch from 451d0b9 to d52373d Compare November 23, 2016 16:27
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left another round of comments. This is tricky stuff.

if (in.readBoolean()) {
allocationStatus = AllocationStatus.readFrom(in);
} else {
allocationStatus = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocationStatus = in.readOptionalWriteable(AllocationStatus::readFrom);

return new AllocateUnassignedDecision(Type.NO, AllocationStatus.DECIDERS_NO, explanation, null, null, null, shardDecision);
public AllocateUnassignedDecision(StreamInput in) throws IOException {
if (in.readBoolean()) {
decision = Type.readFrom(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decision = in.readOptionalWriteable(Type::readFrom); and make Type implement Writeable (converting static writeTo method to instance method)

/**
* Returns the explanation behind the {@link #getDecision()} that is returned for this decision.
*/
public String getFinalExplanation() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just "getExplanation"?

if (assignedNodeId != null) {
builder.startObject("assigned_node");
builder.field("id", assignedNodeId);
builder.field("name", assignedNodeName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should store the full DiscoveryNode so that we have more freedom to display whatever we like here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating that myself. I am fine with storing the full DiscoveryNode

builder.field("allocation_id", allocationId);
}
if (allocationStatus == AllocationStatus.DELAYED_ALLOCATION) {
builder.field("remaining_delay_in_millis", remainingDelayInMillis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous explain API also had information about how long the total delay was (i.e. configured delay in index settings).

Tuple<Decision, Map<String, NodeAllocationResult>> result = canBeAllocatedToAtLeastOneNode(unassignedShard, allocation);
Decision allocateDecision = result.v1();
if (allocateDecision.type() != Decision.Type.YES) {
// only return early if we are not in explain mode, because if we are in explain mode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this explanation. we return early here, even in explain mode?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is with fetching. Suppose canBeAllocatedToAtleastOneNode returns a non-YES decision. If this is the case from the start of the ES instance, then we will have never before fetched shard data (b/c canBeAllocatedToAtleastOneNode) aborted us from doing so. Now suppose we execute an explain request. If we didn't abort early here, then the explain call would've triggered a shard fetch, and the response we would get from the explain API is "pending shard fetch".

I don't believe this is accurate, as the explain API is not reflecting reality (i.e. its telling us its pending shard fetching when that's not the reality of allocation, the reality of allocation is that it never bothered trying to fetch shard data due to a NO or THROTTLE decision from canBeAllocatedToAtLeastOneNode). I don't think the explain API should be the cause of a shard fetch for the PrimaryShardAllocator nor the ReplicaShardAllocator.

That was the reasoning here for aborting early, even in explain mode. One way to avoid the explain API triggering a shard fetch and still return node decisions even if canBeAllocatedToAtleastOneNode returns NO is to add a method to the ReplicaShardsAllocator to check if a shard data fetch has ever been triggered. I tried this strategy in this commit: aa31f4c72bd4f1b0febab835b73542c00364db32. Let me know what you think.

}

ShardRouting primaryShard = routingNodes.activePrimary(unassignedShard.shardId());
assert primaryShard != null : "the replica shard can be allocated on at least one node, so there must be an active primary";
if (primaryShard == null) {
assert explain : "primary should only be null here if we are in explain mode, so we didn't " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that still hold true?

if (decision.type() == Decision.Type.YES) {
return Tuple.tuple(decision, null);
if (decision.type() == Decision.Type.YES && madeDecision.type() != Decision.Type.YES) {
if (allocation.debugDecision()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (explain)

} else {
shardStore = new ShardStore(StoreStatus.UNKNOWN, matchingBytes);
}
nodeDecisions.put(node.nodeId(), new NodeAllocationResult(discoNode, shardStore, decision));
}

if (decision.type() == Decision.Type.NO) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not calculate the nodesToSize when in explain mode?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow? nodesToSize is used for the actual decision, to store the number of matching bytes for nodes that canAllocate returns a YES/THROTTLE decision for. The nodeDecisions are populated regardless of whether they end up in nodesToSize, b/c we want the shard store status for every node, even if canAllocate returns a NO decision.

@@ -215,7 +215,7 @@ public int getDocCount() {
PerSegmentCollects(LeafReaderContext readerContext) throws IOException {
// The publisher behaviour for Reader/Scorer listeners triggers a
// call to this constructor with a null scorer so we can't call
// scorer.getWeight() and pass the Weight to our base class.
// scorer.getWeightRanking() and pass the Weight to our base class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

@abeyad abeyad added v5.2.0 and removed v5.1.1 labels Nov 24, 2016
@abeyad
Copy link
Author

abeyad commented Nov 25, 2016

@ywelsch I pushed some more commits that address the latest round of feedback, and i've left comments/questions regarding outstanding issues. Its ready for another look. I'm working on a minimal_mode JSON output which may be better served in a separate PR after this one.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left another round of comments. The three main things to address are the sorting of results, whether we should not have the StoreReadability enum and what the output of NodeRebalanceResult should be.

explanation = "cannot allocate because a previous copy of the shard existed, but could not be found";
}
} else if (allocationStatus == AllocationStatus.DELAYED_ALLOCATION) {
explanation = "cannot allocate because the cluster is waiting " + Long.toString(remainingDelayInMillis / 1000L) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's simpler to use TimeValue's toString() to give nice output here.

}
if (assignedNode != null) {
builder.startObject("assigned_node");
builder.field("id", assignedNode.getId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put the whole DiscoveryNode here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a single field assigned_node with to toString() of the DiscoveryNode? Or calling DiscoveryNode#toXContent? I feel it may be too verbose and provide unnecessary information, especially when any of that extra information is easily obtained via /_cat/nodes or getting the cluster state. But if you feel strongly that we should put the full object there, I don't mind.

builder.startObject("node_decisions");
{
List<String> nodeIds = new ArrayList<>(nodeDecisions.keySet());
Collections.sort(nodeIds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sort by node id? Should we sort by other properties? e.g. yes before throttle before no and also take the weights into account as secondary sort criterium.

public float getWeight() {
return weight;
public boolean isWeightRanked() {
return weightRanking != -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use 0 instead of 1 for this and start the ranking with 1,2,3,...

builder.field("matching_bytes", new ByteSizeValue(matchingBytes).toString());
}
if (storeException != null) {
builder.field("store_exception", ExceptionsHelper.detailedMessage(storeException));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ElasticsearchException.toXContent()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i didn't know about that!

}
builder.field("final_explanation", getExplanation());
if (assignedNode != null) {
builder.startObject("assigned_node");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put full node here?

@@ -371,10 +363,11 @@ private RebalanceDecision decideRebalance(final ShardRouting shard) {
final String idxName = shard.getIndexName();
Map<String, NodeRebalanceResult> nodeDecisions = new HashMap<>(modelNodes.length - 1);
Type rebalanceDecisionType = Type.NO;
String assignedNodeId = null;
DiscoveryNode assignedNode = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just store ModelNode here and later resolve when passing it to RebalanceDecision.

rebalanceConditionsMet ? canAllocate.type() : Type.NO,
canAllocate,
++weightRanking,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's leave it like this for now.

if (storeErr != null) {
Throwable unwrapped = ExceptionsHelper.unwrapCause(storeErr);
if (unwrapped instanceof CorruptIndexException) {
storeReadability = StoreReadability.CORRUPT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need an enum for this as we already have the exception at hand in ShardStore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on this and thought the same when separating these two enums out. If the storeException is non-null, we already have details on the exception and don't need any further confirmation on whether it was "corrupt" or "shard-lock". The exception tells us that and more. If the storeException is null, we know we don't have any reading errors on the shard. I am happy to remove this enum.

// The ids are only empty if dealing with a legacy index
storeStatus = StoreStatus.UNKNOWN;
} else if (nodeShardState.allocationId() != null && inSyncAllocationIds.contains(nodeShardState.allocationId())) {
storeStatus = StoreStatus.CURRENT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of CURRENT we could use IN_SYNC?

@abeyad
Copy link
Author

abeyad commented Nov 29, 2016

@ywelsch I pushed 8343e227a95787cc97894344fe90a79e6d6e9875 to address the latest round.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left another round of comments.

* A toXContent implementation that leaves off some of the non-critical fields, and assumes the outer object
* is created outside of this method call.
*/
public XContentBuilder toXContentLight(XContentBuilder builder, Params params) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should be moved to the allocation explain code somewhere. It is a bit too specific for this class.

builder.field("allocation_id", allocationId);
}
if (allocationStatus == AllocationStatus.DELAYED_ALLOCATION) {
builder.field("remaining_delay", TimeValue.timeValueSeconds(remainingDelayInMillis / 1000L).toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use timeValueMillis?

}
if (allocationStatus == AllocationStatus.DELAYED_ALLOCATION) {
builder.field("remaining_delay", TimeValue.timeValueSeconds(remainingDelayInMillis / 1000L).toString());
builder.field("total_delay", TimeValue.timeValueSeconds(totalDelayInMillis / 1000L).toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use timeValueMillis?

});
List<String> nodeIds = new ArrayList<>(nodeDecisions.keySet());
Collections.sort(nodeIds);
for (String nodeId : nodeIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you iterate over the entries here?

* deciders ran. For example, suppose there are 3 nodes which the allocation deciders
* decided upon: node1, node2, and node3. If node2 had the best weight for holding the
* shard, followed by node3, followed by node1, then node2's weight will be 1, node3's
* weight will be 2, and node1's weight will be 1. A value of -1 means the weight was
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update javadoc. 0 is the new -1

// The copy matches sync ids with the primary
MATCHING_SYNC_ID((byte) 2),
// It's unknown what the copy of the data is
UNKNOWN((byte) 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have an enum value for "no data found"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a node had no data, it would never be considered by the PrimaryShardAllocator nor the ReplicaShardAllocator, so that node would not appear in the node decisions map. Do you think we should add all nodes to the map even if the Primary or Replica ShardsAllocator took the decision, and for those nodes that had no copy, just have a NO decision with ShardStatus = NO_DATA_FOUND? I thought that would be overkill, but can add it if you think it will make it more clear that all the other nodes just didn't have any copies of the data, so weren't in consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave it like this for now and revisit when we provide the API endpoint.

protected void innerToXContent(XContentBuilder builder, ToXContent.Params params) throws IOException {
builder.field("delta_above_threshold", deltaAboveThreshold);
if (deltaAboveThreshold) {
builder.field("better_weight_with_shard_added", betterWeightWithShardAdded);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is the weight delta above threshold but weight not better? I don't understand this conditional output.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets take two nodes, A and B, and we are trying to balance shard S0 belonging to index idx which current resides on node A. Lets say the current weight of node A with respect to idx is 25 and the current weight of node B with respect to idx is 10. The current weight delta between the two nodes is 15. Lets assume the configured threshold for prompting rebalances is 10, so node B will have deltaAboveThreshold=true with respect to shard S0 on node A.

Now, we simulate moving S0 from A to B. Lets say the new weight of A with respect to idx is now 10 and the new weight of B with respect to idx is 26. The new delta between the two nodes with respect to idx is now 16 which is greater than the previous delta of 15, so better_balance would be false, even though deltaAboveThreshold is true.

protected void innerToXContent(XContentBuilder builder, ToXContent.Params params) throws IOException {
builder.field("delta_above_threshold", deltaAboveThreshold);
if (deltaAboveThreshold) {
builder.field("better_weight_with_shard_added", betterWeightWithShardAdded);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better_weight_with_shard_added reads so weird. Wouldn't better_weight be more accurate. Or maybe just better_balance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better_balance sounds good to me. I liked better_weight_with_shard_added because it essentially told us exactly the computation that occurred (simulating moving the shard and checking the weights), but that's probably meaningless to the user and they just want to know if the balance is better by moving the shard.

public RebalanceDecision(Decision canRebalanceDecision, Type finalDecision, String finalExplanation) {
this(canRebalanceDecision, finalDecision, finalExplanation, null, null, Float.POSITIVE_INFINITY);
}
private final float currentWeight;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need the current weight if there is no other weight to compare it to?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true, its not needed anymore as the node results no longer maintain their weights as a comparison point (just using weight ranking now) so i'll remove this


@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeBoolean(true); // flag indicating it is a multi decision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use the NamedWriteable infrastructure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I didn't want to break the wire protocol for Decision (its the same reason I didn't reorder the ids in the Type enum which would've made the comparing easier). The Decision class (and therefore, Type as well) can be serialized over the wire protocol in case of running a reroute command with explanations turned on. RoutingExplanations relies on serializing the Decision class. I don't think the benefit outweighs breaking the wire protocol here, even though it may be a rare case. WDYT?

@abeyad
Copy link
Author

abeyad commented Dec 1, 2016

@ywelsch I pushed e0cb904d986614432acd6465d3c61bb370130ddc and 44ac196883b6a4bbff8e547bdd65e1aac46339c3 to address the latest feedback.

* A toXContent implementation that leaves off some of the non-critical fields, and assumes the outer object
* is created outside of this method call.
*/
public static XContentBuilder discoveryNodeToXContent(XContentBuilder builder, Params params, DiscoveryNode node) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nitty, but DiscoveryNode should be the first parameter of this method, so it reads more nicely.

}
this.nodeDecisions = (nodeDecisions != null) ? Collections.unmodifiableMap(nodeDecisions) : null;

nodeDecisions = in.readBoolean() ? sortNodeDecisions(in.readMap(StreamInput::readString, NodeAllocationResult::new)) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is sorting needed when we read the map? Can we assume it's sorted already and use a LinkedHashMap? This could be done by generalizing readMap so that it also takes a Map supplier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 42aad8bf40c41cef68a689907bc5a6a07033fe7a which adds a readMap method to StreamInput that takes a map supplier. I used the same code instead of having the older readMap method call into readMap with a hash map supplier because of the ability of the old readMap method to use the size param to properly size the map from the start. In practice, this is likely not going to cost us much, so I'm also fine with having just one method that takes a map supplier and is not able to size the map from the beginning. WDYT?

/**
* A class that holds utility methods for serializing and working with allocation decisions.
*/
public class DecisionUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need a class like this (the stuff it provides feels a bit arbitrary). The comparator, sortNodeDecisions and nodeDecisionsToXContent could go into the NodeAllocationResult class. This leaves discoveryNodeToXContent which could go into an AbstractAllocationDecision class that's a superclass for all decision classes?
We could also have NodeAllocationResult implement Comparable.

Copy link
Author

@abeyad abeyad Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I contemplated this. Given our discussion on the node rebalancing output, I realized that NodeRebalanceResult is not needed - all the info is in NodeAllocationResult. Likewise, after iterations on this PR, I realized that an AbstractAllocationDecision can encompass a few common elements between all the decisions, so I've done quite a bit of refactoring there. I also made NodeAllocationResult implement Comparable. I implemented all this in 553eb77677fe696fb7d6b314bfed2fe4392a727b. WDYT?

YES(1),
THROTTLE(2);
THROTTLE(2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now you rely on the natural order? You should then also have a test for it so that it does not get reordered by anyone else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1f304d706afd001ebed72ceac57684a4be884d4a

// The copy matches sync ids with the primary
MATCHING_SYNC_ID((byte) 2),
// It's unknown what the copy of the data is
UNKNOWN((byte) 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave it like this for now and revisit when we provide the API endpoint.

return weightWithShardAdded;
@Override
protected void innerToXContent(XContentBuilder builder, ToXContent.Params params) throws IOException {
builder.field("delta_above_threshold", deltaAboveThreshold);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit on the fence of exposing this. I'll reach out to discuss.

@abeyad
Copy link
Author

abeyad commented Dec 2, 2016

I pushed 16896e726291e9ad6c95eb6012e805440e3b061d to use relative weight ranking in the rebalance decision.

@abeyad
Copy link
Author

abeyad commented Dec 2, 2016

@ywelsch I pushed commits that did some refactoring based on your review and our discussion.

* Generates X-Content for a {@link DiscoveryNode} that leaves off some of the non-critical fields,
* and assumes the outer object is created outside of this method call.
*/
public static XContentBuilder discoveryNodeToXContent(XContentBuilder builder, ToXContent.Params params, DiscoveryNode node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiscoveryNode should be the first parameter of this method, so it reads more nicely.

super.toXContent(builder, params);
builder.startObject("can_remain_decision");
{
builder.field("decision", canRemainDecision.type().toString());
canRemainDecision.toXContent(builder, params);
}
builder.endObject();
nodeDecisionsToXContent(builder, params, nodeDecisions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reorders the output. I think that the StreamOutput/StreamInput can call super() but that the toXContent should be completely done by the subclass so that we can render output in the optimal way. I would for example prefer to see the can_remain_decision before the node decisions where to allocate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

innerToXContent is called before nodeDecisionsToXContent in AbstractAllocationDecision, so the can_remain_decision comes before the node decisions. I was basically able to do it such that the order was kept the same and sensible throughout by having the innerToXContent so I abstracted it out in this way. Do you mean you'd rather see can_remain_decision before assigned_node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that was what I meant

@@ -83,6 +83,8 @@ public NodeAllocationResult(StreamInput in) throws IOException {
weightRanking = in.readVInt();
}

@Override public String toString() { return weightRanking+""; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer.toString(int i). Also, give this a few newlines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was leftover from testing, i forgot to remove it, i removed it now

@@ -111,7 +123,7 @@ public String getExplanation() {

@Override
public void innerToXContent(XContentBuilder builder, Params params) throws IOException {
super.toXContent(builder, params);
builder.field("current_node_ranking", currentNodeRanking);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just call this weight_ranking and put it under the toXContent for the currently assigned node.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't currently output the current node to which the shard is assigned. would you like to add this? this would only apply to RebalanceDecision and MoveDecision, and only the RebalanceDecision would have a "weight_ranking" on its current node. This would be a more compelling reason to have each individual decision implement its own toXContent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@abeyad
Copy link
Author

abeyad commented Dec 2, 2016

@ywelsch I pushed commits to improve the ranking algorithm code and address the code review comments

@abeyad
Copy link
Author

abeyad commented Dec 2, 2016

retest this please

// the current node is the worst ranked, so its ranking never got assigned above
currentNodeWeightRanking = weightRanking + 1;
int weightRanking = 0;
Map<String, NodeAllocationResult> nodeDecisions = new HashMap<>(modelNodes.length - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a map here? As we sort anyway in the RebalanceDecision constructor, is it good enough to just pass a list of NodeRebalanceResult to it?

builder.field("explanation", getExplanation());
if (getAssignedNode() != null) {
builder.startObject("assigned_node");
discoveryNodeToXContent(getAssignedNode(), builder, params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to make the fields protected and access them in that way.

builder.field("decision", getDecisionType().toString());
builder.field("explanation", getExplanation());
if (getAssignedNode() != null) {
builder.startObject("assigned_node");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should name this target_node?

@@ -111,7 +123,7 @@ public String getExplanation() {

@Override
public void innerToXContent(XContentBuilder builder, Params params) throws IOException {
super.toXContent(builder, params);
builder.field("current_node_ranking", currentNodeRanking);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@abeyad
Copy link
Author

abeyad commented Dec 2, 2016

@ywelsch I've addressed the latest round

Ali Beyad added 5 commits December 2, 2016 21:28
them to be used directly by the cluster allocation explain API.  In
particular, this commit does the following:

 1. Adds serialization and toXContent methods to the response objects,
    which will form the explain API responses.
 2. Moves the calculation of the final explanation to the response
    object itself, removing it from the responsibility of the
    allocators.
 3. Adds shard store information to the NodeAllocationResult, so that
    store information is available for each node, when explaining a
    shard allocation by the PrimaryShardAllocator or the
    ReplicaShardAllocator.
 4. Moves NodeRebalanceResult to its own top-level class, as that
    was forgotten to be done in elastic#21662.
 5. Adds delta information to the NodeRebalanceResult.
@abeyad
Copy link
Author

abeyad commented Dec 3, 2016

retest this please

class and (2) throws IllegalStateException if fields called on decision
object where decision was not taken
@abeyad abeyad force-pushed the explain_api_refactor_responses branch from dbc082c to 8f390c2 Compare December 3, 2016 17:35
@abeyad
Copy link
Author

abeyad commented Dec 3, 2016

@ywelsch As I was reworking the rest layer, I realized we will want to know the canRemain decision even in the RebalanceDecision step. When adding it, I also realized there are lots of commonalities between MoveDecision and RebalanceDecision and I don't even feel we need a separate class for expressing "rebalancing". Its simply a move decision where either the move was forced (depending on the canRemain decision) or not. So I experimented with this and pushed 8f390c2.

This will make it much easier to express the canRemain decision alongside a canRebalance decision, if indeed we aren't required to force move the shard. I also added checks to ensure isDecisionTaken returns true before getting the various decision property objects, otherwise an IllegalStateException is thrown. Let me know what you think.

@abeyad
Copy link
Author

abeyad commented Dec 5, 2016

@ywelsch I pushed 4a0ed53 and a579b19 based on our analysis of the JSON output

@@ -108,7 +108,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
builder.field("final_decision", finalDecision.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this "final_decision" coming from? I thought we got rid of those...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the old explain API code. Once we plug the new explain API into the rest layer, this whole class will disappear.

}

protected AbstractAllocationDecision(StreamInput in) throws IOException {
decision = in.readOptionalWriteable(Type::readFrom);
targetNode = in.readOptionalWriteable(DiscoveryNode::new);
nodeDecisions = in.readBoolean() ? Collections.unmodifiableMap(
in.readMap(StreamInput::readString, NodeAllocationResult::new, LinkedHashMap::new)) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the readMap method can go away again now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

out.writeVLong(configuredDelayInMillis);
}

private boolean atleastOneNodeWithYesDecision() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atLeast...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public abstract class AbstractAllocationDecision implements ToXContent, Writeable {

@Nullable
protected final Type decision;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this variable being called "decision", "finalDecision", "decisionType" in different places. Let's make this a bit more uniform. I also wonder if this is the right class type to expose as overall "result". It can just have the values "YES/NO/THROTTLE" which feel limited w.r.t AllocationStatus.

* @param explain true if in explain mode
* @param currentNodeId the current node id where the shard is assigned
* @param assignedNodeId the node id for where the shard can move to
* @param assignedNode the node for where the shard can move to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node where the shard should move to

* Gets the individual node-level decisions that went into making the final decision as represented by
* {@link #getFinalDecisionType()}. The map that is returned has the node id as the key and a {@link NodeAllocationResult}.
* Returns the decision for being allowed to rebalance the shard. Invoking this method will return a
* {@code null} if {@link #cannotRemain()} returns {@code true}, which means the node is not allowed to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return a null?

explanation = "can rebalance shard";
}
} else {
explanation = "cannot rebalance as no target node node exists that can both allocate this shard " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node node

} else {
// it was a decision to force move the shard
if (cannotRemain() == false) {
explanation = "can remain on its current node";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shard can remain

// if the node decision is NO, despite the canAllocate decision returning YES, it might not seem
// intuitive why the node has a NO decision, so we provide an extra explanation in this case
// to denote the reason for the NO decision was that the balance was not improved
builder.field("explanation", "not rebalancing to this node because the weight ranking is the same or worse " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of "the same or worse" you can use "not better"

private final Decision canRemainDecision;
@Nullable
private final Map<String, NodeAllocationResult> nodeDecisions;
private final Decision canRebalanceDecision;
private final boolean fetchPending;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we use a boolean fetchPending whereas AllocateUnassignedDecision uses the allocationStatus to determine if fetch is pending. As said in an earlier comment, I think we should have a uniform result object for this.

@@ -152,7 +152,7 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
// if the node decision is NO, despite the canAllocate decision returning YES, it might not seem
// intuitive why the node has a NO decision, so we provide an extra explanation in this case
// to denote the reason for the NO decision was that the balance was not improved
builder.field("explanation", "not rebalancing to this node because the weight ranking is the same or worse " +
builder.field("explanation", "not rebalancing to this node because the weight ranking is not better" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space missing at the end

explanation = "can allocate the shard";
} else if (allocationStatus == AllocationStatus.DECIDERS_THROTTLED) {
explanation = "allocation temporarily throttled";
} else {
if (allocationStatus == AllocationStatus.FETCHING_SHARD_DATA) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else if { ?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @abeyad

@abeyad
Copy link
Author

abeyad commented Dec 7, 2016

Thanks for the patience with the review @ywelsch and helping to come up with a much leaner and easy to follow output!

@abeyad abeyad merged commit e6e7bab into elastic:master Dec 7, 2016
abeyad pushed a commit that referenced this pull request Dec 8, 2016
…in API (#21691)

This commit enhances the allocator decision result objects (namely,
AllocateUnassignedDecision, MoveDecision, and RebalanceDecision)
to enable them to be used directly by the cluster allocation explain API. In
particular, this commit does the following:

- Adds serialization and toXContent methods to the response objects,
which will form the explain API responses.
- Moves the calculation of the final explanation to the response
object itself, removing it from the responsibility of the allocators.
- Adds shard store information to the NodeAllocationResult, so that
store information is available for each node, when explaining a
shard allocation by the PrimaryShardAllocator or the ReplicaShardAllocator.
- Removes RebalanceDecision in favor of using MoveDecision for both
moving and rebalancing shards.
- Removes NodeRebalanceResult in favor of using NodeAllocationResult.
- Changes the notion of weight ranking to be relative to the current node,
instead of an absolute weight that doesn't convey any added value to the
API user and can be confusing.
- Introduces a new enum AllocationDecision to convey the decision type,
which enables conveying unassigned, moving, and rebalancing scenarios
with more detail as opposed to just Decision.Type and AllocationStatus.
@abeyad
Copy link
Author

abeyad commented Dec 8, 2016

5.x commit: b7bdc1c

@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement v5.2.0 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants