-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Cluster Explain API uses the allocation process to explain shard allocation decisions #22182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments. Will give it a more thorough look tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need to specify full path, ShardRoutingState is imported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shard has unassignedInfo if it is in state "initializing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_state
might be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not change AllocationStatus.value() to print "not_permitted"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to change how the AllocationStatus
is represented in the ClusterState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe convert it to an AllocationDecision and use the toString()
provided by that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that will be fine, since every possible value of AllocationStatus
is covered by AllocationDecision
, and this rendering for UnassignedInfo
is specific to the explain API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why catch this and swallow exception? I think it's ok if the API does not work if that bit is not implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nicer to have:
StringBuilder buf = new StringBuilder();
if (usedIndexSetting) {
buf.append("index setting [");
buf.append(INDEX_ROUTING_ALLOCATION_ENABLE_SETTING.getKey());
} else {
buf.append("cluster setting [");
buf.append(CLUSTER_ROUTING_ALLOCATION_ENABLE_SETTING.getKey());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, other deciders have same issue with messages (e.g. filter allocation decider). Let's have a separate PR to fix these messages to make clear that they refer to index / cluster settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaner to just use
AsyncShardFetch<TransportNodesListGatewayStartedShards.NodeGatewayStartedShards> fetch = asyncFetchStarted.computeIfAbsent(shard.shardId(),
shardId -> new InternalAsyncFetch<>(logger, "shard_started", shard.shardId(), startedAction));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does Clint think about this? I also wonder if we should only enable this when ?human
is not explicitly set on the request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintongormley what do you think about setting the human
readable flag for the REST API here? it only affects the time fields (e.g. remaining_delay
vs remaining_delay_in_millis
for the delayed allocation scenario) and byte fields for shard stores. Should the user just be required to put ?human
to get the more user friendly outputs similar to the other APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was advocating for was to enable the human flag by default, but if someone sets human = false to respect that as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, yes that makes sense. @clintongormley should the human
flag be enabled by default for the explain API? For example, the GET indices API takes the human flag but the default is false
in that case. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everywhere that we support the human
flag, it defaults to false
. Even though this API will primarily consumed by humans, I think I'd prefer to keep the consistency of defaulting to false
. remaining_delay_in_millis
is still quite readable for a human, and rerunning the request with ?human
is easy enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comes for free if you use auto-managed min-master-nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
@ywelsch I pushed 02dc012353ab8325ac9036fd3e2b728cf1339842 to address your initial comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm not sure how we should handle initializing / relocating shards. Giving the same output as for started shards is highly confusing (e.g. SameShardAllocator). There are also no tests for these 2 scenarios.
- The API should support selecting a specific shard if there are multiple replica shards that are allocated. One possibility would be to pass the node id of the node the selected shard is allocated to. This will help answer the question "Why is that replica shard allocated to that node".
- The docs also require updates with the changed output format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have an extra flag to set "explain mode".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but felt that there really is no concrete distinction between "debug" mode and "explain" mode in this case. Would changing the naming of the enum values help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change regarding relocated/initializing shard I think it's fine now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this throw an UnsupportedOperationException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it would help to catch incorrect usages, as getExplanation()
should never be called on Multi
I am not sure myself. Thinking about it some more, I prefer if we don't try to explain initializing / relocating shards. It doesn't make sense in context - an initializing shard just got assigned, so there is nothing to report about in terms of "why are my shards unassigned" and whichever node it got assigned to is either (1) the most optimal node from a cluster balance perspective or (2) it had a copy of the data. In any case, nothing to really explain. In the case of relocating shards, the same thing - we are relocating due to a better balance or being required to move, so there is nothing new to explain that the allocators didn't already take into account by starting the relocation process. I think if we returned a
This is a good idea. I'll add this.
Yes, I was going to do this in a follow up PR. |
I pushed 56573f99d6518996d98eefa6b7a5a56288e218d1 to add the option to specify a node id for an assigned replica shard to explain. |
Sounds reasonable to me |
@ywelsch I pushed 7c960c4f81ae2819834d1d3367c5217e881aa395 to not try to explain initializing or relocating shards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why default to false? This API is to be consumed primarily by humans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per Clint's comment here #22182 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClinT_T
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable is only used in one place. Just put it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a mistake, i meant to use it in the line below as well, so I changed the line below to use it (its used twice now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe swap the branches, no need to have an == false
condition here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just make NOT_TAKEN public static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_node
might suffice (see reroute API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return null as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if replicaNode is specified, we have to check that a shard on that node can be found. Also I think we should use DiscoveryNodes.resolveNode
so that the user can specify the node by name or id or ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code already fails if a replica shard on the specified node cannot be found. foundShard
will be null so we will throw an exception below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change regarding relocated/initializing shard I think it's fine now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments.
@ywelsch I pushed 9a5134af5c92120ed362efc4c156c060d9a9f2e6 and e0d6df724b6d489129bbc4d0e382cf16e8114b98 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments but looks good o.w.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IllegalStateException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just let it bubble up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not true anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also include this.currentNode == null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not true anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer relocationTargetNode
to relocatingTargetNode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClinT_T
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testInitializingOrRelocatingShardExplanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_state?
81c16f5
to
02213ed
Compare
@ywelsch thanks for your review! I pushed another commit to address the last round of feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @abeyad
retest this please |
1 similar comment
retest this please |
9648ff4
to
0089cc1
Compare
retest this please |
1 similar comment
retest this please |
This commit changes the REST and transport layers to use the allocation deciders refactoring and new formats for presenting the allocation explanations. The API is changed from both the Java API layer as well as the REST layer. This is a breaking change from 5.0.x and 5.1.x explain API.
the coordinating node and master node differ in version numbers.
aa5aae2
to
4e7bcff
Compare
"cluster shard allocation explanation test": | ||
- skip: | ||
version: " - 5.2.0" | ||
reason: allocation explain api format is different in versions <= 5.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean 5.1.99? The explain format is different in versions < 5.2.0?
try { | ||
orig = serialize(version, streamable); | ||
} catch (IllegalArgumentException e) { | ||
} catch (IllegalArgumentException | IllegalStateException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use IllegalArgumentException for checkVersion functionality as well. It fits better than IllegalStateException I think (the node does not need to look at current state to reject serializing the message but just at the arguments).
|
||
@Override | ||
public void writeTo(StreamOutput out) throws IOException { | ||
checkVersion(out.getVersion()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can have a REST test that checks this exception. This could be done with a skip section of the form 5.2.0 -
so that the test is only active in BWC tests where the minimum es version is < 5.2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with doing this is that such a test will sometimes pass (having triggered the IllegalStateException
) and sometimes it won't. In particular, if the REST request is routed directly to the master node, there will be no exception. The exception will only occur if the REST request is routed to the non-master node which is on a different version from master (and not backward compatible w.r.t. the explain API), in which case the exception will be triggered when routed to the master node.
I did manually test these scenarios. We can discuss options tomorrow.
retest this please |
…cation decisions (#22182) This PR completes the refactoring of the cluster allocation explain API and improves it in the following two high-level ways: 1. The explain API now uses the same allocators that the AllocationService uses to make shard allocation decisions. Prior to this PR, the explain API would run the deciders against each node for the shard in question, but this was not executed on the same code path as the allocators, and many of the scenarios in shard allocation were not captured due to not executing through the same code paths as the allocators. 2. The APIs have changed, both on the Java and JSON level, to accurately capture the decisions made by the system. The APIs also now report on shard moving and rebalancing decisions, whereas the previous API did not report decisions for moving shards which cannot remain on their current node or rebalancing shards to form a more balanced cluster. Note: this change affects plugin developers who may have a custom implementation of the ShardsAllocator interface. The method weighShards has been removed and no longer has any utility. In order to support the new explain API, however, a custom implementation of ShardsAllocator must now implement ShardAllocationDecision decideShardAllocation(ShardRouting shard, RoutingAllocation allocation) which provides a decision and explanation for allocating a single shard. For implementations that do not support explaining a single shard allocation via the cluster allocation explain API, this method can simply return an UnsupportedOperationException.
5.x commit: 7c879ef |
Removes ensureStableCluster() calls in ClusterAllocationExplainIT testing after startNode*() calls. The stabilization checks were originally necessary when the tests were added in #22182. However, #49248 later enhanced validateClusterFormed(), which startNode*() implicitly calls, to perform similar checks to ensureStableCluster()
This PR completes the refactoring of the cluster allocation explain API and improves it in the following two high-level ways:
The explain API now uses the same allocators that the
AllocationService
uses to make shard allocation decisions. Prior to this PR, the explain API would run the deciders against each node for the shard in question, but this was not executed on the same code path as the allocators, and many of the scenarios in shard allocation were not captured due to not executing through the same code paths as the allocators.The APIs have changed, both on the Java and JSON level, to accurately capture the decisions made by the system. The APIs also now report on shard moving and rebalancing decisions, whereas the previous API did not report decisions for moving shards which cannot remain on their current node or rebalancing shards to form a more balanced cluster.
Note: this change affects plugin developers who may have a custom implementation of the
ShardsAllocator
interface. The methodweighShards
has been removed and no longer has any utility. In order to support the new explain API, however, a custom implementation ofShardsAllocator
must now implementShardAllocationDecision decideShardAllocation(ShardRouting shard, RoutingAllocation allocation)
which provides a decision and explanation for allocating a single shard. For implementations that do not support explaining a single shard allocation via the cluster allocation explain API, this method can simply return anUnsupportedOperationException
.Relates #21691 #21662 #21103 #20634 #20347