Skip to content

Conversation

DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Sep 4, 2025

Closes ES-12716

I've left some TODOs to tickets I've filed as a result of code exploration, particularly to test some balancer changes I'm making for canRemain response handling.

In case it makes reviewing the test file changes easier, I created a branch without the new test, so you can see the test refactor changes separately from the new test -- the new test makes the diff harder to read, unfortunately.

@DiannaHohensee DiannaHohensee self-assigned this Sep 4, 2025
@DiannaHohensee DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 and removed v9.2.0 labels Sep 4, 2025
@DiannaHohensee DiannaHohensee force-pushed the 2025/09/02/ES-12716-balancer branch from 2d70a0f to 5c04938 Compare September 9, 2025 19:36
@DiannaHohensee DiannaHohensee marked this pull request as ready for review September 9, 2025 19:59
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

RoutingNode routingNode = sourceNode.getRoutingNode();
Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
if (canRemain.type() != Decision.Type.NO) {
if (canRemain.type() != Decision.Type.NO && canRemain.type() != Decision.Type.NOT_PREFERRED) {
Copy link
Contributor Author

@DiannaHohensee DiannaHohensee Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally an early return if not NO for canRemain. Excluding NOT_PREFERRED from early return, so we'll go on to try to move it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to keep this as-is, pending the introduction of the proposed moveNotPreferred phase. For example if a node is hot spotting and all its shards are returning NOT_PREFERRED, we probably want to delay dealing with those until moveNotPreferred when we'll move them in preferential order.

I have a PR for moveNotPreferred which I'll put up for review shortly to get feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of our work is feature gated, so in that respect I'm not worried about waiting for other code first. I can't test without this change: I've got the canRemain work done, waiting on the PR so I can rebase before publishing the work. You will also be able to take advantage of the testing / functionality, once your feature is in place, however it turns out, so that might be appealing. This logic can be changed easily since it's a line of code.

If you're comfortable with that, I'd like to go ahead with getting the dumb case (of picking any shard) working, so we don't bottleneck work. I was actually expecting moveNotPreferred to run before moveShards. In that case, though, we would not actually exercise this check. We might even turn this into an assert that not-preferred never occurs.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits

Comment on lines 833 to 837
} else if (bestDecision == Type.NOT_PREFERRED) {
assert remainDecision.type() != Type.NOT_PREFERRED;
assert bestDecision != Type.YES;
// If we don't ever find a YES decision, we'll settle for NOT_PREFERRED as preferable to NO.
targetNode = target;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change Type bestDecision = Type.NO; to Type bestDecision = remainDecision.type(); then if (allocationDecision.type().higherThan(bestDecision)) { will work for all cases without new if conditions.

When source and target NOT_PREFERRED allocationDecision.type().higherThan(bestDecision) returns false. So first condition is redundant. if (allocationDecision.type() == Type.NOT_PREFERRED && remainDecision.type() == Type.NOT_PREFERRED) {

Basically only one line change is needed Type bestDecision = remainDecision.type();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is clever :) I'd worry about clarity, though, with implicit logic, and thus making the code fragile.

IIUC. The method could then return NOT_PREFERRED as the bestDecision, which then gets treated as YES in the code but has a null target node. Or the canRemain could be NOT_PREFERRED, but all the options are NO, and we return NOT_PREFERRED, which could be overridden (except no target node). These may or may not have consequences, can't say without experimentation. But my primary concern would be too much implicit logic that's easy to break in future..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats how it suppose to be in the first place, before your change, then you wouldn't need to change anything. The whole methods simply says is there better decision than current(remainDecision). It should start with best=remainDecision, not NO.

It's a nit comment, I dont find explicitness here helps with clarity, rather a redundancy that takes few more moments to understand whats going on and why we need extra IFs.

Comment on lines 101 to 107
Strings.format(
"Shard [%s] in index [%s] can be assigned to node [%s]. The node's utilization would become [%s]",
shardRouting.shardId(),
shardRouting.index(),
node.nodeId(),
newWriteThreadPoolUtilization
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other deciders tend to use String explanation = Strings.format(...) and pass it to the logger and allocation.decision, DRY :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done 👍

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Updated.

Comment on lines 101 to 107
Strings.format(
"Shard [%s] in index [%s] can be assigned to node [%s]. The node's utilization would become [%s]",
shardRouting.shardId(),
shardRouting.index(),
node.nodeId(),
newWriteThreadPoolUtilization
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done 👍

Comment on lines 833 to 837
} else if (bestDecision == Type.NOT_PREFERRED) {
assert remainDecision.type() != Type.NOT_PREFERRED;
assert bestDecision != Type.YES;
// If we don't ever find a YES decision, we'll settle for NOT_PREFERRED as preferable to NO.
targetNode = target;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is clever :) I'd worry about clarity, though, with implicit logic, and thus making the code fragile.

IIUC. The method could then return NOT_PREFERRED as the bestDecision, which then gets treated as YES in the code but has a null target node. Or the canRemain could be NOT_PREFERRED, but all the options are NO, and we return NOT_PREFERRED, which could be overridden (except no target node). These may or may not have consequences, can't say without experimentation. But my primary concern would be too much implicit logic that's easy to break in future..

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, only real concern is the treatment of NOT_PREFERRED in canMove. I will tidy up what I have for moveNotPreferred and put it up today to add context to my comment.

RoutingNode routingNode = sourceNode.getRoutingNode();
Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
if (canRemain.type() != Decision.Type.NO) {
if (canRemain.type() != Decision.Type.NO && canRemain.type() != Decision.Type.NOT_PREFERRED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to keep this as-is, pending the introduction of the proposed moveNotPreferred phase. For example if a node is hot spotting and all its shards are returning NOT_PREFERRED, we probably want to delay dealing with those until moveNotPreferred when we'll move them in preferential order.

I have a PR for moveNotPreferred which I'll put up for review shortly to get feedback.

if (canRemainDecision.type() != Decision.Type.NO && canRemainDecision.type() != Decision.Type.NOT_PREFERRED) {
// If movement is throttled, a future reconciliation round will see a resolution. For now, leave it alone.
// Reconciliation treats canRemain NOT_PREFERRED answers as YES because the DesiredBalance computation already decided
// how to handle the situation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will mean we move NO and NOT_PREFERRED shards with the same priority in the reconciler. Not obviously wrong to me, but I wonder if we want to do NO first then NOT_PREFERRED? Seems easier to treat them as the same for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking the other way, prioritizing moves to address hot-spots first, would be more ideal since it addresses a performance problem. Though actually, that would deprioritize shutdown moves.. But then again, timeout during shutdown is often because something else is going on than allocation.

This bit of code, though, doesn't control ordering -- that would have to be a new feature in the code to organize shard selection based on NO vs NOT_PREFERRED, probably hard -- rather it's an early exit if canRemain say YES or THROTTLE.

But yeah, perhaps we'll see a motivation later for something fancier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does control ordering in that shards moved in this phase will consume limited incoming/outgoing recovery slots, right? so shards eligible for movement in this phase will be prioritised before undesired allocations eligible only for movement in the balance() phase?

In saying that it probably makes sense to prioritise NOT_PREFERRED before merely undesired allocations.

* off of Node1 while Node2 and Node3 are hot-spotting, resulting in overriding not-preferred and relocating shards to Node2 and Node3.
*/
public void testShardsAreAssignedToNotPreferredWhenAlternativeIsNo() {
TestHarness harness = setUpThreeTestNodesAndAllIndexShardsOnFirstNode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test feels more like a test for the balancer logic than the write load constraint decider specifically. I wonder if it would be less verbose to do this by putting in dummy deciders which you can directly control the canRemain/canAllocate values for, rather than creating the conditions where WriteLoadConstraintDecider returns NO and NOT_PREFERRED ? Perhaps there are already similar tests for testing the existing balancer logic?

I might have missed something though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test feels more like a test for the balancer logic than the write load constraint decider specifically.

Hmm, I agree with you. Though this is the easiest place for me to write the test, since I have the shared setup logic already: it'd be duplicative otherwise. I'm inclined to let it slide because we typically aren't that concerned where we test features, and this behavior is very relevant to this decider. Perhaps I've phrased the test as more generic than it actually is: there's a good deal of write load decider specifics.

I wonder if it would be less verbose to do this by putting in dummy deciders which you can directly control the canRemain/canAllocate values for, rather than creating the conditions where WriteLoadConstraintDecider returns NO and NOT_PREFERRED ? Perhaps there are already similar tests for testing the existing balancer logic?

This does test the write load decider decisions in particular with the thresholds and such. So my first thought is that we would lose WriteLoadDecider test coverage by using dummy deciders. Perhaps what you suggest with dummy deciders would be a fitting balancer unit test, rather than a decider integration test.

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Nick. I've responded to the comment threads, let me know what you think.

RoutingNode routingNode = sourceNode.getRoutingNode();
Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
if (canRemain.type() != Decision.Type.NO) {
if (canRemain.type() != Decision.Type.NO && canRemain.type() != Decision.Type.NOT_PREFERRED) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of our work is feature gated, so in that respect I'm not worried about waiting for other code first. I can't test without this change: I've got the canRemain work done, waiting on the PR so I can rebase before publishing the work. You will also be able to take advantage of the testing / functionality, once your feature is in place, however it turns out, so that might be appealing. This logic can be changed easily since it's a line of code.

If you're comfortable with that, I'd like to go ahead with getting the dumb case (of picking any shard) working, so we don't bottleneck work. I was actually expecting moveNotPreferred to run before moveShards. In that case, though, we would not actually exercise this check. We might even turn this into an assert that not-preferred never occurs.

if (canRemainDecision.type() != Decision.Type.NO && canRemainDecision.type() != Decision.Type.NOT_PREFERRED) {
// If movement is throttled, a future reconciliation round will see a resolution. For now, leave it alone.
// Reconciliation treats canRemain NOT_PREFERRED answers as YES because the DesiredBalance computation already decided
// how to handle the situation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking the other way, prioritizing moves to address hot-spots first, would be more ideal since it addresses a performance problem. Though actually, that would deprioritize shutdown moves.. But then again, timeout during shutdown is often because something else is going on than allocation.

This bit of code, though, doesn't control ordering -- that would have to be a new feature in the code to organize shard selection based on NO vs NOT_PREFERRED, probably hard -- rather it's an early exit if canRemain say YES or THROTTLE.

But yeah, perhaps we'll see a motivation later for something fancier.

* off of Node1 while Node2 and Node3 are hot-spotting, resulting in overriding not-preferred and relocating shards to Node2 and Node3.
*/
public void testShardsAreAssignedToNotPreferredWhenAlternativeIsNo() {
TestHarness harness = setUpThreeTestNodesAndAllIndexShardsOnFirstNode();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test feels more like a test for the balancer logic than the write load constraint decider specifically.

Hmm, I agree with you. Though this is the easiest place for me to write the test, since I have the shared setup logic already: it'd be duplicative otherwise. I'm inclined to let it slide because we typically aren't that concerned where we test features, and this behavior is very relevant to this decider. Perhaps I've phrased the test as more generic than it actually is: there's a good deal of write load decider specifics.

I wonder if it would be less verbose to do this by putting in dummy deciders which you can directly control the canRemain/canAllocate values for, rather than creating the conditions where WriteLoadConstraintDecider returns NO and NOT_PREFERRED ? Perhaps there are already similar tests for testing the existing balancer logic?

This does test the write load decider decisions in particular with the thresholds and such. So my first thought is that we would lose WriteLoadDecider test coverage by using dummy deciders. Perhaps what you suggest with dummy deciders would be a fitting balancer unit test, rather than a decider integration test.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if (canRemainDecision.type() != Decision.Type.NO && canRemainDecision.type() != Decision.Type.NOT_PREFERRED) {
// If movement is throttled, a future reconciliation round will see a resolution. For now, leave it alone.
// Reconciliation treats canRemain NOT_PREFERRED answers as YES because the DesiredBalance computation already decided
// how to handle the situation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does control ordering in that shards moved in this phase will consume limited incoming/outgoing recovery slots, right? so shards eligible for movement in this phase will be prioritised before undesired allocations eligible only for movement in the balance() phase?

In saying that it probably makes sense to prioritise NOT_PREFERRED before merely undesired allocations.

@DiannaHohensee DiannaHohensee merged commit 6f96ea3 into elastic:main Sep 23, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants