Balancer changes to use Decision#NOT_PREFERRED #134160

DiannaHohensee · 2025-09-04T18:19:31Z

Closes ES-12716

I've left some TODOs to tickets I've filed as a result of code exploration, particularly to test some balancer changes I'm making for canRemain response handling.

In case it makes reviewing the test file changes easier, I created a branch without the new test, so you can see the test refactor changes separately from the new test -- the new test makes the diff harder to read, unfortunately.

…red in reconciliation

…urrency settings

…terpret as YES but delay until the YES assignments are exhausted

elasticsearchmachine · 2025-09-09T20:00:13Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DiannaHohensee · 2025-09-09T21:59:55Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

            RoutingNode routingNode = sourceNode.getRoutingNode();
            Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
-            if (canRemain.type() != Decision.Type.NO) {
+            if (canRemain.type() != Decision.Type.NO && canRemain.type() != Decision.Type.NOT_PREFERRED) {


Originally an early return if not NO for canRemain. Excluding NOT_PREFERRED from early return, so we'll go on to try to move it.

I wonder if we want to keep this as-is, pending the introduction of the proposed moveNotPreferred phase. For example if a node is hot spotting and all its shards are returning NOT_PREFERRED, we probably want to delay dealing with those until moveNotPreferred when we'll move them in preferential order.

I have a PR for moveNotPreferred which I'll put up for review shortly to get feedback.

All of our work is feature gated, so in that respect I'm not worried about waiting for other code first. I can't test without this change: I've got the canRemain work done, waiting on the PR so I can rebase before publishing the work. You will also be able to take advantage of the testing / functionality, once your feature is in place, however it turns out, so that might be appealing. This logic can be changed easily since it's a line of code.

If you're comfortable with that, I'd like to go ahead with getting the dumb case (of picking any shard) working, so we don't bottleneck work. I was actually expecting moveNotPreferred to run before moveShards. In that case, though, we would not actually exercise this check. We might even turn this into an assert that not-preferred never occurs.

…ation, not reconciliation's job

mhl-b

LGTM with nits

mhl-b · 2025-09-09T23:45:06Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+                        } else if (bestDecision == Type.NOT_PREFERRED) {
+                            assert remainDecision.type() != Type.NOT_PREFERRED;
+                            assert bestDecision != Type.YES;
+                            // If we don't ever find a YES decision, we'll settle for NOT_PREFERRED as preferable to NO.
+                            targetNode = target;


If you change Type bestDecision = Type.NO; to Type bestDecision = remainDecision.type(); then if (allocationDecision.type().higherThan(bestDecision)) { will work for all cases without new if conditions.

When source and target NOT_PREFERRED allocationDecision.type().higherThan(bestDecision) returns false. So first condition is redundant. if (allocationDecision.type() == Type.NOT_PREFERRED && remainDecision.type() == Type.NOT_PREFERRED) {

Basically only one line change is needed Type bestDecision = remainDecision.type();

That is clever :) I'd worry about clarity, though, with implicit logic, and thus making the code fragile.

IIUC. The method could then return NOT_PREFERRED as the bestDecision, which then gets treated as YES in the code but has a null target node. Or the canRemain could be NOT_PREFERRED, but all the options are NO, and we return NOT_PREFERRED, which could be overridden (except no target node). These may or may not have consequences, can't say without experimentation. But my primary concern would be too much implicit logic that's easy to break in future..

I think thats how it suppose to be in the first place, before your change, then you wouldn't need to change anything. The whole methods simply says is there better decision than current(remainDecision). It should start with best=remainDecision, not NO.

It's a nit comment, I dont find explicitness here helps with clarity, rather a redundancy that takes few more moments to understand whats going on and why we need extra IFs.

mhl-b · 2025-09-10T00:47:35Z

...in/java/org/elasticsearch/cluster/routing/allocation/decider/WriteLoadConstraintDecider.java

+                Strings.format(
+                    "Shard [%s] in index [%s] can be assigned to node [%s]. The node's utilization would become [%s]",
+                    shardRouting.shardId(),
+                    shardRouting.index(),
+                    node.nodeId(),
+                    newWriteThreadPoolUtilization
+                )


other deciders tend to use String explanation = Strings.format(...) and pass it to the logger and allocation.decision, DRY :D

Yep, done 👍

DiannaHohensee

Thanks for the review. Updated.

DiannaHohensee · 2025-09-10T15:56:41Z

...in/java/org/elasticsearch/cluster/routing/allocation/decider/WriteLoadConstraintDecider.java

+                Strings.format(
+                    "Shard [%s] in index [%s] can be assigned to node [%s]. The node's utilization would become [%s]",
+                    shardRouting.shardId(),
+                    shardRouting.index(),
+                    node.nodeId(),
+                    newWriteThreadPoolUtilization
+                )


Yep, done 👍

DiannaHohensee · 2025-09-10T16:07:43Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+                        } else if (bestDecision == Type.NOT_PREFERRED) {
+                            assert remainDecision.type() != Type.NOT_PREFERRED;
+                            assert bestDecision != Type.YES;
+                            // If we don't ever find a YES decision, we'll settle for NOT_PREFERRED as preferable to NO.
+                            targetNode = target;


That is clever :) I'd worry about clarity, though, with implicit logic, and thus making the code fragile.

IIUC. The method could then return NOT_PREFERRED as the bestDecision, which then gets treated as YES in the code but has a null target node. Or the canRemain could be NOT_PREFERRED, but all the options are NO, and we return NOT_PREFERRED, which could be overridden (except no target node). These may or may not have consequences, can't say without experimentation. But my primary concern would be too much implicit logic that's easy to break in future..

nicktindall

Looking good, only real concern is the treatment of NOT_PREFERRED in canMove. I will tidy up what I have for moveNotPreferred and put it up today to add context to my comment.

nicktindall · 2025-09-10T23:20:54Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

            RoutingNode routingNode = sourceNode.getRoutingNode();
            Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
-            if (canRemain.type() != Decision.Type.NO) {
+            if (canRemain.type() != Decision.Type.NO && canRemain.type() != Decision.Type.NOT_PREFERRED) {


I wonder if we want to keep this as-is, pending the introduction of the proposed moveNotPreferred phase. For example if a node is hot spotting and all its shards are returning NOT_PREFERRED, we probably want to delay dealing with those until moveNotPreferred when we'll move them in preferential order.

I have a PR for moveNotPreferred which I'll put up for review shortly to get feedback.

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

nicktindall · 2025-09-11T00:16:12Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

+                if (canRemainDecision.type() != Decision.Type.NO && canRemainDecision.type() != Decision.Type.NOT_PREFERRED) {
+                    // If movement is throttled, a future reconciliation round will see a resolution. For now, leave it alone.
+                    // Reconciliation treats canRemain NOT_PREFERRED answers as YES because the DesiredBalance computation already decided
+                    // how to handle the situation.


This will mean we move NO and NOT_PREFERRED shards with the same priority in the reconciler. Not obviously wrong to me, but I wonder if we want to do NO first then NOT_PREFERRED? Seems easier to treat them as the same for now.

I was actually thinking the other way, prioritizing moves to address hot-spots first, would be more ideal since it addresses a performance problem. Though actually, that would deprioritize shutdown moves.. But then again, timeout during shutdown is often because something else is going on than allocation.

This bit of code, though, doesn't control ordering -- that would have to be a new feature in the code to organize shard selection based on NO vs NOT_PREFERRED, probably hard -- rather it's an early exit if canRemain say YES or THROTTLE.

But yeah, perhaps we'll see a motivation later for something fancier.

It does control ordering in that shards moved in this phase will consume limited incoming/outgoing recovery slots, right? so shards eligible for movement in this phase will be prioritised before undesired allocations eligible only for movement in the balance() phase?

In saying that it probably makes sense to prioritise NOT_PREFERRED before merely undesired allocations.

nicktindall · 2025-09-11T00:41:36Z

.../java/org/elasticsearch/cluster/routing/allocation/decider/WriteLoadConstraintDeciderIT.java

+     * off of Node1 while Node2 and Node3 are hot-spotting, resulting in overriding not-preferred and relocating shards to Node2 and Node3.
+     */
+    public void testShardsAreAssignedToNotPreferredWhenAlternativeIsNo() {
+        TestHarness harness = setUpThreeTestNodesAndAllIndexShardsOnFirstNode();


This test feels more like a test for the balancer logic than the write load constraint decider specifically. I wonder if it would be less verbose to do this by putting in dummy deciders which you can directly control the canRemain/canAllocate values for, rather than creating the conditions where WriteLoadConstraintDecider returns NO and NOT_PREFERRED ? Perhaps there are already similar tests for testing the existing balancer logic?

I might have missed something though.

This test feels more like a test for the balancer logic than the write load constraint decider specifically.

Hmm, I agree with you. Though this is the easiest place for me to write the test, since I have the shared setup logic already: it'd be duplicative otherwise. I'm inclined to let it slide because we typically aren't that concerned where we test features, and this behavior is very relevant to this decider. Perhaps I've phrased the test as more generic than it actually is: there's a good deal of write load decider specifics.

I wonder if it would be less verbose to do this by putting in dummy deciders which you can directly control the canRemain/canAllocate values for, rather than creating the conditions where WriteLoadConstraintDecider returns NO and NOT_PREFERRED ? Perhaps there are already similar tests for testing the existing balancer logic?

This does test the write load decider decisions in particular with the thresholds and such. So my first thought is that we would lose WriteLoadDecider test coverage by using dummy deciders. Perhaps what you suggest with dummy deciders would be a fitting balancer unit test, rather than a decider integration test.

DiannaHohensee

Thanks for the review, Nick. I've responded to the comment threads, let me know what you think.

DiannaHohensee · 2025-09-12T17:30:57Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

            RoutingNode routingNode = sourceNode.getRoutingNode();
            Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
-            if (canRemain.type() != Decision.Type.NO) {
+            if (canRemain.type() != Decision.Type.NO && canRemain.type() != Decision.Type.NOT_PREFERRED) {


All of our work is feature gated, so in that respect I'm not worried about waiting for other code first. I can't test without this change: I've got the canRemain work done, waiting on the PR so I can rebase before publishing the work. You will also be able to take advantage of the testing / functionality, once your feature is in place, however it turns out, so that might be appealing. This logic can be changed easily since it's a line of code.

If you're comfortable with that, I'd like to go ahead with getting the dumb case (of picking any shard) working, so we don't bottleneck work. I was actually expecting moveNotPreferred to run before moveShards. In that case, though, we would not actually exercise this check. We might even turn this into an assert that not-preferred never occurs.

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

DiannaHohensee · 2025-09-12T17:42:06Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

+                if (canRemainDecision.type() != Decision.Type.NO && canRemainDecision.type() != Decision.Type.NOT_PREFERRED) {
+                    // If movement is throttled, a future reconciliation round will see a resolution. For now, leave it alone.
+                    // Reconciliation treats canRemain NOT_PREFERRED answers as YES because the DesiredBalance computation already decided
+                    // how to handle the situation.


I was actually thinking the other way, prioritizing moves to address hot-spots first, would be more ideal since it addresses a performance problem. Though actually, that would deprioritize shutdown moves.. But then again, timeout during shutdown is often because something else is going on than allocation.

This bit of code, though, doesn't control ordering -- that would have to be a new feature in the code to organize shard selection based on NO vs NOT_PREFERRED, probably hard -- rather it's an early exit if canRemain say YES or THROTTLE.

But yeah, perhaps we'll see a motivation later for something fancier.

DiannaHohensee · 2025-09-12T17:57:12Z

.../java/org/elasticsearch/cluster/routing/allocation/decider/WriteLoadConstraintDeciderIT.java

+     * off of Node1 while Node2 and Node3 are hot-spotting, resulting in overriding not-preferred and relocating shards to Node2 and Node3.
+     */
+    public void testShardsAreAssignedToNotPreferredWhenAlternativeIsNo() {
+        TestHarness harness = setUpThreeTestNodesAndAllIndexShardsOnFirstNode();


This test feels more like a test for the balancer logic than the write load constraint decider specifically.

Hmm, I agree with you. Though this is the easiest place for me to write the test, since I have the shared setup logic already: it'd be duplicative otherwise. I'm inclined to let it slide because we typically aren't that concerned where we test features, and this behavior is very relevant to this decider. Perhaps I've phrased the test as more generic than it actually is: there's a good deal of write load decider specifics.

I wonder if it would be less verbose to do this by putting in dummy deciders which you can directly control the canRemain/canAllocate values for, rather than creating the conditions where WriteLoadConstraintDecider returns NO and NOT_PREFERRED ? Perhaps there are already similar tests for testing the existing balancer logic?

This does test the write load decider decisions in particular with the thresholds and such. So my first thought is that we would lose WriteLoadDecider test coverage by using dummy deciders. Perhaps what you suggest with dummy deciders would be a fitting balancer unit test, rather than a decider integration test.

nicktindall

LGTM

nicktindall · 2025-09-17T05:17:32Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

+                if (canRemainDecision.type() != Decision.Type.NO && canRemainDecision.type() != Decision.Type.NOT_PREFERRED) {
+                    // If movement is throttled, a future reconciliation round will see a resolution. For now, leave it alone.
+                    // Reconciliation treats canRemain NOT_PREFERRED answers as YES because the DesiredBalance computation already decided
+                    // how to handle the situation.


It does control ordering in that shards moved in this phase will consume limited incoming/outgoing recovery slots, right? so shards eligible for movement in this phase will be prioritised before undesired allocations eligible only for movement in the balance() phase?

In saying that it probably makes sense to prioritise NOT_PREFERRED before merely undesired allocations.

DiannaHohensee self-assigned this Sep 4, 2025

elasticsearchmachine added the v9.2.0 label Sep 4, 2025

DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 and removed v9.2.0 labels Sep 4, 2025

DiannaHohensee added 8 commits September 9, 2025 12:33

wip

88fac67

comment cleanup

c9c2645

change for rebalancing, skip if not-preferred

0f837c5

change write load decider to not-preferred answers; handle not-prefer…

4fef486

…red in reconciliation

test code refactor, plus testing todos

e3da024

test improvements; obviate THROTTLE handling issue by overriding conc…

a07e654

…urrency settings

bit of write load decider explanation / logging change

3c16fa0

second test working; add reconciliation handling of NOT_PREFERRED, in…

5c04938

…terpret as YES but delay until the YES assignments are exhausted

DiannaHohensee force-pushed the 2025/09/02/ES-12716-balancer branch from 2d70a0f to 5c04938 Compare September 9, 2025 19:36

DiannaHohensee added 5 commits September 9, 2025 12:41

fix test file after force rebase

89e2dd0

tidy up test file

2621ac1

improve test change readability

d04e073

improve test change readability again

515175d

return log level after debugging

0aba59a

DiannaHohensee marked this pull request as ready for review September 9, 2025 19:59

[CI] Auto commit changes from spotless

b609df8

DiannaHohensee requested review from nicktindall and mhl-b September 9, 2025 21:56

DiannaHohensee commented Sep 9, 2025

View reviewed changes

DiannaHohensee added 2 commits September 9, 2025 15:03

add an assert for protection

8dd940c

remove todo, consider canRemain not-preferred as YES during reconcili…

fd217db

…ation, not reconciliation's job

DiannaHohensee requested a review from DaveCTurner September 9, 2025 22:14

mhl-b approved these changes Sep 10, 2025

View reviewed changes

DiannaHohensee added 2 commits September 10, 2025 09:09

improve string use, remove duplicate code

18fa1f2

Merge branch 'main' into 2025/09/02/ES-12716-balancer

f14e4f1

DiannaHohensee commented Sep 10, 2025

View reviewed changes

nicktindall reviewed Sep 11, 2025

View reviewed changes

DiannaHohensee added 2 commits September 12, 2025 10:48

Merge branch 'main' into 2025/09/02/ES-12716-balancer

f36746b

remove redundant assert

d048da0

DiannaHohensee commented Sep 12, 2025

View reviewed changes

nicktindall approved these changes Sep 17, 2025

View reviewed changes

Merge branch 'main' into 2025/09/02/ES-12716-balancer

c10f655

DiannaHohensee merged commit 6f96ea3 into elastic:main Sep 23, 2025
34 checks passed

nicktindall mentioned this pull request Sep 29, 2025

Prioritise movement of shards in non-preferred allocations #135058

Merged

Balancer changes to use Decision#NOT_PREFERRED #134160

Balancer changes to use Decision#NOT_PREFERRED #134160

Uh oh!

Conversation

DiannaHohensee commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 9, 2025

Uh oh!

DiannaHohensee Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DiannaHohensee commented Sep 4, 2025 •

edited

Loading

DiannaHohensee Sep 9, 2025 •

edited

Loading