Skip to content

Conversation

@DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Oct 28, 2025

Adds numerous NOT_PREFERRED options to allocation decision / status types.

Adds NOT_PREFERRED option to AllocationDecision (resolving ES-12729).

A significant change is to re-order comparison of Decision.Type enum
values, such that THROTTLE is chosen over NOT_PREFERRED. Functionally
this change should not matter because simulation (DesiredBalance
computation) does not throttle and reconciliation (real shard movement)
treats not-preferred essentially as a YES: they are not compared.

Closes ES-12833, ES-13288, ES-12729

@DiannaHohensee DiannaHohensee self-assigned this Oct 28, 2025
@DiannaHohensee DiannaHohensee force-pushed the 2025/10/23/ES-12833 branch 4 times, most recently from b03f11f to be58da7 Compare October 30, 2025 20:21
@DiannaHohensee DiannaHohensee force-pushed the 2025/10/23/ES-12833 branch 6 times, most recently from 13322ec to 17de33c Compare November 8, 2025 02:07
Adds numerous NOT_PREFERRED options
to allocation decision / status types.

Adds NOT_PREFERRED option to
AllocationDecision (resolving ES-12729).

Closes ES-12833, ES-13288, ES-12729
@DiannaHohensee DiannaHohensee changed the title [DRAFT] Return NOT_PREFERRED decisions in allocation explain Return NOT_PREFERRED decisions in allocation explain Nov 8, 2025
@DiannaHohensee DiannaHohensee marked this pull request as ready for review November 8, 2025 02:28
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Nov 8, 2025
@DiannaHohensee DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. Team:Distributed Coordination Meta label for Distributed Coordination team and removed needs:triage Requires assignment of a team area label :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. labels Nov 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DiannaHohensee
Copy link
Contributor Author

DiannaHohensee commented Nov 8, 2025

Adding @DaveCTurner as a reviewer in case he wants to take a high level look for anything I'm missing -- I saw in git-blame that he wrote the Explanations.java file a while ago. But optional.

@DiannaHohensee
Copy link
Contributor Author

DiannaHohensee commented Nov 8, 2025

I think I might be missing test coverage (if not functionality) of the allocate unassigned code path. However, I've been hacking on the rebalancing code path for a while now to get this all to work sensibly, so if agreeable I might just file another ticket to explore that.

.getShardAllocationDecision()
.getMoveDecision()
.getNodeDecisions();
assertThat(canAllocateDecisions.size(), equalTo(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give 2 a name somewhere? I assume it's {number of cluster nodes} - 1 perhaps we should make our expectedNumberOfClusterNodes explicit and assert that it's accurate after we run the other test logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, updated to indicate purpose and make dynamic 👍

TimeValue.timeValueMillis(queueLatencyThresholdMillis)
)
// Keep all the debug logging, no throttling of decider messages.
.put(WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_MINIMUM_LOGGING_INTERVAL.getKey(), TimeValue.timeValueMinutes(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is this necessary, I don't think we're asserting on these messages anywhere are we? could also use TimeValue.ZERO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall why I added this anymore. I think perhaps for general debug purposes, in case of a test failure.

There don't appear to be any tests in this file with *Decider DEBUG logging turned on, so I'll remove it 👍

int shard,
boolean primary,
@Nullable String currentNode
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't think this is necessary, you can call

ClusterAllocationExplainRequest allocationExplainRequest = 
        new ClusterAllocationExplainRequest(TEST_REQUEST_TIMEOUT)
             .setIndex(harness.indexName)
             .setShard(0),
             .setPrimary(true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you! Reverted.

}
if (allocationDecision.type().higherThan(bestDecision)) {
assert allocationDecision.type() != Type.THROTTLE
: "DesiredBalance computations run in a simulation mode and should not encounter throttling";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, how do we know we're simulating here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as extra protection because I changed the Decision.Type enum ordering, such that the higherThan gate above would now prefer (return true for) THROTTLE over NOT_PREFERRED.

Because this is simulation, we should never see THROTTLE.

assert decider.canRebalance(shardRouting, allocation).type() != Decision.Type.THROTTLE
: decider.getClass().getSimpleName() + " throttled unexpectedly in canRebalance";
return decider.canRebalance(shardRouting, allocation);
}, (decider, decision) -> Strings.format("Can not rebalance [%s]. [%s]: %s", shardRouting, decider, decision));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConcurrentRebalanceAllocationDecider appears to return THROTTLE from canRebalance ?

Nit: I'd prefer if we stored the result of canRebalance then ran the assertion after it rather than running canRebalance twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, reverted 👍

NOT_PREFERRED,
// Temporarily throttled is a better choice than choosing a not-preferred node,
// but NOT_PREFERRED and THROTTLED are generally not comparable.
THROTTLE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do this can we? It'll mean in the presence of a THROTTLE and a NOT_PREFERRED, canAllocate will return NOT_PREFERRED, which I think is bad for non-desired-balance allocation, because there we should wait for a THROTTLE to become a yes before allocating to a NOT_PREFERRED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there's currently any direct comparison between THROTTLE and NOT_PREFERRED. But we still need to specify an ordering here, to not break how one or the other gets compared to YES and NO.

It'll mean in the presence of a THROTTLE and a NOT_PREFERRED, canAllocate will return NOT_PREFERRED

I think this would be the other way around, THROTTLE would be higher value, closer to YES, and chosen over NOT_PREFERRED. So, IIUC, we both want it the current way.

I've a little blurb for the commit message:

"A significant change is to re-order comparison of Decision.Type enum
values, such that THROTTLE is chosen over NOT_PREFERRED. Functionally
this change should not matter because simulation (DesiredBalance
computation) does not throttle and reconciliation (real shard movement)
treats not-preferred essentially as a YES: they are not compared."

@Override
public String getExplanation() {
checkDecisionState();
return switch (getAllocationDecision()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at AllocateUnassignedDecision#getAllocationDecision it appears as though it'll never return NOT_PREFERRED, see AllocationDecision#fromAllocationStatus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, fixed 👍

public static final String NOT_PREFERRED = """
Elasticsearch will not rebalance this shard to another node because all other eligible nodes have high resource usage. The \
total cluster balance weights might improve, were the shard relocated, but it would push one resource usage dimension \
too high and threaten performance. See the node-by-node explanation to understand what resource usage is high.""";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the above description sounds over-fitted to the write-load decider. The index balance decider will also return NOT_PREFERRED right? and that has nothing to do with resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index balance decider is also interested in resources, in the sense that a spike in an index' ingest would be more safely distributed across as many nodes as possible. Those are the lines along which I was thinking.

I've tweaked the phrasing a little, but not sure if as much as desirable. If you have any suggestions, let me know.

}
// TODO (ES-13482): clusterRebalanceDecision is set to the result of AllocationDecider#canRebalance, which does not return
// NOT_PREFERRED or THROTTLE. This switch statement, and how MoveDecision uses clusterRebalanceDecision, should be
// refactored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still does? see org/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java:182

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Simon committed some changes concurrently with my PR.

But I think I was wrong, anyway: I didn't realize the complexities of canRebalance.

Removed 👍 Thanks!

@DiannaHohensee DiannaHohensee removed the request for review from DaveCTurner November 20, 2025 19:58
Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, I'm finally returning to this. Fixed up the code per your feedback.

.getShardAllocationDecision()
.getMoveDecision()
.getNodeDecisions();
assertThat(canAllocateDecisions.size(), equalTo(2));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, updated to indicate purpose and make dynamic 👍

TimeValue.timeValueMillis(queueLatencyThresholdMillis)
)
// Keep all the debug logging, no throttling of decider messages.
.put(WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_MINIMUM_LOGGING_INTERVAL.getKey(), TimeValue.timeValueMinutes(0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall why I added this anymore. I think perhaps for general debug purposes, in case of a test failure.

There don't appear to be any tests in this file with *Decider DEBUG logging turned on, so I'll remove it 👍

int shard,
boolean primary,
@Nullable String currentNode
) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you! Reverted.

}
if (allocationDecision.type().higherThan(bestDecision)) {
assert allocationDecision.type() != Type.THROTTLE
: "DesiredBalance computations run in a simulation mode and should not encounter throttling";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as extra protection because I changed the Decision.Type enum ordering, such that the higherThan gate above would now prefer (return true for) THROTTLE over NOT_PREFERRED.

Because this is simulation, we should never see THROTTLE.

NOT_PREFERRED,
// Temporarily throttled is a better choice than choosing a not-preferred node,
// but NOT_PREFERRED and THROTTLED are generally not comparable.
THROTTLE,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there's currently any direct comparison between THROTTLE and NOT_PREFERRED. But we still need to specify an ordering here, to not break how one or the other gets compared to YES and NO.

It'll mean in the presence of a THROTTLE and a NOT_PREFERRED, canAllocate will return NOT_PREFERRED

I think this would be the other way around, THROTTLE would be higher value, closer to YES, and chosen over NOT_PREFERRED. So, IIUC, we both want it the current way.

I've a little blurb for the commit message:

"A significant change is to re-order comparison of Decision.Type enum
values, such that THROTTLE is chosen over NOT_PREFERRED. Functionally
this change should not matter because simulation (DesiredBalance
computation) does not throttle and reconciliation (real shard movement)
treats not-preferred essentially as a YES: they are not compared."

}
// TODO (ES-13482): clusterRebalanceDecision is set to the result of AllocationDecider#canRebalance, which does not return
// NOT_PREFERRED or THROTTLE. This switch statement, and how MoveDecision uses clusterRebalanceDecision, should be
// refactored.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Simon committed some changes concurrently with my PR.

But I think I was wrong, anyway: I didn't realize the complexities of canRebalance.

Removed 👍 Thanks!

assert decider.canRebalance(shardRouting, allocation).type() != Decision.Type.THROTTLE
: decider.getClass().getSimpleName() + " throttled unexpectedly in canRebalance";
return decider.canRebalance(shardRouting, allocation);
}, (decider, decision) -> Strings.format("Can not rebalance [%s]. [%s]: %s", shardRouting, decider, decision));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, reverted 👍

@Override
public String getExplanation() {
checkDecisionState();
return switch (getAllocationDecision()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, fixed 👍

public static final String NOT_PREFERRED = """
Elasticsearch will not rebalance this shard to another node because all other eligible nodes have high resource usage. The \
total cluster balance weights might improve, were the shard relocated, but it would push one resource usage dimension \
too high and threaten performance. See the node-by-node explanation to understand what resource usage is high.""";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index balance decider is also interested in resources, in the sense that a spike in an index' ingest would be more safely distributed across as many nodes as possible. Those are the lines along which I was thinking.

I've tweaked the phrasing a little, but not sure if as much as desirable. If you have any suggestions, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants