Skip to content

Conversation

@nicktindall
Copy link
Contributor

This might have been a hangover from before we did the work to prioritise shard movement. It's redundant now.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Oct 31, 2025
@nicktindall nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Oct 31, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 31, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Oct 31, 2025
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 236 to 245
assertEquals(
"A shard without write load should remain on a node with queuing above the threshold",
Decision.Type.YES,
writeLoadDecider.canRemain(
testHarness.clusterState.metadata().getProject().index(indexName),
testHarness.shardRoutingNoWriteLoad,
testHarness.aboveQueuingThresholdRoutingNode,
testHarness.routingAllocation
).type()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test to show a shard without write load can still trigger NOT_PREFERRED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 6fc953f

@ywangd
Copy link
Member

ywangd commented Oct 31, 2025

I assume we have a protection against non-existent shard write load in BalancedShardAllocator so that it is basically either ignored or ranked to the bottom when it comes to pick a movement to mitigate hotspot?

@nicktindall
Copy link
Contributor Author

I assume we have a protection against non-existent shard write load in BalancedShardAllocator so that it is basically either ignored or ranked to the bottom when it comes to pick a movement to mitigate hotspot?

Yep:

final double lhsWriteLoad = shardWriteLoads.getOrDefault(lhs.shardId(), MISSING_WRITE_LOAD);
final double rhsWriteLoad = shardWriteLoads.getOrDefault(rhs.shardId(), MISSING_WRITE_LOAD);
// prefer any known write-load over any unknown write-load
final var rhsIsMissing = rhsWriteLoad == MISSING_WRITE_LOAD;
final var lhsIsMissing = lhsWriteLoad == MISSING_WRITE_LOAD;
if (rhsIsMissing && lhsIsMissing) {
return 0;
}
if (rhsIsMissing ^ lhsIsMissing) {
return lhsIsMissing ? 1 : -1;
}

@nicktindall nicktindall merged commit c986867 into elastic:main Nov 1, 2025
34 checks passed
@nicktindall nicktindall deleted the dont_require_shard_write_load_canRemain branch November 1, 2025 04:06
@DiannaHohensee
Copy link
Contributor

I don't understand why we'd want to remove this? Logically, if a shard has no write load, there's no benefit in moving it away. The decider seems more logically complete with the check.

The Allocator may not pass a null/0 value write load today, but there's no testing to protect against it in future, and then the behavior will be strange.

@ywangd
Copy link
Member

ywangd commented Nov 5, 2025

if a shard has no write load, there's no benefit in moving it away.

My reasoning is that deciding whether the shard can remain and picking exactly which shard to move are two things. The later is done elsewhere which compares shard write loads already. So we don't need it for the can remain check. If for some reason, all shards reporting 0 write load incorrectly, we will still be able to move one of them with this change instead of taking no action.

there's no testing to protect against it in future

There is some protection as Nick commented here. We could also add a test for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants