Log NOT_PREFERRED shard movements #138069

nicktindall · 2025-11-14T00:14:21Z

It is interesting to see which shards the write-load constraint decider is nominating for movement, and what their write load is. I made this a separate logger to the BalancedShardsAllocator because turning debug on for that would be very noisy.

Relates: ES-13491

nicktindall · 2025-11-14T00:17:05Z

...in/java/org/elasticsearch/cluster/routing/allocation/decider/WriteLoadConstraintDecider.java

                    """
                        Node [%s] has a queue latency of [%d] millis that exceeds the queue latency threshold of [%s]. This node is \
-                        hot-spotting. Current thread pool utilization [%f]. Moving shard(s) away.""",
+                        hot-spotting. Current thread pool utilization [%f]. Shard write load [%s]. Moving shard(s) away.""",


Add shard write load into explanation, so we can see it when we log the movement

nicktindall · 2025-11-14T00:18:39Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+                            shardRouting,
+                            moveDecision.getCanRemainDecision().getExplanation()
+                        );
+                    }


If we have debug logging turned on for the WriteLoadConstraintDecider the explanation will include the shard write load and the node utilisation.

elasticsearchmachine · 2025-11-14T04:37:57Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

DiannaHohensee · 2025-11-14T19:52:53Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

                final var moveDecision = shardMoved ? decideMove(index, shardRouting) : storedShardMovement.moveDecision();
                if (moveDecision.isDecisionTaken() && moveDecision.cannotRemainAndCanMove()) {
+                    if (notPreferredLogger.isDebugEnabled()) {
+                        logMoveNotPreferred.maybeExecute(


Can we do away with the throttled logging? Rather, just log everything.

Conceptually, we should only be picking the best shard to move away when a node is hot-spotting. We fix the hot-spot with one move, and then we don't have a hot-spot and this code doesn't run. So logging once per node per 30 seconds. I don't think it needs throttling?

OK, removed in 3ba31a0 🤞

DiannaHohensee · 2025-11-14T19:59:09Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+                                "Moving shard [{}] from a NOT_PREFERRED allocation, explanation is [{}]",
+                                shardRouting,
+                                moveDecision.getCanRemainDecision().getExplanation()
+                            )


I think it would be helpful to see where the shard is being moved (moveDecision), not only why it cannot remain. Then we can check whether the shard moved where intended, or got derailed later (either not moved at all, or moved to a different node, perhaps, than the original target).

A canAllocate YES will also give us information about the target node utilization, which might be interesting: "Shard [%s] in index [%s] can be assigned to node [%s]. The node's utilization would become [%s]"

Good call, added in 7dfedbe

I don't think we can print the allocate decisions because nodeDecisions won't be populated under normal circumstances (unless we debugDecision)

I don't think we can print the allocate decisions because nodeDecisions won't be populated under normal circumstances (unless we debugDecision)

Is this something to do with Mutli vs Single Decision types? nodeDecisions looks like something in the explain path, yes. But the Decision returned from canAllocate for the chosen target node should have an explanation string. The Multi type I recall obfuscating some things, though

Oh, maybe this is the problem:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

Line 1019 in e0fcab7

AllocationDecision.fromDecisionType(bestDecision),

We lose that information when converting Decision into an AllocationDecision.

Ooph. Okay cool 👍

Yeah if allocation.debugDecisions() was turned on we'd preserve them in nodeResults, but it won't be for normal allocation

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

Lines 991 to 993 in e0fcab7

if (explain) {

nodeResults.add(new NodeAllocationResult(currentNode.getRoutingNode().node(), allocationDecision, ++weightRanking));

}

…ements

DiannaHohensee

lgtm 👍

* main: (135 commits) Mute org.elasticsearch.upgrades.IndexSortUpgradeIT testIndexSortForNumericTypes {upgradedNodes=1} elastic#138130 Mute org.elasticsearch.upgrades.IndexSortUpgradeIT testIndexSortForNumericTypes {upgradedNodes=2} elastic#138129 Mute org.elasticsearch.search.basic.SearchWithRandomDisconnectsIT testSearchWithRandomDisconnects elastic#138128 [DiskBBQ] avoid EsAcceptDocs bug by calling cost before building iterator (elastic#138127) Log NOT_PREFERRED shard movements (elastic#138069) Improve bulk loading of binary doc values (elastic#137995) Add internal action for getting inference fields and inference results for those fields (elastic#137680) Address issue with DateFieldMapper#isFieldWithinQuery(...) (elastic#138032) WriteLoadConstraintDecider: Have separate rate limiting for canRemain and canAllocate decisions (elastic#138067) Adding NodeContext to TransportBroadcastByNodeAction (elastic#138057) Mute org.elasticsearch.simdvec.ESVectorUtilTests testSoarDistanceBulk elastic#138117 Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#137909 Backport batched_response_might_include_reduction_failure version to 8.19 (elastic#138046) Add summary metrics for tdigest fields (elastic#137982) Add gp-llm-v2 model ID and inference endpoint (elastic#138045) Various tracing fixes (elastic#137908) [ML] Fixing KDE evaluate() to return correct ValueAndMagnitude object (elastic#128602) Mute org.elasticsearch.xpack.shutdown.NodeShutdownIT testStalledShardMigrationProperlyDetected elastic#115697 [ML] Fix Flaky Audit Message Assertion in testWithDatastream for RegressionIT and ClassificationIT (elastic#138065) [ML] Fix Non-Deterministic Training Set Selection in RegressionIT testTwoJobsWithSameRandomizeSeedUseSameTrainingSet (elastic#138063) ... # Conflicts: # rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/200_dense_vector_docvalue_fields.yml

Log NOT_PREFERRED shard movements

d41aaea

elasticsearchmachine added the v9.3.0 label Nov 14, 2025

Log NOT_PREFERRED shard movements

7a0455f

nicktindall commented Nov 14, 2025

View reviewed changes

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Nov 14, 2025

nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Nov 14, 2025

nicktindall added 2 commits November 14, 2025 14:36

Merge branch 'main' into log_not_preferred_movements

9a1135b

Add tests/finish implementation

6bf362b

nicktindall requested review from DiannaHohensee and ywangd November 14, 2025 04:37

nicktindall marked this pull request as ready for review November 14, 2025 04:37

nicktindall requested a review from a team as a code owner November 14, 2025 04:37

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Nov 14, 2025

nicktindall commented Nov 14, 2025

View reviewed changes

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java Outdated Show resolved Hide resolved

DiannaHohensee reviewed Nov 14, 2025

View reviewed changes

nicktindall added 3 commits November 15, 2025 08:44

Log target node when making NOT_PREFERRED move

7dfedbe

Merge remote-tracking branch 'origin/main' into log_not_preferred_mov…

ce79281

…ements

Remove rate-limiting

3ba31a0

nicktindall removed the serverless-linked Added by automation, don't add manually label Nov 14, 2025

Reduce change

c593950

DiannaHohensee approved these changes Nov 14, 2025

View reviewed changes

Fix test to fit new pattern

848be0a

nicktindall enabled auto-merge (squash) November 15, 2025 01:34

Merge branch 'main' into log_not_preferred_movements

aad4dfa

nicktindall merged commit 93410cf into elastic:main Nov 15, 2025
34 checks passed

nicktindall deleted the log_not_preferred_movements branch November 16, 2025 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Log NOT_PREFERRED shard movements #138069

Log NOT_PREFERRED shard movements #138069

nicktindall commented Nov 14, 2025 •

edited

Loading

Uh oh!

nicktindall Nov 14, 2025

Uh oh!

nicktindall Nov 14, 2025

Uh oh!

elasticsearchmachine commented Nov 14, 2025

Uh oh!

Uh oh!

DiannaHohensee Nov 14, 2025

Uh oh!

nicktindall Nov 14, 2025

Uh oh!

DiannaHohensee Nov 14, 2025

Uh oh!

nicktindall Nov 14, 2025

Uh oh!

DiannaHohensee Nov 14, 2025

Uh oh!

DiannaHohensee Nov 14, 2025

Uh oh!

nicktindall Nov 15, 2025

Uh oh!

DiannaHohensee left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if (explain) {
	nodeResults.add(new NodeAllocationResult(currentNode.getRoutingNode().node(), allocationDecision, ++weightRanking));
	}

Log NOT_PREFERRED shard movements #138069

Log NOT_PREFERRED shard movements #138069

Conversation

nicktindall commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Nov 14, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nicktindall commented Nov 14, 2025 •

edited

Loading