Additional trace logging for desired balance computer #105910

idegtiarenko · 2024-03-04T15:12:51Z

I believe the following sequence of events is happening casing a test failure:

First 5 of 6 shard sizes are fetched from snapshot repository, shard allocating is starting
The smallest of those shards starts on the node with restricted disk size, however the cluster info is not refreshed and the node still appears empty
Remaining shard size is fetched. It happens to be the smallest or the second smallest and is assigned to the smallest node (since it is the emptiest shard wise) due to outdated cluster info.

I am adding more logs to confirm this theory.

I think it is possible to fix this test either by:

using 6 threads in cluster.snapshot.info.max_concurrent_fetches
using 5 or less shards

however I am not sure if it is possible to avoid using outdated cluster info when allocating (with the assumptions that we want to allocate shards as fast as possible).

Related to: #105331

elasticsearchmachine · 2024-03-04T15:13:27Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2024-03-04T15:13:49Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

@@ -283,7 +293,6 @@ public DesiredBalance compute(
                        hasChanges = true;
                        clusterInfoSimulator.simulateShardStarted(shardRouting);
                        routingNodes.startShard(logger, shardRouting, changes, 0L);
-                        logger.trace("starting shard {}", shardRouting);


Removed as it is easy to deduct shard starting from above huge log entry

DaveCTurner

LGTM (one suggestion)

DaveCTurner · 2024-03-04T17:50:08Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

+                desiredBalanceInput.routingAllocation().snapshotShardSizeInfo().toString()
+            );
+        } else {
+            logger.debug("Recomputing desired balance for [{}]", desiredBalanceInput.index());


I think it'd be useful to see the .index() even if trace logging

Additional trace logging for desired balance computer

3d79cd3

idegtiarenko added >test Issues or PRs that are addressing/adding tests :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team v8.14.0 labels Mar 4, 2024

idegtiarenko requested a review from DaveCTurner March 4, 2024 15:12

idegtiarenko commented Mar 4, 2024

View reviewed changes

DaveCTurner approved these changes Mar 4, 2024

View reviewed changes

idegtiarenko added 3 commits March 5, 2024 08:53

add index

85e0cc5

Merge branch 'main' into debug_desired_balance_computation

151220f

Merge branch 'main' into debug_desired_balance_computation

5d682ca

idegtiarenko merged commit eeecdbf into elastic:main Mar 6, 2024
14 checks passed

idegtiarenko deleted the debug_desired_balance_computation branch March 6, 2024 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional trace logging for desired balance computer #105910

Additional trace logging for desired balance computer #105910

idegtiarenko commented Mar 4, 2024 •

edited

Loading

elasticsearchmachine commented Mar 4, 2024

idegtiarenko Mar 4, 2024

DaveCTurner left a comment

DaveCTurner Mar 4, 2024

Additional trace logging for desired balance computer #105910

Additional trace logging for desired balance computer #105910

Conversation

idegtiarenko commented Mar 4, 2024 • edited Loading

elasticsearchmachine commented Mar 4, 2024

idegtiarenko Mar 4, 2024

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Mar 4, 2024

Choose a reason for hiding this comment

idegtiarenko commented Mar 4, 2024 •

edited

Loading