Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional trace logging for desired balance computer #105910

Merged

Conversation

idegtiarenko
Copy link
Contributor

@idegtiarenko idegtiarenko commented Mar 4, 2024

I believe the following sequence of events is happening casing a test failure:

  • First 5 of 6 shard sizes are fetched from snapshot repository, shard allocating is starting
  • The smallest of those shards starts on the node with restricted disk size, however the cluster info is not refreshed and the node still appears empty
  • Remaining shard size is fetched. It happens to be the smallest or the second smallest and is assigned to the smallest node (since it is the emptiest shard wise) due to outdated cluster info.

I am adding more logs to confirm this theory.

I think it is possible to fix this test either by:

  • using 6 threads in cluster.snapshot.info.max_concurrent_fetches
  • using 5 or less shards

however I am not sure if it is possible to avoid using outdated cluster info when allocating (with the assumptions that we want to allocate shards as fast as possible).

Related to: #105331

@idegtiarenko idegtiarenko added >test Issues or PRs that are addressing/adding tests :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team v8.14.0 labels Mar 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@@ -283,7 +293,6 @@ public DesiredBalance compute(
hasChanges = true;
clusterInfoSimulator.simulateShardStarted(shardRouting);
routingNodes.startShard(logger, shardRouting, changes, 0L);
logger.trace("starting shard {}", shardRouting);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as it is easy to deduct shard starting from above huge log entry

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (one suggestion)

desiredBalanceInput.routingAllocation().snapshotShardSizeInfo().toString()
);
} else {
logger.debug("Recomputing desired balance for [{}]", desiredBalanceInput.index());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to see the .index() even if trace logging

@idegtiarenko idegtiarenko merged commit eeecdbf into elastic:main Mar 6, 2024
14 checks passed
@idegtiarenko idegtiarenko deleted the debug_desired_balance_computation branch March 6, 2024 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team >test Issues or PRs that are addressing/adding tests v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants