Skip to content

Conversation

@DiannaHohensee
Copy link
Contributor

Adds a new cluster setting to allow the ThrottlingAllocationDecider
to bypass replica shard throttling during balancer simulation. Primary
shards already always bypass throttling during simulation so that new
index shards are assigned (and made available) as quickly as possible.
Replicas need the same quick availability in some environments.

Relates ES-12942


I'm splitting the work for ES-12942 into pieces. Next I'll need to make sure the BalancedShardsAllocator#allocate() can assign all the replicas in one call, and then change the DesiredBalanceComputer's early return logic to pick up new assignment of unassigned replicas.

@DiannaHohensee DiannaHohensee requested a review from a team as a code owner November 24, 2025 21:34
@elasticsearchmachine elasticsearchmachine added v9.3.0 needs:triage Requires assignment of a team area label labels Nov 24, 2025
@DiannaHohensee DiannaHohensee force-pushed the 2025/11/21/throttle-decider-unthrottles-replica branch from aef7911 to bd3d442 Compare November 24, 2025 21:35
@DiannaHohensee DiannaHohensee removed the request for review from a team November 24, 2025 21:39
@DiannaHohensee DiannaHohensee force-pushed the 2025/11/21/throttle-decider-unthrottles-replica branch from bd3d442 to 4ade9fa Compare November 24, 2025 21:57
@DiannaHohensee DiannaHohensee force-pushed the 2025/11/21/throttle-decider-unthrottles-replica branch from 4ade9fa to 4b5ef05 Compare November 24, 2025 21:58
@DiannaHohensee DiannaHohensee added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team >non-issue and removed needs:triage Requires assignment of a team area label labels Nov 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

// During simulation, this supports early publishing DesiredBalance, with all unassigned shards assigned.
// Notably, this bypass is only in simulation decisions. Reconciliation will continue to obey throttling, in particular the
// requirement to assign a primary before allowing its replicas to begin initializing.
return allocation.decision(Decision.YES, NAME, "replica allocation is not throttled when simulating");
Copy link
Contributor

@nicktindall nicktindall Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change possibly redundant since we implemented #134786. If I understand correctly the ThrottlingAllocationDecider won't ever kick in now that we do a single move per balancing round?

I thought that ES-12942 was more similar to this change #115511

Never mind I see now we'll need this with the subsequent changes

ShardRouting shardRouting1Primary,
ShardRouting shardRouting1Replica,
ShardRouting shardRouting2Primary,
ShardRouting shardRouting2Replica
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these have unassigned in the name (e.g. unassignedShard1Primary) or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that'd be clearer, done 👍

String[] indices,
int numberOfShards,
List<ShardRouting.Role> replicaRoles
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate that numberOfShards == replicaRoles.size() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that has to be true, AFAICT. The number of replicaRoles is equal to the number of replicas, which is not limited to the number of shards.

} else if (i == numberOfDataNodes) {
discoBuilder.masterNodeId(node.getId());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading it right, the actual number of data nodes ends up being numberOfDataNodes + 1, this seems counter-intuitive, could we change numberOfDataNodes to have the actual number of data nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is pretty convoluted. I never have found a reason for it in the other helpers. Perhaps some historical testing need that no longer exists.

Fixing 👍

ShardRouting shardRouting1Primary = TestShardRouting.newShardRouting(testShardId1, null, null, true, ShardRoutingState.UNASSIGNED);
ShardRouting shardRouting2Primary = TestShardRouting.newShardRouting(testShardId2, null, null, true, ShardRoutingState.UNASSIGNED);
ShardRouting shardRouting1Replica = TestShardRouting.newShardRouting(testShardId1, null, null, false, ShardRoutingState.UNASSIGNED);
ShardRouting shardRouting2Replica = TestShardRouting.newShardRouting(testShardId2, null, null, false, ShardRoutingState.UNASSIGNED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's not important, but I think we should be able to pull these out of the routing table with

        RoutingTable routingTable = clusterState.routingTable(ProjectId.DEFAULT);
        routingTable.shardRoutingTable(shardId).primaryShard();
        routingTable.shardRoutingTable(shardId).replicaShards().get(0);

Then perhaps we can avoid the change to TestShardRouting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, routing table is probably better. I just copy-pasted.

The TestShardRouting change is label tidying, other callers provide null.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor queries

@DiannaHohensee DiannaHohensee added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 25, 2025
@elasticsearchmachine elasticsearchmachine merged commit ee72be0 into elastic:main Dec 1, 2025
34 checks passed
@DiannaHohensee DiannaHohensee deleted the 2025/11/21/throttle-decider-unthrottles-replica branch December 1, 2025 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants