Allocation: introduce a new decider that balances the index shard count among nodes #135875

zhubotang-wq · 2025-10-02T20:11:04Z

For a cluster with n data nodes to host an index with m shards, each node ideally should host not significantly more than m / n shards each. This new allocation decider acts on this principle to respond with a NOT_PREFERRED in event of a node being allocated more shards than threshold.

Nodes in shutdown are excluded when computing fair workload for nodes. A load skew tolerance setting is added to permit nodes to own more than ideal number of shards for the index.

Relates ES-12080

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

…ount-allocation-decider

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

elasticsearchmachine · 2025-10-02T20:11:29Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

zhubotang-wq · 2025-10-02T20:13:28Z

This is still very much work in progress (needs additional unit tests consolidated). Published this PR prematurely to accelerate elicitation of feedbacks.

elasticsearchmachine · 2025-10-02T20:17:38Z

Hi @zhubotang-wq, I've created a changelog YAML for you.

...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java

DiannaHohensee

I took a first pass. I didn't have time to take a peek at the testing yet.

...ain/java/org/elasticsearch/cluster/routing/allocation/IndexShardCountConstraintSettings.java

...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java

...ain/java/org/elasticsearch/cluster/routing/allocation/IndexShardCountConstraintSettings.java

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

zhubotang-wq · 2025-10-16T08:08:49Z

Noticed an issue during testing, in a perfected balanced cluster, if the cluster is requested to further split indexes, the index shard count balance decider needs to use the newly proposed (after split) intended shard counts to calculate fair workload. However, this new shard count is not available in Cluster State,

At least allocation.getClusterState().routingTable(ProjectId.DEFAULT).index(index).size()` still gives the old shard counts.

This creates a case where all nodes are asked to host additional shards, but all consider themselves already doing enough.

In the FullClusterRestartIT integration test case, when
client().performRequest(newXContentRequest(HttpMethod.PUT, "/" + index + "/_split/" + target, (builder, params) -> {

All nodes saying that they have done enough.

[2025-10-16T04:56:00,074][DEBUG][o.e.c.r.a.d.IndexShardCountAllocationDecider] [test-cluster-0] For index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]] with [6] shards, Node [Ex1StRCOQwG_ghw3EJugoA] is expected to hold [3] shards for index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]], based on the total of [2]
nodes available. The configured load skew tolerance is [1.50], which yields an allocation threshold of
Math.ceil([3] × [1.50]) = [5] shards. Currently, node [Ex1StRCOQwG_ghw3EJugoA] is assigned [5] shards of index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]]. Therefore,
assigning additional shards is not preferred.

[2025-10-16T04:55:59,560][DEBUG][o.e.c.r.a.d.IndexShardCountAllocationDecider] [test-cluster-0] For index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]] with [6] shards, Node [Ex1StRCOQwG_ghw3EJugoA] is expected to hold [3] shards for index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]], based on the total of [2]
nodes available. The configured load skew tolerance is [1.50], which yields an allocation threshold of
Math.ceil([3] × [1.50]) = [5] shards. Currently, node [Ex1StRCOQwG_ghw3EJugoA] is assigned [5] shards of index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]]. Therefore,
assigning additional shards is not preferred.

Need to make code changes so that in event of a split is requested, deciders should use the proposed new primary shard count.

ywangd · 2025-10-16T23:41:05Z

It would be helpful if you coud share the buildscan link for the failed test so that we have a clear context. Without it, I assume the failure is from FullClusterRestartIT.testResize

this new shard count is not available in Cluster State

This is not true. The original index is created with 3 primaries and 1 replica. The error message you shared above says

For index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]] with [6] shards,

Note the 6 shards actually means 6 primary shards, that is double the original number of 3 due to splitting. This is because the PR computes the number as

var totalShards = allocation.getClusterState().routingTable(ProjectId.DEFAULT).index(index).size();

This is counting the size of a IndexRoutingTable which is determined by its array of IndexShardRoutingTable. The IndexShardRoutingTable in turn manages all copies (primary and replicas) of the same shard. Therefore the total number of shards (primaries and replicas) should be IndexRoutingTable.size() * IndexShardRoutingTable.size() = 12. But you might as well use IndexMetadata.getTotalNumberOfShards() for it.

I think using total number of shards makes sense for stateful. But in stateless, we probably should differentiate between primaries and replicas since their corresponding nodes are two disjoint sets (not sure what's the plan for this PR, I can see this as a 2nd step).

Also, IIUC,the new decider does not return NO. So why would shards remain unassigned when it says NOT_PREFERRED. That does not sound right, especially for newly created shards.

nicktindall

Looking good, a bunch of minor questions/comments from me

.../java/org/elasticsearch/cluster/routing/allocation/decider/WriteLoadConstraintDeciderIT.java

nicktindall · 2025-11-20T23:30:32Z

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java

+    public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
+        if (indexBalanceConstraintSettings.isDeciderEnabled() == false || isStateless == false || hasNoFilters() == false) {
+            return Decision.single(Decision.Type.YES, NAME, "Decider is disabled.");
+        }


If it's disabled for stateful, I wonder if we could just configure it to be added in the stateless plugin? Then we wouldn't need to check isStateless every time.

Also hasNoFilters() == false is harder to read than filtersAreConfigured() or similar. It's Friday here, I can't process a double-negative.

Refactored to remove double negative. Since there are follow up plans to add stateful logic as well as protracted nature of this iteration, I am inclined to keep its current placement.

I completely agree that placing this in the stateless plugin is a much better choice than the current location. Since the decision to make this a stateless-only decider was made nearly 2 months ago, I’m a bit surprised this option didn’t come up earlier from previous dozens of comments.

Maybe the reviewers could concentrate more on the architectural/logic aspects here — those will have a greater impact than the variable naming details.

This would enable me to take this approach far earlier.

nicktindall · 2025-11-20T23:40:54Z

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java

+
+        if (node.node().getRoles().contains(SEARCH_ROLE) && shardRouting.primary()) {
+            return Decision.single(Decision.Type.YES, NAME, "A search node cannot own primary shards. Decider inactive.");
+        }


Could we perhaps combine this into a single check like co.elastic.elasticsearch.stateless.allocation.StatelessAllocationDecider#canAllocateShardToNode so it doesn't distract from the focus of this decider? (lines 84-92 could be replaced by such a check maybe?)

Ack. In hindsight, the deciders ought to have been place in the stateless repo as part of the stateless plugin.

Like I mentioned earlier. this feedback makes absolute sense since the canAllocateShardToNode deals with the precise requirement here.

At this stage, I am inclined to leave this refactoring to follow up PR when canRemain() is added.

nicktindall · 2025-11-21T00:51:55Z

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java

+
+        final double idealAllocation = Math.ceil((double) totalShards / eligibleNodes.size());
+        final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings.getExcessShards()) / eligibleNodes
+            .size();


I don't understand this formula, what am I missing, does it deserve an explanation?

nicktindall · 2025-11-21T00:54:05Z

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java

+                currentAllocation
+            );
+
+            logger.trace(explanation);


Should this be logger.debug?

For this logging statement, 4 different reviewers offered different opinions. At this stage, I am inclined to leave it unchanged.

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java

nicktindall · 2025-11-21T01:12:06Z

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

        addAllocationDecider(deciders, new ThrottlingAllocationDecider(clusterSettings));
        addAllocationDecider(deciders, new ShardsLimitAllocationDecider(clusterSettings));
        addAllocationDecider(deciders, new AwarenessAllocationDecider(settings, clusterSettings));
+        addAllocationDecider(deciders, new IndexBalanceAllocationDecider(settings, clusterSettings));


as suggested above, is this something we could add configure only in the serverless plugin?

Ack. My previous replies agree this is a sound advice. At this stage, I am inclined to leave this refactoring to follow up iterations.

...org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDeciderTests.java

nicktindall · 2025-11-21T01:28:09Z

...org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDeciderTests.java

+    }
+
+    public void testCanAllocateUnderThresholdWithExcessShards() {
+        setup(false, true);


Nit: I think this is hard to read, especially outside of an IDE (methods with consecutive boolean params like this). Can we just pass in the settings or similar so it's clearer the starting point just by looking at the test?, or even replace the booleans with enums or something.

Could make a settings builder or something to reduce repetition if you think its worthwhile.

Agree. Fixed to use settings and clearly named parameters.

nicktindall · 2025-11-21T01:44:17Z

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java

+            nomenclature = "search";
+        }
+
+        assert eligibleNodes.isEmpty() == false;


I wondered about this also, if the cluster were shutting down all nodes would be marked as shutting down and we'd get an empty array here right?

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java

nicktindall

I'm happy to approve once we're using allocation.decision, that's my only real comment that's not just opinion

nicktindall · 2025-11-24T03:30:13Z

...org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDeciderTests.java

+    private List<RoutingNode> indexTier;
+    private List<RoutingNode> searchIier;
+
+    private void setup(Settings settings) {


This isn't quite what I meant... what I was thinking is you could write utility methods for populating those settings e.g.

public Settings addRandomFilterSetting(Settings settings) { String setting = randomFrom( CLUSTER_ROUTING_REQUIRE_GROUP_PREFIX, CLUSTER_ROUTING_INCLUDE_GROUP_PREFIX, CLUSTER_ROUTING_EXCLUDE_GROUP_PREFIX ); String attribute = randomFrom("_value", "name"); String name = randomFrom("indexNodeOne", "indexNodeTwo", "searchNodeOne", "searchNodeTwo"); String ip = randomFrom("192.168.0.1", "192.168.0.2", "192.168.7.1", "10.17.0.1"); Settings.builder().put(settings) .put(setting + "." + attribute, attribute.equals("name") ? name : ip) .build(); } public Settings allowExcessShards(Settings settings) { int excessShards = randomIntBetween(1, 5); return Settings.builder() .put(settings) .put(IndexBalanceConstraintSettings.INDEX_BALANCE_DECIDER_EXCESS_SHARDS.getKey(), excessShards); .build(); }

Then you could do e.g.

Settings settings = addRandomFilterSetting(Setting.EMPTY); settings = allowExcessShards(settings); setup(settings); //...

Then you just use the provided settings as a starting point for in the setup:

Settings Settings.builder().put(settings) .put("stateless.enabled", "true") .put(IndexBalanceConstraintSettings.INDEX_BALANCE_DECIDER_ENABLED_SETTING.getKey(), "true") .build();

That way the configuration of excess shards/random filters is just done in the places you need them and in the test rather than the setup method?

Settings.builder().put(settings) makes a builder with all the settings in the passed settings object copied.

zhubotang-wq · 2025-11-25T00:16:32Z

I'm happy to approve once we're using allocation.decision, that's my only real comment that's not just opinion

Made the following fixes.

Decision.Single has been replaced.
add defensive if block for eligibleNodes.isEmpty() , retained the assertion based on previous discussion.
add extra comments on

    final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings.getExcessShards()) /

eligibleNodes
.size();

Integer division simpler to reason with than double.
Add eligibleNodes.size() - 1 so that rounding down does not happen.
eligibleNodes.size() - 1 is not so much that no automatic extra 1 shard allowed without touching ExcessShard setting

nicktindall · 2025-11-25T04:11:24Z

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java

+        // The built-in "eligibleNodes.size() - 1" offers just enough buffer so threshold is not rounded down by integer division.
+        // But not too much so that threshold does not get an automatic 1 shard extra allowance.
+        final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings.getExcessShards()) / eligibleNodes
+            .size();


Can we put this in a function called ceilingDivision or something?

I see that it's a common method for doing integer ceiling division. There's also Math.ceilDiv(int, int) which I think would be even better. Leaving it as-is is just a bit mind boggling unless you're familiar with the trick.

I really think we should use Math#ceilDiv it has very detailed javadoc.

The current form of calculation was suggested in an earlier feedback.

#135875 (comment)

The reviewer's rationale is as follows :

Staying in int simplifies I think, though for common numbers, doubles do have exact precision, this helps me not speculate ;-). And adding the skew tolerance before division ensures that with tolerance 1 we get:

2 shards, 2 nodes, allow 2 on each
3 shards, 2 nodes, allow 2 on each (prior version 3, but there is already 1 shard wiggle room).
etc.

The threshold computation has gone through several iterations

This calculation has evolved in several iterations.

final int threshold = (int) Math.ceil(idealAllocation * indexBalanceConstraintSettings. getExcessShards());

final int threshold = (int) Math.ceil(idealAllocation) + indexBalanceConstraintSettings. getExcessShards();

final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings. getExcessShards()) / eligibleNodes.size();

There’s a broad spectrum of perspectives on this topic. I’ve made an effort to incorporate as much feedback as possible, though aligning all perspectives has proven challenging. It’s been a bit difficult to understand the group’s decision-making dynamics in this process.

At this stage, I am inclined to keep it in its current form for this iteration.

I don't think using the integer ceilDiv function from java.lang.Math contradicts any of the prior advice. For me, we want to have a pretty good reason to re-write logic that exists in the standard library and I don't think I see that here. It's just a straightforward readability/maintainability thing.

I think I do feel strongly about this one.

final int threshold = Math.ceilDiv(totalShards + indexBalanceConstraintSettings.getExcessShards(), eligibleNodes.size());

is much easier to understand than

final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings.getExcessShards()) / eligibleNodes .size();

Math.ceilDiv(int n, int d) seems to be equivalent to n + d - 1 / d except that the former is very nicely documented. I don't think @henningandersen would disagree?

yeah, I agree to this, using a standard function is better.

…on-decider' into 12080-index-shard-count-allocation-decider

nicktindall

This LGTM

zhubotang-wq added 8 commits September 28, 2025 15:29

Allocation: Include index shard counts as a criteria

8d4d036

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

Allocation: Include index shard counts as a criteria

19dea87

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

Merge remote-tracking branch 'upstream/main' into 12080-index-shard-c…

f1bff0d

…ount-allocation-decider

Allocation: Include index shard counts as a criteria

3016a65

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

Allocation: Include index shard counts as a criteria

25b5bca

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

Allocation: Include index shard counts as a criteria

7913808

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

Allocation: Include index shard counts as a criteria

b0e7186

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

Allocation: Include index shard counts as a criteria

0d79752

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

zhubotang-wq requested review from DaveCTurner, DiannaHohensee and JeremyDahlgren October 2, 2025 20:11

zhubotang-wq requested a review from a team as a code owner October 2, 2025 20:11

elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0 labels Oct 2, 2025

Update docs/changelog/135875.yaml

5fc01c8

[CI] Auto commit changes from spotless

6d95721

JeremyDahlgren reviewed Oct 3, 2025

View reviewed changes

...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java Outdated Show resolved Hide resolved

Merge branch 'main' into 12080-index-shard-count-allocation-decider

f4610f3

DiannaHohensee reviewed Oct 8, 2025

View reviewed changes

Merge branch 'main' into 12080-index-shard-count-allocation-decider

faf8187

zhubotang-wq removed the :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. label Oct 14, 2025

Allocation: Include index shard counts as a criteria

5212d02

In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.

zhubotang-wq requested a review from nicktindall November 20, 2025 23:14

zhubotang-wq added 2 commits November 20, 2025 15:15

Merge branch 'main' into 12080-index-shard-count-allocation-decider

ba6db1b

Changes to support stateless test

68fb74b

nicktindall reviewed Nov 21, 2025

View reviewed changes

address feedbacks.

cdd2ecc

zhubotang-wq requested a review from nicktindall November 21, 2025 19:29

Merge branch 'main' into 12080-index-shard-count-allocation-decider

406c6ea

nicktindall reviewed Nov 24, 2025

View reviewed changes

...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java Outdated Show resolved Hide resolved

nicktindall reviewed Nov 24, 2025

View reviewed changes

zhubotang-wq added 3 commits November 24, 2025 15:25

Merge branch 'main' into 12080-index-shard-count-allocation-decider

1abb626

Merge branch 'main' into 12080-index-shard-count-allocation-decider

e9d0d32

address feedbacks.

09d8714

zhubotang-wq requested a review from nicktindall November 25, 2025 00:19

Merge branch 'main' into 12080-index-shard-count-allocation-decider

f56aa5f

nicktindall reviewed Nov 25, 2025

View reviewed changes

zhubotang-wq added 7 commits November 25, 2025 07:57

Merge branch 'main' into 12080-index-shard-count-allocation-decider

719b8fb

Merge branch 'main' into 12080-index-shard-count-allocation-decider

7fe2155

address feedbacks.

da55db3

Merge remote-tracking branch 'origin/12080-index-shard-count-allocati…

c43babe

…on-decider' into 12080-index-shard-count-allocation-decider

address feedbacks.

e2f576d

address feedbacks.

4526700

Merge branch 'main' into 12080-index-shard-count-allocation-decider

4fd29ee

zhubotang-wq requested a review from nicktindall November 25, 2025 22:32

nicktindall approved these changes Nov 25, 2025

View reviewed changes

Merge branch 'main' into 12080-index-shard-count-allocation-decider

860d514

zhubotang-wq merged commit b698d38 into elastic:main Nov 26, 2025
34 checks passed

zhubotang-wq deleted the 12080-index-shard-count-allocation-decider branch November 26, 2025 18:04

This was referenced Nov 27, 2025

Refactor Allocation Decider Test #138710

Merged

Index Balanced Allocation Decider Add index routing filter handling #138741

Open

Allocation: introduce a new decider that balances the index shard count among nodes #135875

Allocation: introduce a new decider that balances the index shard count among nodes #135875

Conversation

zhubotang-wq commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 2, 2025

Uh oh!

zhubotang-wq commented Oct 2, 2025

Uh oh!

elasticsearchmachine commented Oct 2, 2025

Uh oh!

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhubotang-wq commented Oct 16, 2025

Uh oh!

ywangd commented Oct 16, 2025

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhubotang-wq Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhubotang-wq commented Nov 25, 2025

Uh oh!

nicktindall Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhubotang-wq commented Oct 2, 2025 •

edited

Loading

zhubotang-wq Nov 21, 2025 •

edited

Loading

nicktindall Nov 25, 2025 •

edited

Loading

nicktindall Nov 25, 2025 •

edited

Loading

zhubotang-wq Nov 25, 2025 •

edited

Loading

nicktindall Nov 25, 2025 •

edited

Loading