-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Allocation: introduce a new decider that balances the index shard count among nodes #135875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation: introduce a new decider that balances the index shard count among nodes #135875
Conversation
In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.
In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.
…ount-allocation-decider
In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.
In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.
In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.
In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.
In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
This is still very much work in progress (needs additional unit tests consolidated). Published this PR prematurely to accelerate elicitation of feedbacks. |
|
Hi @zhubotang-wq, I've created a changelog YAML for you. |
...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java
Outdated
Show resolved
Hide resolved
DiannaHohensee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first pass. I didn't have time to take a peek at the testing yet.
...ain/java/org/elasticsearch/cluster/routing/allocation/IndexShardCountConstraintSettings.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/cluster/routing/allocation/IndexShardCountConstraintSettings.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/IndexShardCountAllocationDecider.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/cluster/routing/allocation/IndexShardCountConstraintSettings.java
Outdated
Show resolved
Hide resolved
In a balanced allocation, for an index with n shards on a cluster of m nodes, each node should host not significantly more than n / m shards. This decider enforces this principle.
|
Noticed an issue during testing, in a perfected balanced cluster, if the cluster is requested to further split indexes, the index shard count balance decider needs to use the newly proposed (after split) intended shard counts to calculate fair workload. However, this new shard count is not available in Cluster State, At least allocation.getClusterState().routingTable(ProjectId.DEFAULT).index(index).size()` still gives the old shard counts. This creates a case where all nodes are asked to host additional shards, but all consider themselves already doing enough. In the FullClusterRestartIT integration test case, when All nodes saying that they have done enough. [2025-10-16T04:56:00,074][DEBUG][o.e.c.r.a.d.IndexShardCountAllocationDecider] [test-cluster-0] For index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]] with [6] shards, Node [Ex1StRCOQwG_ghw3EJugoA] is expected to hold [3] shards for index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]], based on the total of [2] [2025-10-16T04:55:59,560][DEBUG][o.e.c.r.a.d.IndexShardCountAllocationDecider] [test-cluster-0] For index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]] with [6] shards, Node [Ex1StRCOQwG_ghw3EJugoA] is expected to hold [3] shards for index [[testresize_split/UqXRWqRfRNSrBFZ_Yhzqfw]], based on the total of [2] Need to make code changes so that in event of a split is requested, deciders should use the proposed new primary shard count. |
|
It would be helpful if you coud share the buildscan link for the failed test so that we have a clear context. Without it, I assume the failure is from FullClusterRestartIT.testResize
This is not true. The original index is created with 3 primaries and 1 replica. The error message you shared above says
Note the This is counting the size of a I think using total number of shards makes sense for stateful. But in stateless, we probably should differentiate between primaries and replicas since their corresponding nodes are two disjoint sets (not sure what's the plan for this PR, I can see this as a 2nd step). Also, IIUC,the new decider does not return |
nicktindall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, a bunch of minor questions/comments from me
.../java/org/elasticsearch/cluster/routing/allocation/decider/WriteLoadConstraintDeciderIT.java
Outdated
Show resolved
Hide resolved
| public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) { | ||
| if (indexBalanceConstraintSettings.isDeciderEnabled() == false || isStateless == false || hasNoFilters() == false) { | ||
| return Decision.single(Decision.Type.YES, NAME, "Decider is disabled."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's disabled for stateful, I wonder if we could just configure it to be added in the stateless plugin? Then we wouldn't need to check isStateless every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also hasNoFilters() == false is harder to read than filtersAreConfigured() or similar. It's Friday here, I can't process a double-negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to remove double negative. Since there are follow up plans to add stateful logic as well as protracted nature of this iteration, I am inclined to keep its current placement.
I completely agree that placing this in the stateless plugin is a much better choice than the current location. Since the decision to make this a stateless-only decider was made nearly 2 months ago, I’m a bit surprised this option didn’t come up earlier from previous dozens of comments.
Maybe the reviewers could concentrate more on the architectural/logic aspects here — those will have a greater impact than the variable naming details.
This would enable me to take this approach far earlier.
|
|
||
| if (node.node().getRoles().contains(SEARCH_ROLE) && shardRouting.primary()) { | ||
| return Decision.single(Decision.Type.YES, NAME, "A search node cannot own primary shards. Decider inactive."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we perhaps combine this into a single check like co.elastic.elasticsearch.stateless.allocation.StatelessAllocationDecider#canAllocateShardToNode so it doesn't distract from the focus of this decider? (lines 84-92 could be replaced by such a check maybe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. In hindsight, the deciders ought to have been place in the stateless repo as part of the stateless plugin.
Like I mentioned earlier. this feedback makes absolute sense since the canAllocateShardToNode deals with the precise requirement here.
At this stage, I am inclined to leave this refactoring to follow up PR when canRemain() is added.
|
|
||
| final double idealAllocation = Math.ceil((double) totalShards / eligibleNodes.size()); | ||
| final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings.getExcessShards()) / eligibleNodes | ||
| .size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this formula, what am I missing, does it deserve an explanation?
| currentAllocation | ||
| ); | ||
|
|
||
| logger.trace(explanation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be logger.debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this logging statement, 4 different reviewers offered different opinions. At this stage, I am inclined to leave it unchanged.
...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java
Show resolved
Hide resolved
| addAllocationDecider(deciders, new ThrottlingAllocationDecider(clusterSettings)); | ||
| addAllocationDecider(deciders, new ShardsLimitAllocationDecider(clusterSettings)); | ||
| addAllocationDecider(deciders, new AwarenessAllocationDecider(settings, clusterSettings)); | ||
| addAllocationDecider(deciders, new IndexBalanceAllocationDecider(settings, clusterSettings)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as suggested above, is this something we could add configure only in the serverless plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. My previous replies agree this is a sound advice. At this stage, I am inclined to leave this refactoring to follow up iterations.
...org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDeciderTests.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| public void testCanAllocateUnderThresholdWithExcessShards() { | ||
| setup(false, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this is hard to read, especially outside of an IDE (methods with consecutive boolean params like this). Can we just pass in the settings or similar so it's clearer the starting point just by looking at the test?, or even replace the booleans with enums or something.
Could make a settings builder or something to reduce repetition if you think its worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Fixed to use settings and clearly named parameters.
| nomenclature = "search"; | ||
| } | ||
|
|
||
| assert eligibleNodes.isEmpty() == false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about this also, if the cluster were shutting down all nodes would be marked as shutting down and we'd get an empty array here right?
...java/org/elasticsearch/cluster/routing/allocation/decider/IndexBalanceAllocationDecider.java
Outdated
Show resolved
Hide resolved
nicktindall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to approve once we're using allocation.decision, that's my only real comment that's not just opinion
| private List<RoutingNode> indexTier; | ||
| private List<RoutingNode> searchIier; | ||
|
|
||
| private void setup(Settings settings) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite what I meant... what I was thinking is you could write utility methods for populating those settings e.g.
public Settings addRandomFilterSetting(Settings settings) {
String setting = randomFrom(
CLUSTER_ROUTING_REQUIRE_GROUP_PREFIX,
CLUSTER_ROUTING_INCLUDE_GROUP_PREFIX,
CLUSTER_ROUTING_EXCLUDE_GROUP_PREFIX
);
String attribute = randomFrom("_value", "name");
String name = randomFrom("indexNodeOne", "indexNodeTwo", "searchNodeOne", "searchNodeTwo");
String ip = randomFrom("192.168.0.1", "192.168.0.2", "192.168.7.1", "10.17.0.1");
Settings.builder().put(settings)
.put(setting + "." + attribute, attribute.equals("name") ? name : ip)
.build();
}
public Settings allowExcessShards(Settings settings) {
int excessShards = randomIntBetween(1, 5);
return Settings.builder()
.put(settings)
.put(IndexBalanceConstraintSettings.INDEX_BALANCE_DECIDER_EXCESS_SHARDS.getKey(), excessShards);
.build();
}Then you could do e.g.
Settings settings = addRandomFilterSetting(Setting.EMPTY);
settings = allowExcessShards(settings);
setup(settings);
//...Then you just use the provided settings as a starting point for in the setup:
Settings Settings.builder().put(settings)
.put("stateless.enabled", "true")
.put(IndexBalanceConstraintSettings.INDEX_BALANCE_DECIDER_ENABLED_SETTING.getKey(), "true")
.build();That way the configuration of excess shards/random filters is just done in the places you need them and in the test rather than the setup method?
Settings.builder().put(settings) makes a builder with all the settings in the passed settings object copied.
Made the following fixes.
|
| // The built-in "eligibleNodes.size() - 1" offers just enough buffer so threshold is not rounded down by integer division. | ||
| // But not too much so that threshold does not get an automatic 1 shard extra allowance. | ||
| final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings.getExcessShards()) / eligibleNodes | ||
| .size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this in a function called ceilingDivision or something?
I see that it's a common method for doing integer ceiling division. There's also Math.ceilDiv(int, int) which I think would be even better. Leaving it as-is is just a bit mind boggling unless you're familiar with the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should use Math#ceilDiv it has very detailed javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current form of calculation was suggested in an earlier feedback.
The reviewer's rationale is as follows :
Staying in int simplifies I think, though for common numbers, doubles do have exact precision, this helps me not speculate ;-). And adding the skew tolerance before division ensures that with tolerance 1 we get:
2 shards, 2 nodes, allow 2 on each
3 shards, 2 nodes, allow 2 on each (prior version 3, but there is already 1 shard wiggle room).
etc.
The threshold computation has gone through several iterations
This calculation has evolved in several iterations.
final int threshold = (int) Math.ceil(idealAllocation * indexBalanceConstraintSettings. getExcessShards());
final int threshold = (int) Math.ceil(idealAllocation) + indexBalanceConstraintSettings. getExcessShards();
final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings. getExcessShards()) / eligibleNodes.size();
There’s a broad spectrum of perspectives on this topic. I’ve made an effort to incorporate as much feedback as possible, though aligning all perspectives has proven challenging. It’s been a bit difficult to understand the group’s decision-making dynamics in this process.
At this stage, I am inclined to keep it in its current form for this iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using the integer ceilDiv function from java.lang.Math contradicts any of the prior advice. For me, we want to have a pretty good reason to re-write logic that exists in the standard library and I don't think I see that here. It's just a straightforward readability/maintainability thing.
I think I do feel strongly about this one.
final int threshold = Math.ceilDiv(totalShards + indexBalanceConstraintSettings.getExcessShards(), eligibleNodes.size());is much easier to understand than
final int threshold = (totalShards + eligibleNodes.size() - 1 + indexBalanceConstraintSettings.getExcessShards()) / eligibleNodes
.size();Math.ceilDiv(int n, int d) seems to be equivalent to n + d - 1 / d except that the former is very nicely documented. I don't think @henningandersen would disagree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree to this, using a standard function is better.
…on-decider' into 12080-index-shard-count-allocation-decider
nicktindall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM
For a cluster with n data nodes to host an index with m shards, each node ideally should host not significantly more than m / n shards each. This new allocation decider acts on this principle to respond with a NOT_PREFERRED in event of a node being allocated more shards than threshold.
Nodes in shutdown are excluded when computing fair workload for nodes. A load skew tolerance setting is added to permit nodes to own more than ideal number of shards for the index.
Relates ES-12080