Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
8d4d036
Allocation: Include index shard counts as a criteria
zhubotang-wq Sep 28, 2025
19dea87
Allocation: Include index shard counts as a criteria
zhubotang-wq Sep 29, 2025
f1bff0d
Merge remote-tracking branch 'upstream/main' into 12080-index-shard-c…
zhubotang-wq Sep 30, 2025
3016a65
Allocation: Include index shard counts as a criteria
zhubotang-wq Sep 30, 2025
25b5bca
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 2, 2025
7913808
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 2, 2025
b0e7186
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 2, 2025
0d79752
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 2, 2025
5fc01c8
Update docs/changelog/135875.yaml
zhubotang-wq Oct 2, 2025
6d95721
[CI] Auto commit changes from spotless
Oct 2, 2025
f4610f3
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 6, 2025
faf8187
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 14, 2025
5212d02
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 14, 2025
264a5ab
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 18, 2025
11aaace
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 18, 2025
0365a62
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 18, 2025
4a55cce
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 18, 2025
760b878
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 19, 2025
eb32f72
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 19, 2025
81cf332
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 19, 2025
6cc0845
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 19, 2025
ab2bc00
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 19, 2025
45c11b6
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 19, 2025
3a9f656
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 20, 2025
206f215
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 20, 2025
850a7e7
[CI] Auto commit changes from spotless
Oct 20, 2025
25d467d
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 21, 2025
ff29c07
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 22, 2025
aabb099
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 22, 2025
b54975c
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 22, 2025
79e1805
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 22, 2025
2fb51c3
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 23, 2025
304c4ce
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 23, 2025
d14661e
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 23, 2025
277e4d6
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 23, 2025
dc8ac5b
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 23, 2025
f69bc6f
[CI] Auto commit changes from spotless
Oct 23, 2025
868fc33
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 23, 2025
5fc5c6c
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 23, 2025
be83b88
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 23, 2025
09f2602
Allocation: Include index shard counts as a criteria
zhubotang-wq Oct 23, 2025
7d6da1f
[CI] Auto commit changes from spotless
Oct 23, 2025
ef43160
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 24, 2025
ff666be
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 28, 2025
27c5f8a
fix test regressions
zhubotang-wq Oct 28, 2025
9358387
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 28, 2025
e66278d
Update server/src/main/java/org/elasticsearch/cluster/routing/allocat…
zhubotang-wq Oct 30, 2025
b95a7ba
Update server/src/main/java/org/elasticsearch/cluster/routing/allocat…
zhubotang-wq Oct 30, 2025
d45908f
Update server/src/main/java/org/elasticsearch/cluster/routing/allocat…
zhubotang-wq Oct 30, 2025
83c6101
fix test regressions
zhubotang-wq Oct 31, 2025
a6bb01f
Merge branch '12080-index-shard-count-allocation-decider' of github.c…
zhubotang-wq Oct 31, 2025
67105f2
fix test regressions
zhubotang-wq Oct 31, 2025
5f59868
fix test regressions
zhubotang-wq Oct 31, 2025
a21e2d6
fix test regressions
zhubotang-wq Oct 31, 2025
7057bf0
fix test regressions
zhubotang-wq Oct 31, 2025
aebd924
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Oct 31, 2025
89aea86
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 2, 2025
292c550
fix test regressions
zhubotang-wq Nov 3, 2025
1e0b9f8
Merge branch '12080-index-shard-count-allocation-decider' of github.c…
zhubotang-wq Nov 3, 2025
14c50e8
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 3, 2025
fdc69c7
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 3, 2025
4f83a9e
fix test regressions
zhubotang-wq Nov 3, 2025
e0db4ac
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 7, 2025
3314a01
Update server/src/main/java/org/elasticsearch/cluster/routing/allocat…
zhubotang-wq Nov 7, 2025
f6a1500
[CI] Auto commit changes from spotless
Nov 7, 2025
c2a0d75
Address feedbacks
zhubotang-wq Nov 7, 2025
12b3e27
[CI] Auto commit changes from spotless
Nov 7, 2025
937bdad
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 7, 2025
aba5dcf
Address feedbacks
zhubotang-wq Nov 7, 2025
2693b32
Merge branch '12080-index-shard-count-allocation-decider' of github.c…
zhubotang-wq Nov 7, 2025
0860cca
Update docs/changelog/135875.yaml
zhubotang-wq Nov 7, 2025
76688d6
Address feedbacks
zhubotang-wq Nov 7, 2025
333331e
Merge remote-tracking branch 'origin/12080-index-shard-count-allocati…
zhubotang-wq Nov 7, 2025
03e5d1b
[CI] Auto commit changes from spotless
Nov 7, 2025
d75c2cf
Address feedbacks
zhubotang-wq Nov 7, 2025
a902ede
Merge remote-tracking branch 'origin/12080-index-shard-count-allocati…
zhubotang-wq Nov 7, 2025
3e52b29
Address feedbacks
zhubotang-wq Nov 7, 2025
d8911aa
Address feedbacks
zhubotang-wq Nov 7, 2025
b081af2
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 7, 2025
608d5c0
Address feedbacks
zhubotang-wq Nov 8, 2025
7451a90
Merge remote-tracking branch 'origin/12080-index-shard-count-allocati…
zhubotang-wq Nov 8, 2025
21c403c
Address feedbacks
zhubotang-wq Nov 8, 2025
e7671a8
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 10, 2025
f1eb4f0
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 10, 2025
de4cb7e
Address feedbacks
zhubotang-wq Nov 10, 2025
902a8d8
[CI] Auto commit changes from spotless
Nov 10, 2025
ee3739e
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 10, 2025
06c1e0e
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 12, 2025
bf230b4
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 13, 2025
25b656e
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 13, 2025
f96e988
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 14, 2025
f5e8a09
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 19, 2025
dc08221
Changes to support stateless test
zhubotang-wq Nov 19, 2025
cbe8a5b
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 19, 2025
27bcff8
Changes to support stateless test
zhubotang-wq Nov 19, 2025
350c2db
Changes to support stateless test
zhubotang-wq Nov 20, 2025
4d56505
Changes to support stateless test
zhubotang-wq Nov 20, 2025
ba6db1b
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 20, 2025
68fb74b
Changes to support stateless test
zhubotang-wq Nov 20, 2025
cdd2ecc
address feedbacks.
zhubotang-wq Nov 21, 2025
406c6ea
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 21, 2025
1abb626
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 24, 2025
e9d0d32
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 24, 2025
09d8714
address feedbacks.
zhubotang-wq Nov 25, 2025
f56aa5f
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 25, 2025
719b8fb
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 25, 2025
7fe2155
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 25, 2025
da55db3
address feedbacks.
zhubotang-wq Nov 25, 2025
c43babe
Merge remote-tracking branch 'origin/12080-index-shard-count-allocati…
zhubotang-wq Nov 25, 2025
e2f576d
address feedbacks.
zhubotang-wq Nov 25, 2025
4526700
address feedbacks.
zhubotang-wq Nov 25, 2025
4fd29ee
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 25, 2025
860d514
Merge branch 'main' into 12080-index-shard-count-allocation-decider
zhubotang-wq Nov 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@
import org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider;
import org.elasticsearch.cluster.routing.allocation.decider.EnableAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.FilterAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.IndexBalanceAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.IndexVersionAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.MaxRetryAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.NodeReplacementAllocationDecider;
Expand Down Expand Up @@ -497,6 +498,7 @@ public static Collection<AllocationDecider> createAllocationDeciders(
addAllocationDecider(deciders, new ThrottlingAllocationDecider(clusterSettings));
addAllocationDecider(deciders, new ShardsLimitAllocationDecider(clusterSettings));
addAllocationDecider(deciders, new AwarenessAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new IndexBalanceAllocationDecider(settings, clusterSettings));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as suggested above, is this something we could add configure only in the serverless plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. My previous replies agree this is a sound advice. At this stage, I am inclined to leave this refactoring to follow up iterations.


clusterPlugins.stream()
.flatMap(p -> p.createAllocationDeciders(settings, clusterSettings).stream())
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,10 @@ private boolean isSingleNodeFilterInternal() {
|| (filters.size() > 1 && opType == OpType.AND && NON_ATTRIBUTE_NAMES.containsAll(filters.keySet()));
}

public boolean hasFilters() {
return filters.isEmpty() == false;
}

/**
* Generates a human-readable string for the DiscoverNodeFilters.
* Example: {@code _id:"id1 OR blah",name:"blah OR name2"}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the "Elastic License
* 2.0", the "GNU Affero General Public License v3.0 only", and the "Server Side
* Public License v 1"; you may not use this file except in compliance with, at
* your election, the "Elastic License 2.0", the "GNU Affero General Public
* License v3.0 only", or the "Server Side Public License, v 1".
*/

package org.elasticsearch.cluster.routing.allocation;

import org.elasticsearch.common.settings.ClusterSettings;
import org.elasticsearch.common.settings.Setting;

/**
* Settings definitions for the index shard count allocation decider and associated infrastructure
*/
public class IndexBalanceConstraintSettings {

private static final String SETTING_PREFIX = "cluster.routing.allocation.index_balance_decider.";

public static final Setting<Boolean> INDEX_BALANCE_DECIDER_ENABLED_SETTING = Setting.boolSetting(
SETTING_PREFIX + "enabled",
false,
Setting.Property.Dynamic,
Setting.Property.NodeScope
);

/**
* This setting permits nodes to host more than ideally balanced number of index shards.
* Maximum tolerated index shard count = ideal + skew_tolerance
* i.e. ideal = 4 shards, skew_tolerance = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace 'skew' in comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to keep this unchanged see skew tolerance is synonymous to excess shards.

* maximum tolerated index shards = 4 + 1 = 5.
*/
public static final Setting<Integer> INDEX_BALANCE_DECIDER_EXCESS_SHARDS = Setting.intSetting(
SETTING_PREFIX + "excess_shards",
0,
0,
Setting.Property.Dynamic,
Setting.Property.NodeScope
);

private volatile boolean deciderEnabled;
private volatile int excessShards;

public IndexBalanceConstraintSettings(ClusterSettings clusterSettings) {
clusterSettings.initializeAndWatch(INDEX_BALANCE_DECIDER_ENABLED_SETTING, enabled -> this.deciderEnabled = enabled);
clusterSettings.initializeAndWatch(INDEX_BALANCE_DECIDER_EXCESS_SHARDS, value -> this.excessShards = value);
}

public boolean isDeciderEnabled() {
return this.deciderEnabled;
}

public int getExcessShards() {
return this.excessShards;
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -1581,6 +1581,11 @@ private boolean tryRelocateShard(ModelNode minNode, ModelNode maxNode, ProjectIn
logger.trace("No shards of [{}] can relocate from [{}] to [{}]", idx, maxNode.getNodeId(), minNode.getNodeId());
return false;
}

// Visible for testing.
public RoutingAllocation getAllocation() {
return this.allocation;
}
}

public static class ModelNode implements Iterable<ModelIndex> {
Expand Down Expand Up @@ -1824,7 +1829,8 @@ public WeightFunction getWeightFunction() {
}
}

record ProjectIndex(ProjectId project, String indexName) {
// Visible for testing.
public record ProjectIndex(ProjectId project, String indexName) {
ProjectIndex(RoutingAllocation allocation, ShardRouting shard) {
this(allocation.metadata().projectFor(shard.index()).id(), shard.getIndexName());
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ public WeightFunction(float shardBalance, float indexBalance, float writeLoadBal
theta3 = diskUsageBalance / sum;
}

float calculateNodeWeightWithIndex(
// Visible for testing
public float calculateNodeWeightWithIndex(
BalancedShardsAllocator.Balancer balancer,
BalancedShardsAllocator.ModelNode node,
ProjectIndex index
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,9 @@ public class FilterAllocationDecider extends AllocationDecider {

public static final String NAME = "filter";

private static final String CLUSTER_ROUTING_REQUIRE_GROUP_PREFIX = "cluster.routing.allocation.require";
private static final String CLUSTER_ROUTING_INCLUDE_GROUP_PREFIX = "cluster.routing.allocation.include";
private static final String CLUSTER_ROUTING_EXCLUDE_GROUP_PREFIX = "cluster.routing.allocation.exclude";
public static final String CLUSTER_ROUTING_REQUIRE_GROUP_PREFIX = "cluster.routing.allocation.require";
public static final String CLUSTER_ROUTING_INCLUDE_GROUP_PREFIX = "cluster.routing.allocation.include";
public static final String CLUSTER_ROUTING_EXCLUDE_GROUP_PREFIX = "cluster.routing.allocation.exclude";

public static final Setting.AffixSetting<List<String>> CLUSTER_ROUTING_REQUIRE_GROUP_SETTING = Setting.prefixKeySetting(
CLUSTER_ROUTING_REQUIRE_GROUP_PREFIX + ".",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the "Elastic License
* 2.0", the "GNU Affero General Public License v3.0 only", and the "Server Side
* Public License v 1"; you may not use this file except in compliance with, at
* your election, the "Elastic License 2.0", the "GNU Affero General Public
* License v3.0 only", or the "Server Side Public License, v 1".
*/

package org.elasticsearch.cluster.routing.allocation.decider;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.elasticsearch.cluster.metadata.IndexMetadata;
import org.elasticsearch.cluster.metadata.ProjectId;
import org.elasticsearch.cluster.node.DiscoveryNode;
import org.elasticsearch.cluster.node.DiscoveryNodeFilters;
import org.elasticsearch.cluster.node.DiscoveryNodeRole;
import org.elasticsearch.cluster.routing.RoutingNode;
import org.elasticsearch.cluster.routing.ShardRouting;
import org.elasticsearch.cluster.routing.allocation.IndexBalanceConstraintSettings;
import org.elasticsearch.cluster.routing.allocation.RoutingAllocation;
import org.elasticsearch.common.settings.ClusterSettings;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.core.Strings;
import org.elasticsearch.index.Index;

import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import static org.elasticsearch.cluster.node.DiscoveryNodeFilters.OpType.AND;
import static org.elasticsearch.cluster.node.DiscoveryNodeFilters.OpType.OR;
import static org.elasticsearch.cluster.node.DiscoveryNodeRole.INDEX_ROLE;
import static org.elasticsearch.cluster.node.DiscoveryNodeRole.SEARCH_ROLE;
import static org.elasticsearch.cluster.routing.allocation.decider.FilterAllocationDecider.CLUSTER_ROUTING_EXCLUDE_GROUP_SETTING;
import static org.elasticsearch.cluster.routing.allocation.decider.FilterAllocationDecider.CLUSTER_ROUTING_INCLUDE_GROUP_SETTING;
import static org.elasticsearch.cluster.routing.allocation.decider.FilterAllocationDecider.CLUSTER_ROUTING_REQUIRE_GROUP_SETTING;

/**
* For an index of n shards hosted by a cluster of m nodes, a node should not host
* significantly more than n / m shards. This allocation decider enforces this principle.
* This allocation decider excludes any nodes flagged for shutdown from consideration
* when computing optimal shard distributions.
*/
public class IndexBalanceAllocationDecider extends AllocationDecider {

private static final Logger logger = LogManager.getLogger(IndexBalanceAllocationDecider.class);
private static final String EMPTY = "";

public static final String NAME = "index_balance";

private final IndexBalanceConstraintSettings indexBalanceConstraintSettings;
private final boolean isStateless;

private volatile DiscoveryNodeFilters clusterRequireFilters;
private volatile DiscoveryNodeFilters clusterIncludeFilters;
private volatile DiscoveryNodeFilters clusterExcludeFilters;

public IndexBalanceAllocationDecider(Settings settings, ClusterSettings clusterSettings) {
this.indexBalanceConstraintSettings = new IndexBalanceConstraintSettings(clusterSettings);
setClusterRequireFilters(CLUSTER_ROUTING_REQUIRE_GROUP_SETTING.getAsMap(settings));
setClusterExcludeFilters(CLUSTER_ROUTING_EXCLUDE_GROUP_SETTING.getAsMap(settings));
setClusterIncludeFilters(CLUSTER_ROUTING_INCLUDE_GROUP_SETTING.getAsMap(settings));
clusterSettings.addAffixMapUpdateConsumer(CLUSTER_ROUTING_REQUIRE_GROUP_SETTING, this::setClusterRequireFilters, (a, b) -> {});
clusterSettings.addAffixMapUpdateConsumer(CLUSTER_ROUTING_EXCLUDE_GROUP_SETTING, this::setClusterExcludeFilters, (a, b) -> {});
clusterSettings.addAffixMapUpdateConsumer(CLUSTER_ROUTING_INCLUDE_GROUP_SETTING, this::setClusterIncludeFilters, (a, b) -> {});
isStateless = DiscoveryNode.isStateless(settings);
}

@Override
public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
if (indexBalanceConstraintSettings.isDeciderEnabled() == false || isStateless == false || hasFilters()) {
return allocation.decision(Decision.YES, NAME, "Decider is disabled.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's disabled for stateful, I wonder if we could just configure it to be added in the stateless plugin? Then we wouldn't need to check isStateless every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also hasNoFilters() == false is harder to read than filtersAreConfigured() or similar. It's Friday here, I can't process a double-negative.

Copy link
Contributor Author

@zhubotang-wq zhubotang-wq Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to remove double negative. Since there are follow up plans to add stateful logic as well as protracted nature of this iteration, I am inclined to keep its current placement.

I completely agree that placing this in the stateless plugin is a much better choice than the current location. Since the decision to make this a stateless-only decider was made nearly 2 months ago, I’m a bit surprised this option didn’t come up earlier from previous dozens of comments.

Maybe the reviewers could concentrate more on the architectural/logic aspects here — those will have a greater impact than the variable naming details.

This would enable me to take this approach far earlier.


Index index = shardRouting.index();
if (node.hasIndex(index) == false) {
return allocation.decision(Decision.YES, NAME, "Node does not currently host this index.");
}

assert node.node() != null;
assert node.node().getRoles().contains(INDEX_ROLE) || node.node().getRoles().contains(SEARCH_ROLE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert node.node().getRoles().contains(INDEX_ROLE) || node.node().getRoles().contains(SEARCH_ROLE);
assert node.node().getRoles().contains(INDEX_ROLE) || node.node().getRoles().contains(SEARCH_ROLE)
: "Unexpected role found: " + node.node().getRoles();

Copy link
Contributor Author

@zhubotang-wq zhubotang-wq Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to keep it unchanged. In testing this assert message is superfluous .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this assert were to fail (if there is a bug), it would not be clear what event caused the code to fail. I believe all the assert will currently tell you is that the expression was overall false. Supplying what type of node leaked through would make a potential test failure faster to debug, is the idea.

Not a hill I'll die on, though.


if (node.node().getRoles().contains(INDEX_ROLE) && shardRouting.primary() == false) {
return allocation.decision(Decision.YES, NAME, "An index node cannot own search shards. Decider inactive.");
}

if (node.node().getRoles().contains(SEARCH_ROLE) && shardRouting.primary()) {
return allocation.decision(Decision.YES, NAME, "A search node cannot own primary shards. Decider inactive.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we perhaps combine this into a single check like co.elastic.elasticsearch.stateless.allocation.StatelessAllocationDecider#canAllocateShardToNode so it doesn't distract from the focus of this decider? (lines 84-92 could be replaced by such a check maybe?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. In hindsight, the deciders ought to have been place in the stateless repo as part of the stateless plugin.

Like I mentioned earlier. this feedback makes absolute sense since the canAllocateShardToNode deals with the precise requirement here.

At this stage, I am inclined to leave this refactoring to follow up PR when canRemain() is added.


final ProjectId projectId = allocation.getClusterState().metadata().projectFor(index).id();
final Set<DiscoveryNode> eligibleNodes = new HashSet<>();
int totalShards = 0;
String nomenclature = EMPTY;

if (node.node().getRoles().contains(INDEX_ROLE)) {
collectEligibleNodes(allocation, eligibleNodes, INDEX_ROLE);
// Primary shards only.
totalShards = allocation.getClusterState().routingTable(projectId).index(index).size();
nomenclature = "index";
} else if (node.node().getRoles().contains(SEARCH_ROLE)) {
collectEligibleNodes(allocation, eligibleNodes, SEARCH_ROLE);
// Replicas only.
final IndexMetadata indexMetadata = allocation.getClusterState().metadata().getProject(projectId).index(index);
totalShards = indexMetadata.getNumberOfShards() * indexMetadata.getNumberOfReplicas();
nomenclature = "search";
}

assert eligibleNodes.isEmpty() == false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't already discussed this, have we? I'd think we'd want to exit early rather than assert here. We could say YES, and that there are no non-shutting down nodes to consider.

We know that the node in question has a shard of the index in question. But what if that were the only index node in the cluster, and it's also shutting down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I’m remembering correctly, this assertion was introduced based on earlier review guidance. The new feedback seems to take a different position, and I’m concerned the shifting expectations may be introducing avoidable delays. Could we clarify the preferred direction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I’m remembering correctly, this assertion was introduced based on earlier review guidance.

There have been multiple evolutions of this code, including changes to the logic in the method filtering down to the eligibleNodes, and many of the early exit checks above this. I think we did discuss this verbally at one point before, maybe a couple weeks ago, though I don't recall the surrounding context.

Could we clarify the preferred direction?

In this case, the preferred direction is making the code robust against failure. I reviewed the code and this seems like an issue. It should be fixed if you agree it's possible, or otherwise discussed why it is not.

The new feedback seems to take a different position

I could have been wrong in what I previously advised, I'm not sure. I do try to be correct. The best way to avoid this would be to try to make sure you understand why I recommend something: trust, but verify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this also, if the cluster were shutting down all nodes would be marked as shutting down and we'd get an empty array here right?

if (eligibleNodes.isEmpty()) {
return allocation.decision(Decision.YES, NAME, "There are no eligible nodes available.");
}
assert totalShards > 0;
final double idealAllocation = Math.ceil((double) totalShards / eligibleNodes.size());

// Adding the excess shards before division ensures that with tolerance 1 we get:
// 2 shards, 2 nodes, allow 2 on each
// 3 shards, 2 nodes, allow 2 on each etc.
final int threshold = Math.ceilDiv(totalShards + indexBalanceConstraintSettings.getExcessShards(), eligibleNodes.size());
final int currentAllocation = node.numberOfOwningShardsForIndex(index);

if (currentAllocation >= threshold) {
String explanation = Strings.format(
"There are [%d] eligible nodes in the [%s] tier for assignment of [%d] shards in index [%s]. Ideally no more than [%.0f] "
+ "shard would be assigned per node (the index balance excess shards setting is [%d]). This node is already assigned"
+ " [%d] shards of the index.",
eligibleNodes.size(),
nomenclature,
totalShards,
index,
idealAllocation,
indexBalanceConstraintSettings.getExcessShards(),
currentAllocation
);

logger.trace(explanation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be logger.debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this logging statement, 4 different reviewers offered different opinions. At this stage, I am inclined to leave it unchanged.


return allocation.decision(Decision.NOT_PREFERRED, NAME, explanation);
}

return allocation.decision(Decision.YES, NAME, "Node index shard allocation is under the threshold.");
}

private void collectEligibleNodes(RoutingAllocation allocation, Set<DiscoveryNode> eligibleNodes, DiscoveryNodeRole role) {
for (DiscoveryNode discoveryNode : allocation.nodes()) {
if (discoveryNode.getRoles().contains(role) && allocation.metadata().nodeShutdowns().contains(discoveryNode.getId()) == false) {
eligibleNodes.add(discoveryNode);
}
}
}

private void setClusterRequireFilters(Map<String, List<String>> filters) {
clusterRequireFilters = DiscoveryNodeFilters.trimTier(DiscoveryNodeFilters.buildFromKeyValues(AND, filters));
}

private void setClusterIncludeFilters(Map<String, List<String>> filters) {
clusterIncludeFilters = DiscoveryNodeFilters.trimTier(DiscoveryNodeFilters.buildFromKeyValues(OR, filters));
}

private void setClusterExcludeFilters(Map<String, List<String>> filters) {
clusterExcludeFilters = DiscoveryNodeFilters.trimTier(DiscoveryNodeFilters.buildFromKeyValues(OR, filters));
}

private boolean hasFilters() {
return (clusterExcludeFilters != null && clusterExcludeFilters.hasFilters())
|| (clusterIncludeFilters != null && clusterIncludeFilters.hasFilters())
|| (clusterRequireFilters != null && clusterRequireFilters.hasFilters());
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
import org.elasticsearch.cluster.routing.OperationRouting;
import org.elasticsearch.cluster.routing.allocation.DataTier;
import org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings;
import org.elasticsearch.cluster.routing.allocation.IndexBalanceConstraintSettings;
import org.elasticsearch.cluster.routing.allocation.WriteLoadConstraintSettings;
import org.elasticsearch.cluster.routing.allocation.allocator.AllocationBalancingRoundSummaryService;
import org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator;
Expand Down Expand Up @@ -659,6 +660,8 @@ public void apply(Settings value, Settings current, Settings previous) {
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_HIGH_UTILIZATION_DURATION_SETTING,
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_QUEUE_LATENCY_THRESHOLD_SETTING,
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_REROUTE_INTERVAL_SETTING,
IndexBalanceConstraintSettings.INDEX_BALANCE_DECIDER_ENABLED_SETTING,
IndexBalanceConstraintSettings.INDEX_BALANCE_DECIDER_EXCESS_SHARDS,
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_MINIMUM_LOGGING_INTERVAL,
SamplingService.TTL_POLL_INTERVAL_SETTING,
BlobStoreRepository.MAX_HEAP_SIZE_FOR_SNAPSHOT_DELETION_SETTING,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider;
import org.elasticsearch.cluster.routing.allocation.decider.EnableAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.FilterAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.IndexBalanceAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.IndexVersionAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.MaxRetryAllocationDecider;
import org.elasticsearch.cluster.routing.allocation.decider.NodeReplacementAllocationDecider;
Expand Down Expand Up @@ -286,7 +287,8 @@ public void testAllocationDeciderOrder() {
DiskThresholdDecider.class,
ThrottlingAllocationDecider.class,
ShardsLimitAllocationDecider.class,
AwarenessAllocationDecider.class
AwarenessAllocationDecider.class,
IndexBalanceAllocationDecider.class
);
Collection<AllocationDecider> deciders = ClusterModule.createAllocationDeciders(
Settings.EMPTY,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1277,7 +1277,7 @@ private static class NodeNameDrivenWeightFunction extends WeightFunction {
}

@Override
float calculateNodeWeightWithIndex(
public float calculateNodeWeightWithIndex(
BalancedShardsAllocator.Balancer balancer,
BalancedShardsAllocator.ModelNode node,
BalancedShardsAllocator.ProjectIndex index
Expand Down
Loading