Expose node heap size in cluster info #134436

pxsalehi · 2025-09-10T08:55:24Z

This is needed for a new Stateless decider that limits concurrent recoveries based on node heap size.

Relates ES-12554

pxsalehi · 2025-09-10T10:03:36Z

...n/java/org/elasticsearch/cluster/routing/allocation/decider/ThrottlingAllocationDecider.java

+            if (shardRouting.isPromotableToPrimary()
+                && shardRouting.isSearchable() == false


I've chosen to do this to only impact primary relocation in stateless, since that is the only place where this is needed. I didn't want to complicate the stateful flow as there is no need for that.

There could be also different ways of doing this, e.g. passing Settings to the decider and using STATELESS_ENABLED, I did this as in other places e.g. when to use TransportStatelessPrimaryRelocationAction, we do the same kind of check on the shard.

I wonder if we could do this as a completely separate decider that's added by the Stateless plugin? might be nice to keep it separate from the existing throttling logic, because it's just an additional constraint right? I don't think there's any reason it needs to be in here?

It's an interesting idea. But this seems too much of a niche for a separate decider. It seems very relevant to where it is now and if we decide to apply the same limitation to stateful it can be done here. I don't have a strong argument for any of them frankly. It seemed small enough of a change to add it to the existing decider to begin with. We might also change or remove it if we go in the direction of coming up with some relationship between concurrent relocations and node size.

elasticsearchmachine · 2025-09-10T10:06:37Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall

Looks good, just a question whether we should split this out to its own decider.

nicktindall · 2025-09-11T00:56:09Z

...n/java/org/elasticsearch/cluster/routing/allocation/decider/ThrottlingAllocationDecider.java

+            if (shardRouting.isPromotableToPrimary()
+                && shardRouting.isSearchable() == false


I wonder if we could do this as a completely separate decider that's added by the Stateless plugin? might be nice to keep it separate from the existing throttling logic, because it's just an additional constraint right? I don't think there's any reason it needs to be in here?

mhl-b · 2025-09-11T16:55:45Z

...n/java/org/elasticsearch/cluster/routing/allocation/decider/ThrottlingAllocationDecider.java


            // Allocating a shard to this node will increase the incoming recoveries
            int currentInRecoveries = allocation.routingNodes().getIncomingRecoveries(node.nodeId());
-            if (currentInRecoveries >= concurrentIncomingRecoveries) {


Should we change concurrentIncomingRecoveries be a function of maxHeapSize? There are projects with large number of shards(10k+ per node) and large heaps that take long time (1h+) to relocate.

Something like

int concurrentRecoveriesThreshold = concurrentIncomingRecoveries; if (dynamicConcurrentRecoveriesSetting) { concurrentRecoveriesThreshold = fn(maxHeap); } if (currentInRecoveries >= concurrentRecoveriesThreshold) { ...

As heuristic I would use 1 recovery for every 2GB of max-heap. That would solve problem for 4GB(2GB heap, 1 recovery) and 64GB(32GB heap, 16 recoveries) nodes as well.

once we decide that's what we want and how to do it, we can extend the new stateless decider.

henningandersen

I am slightly worried about adding a setting that we may want to remove again and also that it is settable for non-stateless. I am good with this otherwise if we can find a way around that to avoid any bwc implications of removing the setting.

henningandersen · 2025-09-11T19:58:47Z

...n/java/org/elasticsearch/cluster/routing/allocation/decider/ThrottlingAllocationDecider.java

+        Setting.memorySizeSetting(
+            "cluster.routing.allocation.min_heap_required_for_concurrent_primary_recoveries",
+            ByteSizeValue.ZERO,
+            Property.Dynamic,


Can we make it OperatorDynamic (seems more appropriate but may not matter too much).

henningandersen · 2025-09-11T20:00:04Z

server/src/main/java/org/elasticsearch/common/settings/ClusterSettings.java

        ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_INCOMING_RECOVERIES_SETTING,
        ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_OUTGOING_RECOVERIES_SETTING,
        ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING,
+        ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_MIN_HEAP_REQUIRED_FOR_CONCURRENT_PRIMARY_RECOVERIES_SETTING,


I wonder if we can register this in stateless only (and tests)? To avoid any bwc implications if we end up removing this again and replace it with a more elaborate mechanism. Hmm, might make any lookup fail which complicates things.

Perhaps we can "bomb" the validation by only allowing this when running stateless?

I'd rather go with Nick's suggestion. I've update the PR.

…ecoveries

pxsalehi · 2025-09-12T09:53:52Z

This PR now only exposes the node max heap size in cluster info. I've added a new stateless decider that does the throttling.

pxsalehi · 2025-09-12T09:54:35Z

...n/java/org/elasticsearch/cluster/routing/allocation/decider/ThrottlingAllocationDecider.java

     * This method returns the corresponding initializing shard that would be allocated to this node.
     */
-    private static ShardRouting initializingShard(ShardRouting shardRouting, String currentNodeId) {
+    public static ShardRouting initializingShard(ShardRouting shardRouting, String currentNodeId) {


this is needed to do the same assertion that is done in this decider, but in the stateless decider.

henningandersen

LGTM.

henningandersen · 2025-09-12T10:13:22Z

server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java

            nodeThreadPoolUsageStatsPerNode,
-            indicesStatsSummary.shardWriteLoads()
+            indicesStatsSummary.shardWriteLoads(),
+            maxHeapPerNode


nit: maxHeapPerNode is a volatile field. While it will not matter in the current work, I would find it slightly better to capture the value of maxHeapPerNode once in this method such that the maxHeapPerNode and the estimatedHeapUsages are guaranteed to be created based on the same maxHeapPerNode instance.

sure. 75f7cd0.

henningandersen · 2025-09-12T10:14:53Z

server/src/main/java/org/elasticsearch/cluster/ClusterInfo.java

        return result == null ? ReservedSpace.EMPTY : result;
    }

+    public Map<String, ByteSizeValue> getMaxHeapSizePerNode() {


Can we add a test similar to IndexShardIT.testHeapUsageEstimateIsPresent? We can postpone to follow-up work if that matches timing better.

added test in 44dcc78

DaveCTurner

LGTM2

DaveCTurner · 2025-09-12T10:50:10Z

server/src/main/java/org/elasticsearch/cluster/ClusterInfo.java

    final Map<String, EstimatedHeapUsage> estimatedHeapUsages;
    final Map<String, NodeUsageStatsForThreadPools> nodeUsageStatsForThreadPools;
    final Map<ShardId, Double> shardWriteLoads;
+    final Map<String, ByteSizeValue> maxHeapSizePerNode;


nit: not really a problem introduced here but it'd be nice to indicate what the opaque String keys are -- presumably persistent node IDs?

added comment in 44dcc78

…ecoveries

This is needed for a new Stateless decider that limits concurrent recoveries based on node heap size. Relates ES-12554

pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Sep 10, 2025

elasticsearchmachine added the v9.2.0 label Sep 10, 2025

Make concurrent recovery dependant on heap size

d83e803

pxsalehi force-pushed the ps250908-limitConcRecoveries branch from a538165 to d83e803 Compare September 10, 2025 09:23

pxsalehi commented Sep 10, 2025

View reviewed changes

pxsalehi marked this pull request as ready for review September 10, 2025 10:06

pxsalehi requested a review from a team as a code owner September 10, 2025 10:06

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 10, 2025

pxsalehi requested review from nicktindall and henningandersen September 10, 2025 10:10

nicktindall reviewed Sep 11, 2025

View reviewed changes

pxsalehi requested a review from nicktindall September 11, 2025 14:59

mhl-b reviewed Sep 11, 2025

View reviewed changes

henningandersen reviewed Sep 11, 2025

View reviewed changes

pxsalehi added 3 commits September 12, 2025 11:37

clean up

7464f8c

Merge remote-tracking branch 'upstream/main' into ps250908-limitConcR…

6c6f927

…ecoveries

clean up

899ef80

pxsalehi changed the title ~~Make concurrent primary recoveries dependant on heap size~~ Expose node heap size in cluster info Sep 12, 2025

pxsalehi commented Sep 12, 2025

View reviewed changes

pxsalehi requested review from henningandersen, mhl-b and DaveCTurner September 12, 2025 09:54

henningandersen approved these changes Sep 12, 2025

View reviewed changes

DaveCTurner approved these changes Sep 12, 2025

View reviewed changes

pxsalehi added 3 commits September 12, 2025 13:21

add test and comment

44dcc78

Merge remote-tracking branch 'upstream/main' into ps250908-limitConcR…

bb25030

…ecoveries

capture maxHeapPerNode

75f7cd0

pxsalehi added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 12, 2025

elasticsearchmachine merged commit 4db0011 into elastic:main Sep 12, 2025
34 checks passed

pxsalehi deleted the ps250908-limitConcRecoveries branch September 12, 2025 13:09

gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 18, 2025

Expose node heap size in cluster info (elastic#134436)

baaf0f1

This is needed for a new Stateless decider that limits concurrent recoveries based on node heap size. Relates ES-12554

		if (shardRouting.isPromotableToPrimary()
		&& shardRouting.isSearchable() == false

Expose node heap size in cluster info #134436

Expose node heap size in cluster info #134436

Uh oh!

Conversation

pxsalehi commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 10, 2025

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Sep 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pxsalehi commented Sep 10, 2025 •

edited

Loading

mhl-b Sep 11, 2025 •

edited

Loading