-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Expose node heap size in cluster info #134436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose node heap size in cluster info #134436
Conversation
a538165
to
d83e803
Compare
if (shardRouting.isPromotableToPrimary() | ||
&& shardRouting.isSearchable() == false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've chosen to do this to only impact primary relocation in stateless, since that is the only place where this is needed. I didn't want to complicate the stateful flow as there is no need for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be also different ways of doing this, e.g. passing Settings
to the decider and using STATELESS_ENABLED
, I did this as in other places e.g. when to use TransportStatelessPrimaryRelocationAction
, we do the same kind of check on the shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could do this as a completely separate decider that's added by the Stateless
plugin? might be nice to keep it separate from the existing throttling logic, because it's just an additional constraint right? I don't think there's any reason it needs to be in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an interesting idea. But this seems too much of a niche for a separate decider. It seems very relevant to where it is now and if we decide to apply the same limitation to stateful it can be done here. I don't have a strong argument for any of them frankly. It seemed small enough of a change to add it to the existing decider to begin with. We might also change or remove it if we go in the direction of coming up with some relationship between concurrent relocations and node size.
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a question whether we should split this out to its own decider.
if (shardRouting.isPromotableToPrimary() | ||
&& shardRouting.isSearchable() == false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could do this as a completely separate decider that's added by the Stateless
plugin? might be nice to keep it separate from the existing throttling logic, because it's just an additional constraint right? I don't think there's any reason it needs to be in here?
|
||
// Allocating a shard to this node will increase the incoming recoveries | ||
int currentInRecoveries = allocation.routingNodes().getIncomingRecoveries(node.nodeId()); | ||
if (currentInRecoveries >= concurrentIncomingRecoveries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change concurrentIncomingRecoveries
be a function of maxHeapSize? There are projects with large number of shards(10k+ per node) and large heaps that take long time (1h+) to relocate.
Something like
int concurrentRecoveriesThreshold = concurrentIncomingRecoveries;
if (dynamicConcurrentRecoveriesSetting) {
concurrentRecoveriesThreshold = fn(maxHeap);
}
if (currentInRecoveries >= concurrentRecoveriesThreshold) {
...
As heuristic I would use 1 recovery for every 2GB of max-heap. That would solve problem for 4GB(2GB heap, 1 recovery) and 64GB(32GB heap, 16 recoveries) nodes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we decide that's what we want and how to do it, we can extend the new stateless decider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly worried about adding a setting that we may want to remove again and also that it is settable for non-stateless. I am good with this otherwise if we can find a way around that to avoid any bwc implications of removing the setting.
Setting.memorySizeSetting( | ||
"cluster.routing.allocation.min_heap_required_for_concurrent_primary_recoveries", | ||
ByteSizeValue.ZERO, | ||
Property.Dynamic, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it OperatorDynamic (seems more appropriate but may not matter too much).
ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_INCOMING_RECOVERIES_SETTING, | ||
ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_OUTGOING_RECOVERIES_SETTING, | ||
ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING, | ||
ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_MIN_HEAP_REQUIRED_FOR_CONCURRENT_PRIMARY_RECOVERIES_SETTING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can register this in stateless only (and tests)? To avoid any bwc implications if we end up removing this again and replace it with a more elaborate mechanism. Hmm, might make any lookup fail which complicates things.
Perhaps we can "bomb" the validation by only allowing this when running stateless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather go with Nick's suggestion. I've update the PR.
This PR now only exposes the node max heap size in cluster info. I've added a new stateless decider that does the throttling. |
* This method returns the corresponding initializing shard that would be allocated to this node. | ||
*/ | ||
private static ShardRouting initializingShard(ShardRouting shardRouting, String currentNodeId) { | ||
public static ShardRouting initializingShard(ShardRouting shardRouting, String currentNodeId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is needed to do the same assertion that is done in this decider, but in the stateless decider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
nodeThreadPoolUsageStatsPerNode, | ||
indicesStatsSummary.shardWriteLoads() | ||
indicesStatsSummary.shardWriteLoads(), | ||
maxHeapPerNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maxHeapPerNode
is a volatile field. While it will not matter in the current work, I would find it slightly better to capture the value of maxHeapPerNode
once in this method such that the maxHeapPerNode
and the estimatedHeapUsages
are guaranteed to be created based on the same maxHeapPerNode
instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. 75f7cd0.
return result == null ? ReservedSpace.EMPTY : result; | ||
} | ||
|
||
public Map<String, ByteSizeValue> getMaxHeapSizePerNode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test similar to IndexShardIT.testHeapUsageEstimateIsPresent
? We can postpone to follow-up work if that matches timing better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test in 44dcc78
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
final Map<String, EstimatedHeapUsage> estimatedHeapUsages; | ||
final Map<String, NodeUsageStatsForThreadPools> nodeUsageStatsForThreadPools; | ||
final Map<ShardId, Double> shardWriteLoads; | ||
final Map<String, ByteSizeValue> maxHeapSizePerNode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not really a problem introduced here but it'd be nice to indicate what the opaque String
keys are -- presumably persistent node IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment in 44dcc78
This is needed for a new Stateless decider that limits concurrent recoveries based on node heap size. Relates ES-12554
This is needed for a new Stateless decider that limits concurrent recoveries based on node heap size.
Relates ES-12554