Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiskThresholdMonitor retains large heap objects unnecessarily #94914

Open
DaveCTurner opened this issue Mar 30, 2023 · 1 comment
Open

DiskThresholdMonitor retains large heap objects unnecessarily #94914

DaveCTurner opened this issue Mar 30, 2023 · 1 comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

The DiskThresholdMonitor enqueues a batched reroute if the watermarks situation in the cluster changes:

rerouteService.reroute(
"disk threshold monitor",
Priority.HIGH,
ActionListener.releaseAfter(ActionListener.runAfter(ActionListener.wrap(reroutedClusterState -> {
for (DiskUsage diskUsage : usagesOverHighThreshold) {
final RoutingNode routingNode = reroutedClusterState.getRoutingNodes().node(diskUsage.getNodeId());
final DiskUsage usageIncludingRelocations;
final long relocatingShardsSize;
if (routingNode != null) { // might be temporarily null if ClusterInfoService and ClusterService are out of step
relocatingShardsSize = sizeOfRelocatingShards(routingNode, diskUsage, info, reroutedClusterState);
usageIncludingRelocations = new DiskUsage(
diskUsage.getNodeId(),
diskUsage.getNodeName(),
diskUsage.getPath(),
diskUsage.getTotalBytes(),
diskUsage.getFreeBytes() - relocatingShardsSize
);
} else {
usageIncludingRelocations = diskUsage;
relocatingShardsSize = 0L;
}
final ByteSizeValue total = ByteSizeValue.ofBytes(usageIncludingRelocations.getTotalBytes());
if (usageIncludingRelocations.getFreeBytes() < diskThresholdSettings.getFreeBytesThresholdHighStage(total)
.getBytes()) {
nodesOverHighThresholdAndRelocating.remove(diskUsage.getNodeId());
logger.warn("""
high disk watermark [{}] exceeded on {}, shards will be relocated away from this node; currently \
relocating away shards totalling [{}] bytes; the node is expected to continue to exceed the high disk \
watermark when these relocations are complete\
""", diskThresholdSettings.describeHighThreshold(total, false), diskUsage, -relocatingShardsSize);
} else if (nodesOverHighThresholdAndRelocating.add(diskUsage.getNodeId())) {
logger.info("""
high disk watermark [{}] exceeded on {}, shards will be relocated away from this node; currently \
relocating away shards totalling [{}] bytes; the node is expected to be below the high disk watermark \
when these relocations are complete\
""", diskThresholdSettings.describeHighThreshold(total, false), diskUsage, -relocatingShardsSize);
} else {
logger.debug("""
high disk watermark [{}] exceeded on {}, shards will be relocated away from this node; currently \
relocating away shards totalling [{}] bytes\
""", diskThresholdSettings.describeHighThreshold(total, false), diskUsage, -relocatingShardsSize);
}
}
}, e -> logger.debug("reroute failed", e)), this::setLastRunTimeMillis), asyncRefs.acquire())
);

The listener may be called quite some time later, but it retains the original ClusterInfo (potentially 10s of MiBs) for its computations. Could we discard that and use a more recent ClusterInfo?

That's not all that bad, because we only ever have one such task pending. The worst bit is that this listener takes the reroutedClusterState, which means the BatchedRerouteService captures the newState here ...

public void clusterStateProcessed(ClusterState oldState, ClusterState newState) {
future.addListener(ActionListener.running(() -> ActionListener.onResponse(currentListeners, newState)));
}

... and keeps hold of it until the reroute completes (i.e. we converge on a new balance and start to reconcile it) which could be quite some time later. That means that every other batched reroute also retains the state until reconciliation starts, which could represent a rather large amount of heap.

@DaveCTurner DaveCTurner added >bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Mar 30, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Mar 30, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 30, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a
listener which accepts the exact cluster state that the reroute
produced. However it may be that there have been some other cluster
state updates in between the reroute computation and the completion of
this listener, so the cluster state it receives is potentially stale.
Moreover by accepting the exact cluster state that the reroute produced,
we require the `BatchedRerouteService` to retain every resulting state
for this and all other batched reroute listeners. There's no need for
this, we can instead retrieve the last-applied cluster state afresh, and
avoid this unnecessary retention of cluster states.

Relates elastic#94914
elasticsearchmachine pushed a commit that referenced this issue Mar 31, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a
listener which accepts the exact cluster state that the reroute
produced. However it may be that there have been some other cluster
state updates in between the reroute computation and the completion of
this listener, so the cluster state it receives is potentially stale.
Moreover by accepting the exact cluster state that the reroute produced,
we require the `BatchedRerouteService` to retain every resulting state
for this and all other batched reroute listeners. There's no need for
this, we can instead retrieve the last-applied cluster state afresh, and
avoid this unnecessary retention of cluster states.

Relates #94914
DaveCTurner added a commit that referenced this issue Mar 31, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a
listener which accepts the exact cluster state that the reroute
produced. However it may be that there have been some other cluster
state updates in between the reroute computation and the completion of
this listener, so the cluster state it receives is potentially stale.
Moreover by accepting the exact cluster state that the reroute produced,
we require the `BatchedRerouteService` to retain every resulting state
for this and all other batched reroute listeners. There's no need for
this, we can instead retrieve the last-applied cluster state afresh, and
avoid this unnecessary retention of cluster states.

Relates #94914
rjernst pushed a commit to rjernst/elasticsearch that referenced this issue Apr 6, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a
listener which accepts the exact cluster state that the reroute
produced. However it may be that there have been some other cluster
state updates in between the reroute computation and the completion of
this listener, so the cluster state it receives is potentially stale.
Moreover by accepting the exact cluster state that the reroute produced,
we require the `BatchedRerouteService` to retain every resulting state
for this and all other batched reroute listeners. There's no need for
this, we can instead retrieve the last-applied cluster state afresh, and
avoid this unnecessary retention of cluster states.

Relates elastic#94914
saarikabhasi pushed a commit to saarikabhasi/elasticsearch that referenced this issue Apr 10, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a
listener which accepts the exact cluster state that the reroute
produced. However it may be that there have been some other cluster
state updates in between the reroute computation and the completion of
this listener, so the cluster state it receives is potentially stale.
Moreover by accepting the exact cluster state that the reroute produced,
we require the `BatchedRerouteService` to retain every resulting state
for this and all other batched reroute listeners. There's no need for
this, we can instead retrieve the last-applied cluster state afresh, and
avoid this unnecessary retention of cluster states.

Relates elastic#94914
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

2 participants