New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DiskThresholdMonitor retains large heap objects unnecessarily #94914
Labels
>bug
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
Team:Distributed
Meta label for distributed team
Comments
DaveCTurner
added
>bug
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
labels
Mar 30, 2023
Pinging @elastic/es-distributed (Team:Distributed) |
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Mar 30, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a listener which accepts the exact cluster state that the reroute produced. However it may be that there have been some other cluster state updates in between the reroute computation and the completion of this listener, so the cluster state it receives is potentially stale. Moreover by accepting the exact cluster state that the reroute produced, we require the `BatchedRerouteService` to retain every resulting state for this and all other batched reroute listeners. There's no need for this, we can instead retrieve the last-applied cluster state afresh, and avoid this unnecessary retention of cluster states. Relates elastic#94914
elasticsearchmachine
pushed a commit
that referenced
this issue
Mar 31, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a listener which accepts the exact cluster state that the reroute produced. However it may be that there have been some other cluster state updates in between the reroute computation and the completion of this listener, so the cluster state it receives is potentially stale. Moreover by accepting the exact cluster state that the reroute produced, we require the `BatchedRerouteService` to retain every resulting state for this and all other batched reroute listeners. There's no need for this, we can instead retrieve the last-applied cluster state afresh, and avoid this unnecessary retention of cluster states. Relates #94914
DaveCTurner
added a commit
that referenced
this issue
Mar 31, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a listener which accepts the exact cluster state that the reroute produced. However it may be that there have been some other cluster state updates in between the reroute computation and the completion of this listener, so the cluster state it receives is potentially stale. Moreover by accepting the exact cluster state that the reroute produced, we require the `BatchedRerouteService` to retain every resulting state for this and all other batched reroute listeners. There's no need for this, we can instead retrieve the last-applied cluster state afresh, and avoid this unnecessary retention of cluster states. Relates #94914
rjernst
pushed a commit
to rjernst/elasticsearch
that referenced
this issue
Apr 6, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a listener which accepts the exact cluster state that the reroute produced. However it may be that there have been some other cluster state updates in between the reroute computation and the completion of this listener, so the cluster state it receives is potentially stale. Moreover by accepting the exact cluster state that the reroute produced, we require the `BatchedRerouteService` to retain every resulting state for this and all other batched reroute listeners. There's no need for this, we can instead retrieve the last-applied cluster state afresh, and avoid this unnecessary retention of cluster states. Relates elastic#94914
saarikabhasi
pushed a commit
to saarikabhasi/elasticsearch
that referenced
this issue
Apr 10, 2023
Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a listener which accepts the exact cluster state that the reroute produced. However it may be that there have been some other cluster state updates in between the reroute computation and the completion of this listener, so the cluster state it receives is potentially stale. Moreover by accepting the exact cluster state that the reroute produced, we require the `BatchedRerouteService` to retain every resulting state for this and all other batched reroute listeners. There's no need for this, we can instead retrieve the last-applied cluster state afresh, and avoid this unnecessary retention of cluster states. Relates elastic#94914
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
Team:Distributed
Meta label for distributed team
The
DiskThresholdMonitor
enqueues a batched reroute if the watermarks situation in the cluster changes:elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java
Lines 317 to 363 in b4712a5
The listener may be called quite some time later, but it retains the original
ClusterInfo
(potentially 10s of MiBs) for its computations. Could we discard that and use a more recentClusterInfo
?That's not all that bad, because we only ever have one such task pending. The worst bit is that this listener takes the
reroutedClusterState
, which means theBatchedRerouteService
captures thenewState
here ...elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/BatchedRerouteService.java
Lines 154 to 156 in 405a53b
... and keeps hold of it until the reroute completes (i.e. we converge on a new balance and start to reconcile it) which could be quite some time later. That means that every other batched reroute also retains the state until reconciliation starts, which could represent a rather large amount of heap.
The text was updated successfully, but these errors were encountered: