DiskThresholdMonitor retains large heap objects unnecessarily #94914

DaveCTurner · 2023-03-30T16:31:01Z

The DiskThresholdMonitor enqueues a batched reroute if the watermarks situation in the cluster changes:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java

Lines 317 to 363 in b4712a5

    
                           rerouteService.reroute( 
        
                               "disk threshold monitor", 
        
                               Priority.HIGH, 
        
                               ActionListener.releaseAfter(ActionListener.runAfter(ActionListener.wrap(reroutedClusterState -> { 
        
                                   for (DiskUsage diskUsage : usagesOverHighThreshold) { 
        
                                       final RoutingNode routingNode = reroutedClusterState.getRoutingNodes().node(diskUsage.getNodeId()); 
        
                                       final DiskUsage usageIncludingRelocations; 
        
                                       final long relocatingShardsSize; 
        
                                       if (routingNode != null) { // might be temporarily null if ClusterInfoService and ClusterService are out of step 
        
                                           relocatingShardsSize = sizeOfRelocatingShards(routingNode, diskUsage, info, reroutedClusterState); 
        
                                           usageIncludingRelocations = new DiskUsage( 
        
                                               diskUsage.getNodeId(), 
        
                                               diskUsage.getNodeName(), 
        
                                               diskUsage.getPath(), 
        
                                               diskUsage.getTotalBytes(), 
        
                                               diskUsage.getFreeBytes() - relocatingShardsSize 
        
                                           ); 
        
                                       } else { 
        
                                           usageIncludingRelocations = diskUsage; 
        
                                           relocatingShardsSize = 0L; 
        
                                       } 
        
                                       final ByteSizeValue total = ByteSizeValue.ofBytes(usageIncludingRelocations.getTotalBytes()); 
        
                                       if (usageIncludingRelocations.getFreeBytes() < diskThresholdSettings.getFreeBytesThresholdHighStage(total) 
        
                                           .getBytes()) { 
        
                                           nodesOverHighThresholdAndRelocating.remove(diskUsage.getNodeId()); 
        
                                           logger.warn(""" 
        
                                               high disk watermark [{}] exceeded on {}, shards will be relocated away from this node; currently \ 
        
                                               relocating away shards totalling [{}] bytes; the node is expected to continue to exceed the high disk \ 
        
                                               watermark when these relocations are complete\ 
        
                                               """, diskThresholdSettings.describeHighThreshold(total, false), diskUsage, -relocatingShardsSize); 
        
                                       } else if (nodesOverHighThresholdAndRelocating.add(diskUsage.getNodeId())) { 
        
                                           logger.info(""" 
        
                                               high disk watermark [{}] exceeded on {}, shards will be relocated away from this node; currently \ 
        
                                               relocating away shards totalling [{}] bytes; the node is expected to be below the high disk watermark \ 
        
                                               when these relocations are complete\ 
        
                                               """, diskThresholdSettings.describeHighThreshold(total, false), diskUsage, -relocatingShardsSize); 
        
                                       } else { 
        
                                           logger.debug(""" 
        
                                               high disk watermark [{}] exceeded on {}, shards will be relocated away from this node; currently \ 
        
                                               relocating away shards totalling [{}] bytes\ 
        
                                               """, diskThresholdSettings.describeHighThreshold(total, false), diskUsage, -relocatingShardsSize); 
        
                                       } 
        
                                   } 
        
                               }, e -> logger.debug("reroute failed", e)), this::setLastRunTimeMillis), asyncRefs.acquire()) 
        
                           );

The listener may be called quite some time later, but it retains the original ClusterInfo (potentially 10s of MiBs) for its computations. Could we discard that and use a more recent ClusterInfo?

That's not all that bad, because we only ever have one such task pending. The worst bit is that this listener takes the reroutedClusterState, which means the BatchedRerouteService captures the newState here ...

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/BatchedRerouteService.java

Lines 154 to 156 in 405a53b

    
           public void clusterStateProcessed(ClusterState oldState, ClusterState newState) { 
        
               future.addListener(ActionListener.running(() -> ActionListener.onResponse(currentListeners, newState))); 
        
           }

... and keeps hold of it until the reroute completes (i.e. we converge on a new balance and start to reconcile it) which could be quite some time later. That means that every other batched reroute also retains the state until reconciliation starts, which could represent a rather large amount of heap.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-03-30T16:31:25Z

Pinging @elastic/es-distributed (Team:Distributed)

Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a listener which accepts the exact cluster state that the reroute produced. However it may be that there have been some other cluster state updates in between the reroute computation and the completion of this listener, so the cluster state it receives is potentially stale. Moreover by accepting the exact cluster state that the reroute produced, we require the `BatchedRerouteService` to retain every resulting state for this and all other batched reroute listeners. There's no need for this, we can instead retrieve the last-applied cluster state afresh, and avoid this unnecessary retention of cluster states. Relates elastic#94914

Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a listener which accepts the exact cluster state that the reroute produced. However it may be that there have been some other cluster state updates in between the reroute computation and the completion of this listener, so the cluster state it receives is potentially stale. Moreover by accepting the exact cluster state that the reroute produced, we require the `BatchedRerouteService` to retain every resulting state for this and all other batched reroute listeners. There's no need for this, we can instead retrieve the last-applied cluster state afresh, and avoid this unnecessary retention of cluster states. Relates #94914

Today when the `DiskThresholdMonitor` triggers a reroute, it supplies a listener which accepts the exact cluster state that the reroute produced. However it may be that there have been some other cluster state updates in between the reroute computation and the completion of this listener, so the cluster state it receives is potentially stale. Moreover by accepting the exact cluster state that the reroute produced, we require the `BatchedRerouteService` to retain every resulting state for this and all other batched reroute listeners. There's no need for this, we can instead retrieve the last-applied cluster state afresh, and avoid this unnecessary retention of cluster states. Relates elastic#94914

DaveCTurner added >bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Mar 30, 2023

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Mar 30, 2023

DaveCTurner mentioned this issue Mar 30, 2023

Use applied state after DiskThresholdMonitor reroute #94916

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiskThresholdMonitor retains large heap objects unnecessarily #94914

DiskThresholdMonitor retains large heap objects unnecessarily #94914

DaveCTurner commented Mar 30, 2023

elasticsearchmachine commented Mar 30, 2023

DiskThresholdMonitor retains large heap objects unnecessarily #94914

DiskThresholdMonitor retains large heap objects unnecessarily #94914

Comments

DaveCTurner commented Mar 30, 2023

elasticsearchmachine commented Mar 30, 2023