Introduce max headroom for disk watermark stages (#88639)

Introduce max headroom settings for the low, high, and flood disk watermark stages, similar to the existing max headroom setting for the flood stage of the frozen tier. Introduce new max headrooms in HealthMetadata and in ReactiveStorageDeciderService. Add multiple tests in DiskThresholdDeciderUnitTests, DiskThresholdDeciderTests and DiskThresholdMonitorTests. Moreover, addition & subtraction for ByteSizeValue, and min.
elastic · Sep 19, 2022 · 34471b1 · 34471b1
1 parent fa654b9
commit 34471b1
Show file tree

Hide file tree

Showing 22 changed files with 2,067 additions and 392 deletions.
diff --git a/docs/changelog/88639.yaml b/docs/changelog/88639.yaml
@@ -0,0 +1,6 @@
+pr: 88639
+summary: Introduce max headroom for disk watermark stages
+area: Infra/Settings
+type: enhancement
+issues:
+ - 81406
diff --git a/docs/reference/how-to/fix-common-cluster-issues.asciidoc b/docs/reference/how-to/fix-common-cluster-issues.asciidoc
@@ -51,8 +51,13 @@ PUT _cluster/settings
 {
   "persistent": {
     "cluster.routing.allocation.disk.watermark.low": "90%",
+    "cluster.routing.allocation.disk.watermark.low.max_headroom": "100GB",
     "cluster.routing.allocation.disk.watermark.high": "95%",
-    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
+    "cluster.routing.allocation.disk.watermark.high.max_headroom": "20GB",
+    "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
+    "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5GB",
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5GB"
   }
 }
 
@@ -82,8 +87,13 @@ PUT _cluster/settings
 {
   "persistent": {
     "cluster.routing.allocation.disk.watermark.low": null,
+    "cluster.routing.allocation.disk.watermark.low.max_headroom": null,
     "cluster.routing.allocation.disk.watermark.high": null,
-    "cluster.routing.allocation.disk.watermark.flood_stage": null
+    "cluster.routing.allocation.disk.watermark.high.max_headroom": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
   }
 }
 ----
@@ -674,8 +684,8 @@ for tips on diagnosing and preventing them.
 [[task-queue-backlog]]
 === Task queue backlog
 
-A backlogged task queue can prevent tasks from completing and 
-put the cluster into an unhealthy state. 
+A backlogged task queue can prevent tasks from completing and
+put the cluster into an unhealthy state.
 Resource constraints, a large number of tasks being triggered at once,
 and long running tasks can all contribute to a backlogged task queue.
 
@@ -685,11 +695,11 @@ and long running tasks can all contribute to a backlogged task queue.
 
 **Check the thread pool status**
 
-A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>. 
+A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.
 
-You can use the <<cat-thread-pool,cat thread pool API>> to 
+You can use the <<cat-thread-pool,cat thread pool API>> to
 see the number of active threads in each thread pool and
-how many tasks are queued, how many have been rejected, and how many have completed. 
+how many tasks are queued, how many have been rejected, and how many have completed.
 
 [source,console]
 ----
@@ -698,9 +708,9 @@ GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,comple
 
 **Inspect the hot threads on each node**
 
-If a particular thread pool queue is backed up, 
-you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API 
-to determine if the thread has sufficient 
+If a particular thread pool queue is backed up,
+you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
+to determine if the thread has sufficient
 resources to progress and gauge how quickly it is progressing.
 
 [source,console]
@@ -710,9 +720,9 @@ GET /_nodes/hot_threads
 
 **Look for long running tasks**
 
-Long-running tasks can also cause a backlog. 
-You can use the <<tasks,task management>> API to get information about the tasks that are running. 
-Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. 
+Long-running tasks can also cause a backlog.
+You can use the <<tasks,task management>> API to get information about the tasks that are running.
+Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.
 
 [source,console]
 ----
@@ -723,16 +733,16 @@ GET /_tasks?filter_path=nodes.*.tasks
 [[resolve-task-queue-backlog]]
 ==== Resolve a task queue backlog
 
-**Increase available resources** 
+**Increase available resources**
 
-If tasks are progressing slowly and the queue is backing up, 
-you might need to take steps to <<reduce-cpu-usage>>. 
+If tasks are progressing slowly and the queue is backing up,
+you might need to take steps to <<reduce-cpu-usage>>.
 
 In some cases, increasing the thread pool size might help.
 For example, the `force_merge` thread pool defaults to a single thread.
 Increasing the size to 2 might help reduce a backlog of force merge requests.
 
 **Cancel stuck tasks**
 
-If you find the active task's hot thread isn't progressing and there's a backlog, 
-consider canceling the task. 
+If you find the active task's hot thread isn't progressing and there's a backlog,
+consider canceling the task.
diff --git a/docs/reference/index-modules/blocks.asciidoc b/docs/reference/index-modules/blocks.asciidoc
@@ -35,9 +35,15 @@ the index itself - can increase the index size over time. When
 not permitted. However, deleting the index itself releases the read-only index
 block and makes resources available almost immediately.
 +
-IMPORTANT: {es} adds and removes the read-only index block automatically when
-the disk utilization falls below the high watermark, controlled by
-<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage>>.
+IMPORTANT: {es} adds the read-only index block automatically when the disk
+utilization exceeds the flood stage watermark, controlled by the
+<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage>>
+and <<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage.max_headroom>>
+settings, and removes the block automatically when the disk utilization falls
+under the high watermark, controlled by the
+<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.high>>
+and <<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.high.max_headroom>>
+settings.
 
 `index.blocks.read`::
 

diff --git a/docs/reference/modules/cluster/disk_allocator.asciidoc b/docs/reference/modules/cluster/disk_allocator.asciidoc
@@ -75,13 +75,23 @@ Defaults to `true`. Set to `false` to disable the disk allocation decider. Upon
 Controls the low watermark for disk usage. It defaults to `85%`, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can alternatively be set to a ratio value, e.g., `0.85`. It can also be set to an absolute byte value (like `500mb`) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.
 // end::cluster-routing-watermark-low-tag[]
 
+`cluster.routing.allocation.disk.watermark.low.max_headroom`::
+(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the low watermark (in case of a percentage/ratio value).
+Defaults to 200GB when `cluster.routing.allocation.disk.watermark.low` is not explicitly set.
+This caps the amount of free space required.
+
 [[cluster-routing-watermark-high]]
 // tag::cluster-routing-watermark-high-tag[]
 `cluster.routing.allocation.disk.watermark.high` {ess-icon}::
 (<<dynamic-cluster-setting,Dynamic>>)
 Controls the high watermark. It defaults to `90%`, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can alternatively be set to a ratio value, e.g., `0.9`. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
 // end::cluster-routing-watermark-high-tag[]
 
+`cluster.routing.allocation.disk.watermark.high.max_headroom`::
+(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the high watermark (in case of a percentage/ratio value).
+Defaults to 150GB when `cluster.routing.allocation.disk.watermark.high` is not explicitly set.
+This caps the amount of free space required.
+
 `cluster.routing.allocation.disk.watermark.enable_for_single_data_node`::
     (<<static-cluster-setting,Static>>)
 In earlier releases, the default behaviour was to disregard disk watermarks for a single
@@ -97,8 +107,14 @@ is now `true`. The setting will be removed in a future release.
 (<<dynamic-cluster-setting,Dynamic>>)
 Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (`index.blocks.read_only_allow_delete`) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark. Similarly to the low and high watermark values, it can alternatively be set to a ratio value, e.g., `0.95`, or an absolute byte value.
 
+`cluster.routing.allocation.disk.watermark.flood_stage.max_headroom`::
+(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the flood stage watermark (in case of a percentage/ratio value).
+Defaults to 100GB when
+`cluster.routing.allocation.disk.watermark.flood_stage` is not explicitly set.
+This caps the amount of free space required.
+
 NOTE: You cannot mix the usage of percentage/ratio values and byte values within
-the watermark settings. Either all values are set to percentage/ratio values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold.
+the watermark settings. Either all values are set to percentage/ratio values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold. A similar check is done for the max headroom values.
 
 An example of resetting the read-only index block on the `my-index-000001` index:
 
@@ -122,8 +138,8 @@ Controls the flood stage watermark for dedicated frozen nodes, which defaults to
 
 `cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom` {ess-icon}::
 (<<dynamic-cluster-setting,Dynamic>>)
-Controls the max headroom for the flood stage watermark for dedicated frozen
-nodes. Defaults to 20GB when
+Controls the max headroom for the flood stage watermark (in case of a
+percentage/ratio value) for dedicated frozen nodes. Defaults to 20GB when
 `cluster.routing.allocation.disk.watermark.flood_stage.frozen` is not explicitly
 set. This caps the amount of free space required on dedicated frozen nodes.
 

diff --git a/docs/reference/troubleshooting/common-issues/disk-usage-exceeded.asciidoc b/docs/reference/troubleshooting/common-issues/disk-usage-exceeded.asciidoc
@@ -46,8 +46,13 @@ PUT _cluster/settings
 {
   "persistent": {
     "cluster.routing.allocation.disk.watermark.low": "90%",
+    "cluster.routing.allocation.disk.watermark.low.max_headroom": "100GB",
     "cluster.routing.allocation.disk.watermark.high": "95%",
-    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
+    "cluster.routing.allocation.disk.watermark.high.max_headroom": "20GB",
+    "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
+    "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5GB",
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5GB"
   }
 }
 
@@ -77,8 +82,13 @@ PUT _cluster/settings
 {
   "persistent": {
     "cluster.routing.allocation.disk.watermark.low": null,
+    "cluster.routing.allocation.disk.watermark.low.max_headroom": null,
     "cluster.routing.allocation.disk.watermark.high": null,
-    "cluster.routing.allocation.disk.watermark.flood_stage": null
+    "cluster.routing.allocation.disk.watermark.high.max_headroom": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
   }
 }
-----
+----
diff --git a/docs/reference/troubleshooting/fix-common-cluster-issues.asciidoc b/docs/reference/troubleshooting/fix-common-cluster-issues.asciidoc
@@ -16,8 +16,8 @@ the operation and returns an error.
 The most common causes of high CPU usage and their solutions.
 
 <<high-jvm-memory-pressure,High JVM memory pressure>>::
-High JVM memory usage can degrade cluster performance and trigger circuit 
-breaker errors. 
+High JVM memory usage can degrade cluster performance and trigger circuit
+breaker errors.
 
 <<red-yellow-cluster-status,Red or yellow cluster status>>::
 A red or yellow cluster status indicates one or more shards are missing or
@@ -29,8 +29,8 @@ When {es} rejects a request, it stops the operation and returns an error with a
 `429` response code.
 
 <<task-queue-backlog,Task queue backlog>>::
-A backlogged task queue can prevent tasks from completing and put the cluster 
-into an unhealthy state. 
+A backlogged task queue can prevent tasks from completing and put the cluster
+into an unhealthy state.
 
 <<diagnose-unassigned-shards,Diagnose unassigned shards>>::
 There are multiple reasons why shards might get unassigned, ranging from 
@@ -47,4 +47,4 @@ include::common-issues/high-jvm-memory-pressure.asciidoc[]
 include::common-issues/red-yellow-cluster-status.asciidoc[]
 include::common-issues/rejected-requests.asciidoc[]
 include::common-issues/task-queue-backlog.asciidoc[]
-include::common-issues/diagnose-unassigned-shards.asciidoc[]
+include::common-issues/diagnose-unassigned-shards.asciidoc[]
diff --git a/...usterTest/java/org/elasticsearch/cluster/routing/allocation/decider/MockDiskUsagesIT.java b/...usterTest/java/org/elasticsearch/cluster/routing/allocation/decider/MockDiskUsagesIT.java
@@ -34,8 +34,11 @@
 import java.util.Map;
 import java.util.concurrent.atomic.AtomicReference;
 
+import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING;
+import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING;
+import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.decider.EnableAllocationDecider.CLUSTER_ROUTING_REBALANCE_ENABLE_SETTING;
@@ -92,18 +95,18 @@ public void testRerouteOccursOnDiskPassingHighWatermark() throws Exception {
         clusterInfoService.setDiskUsageFunctionAndRefresh((discoveryNode, fsInfoPath) -> setDiskUsage(fsInfoPath, 100, between(10, 100)));
 
         final boolean watermarkBytes = randomBoolean(); // we have to consistently use bytes or percentage for the disk watermark settings
-        assertAcked(
-            client().admin()
-                .cluster()
-                .prepareUpdateSettings()
-                .setPersistentSettings(
-                    Settings.builder()
-                        .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
-                        .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
-                        .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "0b" : "100%")
-                        .put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms")
-                )
-        );
+        Settings.Builder settings = Settings.builder()
+            .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
+            .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
+            .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "0b" : "100%")
+            .put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms");
+        if (watermarkBytes == false && randomBoolean()) {
+            String headroom = randomIntBetween(10, 100) + "b";
+            settings = settings.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING.getKey(), headroom)
+                .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), headroom)
+                .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), headroom);
+        }
+        assertAcked(client().admin().cluster().prepareUpdateSettings().setPersistentSettings(settings));
         // Create an index with 10 shards so we can check allocation for it
         assertAcked(prepareCreate("test").setSettings(Settings.builder().put("number_of_shards", 10).put("number_of_replicas", 0)));
         ensureGreen("test");
@@ -172,18 +175,17 @@ public void testAutomaticReleaseOfIndexBlock() throws Exception {
         clusterInfoService.setDiskUsageFunctionAndRefresh((discoveryNode, fsInfoPath) -> setDiskUsage(fsInfoPath, 100, between(15, 100)));
 
         final boolean watermarkBytes = randomBoolean(); // we have to consistently use bytes or percentage for the disk watermark settings
-        assertAcked(
-            client().admin()
-                .cluster()
-                .prepareUpdateSettings()
-                .setPersistentSettings(
-                    Settings.builder()
-                        .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
-                        .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
-                        .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "5b" : "95%")
-                        .put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "150ms")
-                )
-        );
+        Settings.Builder builder = Settings.builder()
+            .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
+            .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
+            .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "5b" : "95%")
+            .put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "150ms");
+        if (watermarkBytes == false) {
+            builder = builder.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING.getKey(), "10b")
+                .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), "10b")
+                .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), "5b");
+        }
+        assertAcked(client().admin().cluster().prepareUpdateSettings().setPersistentSettings(builder));
 
         // Create an index with 6 shards so we can check allocation for it
         prepareCreate("test").setSettings(Settings.builder().put("number_of_shards", 6).put("number_of_replicas", 0)).get();