[ML] Add ML memory stats API (#83802)

Adds an API that can be used to find out how much memory ML is permitted to use and is currently using on each node, both within the JVM heap, and natively, outside of the JVM.
elastic · Feb 17, 2022 · bf00ab3 · bf00ab3
1 parent 48e562a
commit bf00ab3
Show file tree

Hide file tree

Showing 29 changed files with 2,043 additions and 61 deletions.
diff --git a/docs/changelog/83802.yaml b/docs/changelog/83802.yaml
@@ -0,0 +1,5 @@
+pr: 83802
+summary: Add ML memory stats API
+area: Machine Learning
+type: enhancement
+issues: []
diff --git a/docs/reference/ml/common/apis/get-ml-memory.asciidoc b/docs/reference/ml/common/apis/get-ml-memory.asciidoc
@@ -0,0 +1,310 @@
+[role="xpack"]
+[[get-ml-memory]]
+= Get machine learning memory stats API
+
+[subs="attributes"]
+++++
+<titleabbrev>Get {ml} memory stats</titleabbrev>
+++++
+
+Returns information on how {ml} is using memory.
+
+[[get-ml-memory-request]]
+== {api-request-title}
+
+`GET _ml/memory/_stats` +
+`GET _ml/memory/<node_id>/_stats`
+
+[[get-ml-memory-prereqs]]
+== {api-prereq-title}
+
+Requires the `monitor_ml` cluster privilege. This privilege is included in the
+`machine_learning_user` built-in role.
+
+[[get-ml-memory-desc]]
+== {api-description-title}
+
+Get information about how {ml} jobs and trained models are using memory, on each
+node, both within the JVM heap, and natively, outside of the JVM.
+
+[[get-ml-memory-path-params]]
+== {api-path-parms-title}
+
+`<node_id>`::
+    (Optional, string) The names of particular nodes in the cluster to target.
+    For example, `nodeId1,nodeId2` or `ml:true`. For node selection options,
+    see <<cluster-nodes>>.
+
+[[get-ml-memory-query-parms]]
+== {api-query-parms-title}
+
+`human`::
+    Specify this query parameter to include the fields with units in the response.
+    Otherwise only the `_in_bytes` sizes are returned in the response.
+
+include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=timeoutparms]
+
+[role="child_attributes"]
+[[get-ml-memory-response-body]]
+== {api-response-body-title}
+
+`_nodes`::
+(object)
+Contains statistics about the number of nodes selected by the request.
++
+.Properties of `_nodes`
+[%collapsible%open]
+====
+`failed`::
+(integer)
+Number of nodes that rejected the request or failed to respond. If this value
+is not `0`, a reason for the rejection or failure is included in the response.
+
+`successful`::
+(integer)
+Number of nodes that responded successfully to the request.
+
+`total`::
+(integer)
+Total number of nodes selected by the request.
+====
+
+`cluster_name`::
+(string)
+Name of the cluster. Based on the <<cluster-name,cluster.name>> setting.
+
+`nodes`::
+(object)
+Contains statistics for the nodes selected by the request.
++
+.Properties of `nodes`
+[%collapsible%open]
+====
+`<node_id>`::
+(object)
+Contains statistics for the node.
++
+.Properties of `<node_id>`
+[%collapsible%open]
+=====
+`attributes`::
+(object)
+include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-attributes]
+
+`ephemeral_id`::
+(string)
+include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-ephemeral-id]
+
+`jvm`::
+(object)
+Contains Java Virtual Machine (JVM) statistics for the node.
++
+.Properties of `jvm`
+[%collapsible%open]
+======
+`heap_max`::
+(<<byte-units,byte value>>)
+Maximum amount of memory available for use by the heap.
+
+`heap_max_in_bytes`::
+(integer)
+Maximum amount of memory, in bytes, available for use by the heap.
+
+`java_inference`::
+(<<byte-units,byte value>>)
+Amount of Java heap currently being used for caching inference models.
+
+`java_inference_in_bytes`::
+(integer)
+Amount of Java heap, in bytes, currently being used for caching inference models.
+
+`java_inference_max`::
+(<<byte-units,byte value>>)
+Maximum amount of Java heap to be used for caching inference models.
+
+`java_inference_max_in_bytes`::
+(integer)
+Maximum amount of Java heap, in bytes, to be used for caching inference models.
+======
+
+`mem`::
+(object)
+Contains statistics about memory usage for the node.
++
+.Properties of `mem`
+[%collapsible%open]
+======
+`adjusted_total`::
+(<<byte-units,byte value>>)
+If the amount of physical memory has been overridden using the `es.total_memory_bytes`
+system property then this reports the overridden value. Otherwise it reports the same
+value as `total`.
+
+`adjusted_total_in_bytes`::
+(integer)
+If the amount of physical memory has been overridden using the `es.total_memory_bytes`
+system property then this reports the overridden value in bytes. Otherwise it reports
+the same value as `total_in_bytes`.
+
+`ml`::
+(object)
+Contains statistics about {ml} use of native memory on the node.
++
+.Properties of `ml`
+[%collapsible%open]
+=======
+`anomaly_detectors`::
+(<<byte-units,byte value>>)
+Amount of native memory set aside for {anomaly-jobs}.
+
+`anomaly_detectors_in_bytes`::
+(integer)
+Amount of native memory, in bytes, set aside for {anomaly-jobs}.
+
+`data_frame_analytics`::
+(<<byte-units,byte value>>)
+Amount of native memory set aside for {dfanalytics-jobs}.
+
+`data_frame_analytics_in_bytes`::
+(integer)
+Amount of native memory, in bytes, set aside for {dfanalytics-jobs}.
+
+`max`::
+(<<byte-units,byte value>>)
+Maximum amount of native memory (separate to the JVM heap) that may be used by {ml}
+native processes.
+
+`max_in_bytes`::
+(integer)
+Maximum amount of native memory (separate to the JVM heap), in bytes, that may be
+used by {ml} native processes.
+
+`native_code_overhead`::
+(<<byte-units,byte value>>)
+Amount of native memory set aside for loading {ml} native code shared libraries.
+
+`native_code_overhead_in_bytes`::
+(integer)
+Amount of native memory, in bytes, set aside for loading {ml} native code shared libraries.
+
+`native_inference`::
+(<<byte-units,byte value>>)
+Amount of native memory set aside for trained models that have a PyTorch `model_type`.
+
+`native_inference_in_bytes`::
+(integer)
+Amount of native memory, in bytes, set aside for trained models that have a PyTorch `model_type`.
+=======
+
+`total`::
+(<<byte-units,byte value>>)
+Total amount of physical memory.
+
+`total_in_bytes`::
+(integer)
+Total amount of physical memory in bytes.
+
+======
+
+`name`::
+(string)
+Human-readable identifier for the node. Based on the <<node-name>> setting.
+
+`roles`::
+(array of strings)
+Roles assigned to the node. See <<modules-node>>.
+
+`transport_address`::
+(string)
+include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-transport-address]
+
+=====
+====
+
+[[get-ml-memory-example]]
+== {api-examples-title}
+
+[source,console]
+--------------------------------------------------
+GET _ml/memory/_stats?human
+--------------------------------------------------
+// TEST[setup:node]
+
+This is a possible response:
+
+[source,console-result]
+----
+{
+  "_nodes": {
+    "total": 1,
+    "successful": 1,
+    "failed": 0
+  },
+  "cluster_name": "my_cluster",
+  "nodes": {
+    "pQHNt5rXTTWNvUgOrdynKg": {
+      "name": "node-0",
+      "ephemeral_id": "ITZ6WGZnSqqeT_unfit2SQ",
+      "transport_address": "127.0.0.1:9300",
+      "attributes": {
+        "ml.machine_memory": "68719476736",
+        "ml.max_jvm_size": "536870912"
+      },
+      "roles": [
+        "data",
+        "data_cold",
+        "data_content",
+        "data_frozen",
+        "data_hot",
+        "data_warm",
+        "ingest",
+        "master",
+        "ml",
+        "remote_cluster_client",
+        "transform"
+      ],
+      "mem": {
+        "total": "64gb",
+        "total_in_bytes": 68719476736,
+        "adjusted_total": "64gb",
+        "adjusted_total_in_bytes": 68719476736,
+        "ml": {
+          "max": "19.1gb",
+          "max_in_bytes": 20615843020,
+          "native_code_overhead": "0b",
+          "native_code_overhead_in_bytes": 0,
+          "anomaly_detectors": "0b",
+          "anomaly_detectors_in_bytes": 0,
+          "data_frame_analytics": "0b",
+          "data_frame_analytics_in_bytes": 0,
+          "native_inference": "0b",
+          "native_inference_in_bytes": 0
+        }
+      },
+      "jvm": {
+        "heap_max": "512mb",
+        "heap_max_in_bytes": 536870912,
+        "java_inference_max": "204.7mb",
+        "java_inference_max_in_bytes": 214748364,
+        "java_inference": "0b",
+        "java_inference_in_bytes": 0
+      }
+    }
+  }
+}
+----
+// TESTRESPONSE[s/"cluster_name": "my_cluster"/"cluster_name": $body.cluster_name/]
+// TESTRESPONSE[s/"pQHNt5rXTTWNvUgOrdynKg"/\$node_name/]
+// TESTRESPONSE[s/"ephemeral_id": "ITZ6WGZnSqqeT_unfit2SQ"/"ephemeral_id": "$body.$_path"/]
+// TESTRESPONSE[s/"transport_address": "127.0.0.1:9300"/"transport_address": "$body.$_path"/]
+// TESTRESPONSE[s/"attributes": \{[^\}]*\}/"attributes": $body.$_path/]
+// TESTRESPONSE[s/"total": "64gb"/"total": "$body.$_path"/]
+// TESTRESPONSE[s/"total_in_bytes": 68719476736/"total_in_bytes": $body.$_path/]
+// TESTRESPONSE[s/"adjusted_total": "64gb"/"adjusted_total": "$body.$_path"/]
+// TESTRESPONSE[s/"adjusted_total_in_bytes": 68719476736/"adjusted_total_in_bytes": $body.$_path/]
+// TESTRESPONSE[s/"max": "19.1gb"/"max": "$body.$_path"/]
+// TESTRESPONSE[s/"max_in_bytes": 20615843020/"max_in_bytes": $body.$_path/]
+// TESTRESPONSE[s/"heap_max": "512mb"/"heap_max": "$body.$_path"/]
+// TESTRESPONSE[s/"heap_max_in_bytes": 536870912/"heap_max_in_bytes": $body.$_path/]
+// TESTRESPONSE[s/"java_inference_max": "204.7mb"/"java_inference_max": "$body.$_path"/]
+// TESTRESPONSE[s/"java_inference_max_in_bytes": 214748364/"java_inference_max_in_bytes": $body.$_path/]
diff --git a/docs/reference/ml/common/apis/index.asciidoc b/docs/reference/ml/common/apis/index.asciidoc
@@ -1,6 +1,7 @@
 include::ml-apis.asciidoc[leveloffset=+1]
 //GET
 include::get-ml-info.asciidoc[leveloffset=+2]
+include::get-ml-memory.asciidoc[leveloffset=+2]
 //SET
 include::set-upgrade-mode.asciidoc[leveloffset=+2]
 
diff --git a/docs/reference/ml/common/apis/ml-apis.asciidoc b/docs/reference/ml/common/apis/ml-apis.asciidoc
@@ -2,18 +2,14 @@
 [[ml-apis]]
 = {ml-cap} APIs
 
-You can use the following APIs to retrieve information related to the {stack-ml-features}.
+You can use the following APIs to retrieve information related to the
+{stack-ml-features}:
 
-See also <<ml-ad-apis>>, <<ml-df-analytics-apis>>, and <<ml-df-trained-models-apis>>.
-
-[discrete]
-[[ml-api-ml-info-endpoint]]
-== Info
-
-* <<get-ml-info,Machine learning info>>
+* <<get-ml-info,Get machine learning info>>
+* <<get-ml-memory,Get machine learning memory stats>>
 
-[discrete]
-[[ml-set-upgrade-mode-endpoint]]
-== Set upgrade mode
+The following API is useful when you upgrade:
 
 * <<ml-set-upgrade-mode,Set upgrade mode>>
+
+See also <<ml-ad-apis>>, <<ml-df-analytics-apis>>, and <<ml-df-trained-models-apis>>.
diff --git a/rest-api-spec/src/main/resources/rest-api-spec/api/ml.get_memory_stats.json b/rest-api-spec/src/main/resources/rest-api-spec/api/ml.get_memory_stats.json
@@ -0,0 +1,45 @@
+{
+  "ml.get_memory_stats":{
+    "documentation":{
+      "url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/get-ml-memory.html",
+      "description":"Returns information on how ML is using memory."
+    },
+    "stability":"stable",
+    "visibility":"public",
+    "headers":{
+      "accept": [ "application/json"]
+    },
+    "url":{
+      "paths":[
+        {
+          "path":"/_ml/memory/_stats",
+          "methods":[
+            "GET"
+          ]
+        },
+        {
+          "path":"/_ml/memory/{node_id}/_stats",
+          "methods":[
+            "GET"
+          ],
+          "parts":{
+            "node_id":{
+              "type":"string",
+              "description":"Specifies the node or nodes to retrieve stats for."
+            }
+          }
+        }
+      ]
+    },
+    "params":{
+      "master_timeout":{
+        "type":"time",
+        "description":"Explicit operation timeout for connection to master node"
+      },
+      "timeout":{
+        "type":"time",
+        "description":"Explicit operation timeout"
+      }
+    }
+  }
+}