-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[ML] Add ML memory stats API (#83802)
Adds an API that can be used to find out how much memory ML is permitted to use and is currently using on each node, both within the JVM heap, and natively, outside of the JVM.
- Loading branch information
1 parent
48e562a
commit bf00ab3
Showing
29 changed files
with
2,043 additions
and
61 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
pr: 83802 | ||
summary: Add ML memory stats API | ||
area: Machine Learning | ||
type: enhancement | ||
issues: [] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,310 @@ | ||
[role="xpack"] | ||
[[get-ml-memory]] | ||
= Get machine learning memory stats API | ||
|
||
[subs="attributes"] | ||
++++ | ||
<titleabbrev>Get {ml} memory stats</titleabbrev> | ||
++++ | ||
|
||
Returns information on how {ml} is using memory. | ||
|
||
[[get-ml-memory-request]] | ||
== {api-request-title} | ||
|
||
`GET _ml/memory/_stats` + | ||
`GET _ml/memory/<node_id>/_stats` | ||
|
||
[[get-ml-memory-prereqs]] | ||
== {api-prereq-title} | ||
|
||
Requires the `monitor_ml` cluster privilege. This privilege is included in the | ||
`machine_learning_user` built-in role. | ||
|
||
[[get-ml-memory-desc]] | ||
== {api-description-title} | ||
|
||
Get information about how {ml} jobs and trained models are using memory, on each | ||
node, both within the JVM heap, and natively, outside of the JVM. | ||
|
||
[[get-ml-memory-path-params]] | ||
== {api-path-parms-title} | ||
|
||
`<node_id>`:: | ||
(Optional, string) The names of particular nodes in the cluster to target. | ||
For example, `nodeId1,nodeId2` or `ml:true`. For node selection options, | ||
see <<cluster-nodes>>. | ||
|
||
[[get-ml-memory-query-parms]] | ||
== {api-query-parms-title} | ||
|
||
`human`:: | ||
Specify this query parameter to include the fields with units in the response. | ||
Otherwise only the `_in_bytes` sizes are returned in the response. | ||
|
||
include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=timeoutparms] | ||
|
||
[role="child_attributes"] | ||
[[get-ml-memory-response-body]] | ||
== {api-response-body-title} | ||
|
||
`_nodes`:: | ||
(object) | ||
Contains statistics about the number of nodes selected by the request. | ||
+ | ||
.Properties of `_nodes` | ||
[%collapsible%open] | ||
==== | ||
`failed`:: | ||
(integer) | ||
Number of nodes that rejected the request or failed to respond. If this value | ||
is not `0`, a reason for the rejection or failure is included in the response. | ||
`successful`:: | ||
(integer) | ||
Number of nodes that responded successfully to the request. | ||
`total`:: | ||
(integer) | ||
Total number of nodes selected by the request. | ||
==== | ||
|
||
`cluster_name`:: | ||
(string) | ||
Name of the cluster. Based on the <<cluster-name,cluster.name>> setting. | ||
|
||
`nodes`:: | ||
(object) | ||
Contains statistics for the nodes selected by the request. | ||
+ | ||
.Properties of `nodes` | ||
[%collapsible%open] | ||
==== | ||
`<node_id>`:: | ||
(object) | ||
Contains statistics for the node. | ||
+ | ||
.Properties of `<node_id>` | ||
[%collapsible%open] | ||
===== | ||
`attributes`:: | ||
(object) | ||
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-attributes] | ||
|
||
`ephemeral_id`:: | ||
(string) | ||
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-ephemeral-id] | ||
|
||
`jvm`:: | ||
(object) | ||
Contains Java Virtual Machine (JVM) statistics for the node. | ||
+ | ||
.Properties of `jvm` | ||
[%collapsible%open] | ||
====== | ||
`heap_max`:: | ||
(<<byte-units,byte value>>) | ||
Maximum amount of memory available for use by the heap. | ||
`heap_max_in_bytes`:: | ||
(integer) | ||
Maximum amount of memory, in bytes, available for use by the heap. | ||
`java_inference`:: | ||
(<<byte-units,byte value>>) | ||
Amount of Java heap currently being used for caching inference models. | ||
`java_inference_in_bytes`:: | ||
(integer) | ||
Amount of Java heap, in bytes, currently being used for caching inference models. | ||
`java_inference_max`:: | ||
(<<byte-units,byte value>>) | ||
Maximum amount of Java heap to be used for caching inference models. | ||
`java_inference_max_in_bytes`:: | ||
(integer) | ||
Maximum amount of Java heap, in bytes, to be used for caching inference models. | ||
====== | ||
|
||
`mem`:: | ||
(object) | ||
Contains statistics about memory usage for the node. | ||
+ | ||
.Properties of `mem` | ||
[%collapsible%open] | ||
====== | ||
`adjusted_total`:: | ||
(<<byte-units,byte value>>) | ||
If the amount of physical memory has been overridden using the `es.total_memory_bytes` | ||
system property then this reports the overridden value. Otherwise it reports the same | ||
value as `total`. | ||
`adjusted_total_in_bytes`:: | ||
(integer) | ||
If the amount of physical memory has been overridden using the `es.total_memory_bytes` | ||
system property then this reports the overridden value in bytes. Otherwise it reports | ||
the same value as `total_in_bytes`. | ||
`ml`:: | ||
(object) | ||
Contains statistics about {ml} use of native memory on the node. | ||
+ | ||
.Properties of `ml` | ||
[%collapsible%open] | ||
======= | ||
`anomaly_detectors`:: | ||
(<<byte-units,byte value>>) | ||
Amount of native memory set aside for {anomaly-jobs}. | ||
|
||
`anomaly_detectors_in_bytes`:: | ||
(integer) | ||
Amount of native memory, in bytes, set aside for {anomaly-jobs}. | ||
|
||
`data_frame_analytics`:: | ||
(<<byte-units,byte value>>) | ||
Amount of native memory set aside for {dfanalytics-jobs}. | ||
|
||
`data_frame_analytics_in_bytes`:: | ||
(integer) | ||
Amount of native memory, in bytes, set aside for {dfanalytics-jobs}. | ||
|
||
`max`:: | ||
(<<byte-units,byte value>>) | ||
Maximum amount of native memory (separate to the JVM heap) that may be used by {ml} | ||
native processes. | ||
|
||
`max_in_bytes`:: | ||
(integer) | ||
Maximum amount of native memory (separate to the JVM heap), in bytes, that may be | ||
used by {ml} native processes. | ||
|
||
`native_code_overhead`:: | ||
(<<byte-units,byte value>>) | ||
Amount of native memory set aside for loading {ml} native code shared libraries. | ||
|
||
`native_code_overhead_in_bytes`:: | ||
(integer) | ||
Amount of native memory, in bytes, set aside for loading {ml} native code shared libraries. | ||
|
||
`native_inference`:: | ||
(<<byte-units,byte value>>) | ||
Amount of native memory set aside for trained models that have a PyTorch `model_type`. | ||
|
||
`native_inference_in_bytes`:: | ||
(integer) | ||
Amount of native memory, in bytes, set aside for trained models that have a PyTorch `model_type`. | ||
======= | ||
`total`:: | ||
(<<byte-units,byte value>>) | ||
Total amount of physical memory. | ||
`total_in_bytes`:: | ||
(integer) | ||
Total amount of physical memory in bytes. | ||
====== | ||
|
||
`name`:: | ||
(string) | ||
Human-readable identifier for the node. Based on the <<node-name>> setting. | ||
|
||
`roles`:: | ||
(array of strings) | ||
Roles assigned to the node. See <<modules-node>>. | ||
|
||
`transport_address`:: | ||
(string) | ||
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-transport-address] | ||
|
||
===== | ||
==== | ||
|
||
[[get-ml-memory-example]] | ||
== {api-examples-title} | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
GET _ml/memory/_stats?human | ||
-------------------------------------------------- | ||
// TEST[setup:node] | ||
|
||
This is a possible response: | ||
|
||
[source,console-result] | ||
---- | ||
{ | ||
"_nodes": { | ||
"total": 1, | ||
"successful": 1, | ||
"failed": 0 | ||
}, | ||
"cluster_name": "my_cluster", | ||
"nodes": { | ||
"pQHNt5rXTTWNvUgOrdynKg": { | ||
"name": "node-0", | ||
"ephemeral_id": "ITZ6WGZnSqqeT_unfit2SQ", | ||
"transport_address": "127.0.0.1:9300", | ||
"attributes": { | ||
"ml.machine_memory": "68719476736", | ||
"ml.max_jvm_size": "536870912" | ||
}, | ||
"roles": [ | ||
"data", | ||
"data_cold", | ||
"data_content", | ||
"data_frozen", | ||
"data_hot", | ||
"data_warm", | ||
"ingest", | ||
"master", | ||
"ml", | ||
"remote_cluster_client", | ||
"transform" | ||
], | ||
"mem": { | ||
"total": "64gb", | ||
"total_in_bytes": 68719476736, | ||
"adjusted_total": "64gb", | ||
"adjusted_total_in_bytes": 68719476736, | ||
"ml": { | ||
"max": "19.1gb", | ||
"max_in_bytes": 20615843020, | ||
"native_code_overhead": "0b", | ||
"native_code_overhead_in_bytes": 0, | ||
"anomaly_detectors": "0b", | ||
"anomaly_detectors_in_bytes": 0, | ||
"data_frame_analytics": "0b", | ||
"data_frame_analytics_in_bytes": 0, | ||
"native_inference": "0b", | ||
"native_inference_in_bytes": 0 | ||
} | ||
}, | ||
"jvm": { | ||
"heap_max": "512mb", | ||
"heap_max_in_bytes": 536870912, | ||
"java_inference_max": "204.7mb", | ||
"java_inference_max_in_bytes": 214748364, | ||
"java_inference": "0b", | ||
"java_inference_in_bytes": 0 | ||
} | ||
} | ||
} | ||
} | ||
---- | ||
// TESTRESPONSE[s/"cluster_name": "my_cluster"/"cluster_name": $body.cluster_name/] | ||
// TESTRESPONSE[s/"pQHNt5rXTTWNvUgOrdynKg"/\$node_name/] | ||
// TESTRESPONSE[s/"ephemeral_id": "ITZ6WGZnSqqeT_unfit2SQ"/"ephemeral_id": "$body.$_path"/] | ||
// TESTRESPONSE[s/"transport_address": "127.0.0.1:9300"/"transport_address": "$body.$_path"/] | ||
// TESTRESPONSE[s/"attributes": \{[^\}]*\}/"attributes": $body.$_path/] | ||
// TESTRESPONSE[s/"total": "64gb"/"total": "$body.$_path"/] | ||
// TESTRESPONSE[s/"total_in_bytes": 68719476736/"total_in_bytes": $body.$_path/] | ||
// TESTRESPONSE[s/"adjusted_total": "64gb"/"adjusted_total": "$body.$_path"/] | ||
// TESTRESPONSE[s/"adjusted_total_in_bytes": 68719476736/"adjusted_total_in_bytes": $body.$_path/] | ||
// TESTRESPONSE[s/"max": "19.1gb"/"max": "$body.$_path"/] | ||
// TESTRESPONSE[s/"max_in_bytes": 20615843020/"max_in_bytes": $body.$_path/] | ||
// TESTRESPONSE[s/"heap_max": "512mb"/"heap_max": "$body.$_path"/] | ||
// TESTRESPONSE[s/"heap_max_in_bytes": 536870912/"heap_max_in_bytes": $body.$_path/] | ||
// TESTRESPONSE[s/"java_inference_max": "204.7mb"/"java_inference_max": "$body.$_path"/] | ||
// TESTRESPONSE[s/"java_inference_max_in_bytes": 214748364/"java_inference_max_in_bytes": $body.$_path/] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
include::ml-apis.asciidoc[leveloffset=+1] | ||
//GET | ||
include::get-ml-info.asciidoc[leveloffset=+2] | ||
include::get-ml-memory.asciidoc[leveloffset=+2] | ||
//SET | ||
include::set-upgrade-mode.asciidoc[leveloffset=+2] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
45 changes: 45 additions & 0 deletions
45
rest-api-spec/src/main/resources/rest-api-spec/api/ml.get_memory_stats.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
{ | ||
"ml.get_memory_stats":{ | ||
"documentation":{ | ||
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/get-ml-memory.html", | ||
"description":"Returns information on how ML is using memory." | ||
}, | ||
"stability":"stable", | ||
"visibility":"public", | ||
"headers":{ | ||
"accept": [ "application/json"] | ||
}, | ||
"url":{ | ||
"paths":[ | ||
{ | ||
"path":"/_ml/memory/_stats", | ||
"methods":[ | ||
"GET" | ||
] | ||
}, | ||
{ | ||
"path":"/_ml/memory/{node_id}/_stats", | ||
"methods":[ | ||
"GET" | ||
], | ||
"parts":{ | ||
"node_id":{ | ||
"type":"string", | ||
"description":"Specifies the node or nodes to retrieve stats for." | ||
} | ||
} | ||
} | ||
] | ||
}, | ||
"params":{ | ||
"master_timeout":{ | ||
"type":"time", | ||
"description":"Explicit operation timeout for connection to master node" | ||
}, | ||
"timeout":{ | ||
"type":"time", | ||
"description":"Explicit operation timeout" | ||
} | ||
} | ||
} | ||
} |
Oops, something went wrong.