Skip to content

Commit

Permalink
[ML] Add ML memory stats API (#83802)
Browse files Browse the repository at this point in the history
Adds an API that can be used to find out how much memory ML
is permitted to use and is currently using on each node, both
within the JVM heap, and natively, outside of the JVM.
  • Loading branch information
droberts195 committed Feb 17, 2022
1 parent 48e562a commit bf00ab3
Show file tree
Hide file tree
Showing 29 changed files with 2,043 additions and 61 deletions.
5 changes: 5 additions & 0 deletions docs/changelog/83802.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 83802
summary: Add ML memory stats API
area: Machine Learning
type: enhancement
issues: []
310 changes: 310 additions & 0 deletions docs/reference/ml/common/apis/get-ml-memory.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,310 @@
[role="xpack"]
[[get-ml-memory]]
= Get machine learning memory stats API

[subs="attributes"]
++++
<titleabbrev>Get {ml} memory stats</titleabbrev>
++++

Returns information on how {ml} is using memory.

[[get-ml-memory-request]]
== {api-request-title}

`GET _ml/memory/_stats` +
`GET _ml/memory/<node_id>/_stats`

[[get-ml-memory-prereqs]]
== {api-prereq-title}

Requires the `monitor_ml` cluster privilege. This privilege is included in the
`machine_learning_user` built-in role.

[[get-ml-memory-desc]]
== {api-description-title}

Get information about how {ml} jobs and trained models are using memory, on each
node, both within the JVM heap, and natively, outside of the JVM.

[[get-ml-memory-path-params]]
== {api-path-parms-title}

`<node_id>`::
(Optional, string) The names of particular nodes in the cluster to target.
For example, `nodeId1,nodeId2` or `ml:true`. For node selection options,
see <<cluster-nodes>>.

[[get-ml-memory-query-parms]]
== {api-query-parms-title}

`human`::
Specify this query parameter to include the fields with units in the response.
Otherwise only the `_in_bytes` sizes are returned in the response.

include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=timeoutparms]

[role="child_attributes"]
[[get-ml-memory-response-body]]
== {api-response-body-title}

`_nodes`::
(object)
Contains statistics about the number of nodes selected by the request.
+
.Properties of `_nodes`
[%collapsible%open]
====
`failed`::
(integer)
Number of nodes that rejected the request or failed to respond. If this value
is not `0`, a reason for the rejection or failure is included in the response.
`successful`::
(integer)
Number of nodes that responded successfully to the request.
`total`::
(integer)
Total number of nodes selected by the request.
====

`cluster_name`::
(string)
Name of the cluster. Based on the <<cluster-name,cluster.name>> setting.

`nodes`::
(object)
Contains statistics for the nodes selected by the request.
+
.Properties of `nodes`
[%collapsible%open]
====
`<node_id>`::
(object)
Contains statistics for the node.
+
.Properties of `<node_id>`
[%collapsible%open]
=====
`attributes`::
(object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-attributes]

`ephemeral_id`::
(string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-ephemeral-id]

`jvm`::
(object)
Contains Java Virtual Machine (JVM) statistics for the node.
+
.Properties of `jvm`
[%collapsible%open]
======
`heap_max`::
(<<byte-units,byte value>>)
Maximum amount of memory available for use by the heap.
`heap_max_in_bytes`::
(integer)
Maximum amount of memory, in bytes, available for use by the heap.
`java_inference`::
(<<byte-units,byte value>>)
Amount of Java heap currently being used for caching inference models.
`java_inference_in_bytes`::
(integer)
Amount of Java heap, in bytes, currently being used for caching inference models.
`java_inference_max`::
(<<byte-units,byte value>>)
Maximum amount of Java heap to be used for caching inference models.
`java_inference_max_in_bytes`::
(integer)
Maximum amount of Java heap, in bytes, to be used for caching inference models.
======

`mem`::
(object)
Contains statistics about memory usage for the node.
+
.Properties of `mem`
[%collapsible%open]
======
`adjusted_total`::
(<<byte-units,byte value>>)
If the amount of physical memory has been overridden using the `es.total_memory_bytes`
system property then this reports the overridden value. Otherwise it reports the same
value as `total`.
`adjusted_total_in_bytes`::
(integer)
If the amount of physical memory has been overridden using the `es.total_memory_bytes`
system property then this reports the overridden value in bytes. Otherwise it reports
the same value as `total_in_bytes`.
`ml`::
(object)
Contains statistics about {ml} use of native memory on the node.
+
.Properties of `ml`
[%collapsible%open]
=======
`anomaly_detectors`::
(<<byte-units,byte value>>)
Amount of native memory set aside for {anomaly-jobs}.

`anomaly_detectors_in_bytes`::
(integer)
Amount of native memory, in bytes, set aside for {anomaly-jobs}.

`data_frame_analytics`::
(<<byte-units,byte value>>)
Amount of native memory set aside for {dfanalytics-jobs}.

`data_frame_analytics_in_bytes`::
(integer)
Amount of native memory, in bytes, set aside for {dfanalytics-jobs}.

`max`::
(<<byte-units,byte value>>)
Maximum amount of native memory (separate to the JVM heap) that may be used by {ml}
native processes.

`max_in_bytes`::
(integer)
Maximum amount of native memory (separate to the JVM heap), in bytes, that may be
used by {ml} native processes.

`native_code_overhead`::
(<<byte-units,byte value>>)
Amount of native memory set aside for loading {ml} native code shared libraries.

`native_code_overhead_in_bytes`::
(integer)
Amount of native memory, in bytes, set aside for loading {ml} native code shared libraries.

`native_inference`::
(<<byte-units,byte value>>)
Amount of native memory set aside for trained models that have a PyTorch `model_type`.

`native_inference_in_bytes`::
(integer)
Amount of native memory, in bytes, set aside for trained models that have a PyTorch `model_type`.
=======
`total`::
(<<byte-units,byte value>>)
Total amount of physical memory.
`total_in_bytes`::
(integer)
Total amount of physical memory in bytes.
======

`name`::
(string)
Human-readable identifier for the node. Based on the <<node-name>> setting.

`roles`::
(array of strings)
Roles assigned to the node. See <<modules-node>>.

`transport_address`::
(string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-transport-address]

=====
====

[[get-ml-memory-example]]
== {api-examples-title}

[source,console]
--------------------------------------------------
GET _ml/memory/_stats?human
--------------------------------------------------
// TEST[setup:node]

This is a possible response:

[source,console-result]
----
{
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "my_cluster",
"nodes": {
"pQHNt5rXTTWNvUgOrdynKg": {
"name": "node-0",
"ephemeral_id": "ITZ6WGZnSqqeT_unfit2SQ",
"transport_address": "127.0.0.1:9300",
"attributes": {
"ml.machine_memory": "68719476736",
"ml.max_jvm_size": "536870912"
},
"roles": [
"data",
"data_cold",
"data_content",
"data_frozen",
"data_hot",
"data_warm",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"mem": {
"total": "64gb",
"total_in_bytes": 68719476736,
"adjusted_total": "64gb",
"adjusted_total_in_bytes": 68719476736,
"ml": {
"max": "19.1gb",
"max_in_bytes": 20615843020,
"native_code_overhead": "0b",
"native_code_overhead_in_bytes": 0,
"anomaly_detectors": "0b",
"anomaly_detectors_in_bytes": 0,
"data_frame_analytics": "0b",
"data_frame_analytics_in_bytes": 0,
"native_inference": "0b",
"native_inference_in_bytes": 0
}
},
"jvm": {
"heap_max": "512mb",
"heap_max_in_bytes": 536870912,
"java_inference_max": "204.7mb",
"java_inference_max_in_bytes": 214748364,
"java_inference": "0b",
"java_inference_in_bytes": 0
}
}
}
}
----
// TESTRESPONSE[s/"cluster_name": "my_cluster"/"cluster_name": $body.cluster_name/]
// TESTRESPONSE[s/"pQHNt5rXTTWNvUgOrdynKg"/\$node_name/]
// TESTRESPONSE[s/"ephemeral_id": "ITZ6WGZnSqqeT_unfit2SQ"/"ephemeral_id": "$body.$_path"/]
// TESTRESPONSE[s/"transport_address": "127.0.0.1:9300"/"transport_address": "$body.$_path"/]
// TESTRESPONSE[s/"attributes": \{[^\}]*\}/"attributes": $body.$_path/]
// TESTRESPONSE[s/"total": "64gb"/"total": "$body.$_path"/]
// TESTRESPONSE[s/"total_in_bytes": 68719476736/"total_in_bytes": $body.$_path/]
// TESTRESPONSE[s/"adjusted_total": "64gb"/"adjusted_total": "$body.$_path"/]
// TESTRESPONSE[s/"adjusted_total_in_bytes": 68719476736/"adjusted_total_in_bytes": $body.$_path/]
// TESTRESPONSE[s/"max": "19.1gb"/"max": "$body.$_path"/]
// TESTRESPONSE[s/"max_in_bytes": 20615843020/"max_in_bytes": $body.$_path/]
// TESTRESPONSE[s/"heap_max": "512mb"/"heap_max": "$body.$_path"/]
// TESTRESPONSE[s/"heap_max_in_bytes": 536870912/"heap_max_in_bytes": $body.$_path/]
// TESTRESPONSE[s/"java_inference_max": "204.7mb"/"java_inference_max": "$body.$_path"/]
// TESTRESPONSE[s/"java_inference_max_in_bytes": 214748364/"java_inference_max_in_bytes": $body.$_path/]
1 change: 1 addition & 0 deletions docs/reference/ml/common/apis/index.asciidoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
include::ml-apis.asciidoc[leveloffset=+1]
//GET
include::get-ml-info.asciidoc[leveloffset=+2]
include::get-ml-memory.asciidoc[leveloffset=+2]
//SET
include::set-upgrade-mode.asciidoc[leveloffset=+2]

18 changes: 7 additions & 11 deletions docs/reference/ml/common/apis/ml-apis.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,14 @@
[[ml-apis]]
= {ml-cap} APIs

You can use the following APIs to retrieve information related to the {stack-ml-features}.
You can use the following APIs to retrieve information related to the
{stack-ml-features}:

See also <<ml-ad-apis>>, <<ml-df-analytics-apis>>, and <<ml-df-trained-models-apis>>.

[discrete]
[[ml-api-ml-info-endpoint]]
== Info

* <<get-ml-info,Machine learning info>>
* <<get-ml-info,Get machine learning info>>
* <<get-ml-memory,Get machine learning memory stats>>

[discrete]
[[ml-set-upgrade-mode-endpoint]]
== Set upgrade mode
The following API is useful when you upgrade:

* <<ml-set-upgrade-mode,Set upgrade mode>>

See also <<ml-ad-apis>>, <<ml-df-analytics-apis>>, and <<ml-df-trained-models-apis>>.
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"ml.get_memory_stats":{
"documentation":{
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/get-ml-memory.html",
"description":"Returns information on how ML is using memory."
},
"stability":"stable",
"visibility":"public",
"headers":{
"accept": [ "application/json"]
},
"url":{
"paths":[
{
"path":"/_ml/memory/_stats",
"methods":[
"GET"
]
},
{
"path":"/_ml/memory/{node_id}/_stats",
"methods":[
"GET"
],
"parts":{
"node_id":{
"type":"string",
"description":"Specifies the node or nodes to retrieve stats for."
}
}
}
]
},
"params":{
"master_timeout":{
"type":"time",
"description":"Explicit operation timeout for connection to master node"
},
"timeout":{
"type":"time",
"description":"Explicit operation timeout"
}
}
}
}

0 comments on commit bf00ab3

Please sign in to comment.