Node stats as metrics #102248

piergm · 2023-11-15T16:42:00Z

In ES there are node stats that can be retrieved via API call (GET /_nodes/stats) but not scraped by Metricbeat.
This PR register as metrics some of those stats.

The API has the capability to aggregate stats of all the nodes connected to the cluster. We decided instead each node will report its own stats in order not to hit the wire and cause unwanted latencies.

All the metrics are registered as either LongAsyncCounter or LongGauge both of which have a callback reporting the total value for a metric and not the delta.
We have in place a lazy cache that expires after 1 minute for NodeStats in order not to recalculate it for every metric callback.

List of metrics that this PR will introduce:

es.node.stats.indices.get.total
es.node.stats.indices.get.time
es.node.stats.indices.search.fetch.total
es.node.stats.indices.search.fetch.time
es.node.stats.indices.merge.total
es.node.stats.indices.merge.time
es.node.stats.indices.translog.operations
es.node.stats.indices.translog.size
es.node.stats.indices.translog.uncommitted_operations
es.node.stats.indices.translog.uncommitted_size
es.node.stats.indices.translog.earliest_last_modified_age
es.node.stats.transport.rx_size
es.node.stats.transport.tx_size
es.node.stats.jvm.mem.pools.young.used
es.node.stats.jvm.mem.pools.survivor.used
es.node.stats.jvm.mem.pools.old.used
es.node.stats.fs.io_stats.io_time.total

elasticsearchmachine · 2023-11-15T16:42:25Z

Hi @piergm, I've created a changelog YAML for you.

DaveCTurner · 2023-11-15T19:52:38Z

All the metrics are registered as LongGauges since the value provided is not a delta to the last read but it's every time the total.

Hi @piergm I think many of these stats are counters (they are cumulative over the lifecycle of the node and only increase) rather than gauges (which represent a point-in-time value and can go up and down).

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java

JVerwolf

The API has the capability to aggregate stats of all the nodes connected to the cluster. We decided instead each node will report its own stats in order not to hit the wire and cause unwanted latencies.

I was thinking about this, and in addition to latencies, I also think it might be incorrect. This would otherwise cause double counting if each node reported aggregations for the whole cluster, as there would be overlap in the counts. All this to say that I agree that doing things at the node level is the right approach.

Nice work!

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java

piergm · 2023-11-16T08:59:23Z

Hi @piergm I think many of these stats are counters (they are cumulative over the lifecycle of the node and only increase) rather than gauges (which represent a point-in-time value and can go up and down).

Hi @DaveCTurner, I do agree with you that many of these stats should be counters. But this would complicate a little bit things due to the LongCounter and LongUpAndDownCounter implementation, that exposes only incrementBy and not a set method therefore leaving to the reporting implementation to calculate the delta for incrementing the metric. That's easy enough but the issues do not stop here, while Gauges gets current value during reporting period Counters uses a push mechanism, where when a certain event happen you, for example, increment the counter value. This implementation doesn't really work for node stats. I had a work-around by using SingleObjectCache making the cache refresh the event to update counter values and that, even tho was hacky at best, worked nicely. But with the delta calculation and the period-based reporting I was mimicking the what gauges does, therefore I decided to switch entirely to gauges. Do you think I should go back to counters?

DaveCTurner · 2023-11-16T09:04:32Z

I'm not sure, but my concern is that these counters sometimes reset (when a node restarts) and therefore aggregates of these counters will sometimes suddenly decrease significantly. Counter metrics know to treat these large backwards jumps as reset events and ignore them when showing the rate of change of the value over time, whereas I expect gauges will not handle them so gracefully.

piergm · 2023-11-16T09:06:34Z

@DaveCTurner I see your concern, thanks for pointing this out! I'll check.

piergm · 2023-11-16T09:36:51Z

I was thinking about this, and in addition to latencies, I also think it might be incorrect. This would otherwise cause double counting if each node reported aggregations for the whole cluster, as there would be overlap in the counts. All this to say that I agree that doing things at the node level is the right approach.

@JVerwolf If had gone for the aggregated stats then I would have reported them only from a node (probably the Master node).

DaveCTurner · 2023-11-16T09:46:56Z

reported them only from a node (probably the Master node).

Please wherever possible avoid doing this kind of thing on the master node. It doesn't need to be on the master, and there are normally many better-resourced nodes from which to choose.

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java

piergm · 2023-11-30T16:20:33Z

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java

+    }
+
+    @Override
+    protected void doClose() throws IOException {}


Maybe here we could call close() on all counters and gauges, wdyt?

I was wondering about that too:

the try-with-resources is not usable here, since the AutoClosable is the metrics itself, therefore we would close the metric right after it has being registered:

try(registry.registerLongGauge(...)){ //do nothing } // here the close() method would have been called therefore de-registering the metric

Instead I have an ArrayList of all the metrics and call close() on all counters and gauges in the doClose() of the AbstractLifecycleComponent.
And being doClose called once is safe to de-register the metrics there.
Java doc for the doClose method:

It is called once in the lifetime of a component. If the component was started then it will be stopped before it is closed, and once it is closed it will not be started or stopped.

JVerwolf

Looks great, my only uncertainty is just around closing the resources for the async counters, otherwise G2G. Nice work!

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java

JVerwolf · 2023-11-30T23:00:45Z

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java

+    }
+
+    @Override
+    protected void doClose() throws IOException {}


I was wondering about that too:

JVerwolf

Nice work! LGTM. I had a question just double-checking that the lifecycle of the metrics registration is the same as the metric registry. If so, I think this is safe to merge.

JVerwolf · 2023-12-01T16:23:27Z

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java

+            try {
+                metric.close();
+            } catch (Exception ignore) {
+                // metrics close() method does not throw Exception


My gut feeling is that this could be a logger.warn("metrics close() method should not throw Exception", exception) so that we have visibility into unchecked exceptions should they (unexpectedly) occur. However, I think it's fine to do that as follow up and this should not block merging.

JVerwolf · 2023-12-01T16:30:12Z

server/src/main/java/org/elasticsearch/node/NodeConstruction.java

@@ -1074,6 +1077,7 @@ record PluginServiceInstances(
            b.bind(SearchPhaseController.class).toInstance(new SearchPhaseController(searchService::aggReduceContextBuilder));
            b.bind(Transport.class).toInstance(transport);
            b.bind(TransportService.class).toInstance(transportService);
+            b.bind(NodeMetrics.class).toInstance(nodeMetrics);


I'm not familiar with how Node startup works. Are we sure that this has the same lifecycle as nodeMetrics , i.e. it starts/loads and stops/unloads at the same time? If it doesn't, we might get exceptions for double registration.

This should be safe, I did split the class initialisation from metric registration and put the registration part in the overwritten doStart method that It is only called once in the lifetime of a component therefore avoiding double registration. On top of that doClose It is called once in the lifetime of a component. If the component was started then it will be stopped before it is closed, and once it is closed it will not be started or stopped. (Both quotes are from AbstractLifecycleComponent Java Docs).
So we are safe regarding double initialisation IMHO.

piergm · 2023-12-04T13:16:19Z

@elasticmachine update branch

piergm · 2023-12-05T07:26:07Z

@elasticmachine run elasticsearch-ci/docs

These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates elastic#102248

These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates #102248

These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates elastic#102248

piergm added 9 commits November 15, 2023 09:39

first impl

06f286a

good impl, needs only schedule retry on failure

6ea8556

final touches

d36ffeb

final impl

cbd254f

moved metric registration from NodeConstruction to Node

63a906c

aligned metrics name to standard

8b6894e

cleaning

cec7741

more cleaning

c12710a

rename main class from CommonMetrics to NodeMetrics

c17f841

piergm added >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.12.0 labels Nov 15, 2023

piergm self-assigned this Nov 15, 2023

Update docs/changelog/102248.yaml

41a5f76

piergm added 2 commits November 15, 2023 18:07

Merge branch 'elastic:main' into indices-metrics

77e7e1f

sportless

26b2a60

JVerwolf reviewed Nov 15, 2023

View reviewed changes

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java Outdated Show resolved Hide resolved

JVerwolf reviewed Nov 15, 2023

View reviewed changes

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java Outdated Show resolved Hide resolved

inline methods

5798561

piergm added 2 commits November 16, 2023 14:50

merged main, resolved conflicts

d73808c

GC memory consumption by generation

f18a0a9

droberts195 reviewed Nov 16, 2023

View reviewed changes

server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java Outdated Show resolved Hide resolved

piergm added 5 commits November 30, 2023 15:04

iter

a967fab

iter

247424c

iter

63629e1

NodeMetrics now extends AbstractLifecycleComponent

4e7be1e

nothing on close

703ab26

piergm commented Nov 30, 2023

View reviewed changes

JVerwolf reviewed Nov 30, 2023

View reviewed changes

piergm added 5 commits December 1, 2023 08:24

iter

f6eddb6

used close expect no throw

a99d7e3

iter

6f0c9d6

iter

ce25218

iter

3d6f22a

piergm requested a review from JVerwolf December 1, 2023 14:36

JVerwolf approved these changes Dec 1, 2023

View reviewed changes

piergm added 4 commits December 3, 2023 22:45

Merge branch 'elastic:main' into indices-metrics

a543112

add warning log for exception

393807b

iter

8151dbb

Merge branch 'elastic:main' into indices-metrics

6d1f285

Merge branch 'main' into indices-metrics

f3242de

piergm added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 5, 2023

elasticsearchmachine merged commit ccf92e4 into elastic:main Dec 5, 2023
15 checks passed

piergm deleted the indices-metrics branch December 5, 2023 07:43

rjernst added a commit to rjernst/elasticsearch that referenced this pull request Dec 6, 2023

Cleanup node metric names

2cf8ac8

These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates elastic#102248

rjernst mentioned this pull request Dec 6, 2023

Cleanup node metric names #103089

Merged

fcofdez mentioned this pull request Dec 7, 2023

Publish indexing metric (total docs) to apm #103037

Closed

piergm mentioned this pull request Dec 13, 2023

Adding threadpool metrics #102371

Merged

elasticsearchmachine pushed a commit that referenced this pull request Dec 22, 2023

Cleanup node metric names (#103089)

29ed67d

These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates #102248

jbaiera pushed a commit to jbaiera/elasticsearch that referenced this pull request Jan 10, 2024

Cleanup node metric names (elastic#103089)

5c6c44d

These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates elastic#102248

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node stats as metrics #102248

Node stats as metrics #102248

piergm commented Nov 15, 2023 •

edited

elasticsearchmachine commented Nov 15, 2023

DaveCTurner commented Nov 15, 2023

JVerwolf left a comment

piergm commented Nov 16, 2023

DaveCTurner commented Nov 16, 2023

piergm commented Nov 16, 2023

piergm commented Nov 16, 2023

DaveCTurner commented Nov 16, 2023

piergm Nov 30, 2023

JVerwolf Nov 30, 2023

piergm Dec 1, 2023 •

edited

JVerwolf left a comment

JVerwolf Nov 30, 2023

JVerwolf left a comment

JVerwolf Dec 1, 2023

JVerwolf Dec 1, 2023

piergm Dec 4, 2023

piergm commented Dec 4, 2023

piergm commented Dec 5, 2023

Node stats as metrics #102248

Node stats as metrics #102248

Conversation

piergm commented Nov 15, 2023 • edited

elasticsearchmachine commented Nov 15, 2023

DaveCTurner commented Nov 15, 2023

JVerwolf left a comment

Choose a reason for hiding this comment

piergm commented Nov 16, 2023

DaveCTurner commented Nov 16, 2023

piergm commented Nov 16, 2023

piergm commented Nov 16, 2023

DaveCTurner commented Nov 16, 2023

piergm Nov 30, 2023

Choose a reason for hiding this comment

JVerwolf Nov 30, 2023

Choose a reason for hiding this comment

piergm Dec 1, 2023 • edited

Choose a reason for hiding this comment

JVerwolf left a comment

Choose a reason for hiding this comment

JVerwolf Nov 30, 2023

Choose a reason for hiding this comment

JVerwolf left a comment

Choose a reason for hiding this comment

JVerwolf Dec 1, 2023

Choose a reason for hiding this comment

JVerwolf Dec 1, 2023

Choose a reason for hiding this comment

piergm Dec 4, 2023

Choose a reason for hiding this comment

piergm commented Dec 4, 2023

piergm commented Dec 5, 2023

piergm commented Nov 15, 2023 •

edited

piergm Dec 1, 2023 •

edited