New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node stats as metrics #102248
Node stats as metrics #102248
Conversation
Hi @piergm, I've created a changelog YAML for you. |
Hi @piergm I think many of these stats are counters (they are cumulative over the lifecycle of the node and only increase) rather than gauges (which represent a point-in-time value and can go up and down). |
server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API has the capability to aggregate stats of all the nodes connected to the cluster. We decided instead each node will report its own stats in order not to hit the wire and cause unwanted latencies.
I was thinking about this, and in addition to latencies, I also think it might be incorrect. This would otherwise cause double counting if each node reported aggregations for the whole cluster, as there would be overlap in the counts. All this to say that I agree that doing things at the node level is the right approach.
Nice work!
server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java
Outdated
Show resolved
Hide resolved
Hi @DaveCTurner, I do agree with you that many of these stats should be counters. But this would complicate a little bit things due to the |
I'm not sure, but my concern is that these counters sometimes reset (when a node restarts) and therefore aggregates of these counters will sometimes suddenly decrease significantly. Counter metrics know to treat these large backwards jumps as reset events and ignore them when showing the rate of change of the value over time, whereas I expect gauges will not handle them so gracefully. |
@DaveCTurner I see your concern, thanks for pointing this out! I'll check. |
@JVerwolf If had gone for the aggregated stats then I would have reported them only from a node (probably the Master node). |
Please wherever possible avoid doing this kind of thing on the master node. It doesn't need to be on the master, and there are normally many better-resourced nodes from which to choose. |
server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
protected void doClose() throws IOException {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here we could call close()
on all counters
and gauges
, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the try-with-resources is not usable here, since the AutoClosable
is the metrics itself, therefore we would close the metric right after it has being registered:
try(registry.registerLongGauge(...)){
//do nothing
}
// here the close() method would have been called therefore de-registering the metric
Instead I have an ArrayList of all the metrics and call close()
on all counters and gauges in the doClose()
of the AbstractLifecycleComponent
.
And being doClose called once is safe to de-register the metrics there.
Java doc for the doClose method:
It is called once in the lifetime of a component. If the component was started then it will be stopped before it is closed, and once it is closed it will not be started or stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, my only uncertainty is just around closing the resources for the async counters, otherwise G2G. Nice work!
server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/monitor/metrics/NodeMetrics.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
protected void doClose() throws IOException {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! LGTM. I had a question just double-checking that the lifecycle of the metrics registration is the same as the metric registry. If so, I think this is safe to merge.
try { | ||
metric.close(); | ||
} catch (Exception ignore) { | ||
// metrics close() method does not throw Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that this could be a logger.warn("metrics close() method should not throw Exception", exception)
so that we have visibility into unchecked exceptions should they (unexpectedly) occur. However, I think it's fine to do that as follow up and this should not block merging.
@@ -1074,6 +1077,7 @@ record PluginServiceInstances( | |||
b.bind(SearchPhaseController.class).toInstance(new SearchPhaseController(searchService::aggReduceContextBuilder)); | |||
b.bind(Transport.class).toInstance(transport); | |||
b.bind(TransportService.class).toInstance(transportService); | |||
b.bind(NodeMetrics.class).toInstance(nodeMetrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with how Node startup works. Are we sure that this has the same lifecycle as nodeMetrics
, i.e. it starts/loads and stops/unloads at the same time? If it doesn't, we might get exceptions for double registration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be safe, I did split the class initialisation from metric registration and put the registration part in the overwritten doStart method that It is only called once in the lifetime of a component
therefore avoiding double registration. On top of that doClose It is called once in the lifetime of a component. If the component was started then it will be stopped before it is closed, and once it is closed it will not be started or stopped.
(Both quotes are from AbstractLifecycleComponent
Java Docs).
So we are safe regarding double initialisation IMHO.
@elasticmachine update branch |
@elasticmachine run elasticsearch-ci/docs |
These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates elastic#102248
These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates #102248
These new metric names can be simpler. This commit removes the redundant prefix "node.stats". relates elastic#102248
In ES there are node stats that can be retrieved via API call (
GET /_nodes/stats
) but not scraped by Metricbeat.This PR register as metrics some of those stats.
The API has the capability to aggregate stats of all the nodes connected to the cluster. We decided instead each node will report its own stats in order not to hit the wire and cause unwanted latencies.
All the metrics are registered as either
LongAsyncCounter
orLongGauge
both of which have a callback reporting the total value for a metric and not the delta.We have in place a lazy cache that expires after 1 minute for
NodeStats
in order not to recalculate it for every metric callback.List of metrics that this PR will introduce: