Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node stats as metrics #102248

Merged
merged 47 commits into from Dec 5, 2023
Merged

Conversation

piergm
Copy link
Member

@piergm piergm commented Nov 15, 2023

In ES there are node stats that can be retrieved via API call (GET /_nodes/stats) but not scraped by Metricbeat.
This PR register as metrics some of those stats.

The API has the capability to aggregate stats of all the nodes connected to the cluster. We decided instead each node will report its own stats in order not to hit the wire and cause unwanted latencies.

All the metrics are registered as either LongAsyncCounter or LongGauge both of which have a callback reporting the total value for a metric and not the delta.
We have in place a lazy cache that expires after 1 minute for NodeStats in order not to recalculate it for every metric callback.

List of metrics that this PR will introduce:

  • es.node.stats.indices.get.total
  • es.node.stats.indices.get.time
  • es.node.stats.indices.search.fetch.total
  • es.node.stats.indices.search.fetch.time
  • es.node.stats.indices.merge.total
  • es.node.stats.indices.merge.time
  • es.node.stats.indices.translog.operations
  • es.node.stats.indices.translog.size
  • es.node.stats.indices.translog.uncommitted_operations
  • es.node.stats.indices.translog.uncommitted_size
  • es.node.stats.indices.translog.earliest_last_modified_age
  • es.node.stats.transport.rx_size
  • es.node.stats.transport.tx_size
  • es.node.stats.jvm.mem.pools.young.used
  • es.node.stats.jvm.mem.pools.survivor.used
  • es.node.stats.jvm.mem.pools.old.used
  • es.node.stats.fs.io_stats.io_time.total

@piergm piergm added >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.12.0 labels Nov 15, 2023
@piergm piergm self-assigned this Nov 15, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @piergm, I've created a changelog YAML for you.

@DaveCTurner
Copy link
Contributor

All the metrics are registered as LongGauges since the value provided is not a delta to the last read but it's every time the total.

Hi @piergm I think many of these stats are counters (they are cumulative over the lifecycle of the node and only increase) rather than gauges (which represent a point-in-time value and can go up and down).

Copy link
Contributor

@JVerwolf JVerwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API has the capability to aggregate stats of all the nodes connected to the cluster. We decided instead each node will report its own stats in order not to hit the wire and cause unwanted latencies.

I was thinking about this, and in addition to latencies, I also think it might be incorrect. This would otherwise cause double counting if each node reported aggregations for the whole cluster, as there would be overlap in the counts. All this to say that I agree that doing things at the node level is the right approach.

Nice work!

@piergm
Copy link
Member Author

piergm commented Nov 16, 2023

Hi @piergm I think many of these stats are counters (they are cumulative over the lifecycle of the node and only increase) rather than gauges (which represent a point-in-time value and can go up and down).

Hi @DaveCTurner, I do agree with you that many of these stats should be counters. But this would complicate a little bit things due to the LongCounter and LongUpAndDownCounter implementation, that exposes only incrementBy and not a set method therefore leaving to the reporting implementation to calculate the delta for incrementing the metric. That's easy enough but the issues do not stop here, while Gauges gets current value during reporting period Counters uses a push mechanism, where when a certain event happen you, for example, increment the counter value. This implementation doesn't really work for node stats. I had a work-around by using SingleObjectCache making the cache refresh the event to update counter values and that, even tho was hacky at best, worked nicely. But with the delta calculation and the period-based reporting I was mimicking the what gauges does, therefore I decided to switch entirely to gauges. Do you think I should go back to counters?

@DaveCTurner
Copy link
Contributor

I'm not sure, but my concern is that these counters sometimes reset (when a node restarts) and therefore aggregates of these counters will sometimes suddenly decrease significantly. Counter metrics know to treat these large backwards jumps as reset events and ignore them when showing the rate of change of the value over time, whereas I expect gauges will not handle them so gracefully.

@piergm
Copy link
Member Author

piergm commented Nov 16, 2023

@DaveCTurner I see your concern, thanks for pointing this out! I'll check.

@piergm
Copy link
Member Author

piergm commented Nov 16, 2023

I was thinking about this, and in addition to latencies, I also think it might be incorrect. This would otherwise cause double counting if each node reported aggregations for the whole cluster, as there would be overlap in the counts. All this to say that I agree that doing things at the node level is the right approach.

@JVerwolf If had gone for the aggregated stats then I would have reported them only from a node (probably the Master node).

@DaveCTurner
Copy link
Contributor

reported them only from a node (probably the Master node).

Please wherever possible avoid doing this kind of thing on the master node. It doesn't need to be on the master, and there are normally many better-resourced nodes from which to choose.

}

@Override
protected void doClose() throws IOException {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we could call close() on all counters and gauges, wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that too:
image

Copy link
Member Author

@piergm piergm Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the try-with-resources is not usable here, since the AutoClosable is the metrics itself, therefore we would close the metric right after it has being registered:

try(registry.registerLongGauge(...)){
    //do nothing
} 
// here the close() method would have been called therefore de-registering the metric

Instead I have an ArrayList of all the metrics and call close() on all counters and gauges in the doClose() of the AbstractLifecycleComponent.
And being doClose called once is safe to de-register the metrics there.
Java doc for the doClose method:

It is called once in the lifetime of a component. If the component was started then it will be stopped before it is closed, and once it is closed it will not be started or stopped.

Copy link
Contributor

@JVerwolf JVerwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, my only uncertainty is just around closing the resources for the async counters, otherwise G2G. Nice work!

}

@Override
protected void doClose() throws IOException {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that too:
image

@piergm piergm requested a review from JVerwolf December 1, 2023 14:36
Copy link
Contributor

@JVerwolf JVerwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! LGTM. I had a question just double-checking that the lifecycle of the metrics registration is the same as the metric registry. If so, I think this is safe to merge.

try {
metric.close();
} catch (Exception ignore) {
// metrics close() method does not throw Exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that this could be a logger.warn("metrics close() method should not throw Exception", exception) so that we have visibility into unchecked exceptions should they (unexpectedly) occur. However, I think it's fine to do that as follow up and this should not block merging.

@@ -1074,6 +1077,7 @@ record PluginServiceInstances(
b.bind(SearchPhaseController.class).toInstance(new SearchPhaseController(searchService::aggReduceContextBuilder));
b.bind(Transport.class).toInstance(transport);
b.bind(TransportService.class).toInstance(transportService);
b.bind(NodeMetrics.class).toInstance(nodeMetrics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with how Node startup works. Are we sure that this has the same lifecycle as nodeMetrics , i.e. it starts/loads and stops/unloads at the same time? If it doesn't, we might get exceptions for double registration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be safe, I did split the class initialisation from metric registration and put the registration part in the overwritten doStart method that It is only called once in the lifetime of a component therefore avoiding double registration. On top of that doClose It is called once in the lifetime of a component. If the component was started then it will be stopped before it is closed, and once it is closed it will not be started or stopped. (Both quotes are from AbstractLifecycleComponent Java Docs).
So we are safe regarding double initialisation IMHO.

@piergm
Copy link
Member Author

piergm commented Dec 4, 2023

@elasticmachine update branch

@piergm
Copy link
Member Author

piergm commented Dec 5, 2023

@elasticmachine run elasticsearch-ci/docs

@piergm piergm added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 5, 2023
@elasticsearchmachine elasticsearchmachine merged commit ccf92e4 into elastic:main Dec 5, 2023
15 checks passed
@piergm piergm deleted the indices-metrics branch December 5, 2023 07:43
rjernst added a commit to rjernst/elasticsearch that referenced this pull request Dec 6, 2023
These new metric names can be simpler. This commit removes the redundant
prefix "node.stats".

relates elastic#102248
elasticsearchmachine pushed a commit that referenced this pull request Dec 22, 2023
These new metric names can be simpler. This commit removes the redundant
prefix "node.stats".

relates #102248
jbaiera pushed a commit to jbaiera/elasticsearch that referenced this pull request Jan 10, 2024
These new metric names can be simpler. This commit removes the redundant
prefix "node.stats".

relates elastic#102248
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants