Last x minutes API output mode for counters #77076

PhaedrusTheGreek · 2021-08-31T15:16:26Z

Most API outputs provide statistical counters of activity since cluster or in other cases, node start. For example, the indices stats API gives you the following statistics for all indexing activity:

"indexing" : {
  "index_total" : 424578657,
  "index_time" : "23h",
  "index_time_in_millis" : 82813684,
  "index_current" : 1,
  "index_failed" : 3026,
  "delete_total" : 39497,
  "delete_time" : "6s",
  "delete_time_in_millis" : 6096,
  "delete_current" : 0,
  "noop_update_total" : 0,
  "is_throttled" : false,
  "throttle_time" : "0s",
  "throttle_time_in_millis" : 0
},

Or similarly , node stats will report things like bulk rejections per node:

"write" : {
    "threads" : 0,
    "queue" : 0,
    "active" : 0,
    "rejected" : 1024,
    "largest" : 0,
    "completed" : 0
  }

Currently there are two ways that we can make measurements from these numbers:

We can calculate an average based on the cluster or node's uptime, but the longer the uptime, the more diluted the statistic. For example, the cluster may have experienced a million indexing rejections a year ago, but none since, and it would be impossible to determine if there have been any issues since.
We can call the API twice - the difference between the two counters is the current rate. In Kibana's Monitoring app, we use a derivative aggregation in time series graphs to display this. While the monitoring data is highly accurate, there are two problems: a) it's incomplete for some purposes, e.g., _stats?level=shards and b) there's too much data to process offline , e.g., to send to Elastic Support for analysis. One potential solution for this might be like a diagnostic mode for the monitoring metricbeat.

A 3rd option, which would be conducive to diagnostic collection would be for all counters in API output to report a last x minutes count (1,5,15 ~like "load", e.g.). Perhaps the API could be called with a query string parameters that indicates counters should output with last x instead of since uptime.

This type of data will allow us to do complex cluster workload analysis, for example, to detect imbalance due to sharding issues or hardware performance issues The method support currently uses to do this analysis is to have customer capture 2 diagnostics separated by 15 minutes, which end-to-end is rather manual.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-09-01T15:44:25Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

williamrandolph · 2021-10-06T15:39:49Z

We discussed this in the core/infra team meeting today. We see how this feature would be useful, but if we are moving from a simple implementation of a counter to something more complex, we'd like to make sure we do it right.

For that reason, we are going to put this issue in our backlog for little bit in the hope that we can gather more feedback and use cases from the community.

Thanks for raising the issue, and please let us know if it becomes urgent.

PhaedrusTheGreek added >enhancement needs:triage Requires assignment of a team area label labels Aug 31, 2021

not-napoleon added the :Core/Infra/REST API REST infrastructure and utilities label Sep 1, 2021

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Sep 1, 2021

not-napoleon removed the needs:triage Requires assignment of a team area label label Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last x minutes API output mode for counters #77076

Last x minutes API output mode for counters #77076

PhaedrusTheGreek commented Aug 31, 2021

elasticmachine commented Sep 1, 2021

williamrandolph commented Oct 6, 2021

Last x minutes API output mode for counters #77076

Last x minutes API output mode for counters #77076

Comments

PhaedrusTheGreek commented Aug 31, 2021

elasticmachine commented Sep 1, 2021

williamrandolph commented Oct 6, 2021