Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last x minutes API output mode for counters #77076

Open
PhaedrusTheGreek opened this issue Aug 31, 2021 · 2 comments
Open

Last x minutes API output mode for counters #77076

PhaedrusTheGreek opened this issue Aug 31, 2021 · 2 comments
Labels
:Core/Infra/REST API REST infrastructure and utilities >enhancement Team:Core/Infra Meta label for core/infra team

Comments

@PhaedrusTheGreek
Copy link
Contributor

Most API outputs provide statistical counters of activity since cluster or in other cases, node start. For example, the indices stats API gives you the following statistics for all indexing activity:

"indexing" : {
  "index_total" : 424578657,
  "index_time" : "23h",
  "index_time_in_millis" : 82813684,
  "index_current" : 1,
  "index_failed" : 3026,
  "delete_total" : 39497,
  "delete_time" : "6s",
  "delete_time_in_millis" : 6096,
  "delete_current" : 0,
  "noop_update_total" : 0,
  "is_throttled" : false,
  "throttle_time" : "0s",
  "throttle_time_in_millis" : 0
},

Or similarly , node stats will report things like bulk rejections per node:

"write" : {
    "threads" : 0,
    "queue" : 0,
    "active" : 0,
    "rejected" : 1024,
    "largest" : 0,
    "completed" : 0
  }

Currently there are two ways that we can make measurements from these numbers:

  1. We can calculate an average based on the cluster or node's uptime, but the longer the uptime, the more diluted the statistic. For example, the cluster may have experienced a million indexing rejections a year ago, but none since, and it would be impossible to determine if there have been any issues since.
  2. We can call the API twice - the difference between the two counters is the current rate. In Kibana's Monitoring app, we use a derivative aggregation in time series graphs to display this. While the monitoring data is highly accurate, there are two problems: a) it's incomplete for some purposes, e.g., _stats?level=shards and b) there's too much data to process offline , e.g., to send to Elastic Support for analysis. One potential solution for this might be like a diagnostic mode for the monitoring metricbeat.

A 3rd option, which would be conducive to diagnostic collection would be for all counters in API output to report a last x minutes count (1,5,15 ~like "load", e.g.). Perhaps the API could be called with a query string parameters that indicates counters should output with last x instead of since uptime.

This type of data will allow us to do complex cluster workload analysis, for example, to detect imbalance due to sharding issues or hardware performance issues The method support currently uses to do this analysis is to have customer capture 2 diagnostics separated by 15 minutes, which end-to-end is rather manual.

@PhaedrusTheGreek PhaedrusTheGreek added >enhancement needs:triage Requires assignment of a team area label labels Aug 31, 2021
@not-napoleon not-napoleon added the :Core/Infra/REST API REST infrastructure and utilities label Sep 1, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Sep 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@not-napoleon not-napoleon removed the needs:triage Requires assignment of a team area label label Sep 1, 2021
@williamrandolph
Copy link
Contributor

We discussed this in the core/infra team meeting today. We see how this feature would be useful, but if we are moving from a simple implementation of a counter to something more complex, we'd like to make sure we do it right.

For that reason, we are going to put this issue in our backlog for little bit in the hope that we can gather more feedback and use cases from the community.

Thanks for raising the issue, and please let us know if it becomes urgent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/REST API REST infrastructure and utilities >enhancement Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

4 participants