Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Create a Stack Monitoring health endpoint #127235

Open
2 of 4 tasks
Tracked by #127224
matschaffer opened this issue Mar 9, 2022 · 14 comments
Open
2 of 4 tasks
Tracked by #127224

[Stack Monitoring] Create a Stack Monitoring health endpoint #127235

matschaffer opened this issue Mar 9, 2022 · 14 comments

Comments

@matschaffer
Copy link
Contributor

matschaffer commented Mar 9, 2022

Latest status

Merged the PR and the API will be available starting 8.3.1. The initial implementation only covers a subset of the questions, here are some follow ups:

  • Which indices are present and if they have valid mappings (scout queries) - granted we have permissions to fetch indices and mappings, we need a way to tell whether a mapping is valid. One definition could be that it has the appropriate fields set for the queries to do its aggregation and filtering, and we could potentially build that information with the monitoring.ui.debug_mode setting
  • Some sort of rule execution stats (what stack monitoring rules are configured, last successes, failures, failure info if available, duration of recent executions) - can we get this information from monitoring-alerts-* indice ?
  • [Stack Monitoring] Query metricbeat errors in the Health api #135692

Next steps:


What is it?

A endpoint that makes a handful of pre-determined queries to determine the health/status (using health since task manager already does) of stack monitoring for the configured kibana.

Proposed url: /api/monitoring/_health

Exactly what is returned can be evolved over time, but for starters:

  • What cluster uuids (or '' for standalone) are present, what stack document types are available for each cluster uuid?
  • What monitoring modes are being used for each cluster/component/node tuple?
  • When is the last time we saw each given document type for each cluster/component/node/delivery tuple?
  • Which indices are present and if they have valid mappings (scout queries)
  • Any specifics about monitoring configuration: css, filebeat or metricbeat configured indices, monitoring-only es configuration present?
  • Some sort of rule execution stats (what stack monitoring rules are configured, last successes, failures, failure info if available, duration of recent executions)
  • Any error information we can gather (guessing not much unless kibana logs from monitoring cluster are available in filebeat-*, but maybe in we have some in-memory telemetry?)

Why do we need it?

Today when someone has a problem with stack monitoring, the first step is usually having them run a handful of "well known" queries to gather information like the above.

If we put all that into an endpoint it does two things:

  • It's a much simpler ask (curl endpoint, not error-prone copy paste operations)
  • We can add it to our (internal) support diagnostics tool
  • Once it's part of the diag tool it should become available in ESS and ECE via the "Diagnostic bundle" admin UI feature

Similar endpoints

@matschaffer matschaffer added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Mar 9, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@matschaffer matschaffer changed the title [Stack Monitoring] Create an /api/monitoring/v1/_health endpoint [Stack Monitoring] Create an /api/monitoring/_health endpoint Mar 9, 2022
@matschaffer matschaffer changed the title [Stack Monitoring] Create an /api/monitoring/_health endpoint [Stack Monitoring] Create a Stack Monitoring heatlh endpoint Mar 9, 2022
@matschaffer matschaffer changed the title [Stack Monitoring] Create a Stack Monitoring heatlh endpoint [Stack Monitoring] Create a Stack Monitoring health endpoint Mar 9, 2022
@jasonrhodes
Copy link
Member

I think this would be incredibly useful and that we should bump it up to the top of our tech debt plan. Thanks for laying out the baseline requirements, @matschaffer !

@neptunian
Copy link
Contributor

neptunian commented Apr 21, 2022

I think getCollectionStatus (/api/monitoring/v1/setup/collection/cluster/{clusterUuid?}) answers these questions or could be used:

  • what stack document types are available for each cluster uuid
  • What monitoring modes are being used for each cluster/component/node tuple
  • When is the last time we saw each given document type for each cluster/component/node/delivery tuple

We can use something like getClustersFromRequest (/api/monitoring/v1/clusters) which returns an array of each cluster along with some cluster stats as i'm not sure if getCollectionStatus will keep track of cluster ids.

config can be accessed from the server

  • Any specifics about monitoring configuration: css, filebeat or metricbeat configured indices, monitoring-only es configuration present?

@neptunian
Copy link
Contributor

neptunian commented Apr 21, 2022

@miltonhultgren and I discussed focusing on the broader most common use case items first which are the first three.

  • What cluster uuids (or '' for standalone) are present, what stack document types are available for each cluster uuid?
  • What monitoring modes are being used for each cluster/component/node tuple?
  • When is the last time we saw each given document type for each cluster/component/node/delivery tuple?

@miltonhultgren
Copy link
Contributor

We should look into if we can backport this endpoint and how far back. 7.17 would be ideal but can it use the same code for that version?

The endpoint should wrap all requests it does in graceful timeouts, but these should be configurable. Short by default, then we can make a second call with a longer timeout to see if it's broken or just slow.

@matschaffer
Copy link
Contributor Author

We should look into if we can backport this endpoint and how far back. 7.17 would be ideal but can it use the same code for that version?

@miltonhultgren I don't suspect this would clear the criteria for a 7.17 back port, since it's essentially a new feature.

The endpoint should wrap all requests it does in graceful timeouts, but these should be configurable. Short by default, then we can make a second call with a longer timeout to see if it's broken or just slow.

I think if the whole call exceeds 30s we'll probably start bumping into downstream timeouts (http proxies, etc), so we should probably keep retrieval timeouts pretty short (maybe 10s or less?) in our API and respond with errors for whatever answer we weren't able to gather.

@miltonhultgren miltonhultgren removed their assignment Apr 28, 2022
@miltonhultgren
Copy link
Contributor

@matschaffer The 7.17 backport was a question/request from Andres, since we still need to support that version for a while. But I understand the concern that people are raising around that.

@neptunian
Copy link
Contributor

neptunian commented May 9, 2022

I've started on this a bit during spacetime where I spent most of the week out sick. I'd like to continue with it, but won't be able to get around to it for possibly a few weeks due to training/SDH/PTO so leaving myself unassigned.

@matschaffer
Copy link
Contributor Author

🤗 for @neptunian

@miltonhultgren if needed, I could see publishing a version of the code to npx for use against 7.x - probably best to focus main for now though.

I'm usually pleasantly surprised at how quickly new releases get adopted in the field. So I think the ROI on this work will be good even before we have a 7.x plan.

@klacabane
Copy link
Contributor

klacabane commented May 19, 2022

Raw implementation of the health endpoint here #132705. It only exposes the current shape of the monitoring state in a (hopefully) easy to parse format which helps answering the first 3 bullet points.
I'll throw in some useful configuration settings a leave the other questions as separate tasks

@klacabane
Copy link
Contributor

Merged the PR and the API will be available starting 8.3.1. The initial implementation only covers a subset of the questions in the summary, here are some follow ups:

  • Which indices are present and if they have valid mappings (scout queries) - granted we have permissions to fetch indices and mappings, we need a way to tell whether a mapping is valid. One definition could be that it has the appropriate fields set for the queries to do its aggregation and filtering, and we could potentially build that information with the monitoring.ui.debug_mode setting
  • Some sort of rule execution stats (what stack monitoring rules are configured, last successes, failures, failure info if available, duration of recent executions) - can we get this information from monitoring-alerts-* indice ?
  • Any error information we can gather - we can query metricbeat errors from the metricbeat-* indice (relevant comment)

Next steps:

  • Onboard the API in the support diagnostics tools. Can we already onboard it specifying a future version or should we wait for the 8.3.1 version to exist ?

@matschaffer
Copy link
Contributor Author

@klacabane I think we can add it the same as https://github.com/elastic/support-diagnostics/blob/67661bdce86947c3959cb30737bc931724588b51/src/main/resources/kibana-rest.yml?rgh-link-date=2022-03-09T01%3A03%3A44Z#L59 with >=8.3.1

I'd guess @pickypg can confirm easily enough if that's a safe thing to do (add diag endpoints prior to release).

That should probably be an isolated issue though.

@pickypg
Copy link
Member

pickypg commented Jun 8, 2022

Yep, that would be safe to do assuming it’s merged into 8.3.1+.

@miltonhultgren
Copy link
Contributor

@klacabane Can we also update this issue to reflect what work remains? Or make a new issue for that.

@klacabane klacabane removed their assignment Jul 18, 2023
@smith smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants