[Stack Monitoring] Create a Stack Monitoring health endpoint #127235

matschaffer · 2022-03-09T01:03:44Z

Latest status

Merged the PR and the API will be available starting 8.3.1. The initial implementation only covers a subset of the questions, here are some follow ups:

Which indices are present and if they have valid mappings (scout queries) - granted we have permissions to fetch indices and mappings, we need a way to tell whether a mapping is valid. One definition could be that it has the appropriate fields set for the queries to do its aggregation and filtering, and we could potentially build that information with the monitoring.ui.debug_mode setting
Some sort of rule execution stats (what stack monitoring rules are configured, last successes, failures, failure info if available, duration of recent executions) - can we get this information from monitoring-alerts-* indice ?
[Stack Monitoring] Query metricbeat errors in the Health api #135692

Next steps:

[Stack Monitoring] Onboard health API in the support diagnostics tools #135686

What is it?

A endpoint that makes a handful of pre-determined queries to determine the health/status (using health since task manager already does) of stack monitoring for the configured kibana.

Proposed url: /api/monitoring/_health

Exactly what is returned can be evolved over time, but for starters:

What cluster uuids (or '' for standalone) are present, what stack document types are available for each cluster uuid?
What monitoring modes are being used for each cluster/component/node tuple?
When is the last time we saw each given document type for each cluster/component/node/delivery tuple?
Which indices are present and if they have valid mappings (scout queries)
Any specifics about monitoring configuration: css, filebeat or metricbeat configured indices, monitoring-only es configuration present?
Some sort of rule execution stats (what stack monitoring rules are configured, last successes, failures, failure info if available, duration of recent executions)
Any error information we can gather (guessing not much unless kibana logs from monitoring cluster are available in filebeat-*, but maybe in we have some in-memory telemetry?)

Why do we need it?

Today when someone has a problem with stack monitoring, the first step is usually having them run a handful of "well known" queries to gather information like the above.

If we put all that into an endpoint it does two things:

It's a much simpler ask (curl endpoint, not error-prone copy paste operations)
We can add it to our (internal) support diagnostics tool
Once it's part of the diag tool it should become available in ESS and ECE via the "Diagnostic bundle" admin UI feature

Similar endpoints

/api/task_manager/_health - https://github.com/elastic/kibana/blob/main/x-pack%2Fplugins%2Ftask_manager%2Fserver%2Froutes%2Fhealth.ts

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-09T01:03:46Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

jasonrhodes · 2022-04-18T18:31:35Z

I think this would be incredibly useful and that we should bump it up to the top of our tech debt plan. Thanks for laying out the baseline requirements, @matschaffer !

neptunian · 2022-04-21T14:02:02Z

I think getCollectionStatus (/api/monitoring/v1/setup/collection/cluster/{clusterUuid?}) answers these questions or could be used:

what stack document types are available for each cluster uuid
What monitoring modes are being used for each cluster/component/node tuple
When is the last time we saw each given document type for each cluster/component/node/delivery tuple

We can use something like getClustersFromRequest (/api/monitoring/v1/clusters) which returns an array of each cluster along with some cluster stats as i'm not sure if getCollectionStatus will keep track of cluster ids.

config can be accessed from the server

Any specifics about monitoring configuration: css, filebeat or metricbeat configured indices, monitoring-only es configuration present?

neptunian · 2022-04-21T14:42:46Z

@miltonhultgren and I discussed focusing on the broader most common use case items first which are the first three.

What cluster uuids (or '' for standalone) are present, what stack document types are available for each cluster uuid?
What monitoring modes are being used for each cluster/component/node tuple?
When is the last time we saw each given document type for each cluster/component/node/delivery tuple?

miltonhultgren · 2022-04-27T13:24:18Z

We should look into if we can backport this endpoint and how far back. 7.17 would be ideal but can it use the same code for that version?

The endpoint should wrap all requests it does in graceful timeouts, but these should be configurable. Short by default, then we can make a second call with a longer timeout to see if it's broken or just slow.

matschaffer · 2022-04-28T01:36:14Z

We should look into if we can backport this endpoint and how far back. 7.17 would be ideal but can it use the same code for that version?

@miltonhultgren I don't suspect this would clear the criteria for a 7.17 back port, since it's essentially a new feature.

The endpoint should wrap all requests it does in graceful timeouts, but these should be configurable. Short by default, then we can make a second call with a longer timeout to see if it's broken or just slow.

I think if the whole call exceeds 30s we'll probably start bumping into downstream timeouts (http proxies, etc), so we should probably keep retrieval timeouts pretty short (maybe 10s or less?) in our API and respond with errors for whatever answer we weren't able to gather.

miltonhultgren · 2022-04-28T07:40:01Z

@matschaffer The 7.17 backport was a question/request from Andres, since we still need to support that version for a while. But I understand the concern that people are raising around that.

neptunian · 2022-05-09T13:15:33Z

I've started on this a bit during spacetime where I spent most of the week out sick. I'd like to continue with it, but won't be able to get around to it for possibly a few weeks due to training/SDH/PTO so leaving myself unassigned.

matschaffer · 2022-05-10T00:30:17Z

🤗 for @neptunian

@miltonhultgren if needed, I could see publishing a version of the code to npx for use against 7.x - probably best to focus main for now though.

I'm usually pleasantly surprised at how quickly new releases get adopted in the field. So I think the ROI on this work will be good even before we have a 7.x plan.

klacabane · 2022-05-19T10:17:10Z

Raw implementation of the health endpoint here #132705. It only exposes the current shape of the monitoring state in a (hopefully) easy to parse format which helps answering the first 3 bullet points.
I'll throw in some useful configuration settings a leave the other questions as separate tasks

klacabane · 2022-06-07T13:03:44Z

Merged the PR and the API will be available starting 8.3.1. The initial implementation only covers a subset of the questions in the summary, here are some follow ups:

Which indices are present and if they have valid mappings (scout queries) - granted we have permissions to fetch indices and mappings, we need a way to tell whether a mapping is valid. One definition could be that it has the appropriate fields set for the queries to do its aggregation and filtering, and we could potentially build that information with the monitoring.ui.debug_mode setting
Some sort of rule execution stats (what stack monitoring rules are configured, last successes, failures, failure info if available, duration of recent executions) - can we get this information from monitoring-alerts-* indice ?
Any error information we can gather - we can query metricbeat errors from the metricbeat-* indice (relevant comment)

Next steps:

Onboard the API in the support diagnostics tools. Can we already onboard it specifying a future version or should we wait for the 8.3.1 version to exist ?

matschaffer · 2022-06-08T02:40:40Z

@klacabane I think we can add it the same as https://github.com/elastic/support-diagnostics/blob/67661bdce86947c3959cb30737bc931724588b51/src/main/resources/kibana-rest.yml?rgh-link-date=2022-03-09T01%3A03%3A44Z#L59 with >=8.3.1

I'd guess @pickypg can confirm easily enough if that's a safe thing to do (add diag endpoints prior to release).

That should probably be an isolated issue though.

pickypg · 2022-06-08T04:12:13Z

Yep, that would be safe to do assuming it’s merged into 8.3.1+.

miltonhultgren · 2022-06-08T08:25:23Z

@klacabane Can we also update this issue to reflect what work remains? Or make a new issue for that.

matschaffer added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Mar 9, 2022

matschaffer changed the title ~~[Stack Monitoring] Create an /api/monitoring/v1/_health endpoint~~ [Stack Monitoring] Create an /api/monitoring/_health endpoint Mar 9, 2022

matschaffer changed the title ~~[Stack Monitoring] Create an /api/monitoring/_health endpoint~~ [Stack Monitoring] Create a Stack Monitoring heatlh endpoint Mar 9, 2022

matschaffer changed the title ~~[Stack Monitoring] Create a Stack Monitoring heatlh endpoint~~ [Stack Monitoring] Create a Stack Monitoring health endpoint Mar 9, 2022

matschaffer mentioned this issue Apr 19, 2022

Stack Monitoring Tech Debt Plan #127224

Closed

39 tasks

miltonhultgren self-assigned this Apr 21, 2022

miltonhultgren removed their assignment Apr 28, 2022

klacabane self-assigned this May 11, 2022

klacabane mentioned this issue May 17, 2022

Stack monitoring health API #132354

Closed

klacabane mentioned this issue May 23, 2022

Stack monitoring health API #132705

Merged

klacabane removed their assignment Jul 18, 2023

smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] Create a Stack Monitoring health endpoint #127235

[Stack Monitoring] Create a Stack Monitoring health endpoint #127235

matschaffer commented Mar 9, 2022 •

edited by klacabane

Loading

elasticmachine commented Mar 9, 2022

jasonrhodes commented Apr 18, 2022

neptunian commented Apr 21, 2022 •

edited

Loading

neptunian commented Apr 21, 2022 •

edited

Loading

miltonhultgren commented Apr 27, 2022

matschaffer commented Apr 28, 2022

miltonhultgren commented Apr 28, 2022

neptunian commented May 9, 2022 •

edited

Loading

matschaffer commented May 10, 2022

klacabane commented May 19, 2022 •

edited

Loading

klacabane commented Jun 7, 2022

matschaffer commented Jun 8, 2022

pickypg commented Jun 8, 2022

miltonhultgren commented Jun 8, 2022

[Stack Monitoring] Create a Stack Monitoring health endpoint #127235

[Stack Monitoring] Create a Stack Monitoring health endpoint #127235

Comments

matschaffer commented Mar 9, 2022 • edited by klacabane Loading

Latest status

What is it?

Why do we need it?

Similar endpoints

elasticmachine commented Mar 9, 2022

jasonrhodes commented Apr 18, 2022

neptunian commented Apr 21, 2022 • edited Loading

neptunian commented Apr 21, 2022 • edited Loading

miltonhultgren commented Apr 27, 2022

matschaffer commented Apr 28, 2022

miltonhultgren commented Apr 28, 2022

neptunian commented May 9, 2022 • edited Loading

matschaffer commented May 10, 2022

klacabane commented May 19, 2022 • edited Loading

klacabane commented Jun 7, 2022

matschaffer commented Jun 8, 2022

pickypg commented Jun 8, 2022

miltonhultgren commented Jun 8, 2022

matschaffer commented Mar 9, 2022 •

edited by klacabane

Loading

neptunian commented Apr 21, 2022 •

edited

Loading

neptunian commented Apr 21, 2022 •

edited

Loading

neptunian commented May 9, 2022 •

edited

Loading

klacabane commented May 19, 2022 •

edited

Loading