-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] Create a Stack Monitoring health endpoint #127235
Comments
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
I think this would be incredibly useful and that we should bump it up to the top of our tech debt plan. Thanks for laying out the baseline requirements, @matschaffer ! |
I think getCollectionStatus (
We can use something like getClustersFromRequest (
|
@miltonhultgren and I discussed focusing on the broader most common use case items first which are the first three.
|
We should look into if we can backport this endpoint and how far back. 7.17 would be ideal but can it use the same code for that version? The endpoint should wrap all requests it does in graceful timeouts, but these should be configurable. Short by default, then we can make a second call with a longer timeout to see if it's broken or just slow. |
@miltonhultgren I don't suspect this would clear the criteria for a 7.17 back port, since it's essentially a new feature.
I think if the whole call exceeds 30s we'll probably start bumping into downstream timeouts (http proxies, etc), so we should probably keep retrieval timeouts pretty short (maybe 10s or less?) in our API and respond with errors for whatever answer we weren't able to gather. |
@matschaffer The 7.17 backport was a question/request from Andres, since we still need to support that version for a while. But I understand the concern that people are raising around that. |
I've started on this a bit during spacetime where I spent most of the week out sick. I'd like to continue with it, but won't be able to get around to it for possibly a few weeks due to training/SDH/PTO so leaving myself unassigned. |
🤗 for @neptunian @miltonhultgren if needed, I could see publishing a version of the code to npx for use against 7.x - probably best to focus main for now though. I'm usually pleasantly surprised at how quickly new releases get adopted in the field. So I think the ROI on this work will be good even before we have a 7.x plan. |
Raw implementation of the health endpoint here #132705. It only exposes the current shape of the monitoring state in a (hopefully) easy to parse format which helps answering the first 3 bullet points. |
Merged the PR and the API will be available starting 8.3.1. The initial implementation only covers a subset of the questions in the summary, here are some follow ups:
Next steps:
|
@klacabane I think we can add it the same as https://github.com/elastic/support-diagnostics/blob/67661bdce86947c3959cb30737bc931724588b51/src/main/resources/kibana-rest.yml?rgh-link-date=2022-03-09T01%3A03%3A44Z#L59 with I'd guess @pickypg can confirm easily enough if that's a safe thing to do (add diag endpoints prior to release). That should probably be an isolated issue though. |
Yep, that would be safe to do assuming it’s merged into 8.3.1+. |
@klacabane Can we also update this issue to reflect what work remains? Or make a new issue for that. |
Latest status
Merged the PR and the API will be available starting 8.3.1. The initial implementation only covers a subset of the questions, here are some follow ups:
Next steps:
What is it?
A endpoint that makes a handful of pre-determined queries to determine the health/status (using health since task manager already does) of stack monitoring for the configured kibana.
Proposed url:
/api/monitoring/_health
Exactly what is returned can be evolved over time, but for starters:
''
for standalone) are present, what stack document types are available for each cluster uuid?Why do we need it?
Today when someone has a problem with stack monitoring, the first step is usually having them run a handful of "well known" queries to gather information like the above.
If we put all that into an endpoint it does two things:
Similar endpoints
/api/task_manager/_health
- https://github.com/elastic/kibana/blob/main/x-pack%2Fplugins%2Ftask_manager%2Fserver%2Froutes%2Fhealth.tsThe text was updated successfully, but these errors were encountered: