Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metricbeat] Stack monitoring modules may ignore xpack configuration on error #30809

Closed
klacabane opened this issue Mar 14, 2022 · 4 comments
Closed
Labels
bug Feature:Stack Monitoring Stalled Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring v7.17.0 v8.3.0

Comments

@klacabane
Copy link
Contributor

klacabane commented Mar 14, 2022

Summary

Stack monitoring related modules have a specific configuration (xpack.enabled: true) that allows them to write events to .monitoring-{module}-* indices instead of the usual metricbeat-*.

In these modules, failing to generate metricsets in some code paths will send an event to the metricbeat-* indice regardless of their xpack.enabled configuration. This is aligned with the way metricbeat reports error in all its modules but stack monitoring modules are different beasts since they allow an override of the destination indice for the regular events. Considering that, it can be counter-intuitive ux to ship regular and errors events to different indices and could hinder discoverability of the errors from a user perspective.

7.x versions also have an inconsistent behavior in error handling where in some cases the errors will only be logged and not returned back to the metricbeat error reporting (example).

Questions:

  • is the metricbeat dependency future proof or should we think about a dedicated indice or only logging for error reporting ?

Next steps

Ideally we would build a mechanism that ingests all errors generated when xpack.enabled: true. The first step would be to stop reporting them to metricbeat-* and route them to the logger. As a next step we could use this mechanism to standardize and store these errors in a dedicated place (ie a new .monitoring-errors-mb datastream ?). The idea is to enable easy consumptions of these values:

  • when troubleshooting - a lookup to that place with for example dataset filters could provide insightful data for support or sdh
  • in the UI - with standardized errors Stack Monitoring can consume and surface underlying collection errors to allow customer to be aware of the issues and assuming enough context is provided, solve them
@klacabane klacabane added v8.3.0 v7.17.0 Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring labels Mar 14, 2022
@matschaffer
Copy link
Contributor

Pretty sure I've seen metricbeat-* docs when errors occur in 8. Should be easy to confirm by monitoring kibana with an incorrect basepath setting I think.

@klacabane
Copy link
Contributor Author

In 8.x the codepath that logs errors when xpack.enabled was removed and all errors are routed to metricbeat-* indice so we're already able to query that indice for relevant data (eg error.message : * and event.dataset : "elasticsearch.shard").

This makes me think that status quo is acceptable for 8.x since all errors are available and queryable, and it is an improvement over the 7.x inconsistent behavior that logs the error in some cases. The questions left are whether a dedicated index like .monitoring-errors could be easier to discover when the metricbeat error handling is well documented and known to users, and if the metricbeat-* dependency is problematic considering it is currently guaranteed to exist.

@jasonrhodes
Copy link
Member

If we change the location at all, it'd be interesting to consider something like logs-monitoring.errors-default data stream or similar (if these are in fact "log" like?)...

@botelastic
Copy link

botelastic bot commented Apr 20, 2023

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Apr 20, 2023
@botelastic botelastic bot closed this as completed Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Feature:Stack Monitoring Stalled Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring v7.17.0 v8.3.0
Projects
None yet
Development

No branches or pull requests

4 participants