You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But in cloud environments this can be quite common as VMs will fail or need to be removed for maintenance purposes.
This triggers an alert like this which can be confusing.
Describe the feature:
We should look at either disabling this or reducing the priority of the alert in cloud environments to avoid confusion.
Ideally we should be alerting on a discrepancy between the actual and intended cluster topology.
For example if I expect to have 3 masters and 6 data nodes, but I only have 2 masters and 5 data nodes beyond an expected window of recovery time, then I'd like an alert.
But for that we'll need a way to have stack monitoring be aware of not only the expected cluster topology, but also the expected recovery time for each class of node. Masters should be order of minutes, probably no more than an hour. Warm data nodes could take order of hours in the event a new VM is needed.
The text was updated successfully, but these errors were encountered:
Description of the problem including expected versus actual behavior:
We have an alert that fires whenever a node goes missing: https://github.com/elastic/kibana/blob/master/x-pack/plugins/monitoring/server/alerts/missing_monitoring_data_rule.ts
But in cloud environments this can be quite common as VMs will fail or need to be removed for maintenance purposes.
This triggers an alert like this which can be confusing.
Describe the feature:
We should look at either disabling this or reducing the priority of the alert in cloud environments to avoid confusion.
Ideally we should be alerting on a discrepancy between the actual and intended cluster topology.
For example if I expect to have 3 masters and 6 data nodes, but I only have 2 masters and 5 data nodes beyond an expected window of recovery time, then I'd like an alert.
But for that we'll need a way to have stack monitoring be aware of not only the expected cluster topology, but also the expected recovery time for each class of node. Masters should be order of minutes, probably no more than an hour. Warm data nodes could take order of hours in the event a new VM is needed.
The text was updated successfully, but these errors were encountered: