Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Large shard size" Stack Monitoring rule is missing "Look at the average over X minutes" option #111889

Open
2 tasks
ravikesarwani opened this issue Sep 10, 2021 · 4 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring SM alerting improvements Team:Monitoring Stack Monitoring team

Comments

@ravikesarwani
Copy link
Contributor

ravikesarwani commented Sep 10, 2021

  • Add "Look at the average over X minutes" option to "Large shard size" rule with default value of 15 minutes.
  • Change the default value for the "Large shard size" rule from 55gb to 75gb

The documentation of stack monitoring alert for Large shard size mentions "The condition is met if an index’s average shard size is 55gb or higher in the last 5 minutes" but the parameter to specify the time period is somehow missing on the rule definition.

We don't want a single spike over 55gb for primary shard size to cause an alert.
Force merges can cause the shard to grow much more than 50 GB (in some cases may double) for a short while and potentially trigger an alert that would be considered false positive.
We want the alert to fire only when size in last X minutes (default 15 minutes) averages over 75gb.
This provides additional control point for the users and avoids unneeded noise at time.

This would be similar to "Disk usage" rule.
Screen Shot 2021-09-10 at 1 30 20 PM

@ravikesarwani ravikesarwani added Team:Monitoring Stack Monitoring team Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Sep 10, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@ravikesarwani
Copy link
Contributor Author

Force merges can cause the shard to grow much more than 50 GB for a short while and potentially trigger an alert that would be considered false positive. The change in this issue (where we check for average over last X minute) will help with this temporary condition.

@jasonrhodes jasonrhodes removed this from Features Backlog in [INACTIVE] Metrics / Red Team Backlog Mar 3, 2022
@jasonrhodes jasonrhodes added bug Fixes for quality problems that affect the customer experience and removed Team:Monitoring Stack Monitoring team labels Mar 3, 2022
@pmeresanu85 pmeresanu85 added this to the ER Archive milestone Aug 31, 2022
@pmeresanu85 pmeresanu85 changed the title "Large shard size" rule is missing "Look at the average over X minutes" option [Stack Monitoring - Tech Debt] "Large shard size" rule is missing "Look at the average over X minutes" option Aug 31, 2022
@smith smith changed the title [Stack Monitoring - Tech Debt] "Large shard size" rule is missing "Look at the average over X minutes" option "Large shard size" Stack Monitoring rule is missing "Look at the average over X minutes" option Feb 24, 2023
@sophiec20 sophiec20 removed this from the ER Archive milestone Aug 4, 2023
@smith smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Monitoring)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring SM alerting improvements Team:Monitoring Stack Monitoring team
Projects
None yet
Development

No branches or pull requests

6 participants