-
Notifications
You must be signed in to change notification settings - Fork 9
feat(doc): diagnose false alerts #284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
106 changes: 106 additions & 0 deletions
106
docs/how-to/validate-and-troubleshoot/diagnose-false-alerts.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| --- | ||
| myst: | ||
| html_meta: | ||
| description: "Diagnose false-positive (noisy) and false-negative (silent) alerts in the Canonical Observability Stack." | ||
| --- | ||
|
|
||
| # How to diagnose false alerts | ||
|
|
||
|
sed-i marked this conversation as resolved.
|
||
| Use this guide if you have alerts that aren't behaving as expected, such as alerts that are firing too often or | ||
| never fire when they should. This guide describes common causes and steps to diagnose the issues with your alerts. | ||
|
|
||
| False alerts fall into two categories: | ||
|
|
||
| - False positives (noisy alerts) are alerts that fire when they should not, usually due to | ||
| overly broad label matching or an inappropriate threshold in the PromQL/LogQL expression. | ||
| - False negatives are alerts that do *not* fire when they should. This can be due | ||
| to an unintentional label mismatch in the alert expression, a threshold that is too lenient, | ||
| or the alert rule being missing entirely — either because it was not forwarded over Juju | ||
| relation data or was not serialized to disk correctly. | ||
|
|
||
| ## Diagnose false-positive (noisy) alerts | ||
|
|
||
| From the Alertmanager UI, note the rule name, labels and expression (`expr`). | ||
| In Grafana, evaluate the expression manually by pasting the alert expression into the Grafana query UI. | ||
|
sed-i marked this conversation as resolved.
|
||
| Inspect which time series match. Look for: | ||
|
|
||
| - **Overly broad label selectors**: the expression may match time series from charms or units | ||
| that were not intended. Compare the label matchers in the expression against the actual labels | ||
| on the returned time series. | ||
| - **Juju topology label mismatches**: charmed alert rules are | ||
| [automatically injected](/explanation/architecture/juju-topology) with topology matchers. If the | ||
| topology labels on the time series do not match what was injected, the wrong set of series may be selected. | ||
| Query `up{}` in Prometheus and compare the topology labels there against your expression. | ||
| - **Threshold too sensitive**: the threshold in the expression may be too aggressive for your | ||
| environment. For example, a CPU usage alert firing at 80% may be normal for your workload. | ||
| [Charmed alert rule](/explanation/alerting/charmed-rules) thresholds are opinionated and not | ||
| configurable, so if the threshold is inappropriate you may need to silence the alert or file | ||
| a bug against the charm. | ||
|
|
||
| Depending on the root cause: | ||
|
sed-i marked this conversation as resolved.
|
||
|
|
||
| - **Silence the alert** in the Alertmanager UI, or come up with appropriate inhibit rules. | ||
| - [Disable rule forwarding](/how-to/configure-and-tune/disable-charmed-rules) on the aggregator | ||
| (e.g. `opentelemetry-collector`) if all rules from a particular aggregator are unwanted. | ||
| - **File a bug** against the upstream charm if the charmed rule threshold is inappropriate for | ||
| general use. | ||
|
|
||
| ## Diagnose false-negative alerts | ||
|
|
||
| ### Confirm the alert rule exists | ||
|
|
||
| First, verify that the alert rule is actually loaded in Prometheus or Loki. | ||
| If you know the name of the alert (can be found in the source code of the relevant charm), | ||
| inspect the relation databag for the relevant unit with `juju show-unit`. | ||
|
|
||
| If the rule is not present: | ||
|
|
||
| - **Check if rule forwarding is disabled**: the `forward_alert_rules` config option on aggregators | ||
| such as `opentelemetry-collector`, `cos-proxy`, or `prometheus-scrape-config` may be set to | ||
| `false`. Verify with: | ||
| ```bash | ||
| juju config opentelemetry-collector forward_alert_rules | ||
|
sed-i marked this conversation as resolved.
|
||
| ``` | ||
| - **Check the charm's built-in rules**: charmed alert rules are typically located at | ||
| `./src/prometheus_alert_rules` and `./src/loki_alert_rules` relative to the charm's source | ||
| tree. If the rule file is missing from the charm or stored in a non-default location, | ||
| our charm libraries may not have picked it up. | ||
| - For **rules synced from a Git repository** via the | ||
| [COS Configuration charm](/how-to/configure-and-tune/sync-alert-rules-from-git), confirm that | ||
| the rule file exists in the repository and that the charm is polling successfully. | ||
|
|
||
| ### Check the alert expression | ||
|
|
||
| If the rule exists in Prometheus or Loki but does not fire when you expect it to, the issue is in | ||
| the expression itself. | ||
|
|
||
| Evaluate the expression manually by pasting the alert expression into the Grafana query UI. | ||
| If it returns no results or only `0`, the condition for the alert is never met. | ||
|
|
||
| Common causes: | ||
|
|
||
| - **Label matchers too narrow**: the [juju topology](/explanation/architecture/juju-topology) | ||
| matchers injected into charmed rules qualify the expression so it applies only to a specific | ||
| charm. If the topology labels on the actual time series differ from what was injected (for | ||
| example, after a charm rename or model migration), the expression will not match any series. | ||
| Query `up{}` and compare the label values against the matchers in the alert expression. | ||
|
sed-i marked this conversation as resolved.
|
||
| - **Metric does not exist**: the metric name referenced in the expression may not be emitted by | ||
| your workload, or may have been renamed in a newer version of the workload. Check the metrics | ||
| endpoint directly, for example | ||
| ```bash | ||
| juju ssh <unit/0> curl -s localhost:<port>/metrics | grep <metric_name> | ||
| ``` | ||
| - **Threshold too lenient**: the alert threshold may be higher (or lower) than the values your | ||
| workload produces, so the condition is never satisfied. Evaluate the expressions in the | ||
| alert incrementally to see what values they return. | ||
| - **Alert uses `absent()`**: the `absent()` function returns `1` when the given selector matches | ||
| *no* time series at all, and is commonly used to detect missing metrics. However, `absent()` | ||
| does not support wildcard or regex label matchers — if the selector contains a regex matcher | ||
| (e.g. `juju_unit=~".+"`), `absent()` will never fire because it cannot determine which label | ||
| values are "expected". Alert rules that rely on `absent()` must use exact label matchers. | ||
|
sed-i marked this conversation as resolved.
|
||
| - **Threshold derived from a Grafana dashboard**: if the alert threshold was chosen based on values | ||
| observed in a Grafana panel, be aware that Grafana can apply post-query axis scaling that | ||
| changes the displayed values. For example, a panel may show a "percentage" axis (0–100) while | ||
| the underlying PromQL expression returns a ratio (0–1), or a panel may display bytes while the | ||
| raw metric is in kilobytes. Always verify the raw query result in the Grafana query inspector | ||
| or directly in the Prometheus UI before setting a threshold. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.