Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmalert: new metrics for firing and pending alerts count. #573

Closed
Allineer opened this issue Jun 19, 2020 · 8 comments
Closed

vmalert: new metrics for firing and pending alerts count. #573

Allineer opened this issue Jun 19, 2020 · 8 comments
Labels
enhancement New feature or request vmalert

Comments

@Allineer
Copy link

How about two new vmalert's metrics for:

  1. for firing alerts (gauge)
count(ALERTS{alertstate="firing"})
  1. for pending alerts (gauge)
count(ALERTS{alertstate="pending"})

Maybe, with alertname label.
For those, who uses -remoteWrite.url database only for vmalert instance and with a minimal retention.

@valyala valyala added the enhancement New feature or request label Jun 19, 2020
@chaets
Copy link

chaets commented Jul 13, 2020

How about ALERTS_FOR_STATE as well?

@Allineer
Copy link
Author

Allineer commented Jul 14, 2020

@chaets

For those, who uses -remoteWrite.url database only for vmalert instance and with a minimal retention.

hagen1778 added a commit that referenced this issue Jul 26, 2020
New metrics were added to improve observability:
+ vmalert_alerts_pending{alertname, group} - number of pending alerts per group
per alert;
+ vmalert_alerts_acitve{alertname, group} - number of active alerts per group
per alert;
+ vmalert_alerts_error{alertname, group} - is 1 if alertname ended up with error
during prev execution, is 0 if no errors happened;
+ vmalert_recording_rules_error{recording, group} - is 1 if recording rule
 ended up with error during prev execution, is 0 if no errors happened;
* vmalert_iteration_total{group, file} - now contains group and file name labels.
This should improve control over specific groups;
* vmalert_iteration_duration_seconds{group, file} - now contains group and file name labels. This should improve control over specific groups;

Some collisions for alerts and recording rules are possible, because neither
group name nor alert/recording rule name are unique for compatibility reasons.

Commit contains list of TODOs for Unregistering metrics since groups and rules
are ephemeral and could be removed without application restart. In order to
unlock Unregistering feature corresponding PR was filed - VictoriaMetrics/metrics#13
hagen1778 added a commit that referenced this issue Aug 2, 2020
The changes are following:
* add an ID label to rules metrics, since `name` collisions within one group is
a common case - see the k8s example alerts;
* supports metrics unregistering on rule updates. Consider the case when one rule
was added or removed from the group, or the whole group was added or removed.

The change depends on VictoriaMetrics/metrics#16
where race condition for Unregister method was fixed.
valyala pushed a commit that referenced this issue Aug 9, 2020
* app/vmalert: extend metrics set exported by `vmalert` #573

New metrics were added to improve observability:
+ vmalert_alerts_pending{alertname, group} - number of pending alerts per group
per alert;
+ vmalert_alerts_acitve{alertname, group} - number of active alerts per group
per alert;
+ vmalert_alerts_error{alertname, group} - is 1 if alertname ended up with error
during prev execution, is 0 if no errors happened;
+ vmalert_recording_rules_error{recording, group} - is 1 if recording rule
 ended up with error during prev execution, is 0 if no errors happened;
* vmalert_iteration_total{group, file} - now contains group and file name labels.
This should improve control over specific groups;
* vmalert_iteration_duration_seconds{group, file} - now contains group and file name labels. This should improve control over specific groups;

Some collisions for alerts and recording rules are possible, because neither
group name nor alert/recording rule name are unique for compatibility reasons.

Commit contains list of TODOs for Unregistering metrics since groups and rules
are ephemeral and could be removed without application restart. In order to
unlock Unregistering feature corresponding PR was filed - VictoriaMetrics/metrics#13

* app/vmalert: extend metrics set exported by `vmalert` #573

The changes are following:
* add an ID label to rules metrics, since `name` collisions within one group is
a common case - see the k8s example alerts;
* supports metrics unregistering on rule updates. Consider the case when one rule
was added or removed from the group, or the whole group was added or removed.

The change depends on VictoriaMetrics/metrics#16
where race condition for Unregister method was fixed.
valyala pushed a commit that referenced this issue Aug 9, 2020
* app/vmalert: extend metrics set exported by `vmalert` #573

New metrics were added to improve observability:
+ vmalert_alerts_pending{alertname, group} - number of pending alerts per group
per alert;
+ vmalert_alerts_acitve{alertname, group} - number of active alerts per group
per alert;
+ vmalert_alerts_error{alertname, group} - is 1 if alertname ended up with error
during prev execution, is 0 if no errors happened;
+ vmalert_recording_rules_error{recording, group} - is 1 if recording rule
 ended up with error during prev execution, is 0 if no errors happened;
* vmalert_iteration_total{group, file} - now contains group and file name labels.
This should improve control over specific groups;
* vmalert_iteration_duration_seconds{group, file} - now contains group and file name labels. This should improve control over specific groups;

Some collisions for alerts and recording rules are possible, because neither
group name nor alert/recording rule name are unique for compatibility reasons.

Commit contains list of TODOs for Unregistering metrics since groups and rules
are ephemeral and could be removed without application restart. In order to
unlock Unregistering feature corresponding PR was filed - VictoriaMetrics/metrics#13

* app/vmalert: extend metrics set exported by `vmalert` #573

The changes are following:
* add an ID label to rules metrics, since `name` collisions within one group is
a common case - see the k8s example alerts;
* supports metrics unregistering on rule updates. Consider the case when one rule
was added or removed from the group, or the whole group was added or removed.

The change depends on VictoriaMetrics/metrics#16
where race condition for Unregister method was fixed.
@valyala
Copy link
Collaborator

valyala commented Aug 9, 2020

@hagen1778 , the commit that extends metrics exported by vmalert has been included in v1.39.4. Should we close this issue as fixed?

@Allineer
Copy link
Author

Allineer commented Aug 9, 2020

Wow!

Give me some hours for testing it and i'll close this issue

@Allineer
Copy link
Author

@hagen1778, @valyala, it's working, thanks!

@Allineer
Copy link
Author

But...
Why ID label is used? Why not simply sum() values over alertname and group labels?
This label not in any way indicate a specific rule of several with the same name, i.e. this label is useless to the user, right?

@valyala
Copy link
Collaborator

valyala commented Aug 10, 2020

Why ID label is used? Why not simply sum() values over alertname and group labels?

See the answer at #654 (comment) .

This label not in any way indicate a specific rule of several with the same name, i.e. this label is useless to the user, right?

The alertID is used in http://<vmalert-addr>/api/v1/<groupName>/<alertID>/status

@Allineer
Copy link
Author

Understand. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request vmalert
Projects
None yet
Development

No branches or pull requests

4 participants