new flag: ruler.outage-tolerance #2783

owen-d · 2020-06-24T17:20:01Z

What

Exposes the underlying Prometheus manager option, OutageTolerance. This is the lookback period for which the metric store can be queried for past alert evaluations. It's used when each rule group is initially run. In Cortex, this has implications not only during restarts, but also when the ring membership changes. Without this configuration, previous rule state cannot be restored, meaning that after re-sharding/restarts, we must wait the entire ForDuration until an alert may be considered firing.

edit: defaulted to prometheus defaults for these flags and included a total of 3 configs at Jacob's suggestion:

# Max time to tolerate outage for restoring "for" state of alert.
# CLI flag: -ruler.for-outage-tolerance
[for_outage_tolerance: <duration> | default = 1h]

# Minimum duration between alert and restored "for" state. This is maintained
# only for alerts with configured "for" time greater than grace period.
# CLI flag: -ruler.for-grace-period
[for_grace_period: <duration> | default = 10m]

# Minimum amount of time to wait before resending an alert to Alertmanager.
# CLI flag: -ruler.resend-delay
[resend_delay: <duration> | default = 1m]

Signed-off-by: Owen Diehl <ow.diehl@gmail.com>

jtlisi

LGTM

There are some other options we should also consider exposing here:

https://github.com/prometheus/prometheus/blob/b788986717e1597452ca25e5219510bb787165c7/cmd/prometheus/main.go#L234-L238

pkg/ruler/ruler.go

CHANGELOG.md

Signed-off-by: Owen Diehl <ow.diehl@gmail.com>

pracucci · 2020-06-25T09:57:49Z

CHANGELOG.md

@@ -2,6 +2,9 @@

 ## master / unreleased

+* [FEATURE] Introduced `ruler.for-outage-tolerance`, Max time to tolerate outage for restoring "for" state of alert. #2783
+* [FEATURE] Introduced `ruler.for-grace-period`, Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. #2783
+* [FEATURE] Introduced `ruler.for-resend-delay`, Minimum amount of time to wait before resending an alert to Alertmanager. #2783


Uups! The flag is ruler.resend-delay. @owen-d Would you mind opening a PR to fix the CHANGELOG entry, please?

Also put it in the correct place, like we ask in the checklist.

new flag: ruler.outage-tolerance

7bd4818

Signed-off-by: Owen Diehl <ow.diehl@gmail.com>

pull-request-size bot added the size/S label Jun 24, 2020

jtlisi approved these changes Jun 24, 2020

View reviewed changes

pkg/ruler/ruler.go Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

more ruler options

5b4d201

Signed-off-by: Owen Diehl <ow.diehl@gmail.com>

pull-request-size bot added size/M and removed size/S labels Jun 24, 2020

gotjosh approved these changes Jun 24, 2020

View reviewed changes

gouthamve approved these changes Jun 25, 2020

View reviewed changes

gouthamve merged commit 9960321 into cortexproject:master Jun 25, 2020

pracucci reviewed Jun 25, 2020

View reviewed changes

owen-d mentioned this pull request Jun 25, 2020

Fix ruler.outage-tolerance changelog entry and orders correctly #2792

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new flag: ruler.outage-tolerance #2783

new flag: ruler.outage-tolerance #2783

owen-d commented Jun 24, 2020 •

edited

jtlisi left a comment

pracucci Jun 25, 2020

bboreham Jun 25, 2020

new flag: ruler.outage-tolerance #2783

new flag: ruler.outage-tolerance #2783

Conversation

owen-d commented Jun 24, 2020 • edited

What

jtlisi left a comment

Choose a reason for hiding this comment

pracucci Jun 25, 2020

Choose a reason for hiding this comment

bboreham Jun 25, 2020

Choose a reason for hiding this comment

owen-d commented Jun 24, 2020 •

edited