Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new flag: ruler.outage-tolerance #2783

Merged

Conversation

owen-d
Copy link
Contributor

@owen-d owen-d commented Jun 24, 2020

What

Exposes the underlying Prometheus manager option, OutageTolerance. This is the lookback period for which the metric store can be queried for past alert evaluations. It's used when each rule group is initially run. In Cortex, this has implications not only during restarts, but also when the ring membership changes. Without this configuration, previous rule state cannot be restored, meaning that after re-sharding/restarts, we must wait the entire ForDuration until an alert may be considered firing.

edit: defaulted to prometheus defaults for these flags and included a total of 3 configs at Jacob's suggestion:

# Max time to tolerate outage for restoring "for" state of alert.
# CLI flag: -ruler.for-outage-tolerance
[for_outage_tolerance: <duration> | default = 1h]

# Minimum duration between alert and restored "for" state. This is maintained
# only for alerts with configured "for" time greater than grace period.
# CLI flag: -ruler.for-grace-period
[for_grace_period: <duration> | default = 10m]

# Minimum amount of time to wait before resending an alert to Alertmanager.
# CLI flag: -ruler.resend-delay
[resend_delay: <duration> | default = 1m]

Signed-off-by: Owen Diehl <ow.diehl@gmail.com>
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/ruler/ruler.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Signed-off-by: Owen Diehl <ow.diehl@gmail.com>
@pull-request-size pull-request-size bot added size/M and removed size/S labels Jun 24, 2020
@gouthamve gouthamve merged commit 9960321 into cortexproject:master Jun 25, 2020
@@ -2,6 +2,9 @@

## master / unreleased

* [FEATURE] Introduced `ruler.for-outage-tolerance`, Max time to tolerate outage for restoring "for" state of alert. #2783
* [FEATURE] Introduced `ruler.for-grace-period`, Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. #2783
* [FEATURE] Introduced `ruler.for-resend-delay`, Minimum amount of time to wait before resending an alert to Alertmanager. #2783
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uups! The flag is ruler.resend-delay. @owen-d Would you mind opening a PR to fix the CHANGELOG entry, please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also put it in the correct place, like we ask in the checklist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants