New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new flag: ruler.outage-tolerance #2783
new flag: ruler.outage-tolerance #2783
Conversation
Signed-off-by: Owen Diehl <ow.diehl@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There are some other options we should also consider exposing here:
Signed-off-by: Owen Diehl <ow.diehl@gmail.com>
@@ -2,6 +2,9 @@ | |||
|
|||
## master / unreleased | |||
|
|||
* [FEATURE] Introduced `ruler.for-outage-tolerance`, Max time to tolerate outage for restoring "for" state of alert. #2783 | |||
* [FEATURE] Introduced `ruler.for-grace-period`, Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. #2783 | |||
* [FEATURE] Introduced `ruler.for-resend-delay`, Minimum amount of time to wait before resending an alert to Alertmanager. #2783 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uups! The flag is ruler.resend-delay
. @owen-d Would you mind opening a PR to fix the CHANGELOG entry, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also put it in the correct place, like we ask in the checklist.
What
Exposes the underlying Prometheus manager option,
OutageTolerance
. This is the lookback period for which the metric store can be queried for past alert evaluations. It's used when each rule group is initially run. In Cortex, this has implications not only during restarts, but also when the ring membership changes. Without this configuration, previous rule state cannot be restored, meaning that after re-sharding/restarts, we must wait the entireForDuration
until an alert may be considered firing.edit: defaulted to prometheus defaults for these flags and included a total of 3 configs at Jacob's suggestion: