Description
The TestGetRulesFromBackup test in pkg/ruler/ruler_test.go is flaky, failing intermittently in CI. The failure is a timing race between the Prometheus rule manager's asynchronous first evaluation of the live rule groups and the test's assertion that a live group's GroupStateDesc equals the corresponding backup group's.
Observed on PR CI (integration-configs-db (ubuntu-24.04, amd64)): the same commit produced one passing and one failing run, which is the signature of a flaky test rather than a code regression:
The PR under test only touched pkg/querier/blocks_finder_bucket_scan.go — unrelated to pkg/ruler. integration-configs-db is green on recent master.
Fail excerpt
ruler_test.go:1616:
Error Trace: /go/src/github.com/cortexproject/cortex/pkg/ruler/ruler_test.go:1616
/go/src/github.com/cortexproject/cortex/pkg/ruler/ruler_test.go:1654
Error: Not equal:
expected: time.Date(2026, time.May, 30, 1, 29, 46, 642606566, time.Local)
actual : time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC)
Diff:
--- Expected
+++ Actual
@@ -1,2 +1,2 @@
-(time.Time) 2026-05-30 01:29:46.642606566 +0000 UTC m=+29.180124763
+(time.Time) 0001-01-01 00:00:00 +0000 UTC
Test: TestGetRulesFromBackup
--- FAIL: TestGetRulesFromBackup (0.21s)
Expected behavior
TestGetRulesFromBackup passes consistently.
Actual behavior
It fails intermittently at ruler_test.go:1616, inside the requireGroupStateEqual helper:
require.Equal(t, a.EvaluationTimestamp, b.EvaluationTimestamp)
The failing call site is ruler_test.go:1654, which compares the live group l2 against the backup group b2:
requireGroupStateEqual(stateByKey["namespace;l2"], stateByKey["namespace;b2"])
expected (the live group l2) carries a real evaluation timestamp; actual (the backup group b2) is the zero time.Time.
Root cause
requireGroupStateEqual asserts that a live group's GroupStateDesc equals a backup group's, including EvaluationTimestamp (and EvaluationDuration). But those two fields are fundamentally different between the two paths:
- Backup groups are never evaluated, so their timestamps are hardcoded to the zero value:
pkg/ruler/ruler.go:1445 (group level) — EvaluationTimestamp: time.Time{} ("rules backup is not evaluating so there will be no evaluation timestamp")
pkg/ruler/ruler.go:1492 / :1502 (rule level) — EvaluationTimestamp: time.Time{}
- Live groups derive their timestamp from the Prometheus rule manager:
pkg/ruler/ruler.go:1274 — EvaluationTimestamp: group.GetLastEvaluation()
The assertion therefore only holds while the live group has not yet been evaluated (both sides zero). The Prometheus manager runs an initial evaluation asynchronously after syncRules, so whether that first evaluation lands before or after the test's GetRules call is a race. When it lands first, the live group's EvaluationTimestamp becomes non-zero while the backup stays zero → Not equal. (l2/b2 use Interval: 0, which makes the race especially easy to lose.)
This is a test-only issue: comparing EvaluationTimestamp/EvaluationDuration between a live group and a never-evaluated backup group is not a meaningful invariant. The surrounding comment (ruler_test.go:1650-1652) states the intent is only to confirm the rulepb.RuleGroupList → GroupStateDesc conversion matches the promRules.Group → GroupStateDesc conversion for the config fields — not the evaluation state.
Possible solution
In requireGroupStateEqual, stop comparing the evaluation-state fields between live and backup groups, since the backup side is defined to be zero:
- Drop (or guard)
require.Equal(t, a.EvaluationTimestamp, b.EvaluationTimestamp) and the analogous EvaluationDuration assertions at the group level (ruler_test.go:1616-1617) and the rule level.
- Keep asserting the config-derived fields (
Interval, User, Limit, Expr, Labels, Record/Alert, etc.) and the already-relaxed State check.
Alternatively, the test could assert that the backup group's EvaluationTimestamp is exactly the zero value, and not require it to equal the live group's.
Additional context
- Test:
pkg/ruler/ruler_test.go → TestGetRulesFromBackup
- Helper:
requireGroupStateEqual (assertion at ruler_test.go:1616)
- Backup conversion:
pkg/ruler/ruler.go:1419 ruleGroupListToGroupStateDesc
- The
EvaluationTimestamp comparison in requireGroupStateEqual was added relatively recently; before that the helper did not compare evaluation-state fields.
I'm happy to send a small PR implementing the proposed fix.
Description
The
TestGetRulesFromBackuptest inpkg/ruler/ruler_test.gois flaky, failing intermittently in CI. The failure is a timing race between the Prometheus rule manager's asynchronous first evaluation of the live rule groups and the test's assertion that a live group'sGroupStateDescequals the corresponding backup group's.Observed on PR CI (
integration-configs-db (ubuntu-24.04, amd64)): the same commit produced one passing and one failing run, which is the signature of a flaky test rather than a code regression:26670519832(same head SHA84930cb, same trigger time)The PR under test only touched
pkg/querier/blocks_finder_bucket_scan.go— unrelated topkg/ruler.integration-configs-dbis green on recentmaster.Fail excerpt
Expected behavior
TestGetRulesFromBackuppasses consistently.Actual behavior
It fails intermittently at
ruler_test.go:1616, inside therequireGroupStateEqualhelper:The failing call site is
ruler_test.go:1654, which compares the live groupl2against the backup groupb2:expected(the live groupl2) carries a real evaluation timestamp;actual(the backup groupb2) is the zerotime.Time.Root cause
requireGroupStateEqualasserts that a live group'sGroupStateDescequals a backup group's, includingEvaluationTimestamp(andEvaluationDuration). But those two fields are fundamentally different between the two paths:pkg/ruler/ruler.go:1445(group level) —EvaluationTimestamp: time.Time{}("rules backup is not evaluating so there will be no evaluation timestamp")pkg/ruler/ruler.go:1492/:1502(rule level) —EvaluationTimestamp: time.Time{}pkg/ruler/ruler.go:1274—EvaluationTimestamp: group.GetLastEvaluation()The assertion therefore only holds while the live group has not yet been evaluated (both sides zero). The Prometheus manager runs an initial evaluation asynchronously after
syncRules, so whether that first evaluation lands before or after the test'sGetRulescall is a race. When it lands first, the live group'sEvaluationTimestampbecomes non-zero while the backup stays zero →Not equal. (l2/b2useInterval: 0, which makes the race especially easy to lose.)This is a test-only issue: comparing
EvaluationTimestamp/EvaluationDurationbetween a live group and a never-evaluated backup group is not a meaningful invariant. The surrounding comment (ruler_test.go:1650-1652) states the intent is only to confirm therulepb.RuleGroupList → GroupStateDescconversion matches thepromRules.Group → GroupStateDescconversion for the config fields — not the evaluation state.Possible solution
In
requireGroupStateEqual, stop comparing the evaluation-state fields between live and backup groups, since the backup side is defined to be zero:require.Equal(t, a.EvaluationTimestamp, b.EvaluationTimestamp)and the analogousEvaluationDurationassertions at the group level (ruler_test.go:1616-1617) and the rule level.Interval,User,Limit,Expr,Labels,Record/Alert, etc.) and the already-relaxedStatecheck.Alternatively, the test could assert that the backup group's
EvaluationTimestampis exactly the zero value, and not require it to equal the live group's.Additional context
pkg/ruler/ruler_test.go→TestGetRulesFromBackuprequireGroupStateEqual(assertion atruler_test.go:1616)pkg/ruler/ruler.go:1419ruleGroupListToGroupStateDescEvaluationTimestampcomparison inrequireGroupStateEqualwas added relatively recently; before that the helper did not compare evaluation-state fields.I'm happy to send a small PR implementing the proposed fix.