Skip to content

Flaky Test: TestGetRulesFromBackup #7577

@sandy2008

Description

@sandy2008

Description

The TestGetRulesFromBackup test in pkg/ruler/ruler_test.go is flaky, failing intermittently in CI. The failure is a timing race between the Prometheus rule manager's asynchronous first evaluation of the live rule groups and the test's assertion that a live group's GroupStateDesc equals the corresponding backup group's.

Observed on PR CI (integration-configs-db (ubuntu-24.04, amd64)): the same commit produced one passing and one failing run, which is the signature of a flaky test rather than a code regression:

The PR under test only touched pkg/querier/blocks_finder_bucket_scan.go — unrelated to pkg/ruler. integration-configs-db is green on recent master.

Fail excerpt

    ruler_test.go:1616:
        	Error Trace:	/go/src/github.com/cortexproject/cortex/pkg/ruler/ruler_test.go:1616
        	            				/go/src/github.com/cortexproject/cortex/pkg/ruler/ruler_test.go:1654
        	Error:      	Not equal:
        	            	expected: time.Date(2026, time.May, 30, 1, 29, 46, 642606566, time.Local)
        	            	actual  : time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC)

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(time.Time) 2026-05-30 01:29:46.642606566 +0000 UTC m=+29.180124763
        	            	+(time.Time) 0001-01-01 00:00:00 +0000 UTC
        	Test:       	TestGetRulesFromBackup
--- FAIL: TestGetRulesFromBackup (0.21s)

Expected behavior

TestGetRulesFromBackup passes consistently.

Actual behavior

It fails intermittently at ruler_test.go:1616, inside the requireGroupStateEqual helper:

require.Equal(t, a.EvaluationTimestamp, b.EvaluationTimestamp)

The failing call site is ruler_test.go:1654, which compares the live group l2 against the backup group b2:

requireGroupStateEqual(stateByKey["namespace;l2"], stateByKey["namespace;b2"])

expected (the live group l2) carries a real evaluation timestamp; actual (the backup group b2) is the zero time.Time.

Root cause

requireGroupStateEqual asserts that a live group's GroupStateDesc equals a backup group's, including EvaluationTimestamp (and EvaluationDuration). But those two fields are fundamentally different between the two paths:

  • Backup groups are never evaluated, so their timestamps are hardcoded to the zero value:
    • pkg/ruler/ruler.go:1445 (group level) — EvaluationTimestamp: time.Time{} ("rules backup is not evaluating so there will be no evaluation timestamp")
    • pkg/ruler/ruler.go:1492 / :1502 (rule level) — EvaluationTimestamp: time.Time{}
  • Live groups derive their timestamp from the Prometheus rule manager:
    • pkg/ruler/ruler.go:1274EvaluationTimestamp: group.GetLastEvaluation()

The assertion therefore only holds while the live group has not yet been evaluated (both sides zero). The Prometheus manager runs an initial evaluation asynchronously after syncRules, so whether that first evaluation lands before or after the test's GetRules call is a race. When it lands first, the live group's EvaluationTimestamp becomes non-zero while the backup stays zero → Not equal. (l2/b2 use Interval: 0, which makes the race especially easy to lose.)

This is a test-only issue: comparing EvaluationTimestamp/EvaluationDuration between a live group and a never-evaluated backup group is not a meaningful invariant. The surrounding comment (ruler_test.go:1650-1652) states the intent is only to confirm the rulepb.RuleGroupList → GroupStateDesc conversion matches the promRules.Group → GroupStateDesc conversion for the config fields — not the evaluation state.

Possible solution

In requireGroupStateEqual, stop comparing the evaluation-state fields between live and backup groups, since the backup side is defined to be zero:

  • Drop (or guard) require.Equal(t, a.EvaluationTimestamp, b.EvaluationTimestamp) and the analogous EvaluationDuration assertions at the group level (ruler_test.go:1616-1617) and the rule level.
  • Keep asserting the config-derived fields (Interval, User, Limit, Expr, Labels, Record/Alert, etc.) and the already-relaxed State check.

Alternatively, the test could assert that the backup group's EvaluationTimestamp is exactly the zero value, and not require it to equal the live group's.

Additional context

  • Test: pkg/ruler/ruler_test.goTestGetRulesFromBackup
  • Helper: requireGroupStateEqual (assertion at ruler_test.go:1616)
  • Backup conversion: pkg/ruler/ruler.go:1419 ruleGroupListToGroupStateDesc
  • The EvaluationTimestamp comparison in requireGroupStateEqual was added relatively recently; before that the helper did not compare evaluation-state fields.

I'm happy to send a small PR implementing the proposed fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions