AI Tool Usage Notice
Drafted with help from Claude Code. All code references, line numbers, failure excerpts, and reproduction steps were reviewed and validated against master before submitting.
Describe the bug
TestCompactor_DeleteLocalSyncFiles flakes on test-no-race (arm64) with:
--- FAIL: TestCompactor_DeleteLocalSyncFiles (2.86s)
compactor_test.go:1855:
Error Trace: /__w/cortex/cortex/pkg/compactor/compactor_test.go:1855
Error: Should not be zero, but was 0
Test: TestCompactor_DeleteLocalSyncFiles
Observed in https://github.com/cortexproject/cortex/actions/runs/26555111751/job/78225222195 (PR #7563, which only modifies MAINTAINERS.md — confirming pre-existing flake). The test-no-race (amd64) job in the same run passed. A very similar TestPartitionCompactor_DeleteLocalSyncFiles failure showed up in run 26491543042 with the identical Should not be zero, but was 0 signature at compactor_paritioning_test.go:1631.
What line 1855 asserts
// pkg/compactor/compactor_test.go:1849-1855
cortex_testutil.Poll(t, 10*time.Second, 1.0, func() any {
return prom_testutil.ToFloat64(c2.CompactionRunsCompleted)
})
// Let's check how many users second compactor has.
c2Users := len(c2.listTenantsWithMetaSyncDirectories())
require.NotZero(t, c2Users)
The test fixture creates 10 fixed tenants (user-1..user-10), starts c1, waits for one compaction run, then starts c2 and asserts c2 owns at least one tenant.
To Reproduce
Timing/scheduling-dependent flake; most reproducible on slower (arm64) runners.
- Run the test repeatedly:
go test -run '^TestCompactor_DeleteLocalSyncFiles$' ./pkg/compactor/... -count=50
- Intermittently it fails at
compactor_test.go:1855 with Should not be zero, but was 0: the second compactor (c2) completed a compaction cycle but owned 0 users that cycle. The sibling TestPartitionCompactor_DeleteLocalSyncFiles fails the same way at compactor_paritioning_test.go:1631.
Expected behavior
TestCompactor_DeleteLocalSyncFiles (and its partitioning sibling) pass consistently; the ownership assertion only runs once c2 has actually observed itself owning at least one user in the ring.
Root cause analysis
Two complementary hypotheses, both plausible:
Hypothesis A: ring-view startup race
The Poll only waits for CompactionRunsCompleted == 1.0, but CompactionRunsCompleted is incremented at the end of compactUsers() (compactor.go:826-840) as long as no errors occurred — even if ownedUsers is empty (non-owned users continue at compactor.go:868-927, which is not an error). So a successful "I owned 0 users" cycle still trips the Poll.
c2's ring view is built from two independent updates that race during startup:
- c2's lifecycler writes
ACTIVE + tokens to the (shared) consul KV (ring/lifecycler.go:957-999).
- c2's
Ring object pulls/watches KV and updates ringTokens (ring/ring.go:288-318).
WaitRingStability (compactor.go:736) polls only once per second with WaitStabilityMinDuration=1s / MaxDuration=5s (set by the test at lines 1811-1812). On a slow arm64 runner, the second poll can capture the ring view at a moment when c2's tokens have just been merged in, leaving the in-memory ringInstanceByToken map momentarily skewed — c.ring.Get then picks c1 for all 10 tenants for that cycle.
Hypothesis B: stochastic token distribution
Lifecyclers use 512 random tokens (pkg/compactor/compactor_ring.go:112-114) seeded with time.Now().UnixNano() (pkg/ring/token_generator.go:43-45). With only 10 fixed tenant IDs and 2 instances × 512 tokens, there is a small but non-zero chance every tenant hash maps to c1. arm64 just happened to be the job where the random draw hit. Polling longer won't help — same ring → same answer.
The two hypotheses are not mutually exclusive: a transient skew (A) can compound a borderline distribution (B).
Proposed fix (Hypothesis A path)
Poll on the actual condition (c2 owns ≥1 user), not the counter that ticks even on empty cycles.
--- a/pkg/compactor/compactor_test.go
+++ b/pkg/compactor/compactor_test.go
@@ -1846,11 +1846,16 @@ func TestCompactor_DeleteLocalSyncFiles(t *testing.T) {
// Now start second compactor, and wait until it runs compaction.
require.NoError(t, services.StartAndAwaitRunning(context.Background(), c2))
- cortex_testutil.Poll(t, 10*time.Second, 1.0, func() any {
- return prom_testutil.ToFloat64(c2.CompactionRunsCompleted)
+ // Wait until c2 has both completed a compaction cycle and observed itself
+ // owning at least one user in the ring. CompactionRunsCompleted increments
+ // even when no users were owned that cycle, so we must poll on the actual
+ // condition we depend on.
+ cortex_testutil.Poll(t, 30*time.Second, true, func() any {
+ return prom_testutil.ToFloat64(c2.CompactionRunsCompleted) >= 1 &&
+ len(c2.listTenantsWithMetaSyncDirectories()) > 0
})
- // Let's check how many users second compactor has.
c2Users := len(c2.listTenantsWithMetaSyncDirectories())
require.NotZero(t, c2Users)
Same change should be applied to the sibling TestPartitionCompactor_DeleteLocalSyncFiles at pkg/compactor/compactor_paritioning_test.go:1624-1631 — it has identical structure and is the same failure pattern.
If Hypothesis B turns out to dominate (Poll times out), the fix is to keep generating tenants until c2 owns one (forcing the path the test cares about, which is c1 cleaning up dirs that c2 has taken over).
Risk
- If the underlying race is actually that c2 never observes itself in the ring, the Poll will time out at 30s and surface a clearer error than the current opaque
Should not be zero, but was 0.
- The Poll now reads two values (counter + filesystem listing).
os.ReadDir ~10×/sec is harmless.
- We don't fix the underlying startup-time ring-view race; we just avoid sampling at a fragile moment. A more thorough fix would be in
compactor.go:730-741 (e.g., require at least one observed change before declaring stability), but that's a much larger behavioural change in production code for a test-only flake.
Environment
- Infrastructure: N/A — test-only flake. Observed on the
test-no-race (arm64) CI job (run 26555111751); the amd64 job in the same run passed. arm64 runners are slower at scheduling/syscalls, widening the race window.
- Deployment tool: N/A
Additional context
test-no-race (arm64) excerpt
--- FAIL: TestCompactor_DeleteLocalSyncFiles (2.86s)
compactor_test.go:1855:
Error Trace: /__w/cortex/cortex/pkg/compactor/compactor_test.go:1855
Error: Should not be zero, but was 0
Test: TestCompactor_DeleteLocalSyncFiles
FAIL
FAIL github.com/cortexproject/cortex/pkg/compactor 94.038s
make: *** [Makefile:220: test-no-race] Error 1
Related: #7564 (sibling TestBlocksCleaner arm64 flake from the same CI run), #6724 (flaky-tests tracker).
AI Tool Usage Notice
Drafted with help from Claude Code. All code references, line numbers, failure excerpts, and reproduction steps were reviewed and validated against
masterbefore submitting.Describe the bug
TestCompactor_DeleteLocalSyncFilesflakes ontest-no-race (arm64)with:Observed in https://github.com/cortexproject/cortex/actions/runs/26555111751/job/78225222195 (PR #7563, which only modifies
MAINTAINERS.md— confirming pre-existing flake). Thetest-no-race (amd64)job in the same run passed. A very similarTestPartitionCompactor_DeleteLocalSyncFilesfailure showed up in run 26491543042 with the identicalShould not be zero, but was 0signature atcompactor_paritioning_test.go:1631.What line 1855 asserts
The test fixture creates 10 fixed tenants (
user-1..user-10), startsc1, waits for one compaction run, then startsc2and assertsc2owns at least one tenant.To Reproduce
Timing/scheduling-dependent flake; most reproducible on slower (arm64) runners.
compactor_test.go:1855withShould not be zero, but was 0: the second compactor (c2) completed a compaction cycle but owned 0 users that cycle. The siblingTestPartitionCompactor_DeleteLocalSyncFilesfails the same way atcompactor_paritioning_test.go:1631.Expected behavior
TestCompactor_DeleteLocalSyncFiles(and its partitioning sibling) pass consistently; the ownership assertion only runs oncec2has actually observed itself owning at least one user in the ring.Root cause analysis
Two complementary hypotheses, both plausible:
Hypothesis A: ring-view startup race
The Poll only waits for
CompactionRunsCompleted == 1.0, butCompactionRunsCompletedis incremented at the end ofcompactUsers()(compactor.go:826-840) as long as no errors occurred — even ifownedUsersis empty (non-owned userscontinueatcompactor.go:868-927, which is not an error). So a successful "I owned 0 users" cycle still trips the Poll.c2's ring view is built from two independent updates that race during startup:
ACTIVE+ tokens to the (shared) consul KV (ring/lifecycler.go:957-999).Ringobject pulls/watches KV and updatesringTokens(ring/ring.go:288-318).WaitRingStability(compactor.go:736) polls only once per second withWaitStabilityMinDuration=1s/MaxDuration=5s(set by the test at lines 1811-1812). On a slow arm64 runner, the second poll can capture the ring view at a moment when c2's tokens have just been merged in, leaving the in-memoryringInstanceByTokenmap momentarily skewed —c.ring.Getthen picks c1 for all 10 tenants for that cycle.Hypothesis B: stochastic token distribution
Lifecyclers use 512 random tokens (
pkg/compactor/compactor_ring.go:112-114) seeded withtime.Now().UnixNano()(pkg/ring/token_generator.go:43-45). With only 10 fixed tenant IDs and 2 instances × 512 tokens, there is a small but non-zero chance every tenant hash maps to c1. arm64 just happened to be the job where the random draw hit. Polling longer won't help — same ring → same answer.The two hypotheses are not mutually exclusive: a transient skew (A) can compound a borderline distribution (B).
Proposed fix (Hypothesis A path)
Poll on the actual condition (c2 owns ≥1 user), not the counter that ticks even on empty cycles.
Same change should be applied to the sibling
TestPartitionCompactor_DeleteLocalSyncFilesatpkg/compactor/compactor_paritioning_test.go:1624-1631— it has identical structure and is the same failure pattern.If Hypothesis B turns out to dominate (Poll times out), the fix is to keep generating tenants until c2 owns one (forcing the path the test cares about, which is
c1cleaning up dirs thatc2has taken over).Risk
Should not be zero, but was 0.os.ReadDir~10×/sec is harmless.compactor.go:730-741(e.g., require at least one observed change before declaring stability), but that's a much larger behavioural change in production code for a test-only flake.Environment
test-no-race (arm64)CI job (run 26555111751); theamd64job in the same run passed. arm64 runners are slower at scheduling/syscalls, widening the race window.Additional context
test-no-race (arm64) excerpt
Related: #7564 (sibling TestBlocksCleaner arm64 flake from the same CI run), #6724 (flaky-tests tracker).