Skip to content

Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — Should not be zero, but was 0 #7565

@sandy2008

Description

@sandy2008

AI Tool Usage Notice

Drafted with help from Claude Code. All code references, line numbers, failure excerpts, and reproduction steps were reviewed and validated against master before submitting.

Describe the bug

TestCompactor_DeleteLocalSyncFiles flakes on test-no-race (arm64) with:

--- FAIL: TestCompactor_DeleteLocalSyncFiles (2.86s)
    compactor_test.go:1855: 
        Error Trace: /__w/cortex/cortex/pkg/compactor/compactor_test.go:1855
        Error:      Should not be zero, but was 0
        Test:       TestCompactor_DeleteLocalSyncFiles

Observed in https://github.com/cortexproject/cortex/actions/runs/26555111751/job/78225222195 (PR #7563, which only modifies MAINTAINERS.md — confirming pre-existing flake). The test-no-race (amd64) job in the same run passed. A very similar TestPartitionCompactor_DeleteLocalSyncFiles failure showed up in run 26491543042 with the identical Should not be zero, but was 0 signature at compactor_paritioning_test.go:1631.

What line 1855 asserts

// pkg/compactor/compactor_test.go:1849-1855
cortex_testutil.Poll(t, 10*time.Second, 1.0, func() any {
    return prom_testutil.ToFloat64(c2.CompactionRunsCompleted)
})

// Let's check how many users second compactor has.
c2Users := len(c2.listTenantsWithMetaSyncDirectories())
require.NotZero(t, c2Users)

The test fixture creates 10 fixed tenants (user-1..user-10), starts c1, waits for one compaction run, then starts c2 and asserts c2 owns at least one tenant.

To Reproduce

Timing/scheduling-dependent flake; most reproducible on slower (arm64) runners.

  1. Run the test repeatedly:
    go test -run '^TestCompactor_DeleteLocalSyncFiles$' ./pkg/compactor/... -count=50
    
  2. Intermittently it fails at compactor_test.go:1855 with Should not be zero, but was 0: the second compactor (c2) completed a compaction cycle but owned 0 users that cycle. The sibling TestPartitionCompactor_DeleteLocalSyncFiles fails the same way at compactor_paritioning_test.go:1631.

Expected behavior

TestCompactor_DeleteLocalSyncFiles (and its partitioning sibling) pass consistently; the ownership assertion only runs once c2 has actually observed itself owning at least one user in the ring.

Root cause analysis

Two complementary hypotheses, both plausible:

Hypothesis A: ring-view startup race

The Poll only waits for CompactionRunsCompleted == 1.0, but CompactionRunsCompleted is incremented at the end of compactUsers() (compactor.go:826-840) as long as no errors occurred — even if ownedUsers is empty (non-owned users continue at compactor.go:868-927, which is not an error). So a successful "I owned 0 users" cycle still trips the Poll.

c2's ring view is built from two independent updates that race during startup:

  1. c2's lifecycler writes ACTIVE + tokens to the (shared) consul KV (ring/lifecycler.go:957-999).
  2. c2's Ring object pulls/watches KV and updates ringTokens (ring/ring.go:288-318).

WaitRingStability (compactor.go:736) polls only once per second with WaitStabilityMinDuration=1s / MaxDuration=5s (set by the test at lines 1811-1812). On a slow arm64 runner, the second poll can capture the ring view at a moment when c2's tokens have just been merged in, leaving the in-memory ringInstanceByToken map momentarily skewed — c.ring.Get then picks c1 for all 10 tenants for that cycle.

Hypothesis B: stochastic token distribution

Lifecyclers use 512 random tokens (pkg/compactor/compactor_ring.go:112-114) seeded with time.Now().UnixNano() (pkg/ring/token_generator.go:43-45). With only 10 fixed tenant IDs and 2 instances × 512 tokens, there is a small but non-zero chance every tenant hash maps to c1. arm64 just happened to be the job where the random draw hit. Polling longer won't help — same ring → same answer.

The two hypotheses are not mutually exclusive: a transient skew (A) can compound a borderline distribution (B).

Proposed fix (Hypothesis A path)

Poll on the actual condition (c2 owns ≥1 user), not the counter that ticks even on empty cycles.

--- a/pkg/compactor/compactor_test.go
+++ b/pkg/compactor/compactor_test.go
@@ -1846,11 +1846,16 @@ func TestCompactor_DeleteLocalSyncFiles(t *testing.T) {
 
 	// Now start second compactor, and wait until it runs compaction.
 	require.NoError(t, services.StartAndAwaitRunning(context.Background(), c2))
-	cortex_testutil.Poll(t, 10*time.Second, 1.0, func() any {
-		return prom_testutil.ToFloat64(c2.CompactionRunsCompleted)
+	// Wait until c2 has both completed a compaction cycle and observed itself
+	// owning at least one user in the ring. CompactionRunsCompleted increments
+	// even when no users were owned that cycle, so we must poll on the actual
+	// condition we depend on.
+	cortex_testutil.Poll(t, 30*time.Second, true, func() any {
+		return prom_testutil.ToFloat64(c2.CompactionRunsCompleted) >= 1 &&
+			len(c2.listTenantsWithMetaSyncDirectories()) > 0
 	})
 
-	// Let's check how many users second compactor has.
 	c2Users := len(c2.listTenantsWithMetaSyncDirectories())
 	require.NotZero(t, c2Users)

Same change should be applied to the sibling TestPartitionCompactor_DeleteLocalSyncFiles at pkg/compactor/compactor_paritioning_test.go:1624-1631 — it has identical structure and is the same failure pattern.

If Hypothesis B turns out to dominate (Poll times out), the fix is to keep generating tenants until c2 owns one (forcing the path the test cares about, which is c1 cleaning up dirs that c2 has taken over).

Risk

  • If the underlying race is actually that c2 never observes itself in the ring, the Poll will time out at 30s and surface a clearer error than the current opaque Should not be zero, but was 0.
  • The Poll now reads two values (counter + filesystem listing). os.ReadDir ~10×/sec is harmless.
  • We don't fix the underlying startup-time ring-view race; we just avoid sampling at a fragile moment. A more thorough fix would be in compactor.go:730-741 (e.g., require at least one observed change before declaring stability), but that's a much larger behavioural change in production code for a test-only flake.

Environment

  • Infrastructure: N/A — test-only flake. Observed on the test-no-race (arm64) CI job (run 26555111751); the amd64 job in the same run passed. arm64 runners are slower at scheduling/syscalls, widening the race window.
  • Deployment tool: N/A

Additional context

test-no-race (arm64) excerpt
--- FAIL: TestCompactor_DeleteLocalSyncFiles (2.86s)
    compactor_test.go:1855: 
        Error Trace: /__w/cortex/cortex/pkg/compactor/compactor_test.go:1855
        Error:      Should not be zero, but was 0
        Test:       TestCompactor_DeleteLocalSyncFiles
FAIL
FAIL	github.com/cortexproject/cortex/pkg/compactor	94.038s
make: *** [Makefile:220: test-no-race] Error 1

Related: #7564 (sibling TestBlocksCleaner arm64 flake from the same CI run), #6724 (flaky-tests tracker).

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions