Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — `Should not be zero, but was 0`

**AI Tool Usage Notice**

Drafted with help from Claude Code. All code references, line numbers, failure excerpts, and reproduction steps were reviewed and validated against `master` before submitting.

## Describe the bug

`TestCompactor_DeleteLocalSyncFiles` flakes on `test-no-race (arm64)` with:

```
--- FAIL: TestCompactor_DeleteLocalSyncFiles (2.86s)
    compactor_test.go:1855: 
        Error Trace: /__w/cortex/cortex/pkg/compactor/compactor_test.go:1855
        Error:      Should not be zero, but was 0
        Test:       TestCompactor_DeleteLocalSyncFiles
```

Observed in https://github.com/cortexproject/cortex/actions/runs/26555111751/job/78225222195 (PR #7563, which only modifies `MAINTAINERS.md` — confirming pre-existing flake). The `test-no-race (amd64)` job in the same run passed. A very similar `TestPartitionCompactor_DeleteLocalSyncFiles` failure showed up in run 26491543042 with the identical `Should not be zero, but was 0` signature at `compactor_paritioning_test.go:1631`.

## What line 1855 asserts

```go
// pkg/compactor/compactor_test.go:1849-1855
cortex_testutil.Poll(t, 10*time.Second, 1.0, func() any {
    return prom_testutil.ToFloat64(c2.CompactionRunsCompleted)
})

// Let's check how many users second compactor has.
c2Users := len(c2.listTenantsWithMetaSyncDirectories())
require.NotZero(t, c2Users)
```

The test fixture creates 10 fixed tenants (`user-1`..`user-10`), starts `c1`, waits for one compaction run, then starts `c2` and asserts `c2` owns at least one tenant.

## To Reproduce

Timing/scheduling-dependent flake; most reproducible on slower (arm64) runners.

1. Run the test repeatedly:
   ```
   go test -run '^TestCompactor_DeleteLocalSyncFiles$' ./pkg/compactor/... -count=50
   ```
2. Intermittently it fails at `compactor_test.go:1855` with `Should not be zero, but was 0`: the second compactor (`c2`) completed a compaction cycle but owned 0 users that cycle. The sibling `TestPartitionCompactor_DeleteLocalSyncFiles` fails the same way at `compactor_paritioning_test.go:1631`.

## Expected behavior

`TestCompactor_DeleteLocalSyncFiles` (and its partitioning sibling) pass consistently; the ownership assertion only runs once `c2` has actually observed itself owning at least one user in the ring.

## Root cause analysis

Two complementary hypotheses, both plausible:

### Hypothesis A: ring-view startup race

The Poll only waits for `CompactionRunsCompleted == 1.0`, but `CompactionRunsCompleted` is incremented at the end of `compactUsers()` (`compactor.go:826-840`) as long as no errors occurred — even if `ownedUsers` is empty (non-owned users `continue` at `compactor.go:868-927`, which is not an error). So a successful "I owned 0 users" cycle still trips the Poll.

c2's ring view is built from two independent updates that race during startup:
1. c2's lifecycler writes `ACTIVE` + tokens to the (shared) consul KV (`ring/lifecycler.go:957-999`).
2. c2's `Ring` object pulls/watches KV and updates `ringTokens` (`ring/ring.go:288-318`).

`WaitRingStability` (`compactor.go:736`) polls only once per second with `WaitStabilityMinDuration=1s` / `MaxDuration=5s` (set by the test at lines 1811-1812). On a slow arm64 runner, the second poll can capture the ring view at a moment when c2's tokens have just been merged in, leaving the in-memory `ringInstanceByToken` map momentarily skewed — `c.ring.Get` then picks c1 for all 10 tenants for that cycle.

### Hypothesis B: stochastic token distribution

Lifecyclers use 512 random tokens (`pkg/compactor/compactor_ring.go:112-114`) seeded with `time.Now().UnixNano()` (`pkg/ring/token_generator.go:43-45`). With only 10 fixed tenant IDs and 2 instances × 512 tokens, there is a small but non-zero chance every tenant hash maps to c1. arm64 just happened to be the job where the random draw hit. Polling longer won't help — same ring → same answer.

The two hypotheses are not mutually exclusive: a transient skew (A) can compound a borderline distribution (B).

## Proposed fix (Hypothesis A path)

Poll on the actual condition (c2 owns ≥1 user), not the counter that ticks even on empty cycles.

```diff
--- a/pkg/compactor/compactor_test.go
+++ b/pkg/compactor/compactor_test.go
@@ -1846,11 +1846,16 @@ func TestCompactor_DeleteLocalSyncFiles(t *testing.T) {
 
 	// Now start second compactor, and wait until it runs compaction.
 	require.NoError(t, services.StartAndAwaitRunning(context.Background(), c2))
-	cortex_testutil.Poll(t, 10*time.Second, 1.0, func() any {
-		return prom_testutil.ToFloat64(c2.CompactionRunsCompleted)
+	// Wait until c2 has both completed a compaction cycle and observed itself
+	// owning at least one user in the ring. CompactionRunsCompleted increments
+	// even when no users were owned that cycle, so we must poll on the actual
+	// condition we depend on.
+	cortex_testutil.Poll(t, 30*time.Second, true, func() any {
+		return prom_testutil.ToFloat64(c2.CompactionRunsCompleted) >= 1 &&
+			len(c2.listTenantsWithMetaSyncDirectories()) > 0
 	})
 
-	// Let's check how many users second compactor has.
 	c2Users := len(c2.listTenantsWithMetaSyncDirectories())
 	require.NotZero(t, c2Users)
```

Same change should be applied to the sibling `TestPartitionCompactor_DeleteLocalSyncFiles` at `pkg/compactor/compactor_paritioning_test.go:1624-1631` — it has identical structure and is the same failure pattern.

If Hypothesis B turns out to dominate (Poll times out), the fix is to keep generating tenants until c2 owns one (forcing the path the test cares about, which is `c1` cleaning up dirs that `c2` has taken over).

## Risk

- If the underlying race is actually that c2 *never* observes itself in the ring, the Poll will time out at 30s and surface a clearer error than the current opaque `Should not be zero, but was 0`.
- The Poll now reads two values (counter + filesystem listing). `os.ReadDir` ~10×/sec is harmless.
- We don't fix the underlying startup-time ring-view race; we just avoid sampling at a fragile moment. A more thorough fix would be in `compactor.go:730-741` (e.g., require at least one observed change before declaring stability), but that's a much larger behavioural change in production code for a test-only flake.

## Environment

- Infrastructure: N/A — test-only flake. Observed on the `test-no-race (arm64)` CI job (run 26555111751); the `amd64` job in the same run passed. arm64 runners are slower at scheduling/syscalls, widening the race window.
- Deployment tool: N/A

## Additional context

<details>
<summary>test-no-race (arm64) excerpt</summary>

```
--- FAIL: TestCompactor_DeleteLocalSyncFiles (2.86s)
    compactor_test.go:1855: 
        Error Trace: /__w/cortex/cortex/pkg/compactor/compactor_test.go:1855
        Error:      Should not be zero, but was 0
        Test:       TestCompactor_DeleteLocalSyncFiles
FAIL
FAIL	github.com/cortexproject/cortex/pkg/compactor	94.038s
make: *** [Makefile:220: test-no-race] Error 1
```
</details>

Related: #7564 (sibling TestBlocksCleaner arm64 flake from the same CI run), #6724 (flaky-tests tracker).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — `Should not be zero, but was 0` #7565

Describe the bug

What line 1855 asserts

To Reproduce

Expected behavior

Root cause analysis

Hypothesis A: ring-view startup race

Hypothesis B: stochastic token distribution

Proposed fix (Hypothesis A path)

Risk

Environment

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — Should not be zero, but was 0 #7565

Description

Describe the bug

What line 1855 asserts

To Reproduce

Expected behavior

Root cause analysis

Hypothesis A: ring-view startup race

Hypothesis B: stochastic token distribution

Proposed fix (Hypothesis A path)

Risk

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — `Should not be zero, but was 0` #7565