AI Tool Usage Notice
Drafted with help from Claude Code. All code references, line numbers, failure excerpts, and reproduction steps were reviewed and validated against master before submitting.
Describe the bug
TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s flakes on test-no-race (arm64) with:
--- FAIL: TestBlocksCleaner (0.00s)
--- FAIL: TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s (3.92s)
objstore.go:26:
Error Trace: pkg/util/testutil/objstore.go:26
/usr/local/go/src/testing/testing.go:1317 (t.Cleanup)
/usr/local/go/src/testing/testing.go:1667 (tRunner)
/usr/local/go/src/testing/testing.go:2030
Error: Received unexpected error:
unlinkat /tmp/bucket3527476943: directory not empty
Observed in https://github.com/cortexproject/cortex/actions/runs/26555111751/job/78225222195 (PR #7563, which only modifies MAINTAINERS.md — confirming pre-existing flake). The test-no-race (amd64) job in the same run passed.
To Reproduce
Timing-dependent flake; most reproducible on slower (arm64) runners.
- Run the affected subtest repeatedly:
go test -run '^TestBlocksCleaner$' ./pkg/compactor/... -count=50
- Intermittently the
concurrency=2, markers_migration_enabled=false, tenant_deletion_delay=0s subtest fails in t.Cleanup with unlinkat /tmp/bucketXXXXXX: directory not empty, because a HeartBeat tail Upload/Delete (running on context.Background()) races os.RemoveAll of the test bucket directory.
Expected behavior
TestBlocksCleaner passes consistently; the per-user HeartBeat goroutine fully exits — including its tail bucket I/O — before the test cleanup removes the storage directory.
Root cause analysis
BlocksCleaner.cleanUpActiveUsers (pkg/compactor/blocks_cleaner.go:359-363) and cleanDeletedUsers (pkg/compactor/blocks_cleaner.go:394-398) launch a VisitMarkerManager.HeartBeat goroutine per user but only signal completion via defer func() { errChan <- nil }(). They never wait for the goroutine itself to finish.
After receiving the signal, HeartBeat still performs two bucket I/O calls on the way out (pkg/compactor/visit_marker.go:81, 96-99):
MarkWithStatus(ctx, Completed) — an Upload that does MkdirAll+os.Create
DeleteVisitMarker(context.Background()) — because deleteOnExit=true
Once the test body returns, services.StopAndAwaitTerminated only cancels the loop context; the heartbeat tail uses context.Background(), so it keeps writing into the bucket directory. t.Cleanup's os.RemoveAll(storageDir) in pkg/util/testutil/objstore.go:25-27 then races with the Upload, producing unlinkat ... directory not empty whenever a MkdirAll/Create recreates a subtree between RemoveAll's directory scan and its final unlink.
Why arm64 hits it more
The GitHub Actions arm64 runners are noticeably slower at filesystem syscalls and goroutine scheduling than amd64. With concurrency=2, two heartbeats can be in-flight when the user callback returns, doubling the chance the tail Upload/Delete is still pending when t.Cleanup fires. Same race exists on amd64; just rarely loses there.
Proposed fix
Wait for HeartBeat to fully exit before letting the per-user callback return.
--- a/pkg/compactor/blocks_cleaner.go
+++ b/pkg/compactor/blocks_cleaner.go
@@ pkg/compactor/blocks_cleaner.go ~349-365
return concurrency.ForEachUser(ctx, users, c.cfg.CleanupConcurrency, func(ctx context.Context, userID string) error {
// ...
errChan := make(chan error, 1)
- go visitMarkerManager.HeartBeat(ctx, errChan, c.cleanerVisitMarkerFileUpdateInterval, true)
- defer func() {
- errChan <- nil
- }()
+ var hbWG sync.WaitGroup
+ hbWG.Add(1)
+ go func() {
+ defer hbWG.Done()
+ visitMarkerManager.HeartBeat(ctx, errChan, c.cleanerVisitMarkerFileUpdateInterval, true)
+ }()
+ defer func() {
+ errChan <- nil
+ hbWG.Wait()
+ }()
return errors.Wrapf(c.cleanUser(ctx, userLogger, userBucket, userID, firstRun), "failed to delete blocks for user: %s", userID)
})
Same five-line change applies to cleanDeletedUsers (blocks_cleaner.go:394-399). sync is already imported.
Environment
- Infrastructure: N/A — test-only flake. Observed on the
test-no-race (arm64) CI job (run 26555111751); the amd64 job in the same run passed.
- Deployment tool: N/A
Additional context
#7386 (OPEN, stalled) — "Fix flaky TestBlocksCleaner by awaiting HeartBeat goroutine completion" — proposes essentially this fix. CHANGES_REQUESTED by @friedrichg, last update 2026-05-14. Resolving feedback there would close this issue. Also see #6894 (different subtest, fixed by #7486 via in-memory bucket).
Full failure log
test-no-race (arm64) excerpt
--- FAIL: TestBlocksCleaner (0.00s)
--- FAIL: TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s (3.92s)
objstore.go:26:
Error Trace: /__w/cortex/cortex/pkg/util/testutil/objstore.go:26
/usr/local/go/src/testing/testing.go:1317
/usr/local/go/src/testing/testing.go:1667
/usr/local/go/src/testing/testing.go:2030
Error: Received unexpected error:
unlinkat /tmp/bucket3527476943: directory not empty
Test: TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s
FAIL
FAIL github.com/cortexproject/cortex/pkg/compactor 94.038s
AI Tool Usage Notice
Drafted with help from Claude Code. All code references, line numbers, failure excerpts, and reproduction steps were reviewed and validated against
masterbefore submitting.Describe the bug
TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0sflakes ontest-no-race (arm64)with:Observed in https://github.com/cortexproject/cortex/actions/runs/26555111751/job/78225222195 (PR #7563, which only modifies
MAINTAINERS.md— confirming pre-existing flake). Thetest-no-race (amd64)job in the same run passed.To Reproduce
Timing-dependent flake; most reproducible on slower (arm64) runners.
concurrency=2, markers_migration_enabled=false, tenant_deletion_delay=0ssubtest fails int.Cleanupwithunlinkat /tmp/bucketXXXXXX: directory not empty, because aHeartBeattail Upload/Delete (running oncontext.Background()) racesos.RemoveAllof the test bucket directory.Expected behavior
TestBlocksCleanerpasses consistently; the per-userHeartBeatgoroutine fully exits — including its tail bucket I/O — before the test cleanup removes the storage directory.Root cause analysis
BlocksCleaner.cleanUpActiveUsers(pkg/compactor/blocks_cleaner.go:359-363) andcleanDeletedUsers(pkg/compactor/blocks_cleaner.go:394-398) launch aVisitMarkerManager.HeartBeatgoroutine per user but only signal completion viadefer func() { errChan <- nil }(). They never wait for the goroutine itself to finish.After receiving the signal,
HeartBeatstill performs two bucket I/O calls on the way out (pkg/compactor/visit_marker.go:81, 96-99):MarkWithStatus(ctx, Completed)— an Upload that doesMkdirAll+os.CreateDeleteVisitMarker(context.Background())— becausedeleteOnExit=trueOnce the test body returns,
services.StopAndAwaitTerminatedonly cancels the loop context; the heartbeat tail usescontext.Background(), so it keeps writing into the bucket directory.t.Cleanup'sos.RemoveAll(storageDir)inpkg/util/testutil/objstore.go:25-27then races with the Upload, producingunlinkat ... directory not emptywhenever aMkdirAll/Createrecreates a subtree betweenRemoveAll's directory scan and its final unlink.Why arm64 hits it more
The GitHub Actions arm64 runners are noticeably slower at filesystem syscalls and goroutine scheduling than amd64. With
concurrency=2, two heartbeats can be in-flight when the user callback returns, doubling the chance the tail Upload/Delete is still pending whent.Cleanupfires. Same race exists on amd64; just rarely loses there.Proposed fix
Wait for HeartBeat to fully exit before letting the per-user callback return.
Same five-line change applies to
cleanDeletedUsers(blocks_cleaner.go:394-399).syncis already imported.Environment
test-no-race (arm64)CI job (run 26555111751); theamd64job in the same run passed.Additional context
#7386 (OPEN, stalled) — "Fix flaky TestBlocksCleaner by awaiting HeartBeat goroutine completion" — proposes essentially this fix. CHANGES_REQUESTED by @friedrichg, last update 2026-05-14. Resolving feedback there would close this issue. Also see #6894 (different subtest, fixed by #7486 via in-memory bucket).
Full failure log
test-no-race (arm64) excerpt