Skip to content

Flaky test: TestBlocksCleaner (arm64) — unlinkat: directory not empty #7564

@sandy2008

Description

@sandy2008

AI Tool Usage Notice

Drafted with help from Claude Code. All code references, line numbers, failure excerpts, and reproduction steps were reviewed and validated against master before submitting.

Describe the bug

TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s flakes on test-no-race (arm64) with:

--- FAIL: TestBlocksCleaner (0.00s)
    --- FAIL: TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s (3.92s)
        objstore.go:26: 
            Error Trace: pkg/util/testutil/objstore.go:26
                         /usr/local/go/src/testing/testing.go:1317 (t.Cleanup)
                         /usr/local/go/src/testing/testing.go:1667 (tRunner)
                         /usr/local/go/src/testing/testing.go:2030
            Error:      Received unexpected error:
                        unlinkat /tmp/bucket3527476943: directory not empty

Observed in https://github.com/cortexproject/cortex/actions/runs/26555111751/job/78225222195 (PR #7563, which only modifies MAINTAINERS.md — confirming pre-existing flake). The test-no-race (amd64) job in the same run passed.

To Reproduce

Timing-dependent flake; most reproducible on slower (arm64) runners.

  1. Run the affected subtest repeatedly:
    go test -run '^TestBlocksCleaner$' ./pkg/compactor/... -count=50
    
  2. Intermittently the concurrency=2, markers_migration_enabled=false, tenant_deletion_delay=0s subtest fails in t.Cleanup with unlinkat /tmp/bucketXXXXXX: directory not empty, because a HeartBeat tail Upload/Delete (running on context.Background()) races os.RemoveAll of the test bucket directory.

Expected behavior

TestBlocksCleaner passes consistently; the per-user HeartBeat goroutine fully exits — including its tail bucket I/O — before the test cleanup removes the storage directory.

Root cause analysis

BlocksCleaner.cleanUpActiveUsers (pkg/compactor/blocks_cleaner.go:359-363) and cleanDeletedUsers (pkg/compactor/blocks_cleaner.go:394-398) launch a VisitMarkerManager.HeartBeat goroutine per user but only signal completion via defer func() { errChan <- nil }(). They never wait for the goroutine itself to finish.

After receiving the signal, HeartBeat still performs two bucket I/O calls on the way out (pkg/compactor/visit_marker.go:81, 96-99):

  1. MarkWithStatus(ctx, Completed) — an Upload that does MkdirAll+os.Create
  2. DeleteVisitMarker(context.Background()) — because deleteOnExit=true

Once the test body returns, services.StopAndAwaitTerminated only cancels the loop context; the heartbeat tail uses context.Background(), so it keeps writing into the bucket directory. t.Cleanup's os.RemoveAll(storageDir) in pkg/util/testutil/objstore.go:25-27 then races with the Upload, producing unlinkat ... directory not empty whenever a MkdirAll/Create recreates a subtree between RemoveAll's directory scan and its final unlink.

Why arm64 hits it more

The GitHub Actions arm64 runners are noticeably slower at filesystem syscalls and goroutine scheduling than amd64. With concurrency=2, two heartbeats can be in-flight when the user callback returns, doubling the chance the tail Upload/Delete is still pending when t.Cleanup fires. Same race exists on amd64; just rarely loses there.

Proposed fix

Wait for HeartBeat to fully exit before letting the per-user callback return.

--- a/pkg/compactor/blocks_cleaner.go
+++ b/pkg/compactor/blocks_cleaner.go
@@ pkg/compactor/blocks_cleaner.go ~349-365
 	return concurrency.ForEachUser(ctx, users, c.cfg.CleanupConcurrency, func(ctx context.Context, userID string) error {
 		// ...
 		errChan := make(chan error, 1)
-		go visitMarkerManager.HeartBeat(ctx, errChan, c.cleanerVisitMarkerFileUpdateInterval, true)
-		defer func() {
-			errChan <- nil
-		}()
+		var hbWG sync.WaitGroup
+		hbWG.Add(1)
+		go func() {
+			defer hbWG.Done()
+			visitMarkerManager.HeartBeat(ctx, errChan, c.cleanerVisitMarkerFileUpdateInterval, true)
+		}()
+		defer func() {
+			errChan <- nil
+			hbWG.Wait()
+		}()
 		return errors.Wrapf(c.cleanUser(ctx, userLogger, userBucket, userID, firstRun), "failed to delete blocks for user: %s", userID)
 	})

Same five-line change applies to cleanDeletedUsers (blocks_cleaner.go:394-399). sync is already imported.

Environment

  • Infrastructure: N/A — test-only flake. Observed on the test-no-race (arm64) CI job (run 26555111751); the amd64 job in the same run passed.
  • Deployment tool: N/A

Additional context

#7386 (OPEN, stalled) — "Fix flaky TestBlocksCleaner by awaiting HeartBeat goroutine completion" — proposes essentially this fix. CHANGES_REQUESTED by @friedrichg, last update 2026-05-14. Resolving feedback there would close this issue. Also see #6894 (different subtest, fixed by #7486 via in-memory bucket).

Full failure log

test-no-race (arm64) excerpt
--- FAIL: TestBlocksCleaner (0.00s)
    --- FAIL: TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s (3.92s)
        objstore.go:26: 
            Error Trace: /__w/cortex/cortex/pkg/util/testutil/objstore.go:26
                         /usr/local/go/src/testing/testing.go:1317
                         /usr/local/go/src/testing/testing.go:1667
                         /usr/local/go/src/testing/testing.go:2030
            Error:      Received unexpected error:
                        unlinkat /tmp/bucket3527476943: directory not empty
            Test:       TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s
FAIL
FAIL	github.com/cortexproject/cortex/pkg/compactor	94.038s

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions