Fix data race on ChunkingStats in IndexFromFile#341
Merged
Conversation
IndexFromFile shares &stats with every parallel chunk worker, which update it via atomic.AddUint64. Workers were only signalled to stop (close(done)) and never joined: the consumer loop breaks early when a worker reports EOF, leaving later workers' goroutines running. The function then returned `stats` by value while those goroutines were still atomically incrementing it -- a data race (flagged by -race in TestParallelChunking) that also made ChunksProduced non-deterministic. Track the worker goroutines with a sync.WaitGroup (using the Go 1.25 (*sync.WaitGroup).Go helper) and, on the post-spawn exit path, cancel the context, stop every worker and wait for all goroutines to exit before copying stats into the return value. The wait runs inline before the return (deferred cleanup runs after the return value is already copied, so it cannot fix this), and before the deferred per-worker file Close calls, which also closes a latent read-from-closed-file hazard for abandoned stragglers. defer cancel()/defer w.stop() are kept as an idempotent panic safety net. No signature or behavior change for callers. Add TestIndexFromFileStats: exercises the early-EOF/straggler path with null-heavy inputs across multiple worker counts and asserts the now race-free, deterministic invariants (ChunksAccepted == len(index.Chunks), ChunksProduced >= ChunksAccepted > 0). Verified: go test -race -count=5 of TestParallelChunking and the new test (no DATA RACE; previously reproduced within a few iterations), TestIndexFromFileStats -count=20 without -race, full go test -race . and go test ./cmd/desync all green.
Convert the new test's setup checks and stat invariants from manual t.Fatal/t.Fatalf to github.com/stretchr/testify/require (already a dependency, used elsewhere in the package). TestParallelChunking is left as-is.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
go test -raceonTestParallelChunkingreliably reports a data race inIndexFromFile(make.go). It is a pre-existing latent bug in production code — the recently merged deprecation PR #340 only shifted goroutine timing enough to make it observable.stats := ChunkingStats{}is a local in the main goroutine; its address is shared with every worker viapChunker.stats = &stats.c.stats.incProduced()→atomic.AddUint64(&s.ChunksProduced, 1).stop()is signal-only (once.Do(func(){ close(done) })); there is no join/WaitGroupanywhere. The consumer loopbreaks early when a worker reportseof, leaving later workers' goroutines still running.statsby value — a non-atomic full-struct read concurrent with the still-running workers' atomic increments. That is the race the detector flags.Beyond the race, the by-value snapshot was also non-deterministic: abandoned workers keep incrementing after the function returns, so
ChunksProducedwas whatever happened to be counted at that instant.Fix
make.goonly:sync.WaitGroup, spawned via the Go 1.25(*sync.WaitGroup).Gohelper (doesAdd(1)/Done()automatically).cancel()s the context,stop()s every worker, andwg.Wait()s for all goroutines to fully exit beforestatsis copied into the return.return— a deferred cleanup runs after the return value is already copied, so it cannot fix this.f.Close()calls, it also closes a latent read-from-closed-file hazard for abandoned stragglers.defer cancel()/defer w.stop()are kept as an idempotent panic safety net.cmd/desync/make.go(the only consumer that reads the returned stats) and all_-discarding callers are unaffected.Scope: only the stats race (the one
-raceflags). Thec.err/c.eofcross-goroutine reads are already safe via theclose(c.results)happens-before edge; thec.nextlist-collapse mutation is a separate theoretical concern intentionally left out to keep this minimal.Test
Added
TestIndexFromFileStats: drives the early-EOF/straggler path with null-heavy inputs across worker counts 2–8 and asserts the now race-free, deterministic invariants —ChunksAccepted == len(index.Chunks),ChunksProduced >= ChunksAccepted,ChunksProduced > 0.TestParallelChunkingis kept as-is as the chunk-equivalence guard.Verification
go test -race -count=5 -run 'TestParallelChunking|TestIndexFromFileStats' .— noWARNING: DATA RACE(162s; previously reproduced within a few iterations).go test -run TestIndexFromFileStats -count=20 .— deterministic, green.go test -race .andgo test ./cmd/desync— green.go vet ./...clean,go build ./cmd/desyncclean.