Skip to content

Fix data race on ChunkingStats in IndexFromFile#341

Merged
folbricht merged 2 commits into
masterfrom
fix/make-indexfromfile-stats-race
May 17, 2026
Merged

Fix data race on ChunkingStats in IndexFromFile#341
folbricht merged 2 commits into
masterfrom
fix/make-indexfromfile-stats-race

Conversation

@folbricht
Copy link
Copy Markdown
Owner

Problem

go test -race on TestParallelChunking reliably reports a data race in IndexFromFile (make.go). It is a pre-existing latent bug in production code — the recently merged deprecation PR #340 only shifted goroutine timing enough to make it observable.

  • stats := ChunkingStats{} is a local in the main goroutine; its address is shared with every worker via pChunker.stats = &stats.
  • Worker goroutines call c.stats.incProduced()atomic.AddUint64(&s.ChunksProduced, 1).
  • stop() is signal-only (once.Do(func(){ close(done) })); there is no join/WaitGroup anywhere. The consumer loop breaks early when a worker reports eof, leaving later workers' goroutines still running.
  • The main goroutine then returns stats by value — a non-atomic full-struct read concurrent with the still-running workers' atomic increments. That is the race the detector flags.

Beyond the race, the by-value snapshot was also non-deterministic: abandoned workers keep incrementing after the function returns, so ChunksProduced was whatever happened to be counted at that instant.

Fix

make.go only:

  • Track the worker goroutines with a sync.WaitGroup, spawned via the Go 1.25 (*sync.WaitGroup).Go helper (does Add(1)/Done() automatically).
  • On the post-spawn exit path, route both the mid-loop error case and the normal case through a single tail that cancel()s the context, stop()s every worker, and wg.Wait()s for all goroutines to fully exit before stats is copied into the return.
    • The wait must be inline before return — a deferred cleanup runs after the return value is already copied, so it cannot fix this.
    • Because the wait runs before the deferred per-worker f.Close() calls, it also closes a latent read-from-closed-file hazard for abandoned stragglers.
    • defer cancel() / defer w.stop() are kept as an idempotent panic safety net.
  • Pre-spawn early returns are unchanged (no goroutines exist yet). No signature change; cmd/desync/make.go (the only consumer that reads the returned stats) and all _-discarding callers are unaffected.

Scope: only the stats race (the one -race flags). The c.err/c.eof cross-goroutine reads are already safe via the close(c.results) happens-before edge; the c.next list-collapse mutation is a separate theoretical concern intentionally left out to keep this minimal.

Test

Added TestIndexFromFileStats: drives the early-EOF/straggler path with null-heavy inputs across worker counts 2–8 and asserts the now race-free, deterministic invariants — ChunksAccepted == len(index.Chunks), ChunksProduced >= ChunksAccepted, ChunksProduced > 0. TestParallelChunking is kept as-is as the chunk-equivalence guard.

Verification

  • go test -race -count=5 -run 'TestParallelChunking|TestIndexFromFileStats' . — no WARNING: DATA RACE (162s; previously reproduced within a few iterations).
  • go test -run TestIndexFromFileStats -count=20 . — deterministic, green.
  • Full go test -race . and go test ./cmd/desync — green.
  • go vet ./... clean, go build ./cmd/desync clean.

folbricht added 2 commits May 17, 2026 10:23
IndexFromFile shares &stats with every parallel chunk worker, which
update it via atomic.AddUint64. Workers were only signalled to stop
(close(done)) and never joined: the consumer loop breaks early when a
worker reports EOF, leaving later workers' goroutines running. The
function then returned `stats` by value while those goroutines were
still atomically incrementing it -- a data race (flagged by -race in
TestParallelChunking) that also made ChunksProduced non-deterministic.

Track the worker goroutines with a sync.WaitGroup (using the Go 1.25
(*sync.WaitGroup).Go helper) and, on the post-spawn exit path, cancel
the context, stop every worker and wait for all goroutines to exit
before copying stats into the return value. The wait runs inline before
the return (deferred cleanup runs after the return value is already
copied, so it cannot fix this), and before the deferred per-worker
file Close calls, which also closes a latent read-from-closed-file
hazard for abandoned stragglers. defer cancel()/defer w.stop() are kept
as an idempotent panic safety net. No signature or behavior change for
callers.

Add TestIndexFromFileStats: exercises the early-EOF/straggler path with
null-heavy inputs across multiple worker counts and asserts the now
race-free, deterministic invariants (ChunksAccepted == len(index.Chunks),
ChunksProduced >= ChunksAccepted > 0).

Verified: go test -race -count=5 of TestParallelChunking and the new
test (no DATA RACE; previously reproduced within a few iterations),
TestIndexFromFileStats -count=20 without -race, full go test -race .
and go test ./cmd/desync all green.
Convert the new test's setup checks and stat invariants from manual
t.Fatal/t.Fatalf to github.com/stretchr/testify/require (already a
dependency, used elsewhere in the package). TestParallelChunking is
left as-is.
@folbricht folbricht merged commit 6479455 into master May 17, 2026
6 checks passed
@folbricht folbricht deleted the fix/make-indexfromfile-stats-race branch May 17, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant