Skip to content

Fix 14 bugs: panics, data races, incorrect output, and dead code cleanup#150

Merged
cep21 merged 32 commits intomasterfrom
bug-fixes
Feb 19, 2026
Merged

Fix 14 bugs: panics, data races, incorrect output, and dead code cleanup#150
cep21 merged 32 commits intomasterfrom
bug-fixes

Conversation

@cep21
Copy link
Copy Markdown
Owner

@cep21 cep21 commented Feb 19, 2026

Summary

This PR fixes 14 bugs found through deep analysis of the circuit breaker library, ranging from runtime panics and data races to incorrect metric output and unbounded recursion. All fixes are backwards-compatible and include regression tests. Benchmarks show no performance regression.

19 files changed, 332 insertions, 56 deletions

Bug Fixes

Panics & Crashes

  1. close(nil) panic in waitForErrors (gowrapper.go)
    When skipCatchPanics is true, panicResults is nil. The deferred close(panicResults) panicked. Added a nil guard.

  2. SimpleBadRequest.Error() panics if Err is nil (errors.go)
    Calling .Error() on SimpleBadRequest{Err: nil} caused a nil pointer dereference. Now returns "bad request".

  3. MetricEventStream.Close() panics on double-close (metriceventstream/metriceventstream.go)
    Closing the event stream twice panicked on close(m.closeChan). Guarded with sync.Once.

  4. RollingPercentile.AddDuration panics on pre-start times (faststats/rolling_percentile.go)
    Adding a duration with a timestamp before StartTime produced a negative index, causing an out-of-bounds panic. Added idx < 0 guard.

Correctness

  1. SortedDurations.Var() wrong percentile scale (faststats/rolling_percentile.go)
    Var() called Percentile(.25) etc, but Percentile() expects [0-100], not [0-1]. This made p25/p50/p90/p99 in expvar output nearly identical to the minimum value. Fixed callers to use 25, 50, 90, 99.

  2. RollingBuckets.Advance backward-time guard was dead code (faststats/rolling_bucket.go)
    The condition indexDiff >= r.NumBuckets was always false when indexDiff < 0. Fixed to -indexDiff >= r.NumBuckets.

  3. OpenCircuit used time.Now() instead of c.now() (circuit.go)
    OpenCircuit ignored the configured TimeKeeper.Now, breaking mock-time testing. Changed to c.now().

  4. Double now() call at timeout boundary (circuit.go)
    Two separate now() calls (endTime and runFuncDoneTime) could straddle a timeout boundary, causing a spurious timeout even though the function completed in time. Collapsed to a single call.

  5. Double read of MaxConcurrentRequests in throttle check (circuit.go)
    MaxConcurrentRequests.Get() was called twice in the same condition, allowing a TOCTOU race where the value could change between reads. Read once into a local variable.

  6. MockClock.AfterFunc returned nil (internal/clock/clock.go)
    Callers calling .Stop() on the returned *time.Timer would panic. Now returns a stopped real timer.

Concurrency

  1. Factory closures mutated shared captured config (closers/hystrix/closer.go, closers/hystrix/opener.go)
    CloserFactory and OpenerFactory called config.Merge() on the captured config value, mutating it across concurrent invocations. Fixed by copying config before merging.

  2. RollingBuckets.Advance unbounded recursion under contention (faststats/rolling_bucket.go)
    Two recursive return r.Advance(now, clearBucket) calls could stack overflow under high concurrency. Converted to an iterative loop.

Code Quality (from review)

  1. MockClock.AfterFunc deadlock when d==0 — called f() while holding m.mu. Released lock first.
  2. Dead cancelled field in timedCallbacks — never set to true. Removed along with the check.
  3. Dead test scaffolding — removed unused type declarations and variable assignments in TestOpenCircuitUsesMockTime.
  4. Various minor fixes — import ordering, extra blank lines, comment prefixes, pointer receivers, documentation.

What is NOT in this PR

  • Prevent() metric gap: When ClosedToOpen.Prevent() blocks a request, no metric is emitted. An earlier version of this PR emitted ErrShortCircuit, but this is semantically wrong — Prevent operates on a closed circuit, while ErrShortCircuit implies the circuit is open. Properly fixing this requires adding ErrPrevented to the RunMetrics interface, which is a breaking change for a future major version.

Testing

  • All existing tests pass
  • Each bug fix includes a new regression test that fails without the fix
  • go test ./... -race passes with no races detected
  • Benchmarks show no significant regression (zero-allocation profile unchanged)

cep21 added 30 commits February 19, 2026 16:56
Prevent() blocks requests while the circuit is closed, so emitting
ErrShortCircuit is semantically wrong — it implies the circuit is open.
Properly addressing this metric gap requires a new ErrPrevented method
on the RunMetrics interface, which is a breaking change for a future major version.
@cep21 cep21 merged commit addec5c into master Feb 19, 2026
6 checks passed
@cep21 cep21 deleted the bug-fixes branch February 19, 2026 19:48
cep21 added a commit that referenced this pull request Feb 26, 2026
- fix: use CAS in openCircuit/close to emit Opened()/Closed() exactly once
- fix: copy config in ConsecutiveErrOpenerFactory closure (race hygiene, matches PR #150)
- fix: read Fallback.MaxConcurrentRequests once in throttle check (matches PR #150)
- fix: use nil-safe props.now() in Opener.SetConfigNotThreadSafe
- fix: guard NumBuckets==0 in RollingCounter.GetBuckets (fixes String() panic)
- fix: nil-check pointer fields in RollingCounter.UnmarshalJSON
- fix: MockClock.AfterFunc fires immediately for d<=0 (matches real time.AfterFunc)

All fixes are backwards-compatible (panic->works, race->clean, or strictly
additive guards). Regression tests added for each.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cep21 added a commit that referenced this pull request Feb 27, 2026
* Fix 7 bugs: TOCTOU races, panics, and missed cases from PR #150

- fix: use CAS in openCircuit/close to emit Opened()/Closed() exactly once
- fix: copy config in ConsecutiveErrOpenerFactory closure (race hygiene, matches PR #150)
- fix: read Fallback.MaxConcurrentRequests once in throttle check (matches PR #150)
- fix: use nil-safe props.now() in Opener.SetConfigNotThreadSafe
- fix: guard NumBuckets==0 in RollingCounter.GetBuckets (fixes String() panic)
- fix: nil-check pointer fields in RollingCounter.UnmarshalJSON
- fix: MockClock.AfterFunc fires immediately for d<=0 (matches real time.AfterFunc)

All fixes are backwards-compatible (panic->works, race->clean, or strictly
additive guards). Regression tests added for each.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: UnmarshalJSON returns error on incomplete JSON instead of leaving inconsistent state

The previous fix (nil-check each pointer field and skip) made zero-value
receivers safe but created a new deferred-panic vector: on a pre-initialized
counter, UnmarshalJSON({}) would set buckets=nil while leaving NumBuckets
unchanged, causing GetBuckets to panic later with index out of range.

UnmarshalJSON is a round-trip API — MarshalJSON always emits all fields.
Incomplete JSON is corrupt/truncated input. Return an error (receiver
unmodified) rather than silently partial-apply.

Backwards-compatible vs released master: panic -> error (strictly better).

Tests added:
- IncompleteInput: 5 partial-JSON variants x 2 receiver states (all must error, receiver preserved)
- RoundTrip: 3 counter states marshal+unmarshal cleanly (the BC guarantee)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* tests: fix 6 timing-based flakes

Systematic audit for timing-sensitive tests that can flake on loaded CI runners.

- TestCircuitAttemptsToReopenOnlyOnce: convert to MockClock for deterministic
  timing. The previous real-clock version with SleepWindow: 1ms flaked when CI
  scheduling delays exceeded 1ms (same pattern PR #141 fixed in the sister test
  TestCircuitAttemptsToReopen). This is the flake that caused the CI failure.

- TestThrottled, TestFallbackCircuitConcurrency: replace time.Sleep-based
  concurrency with a barrier pattern. Wait until ALL goroutines have either
  entered the throttled section or been rejected before releasing — guarantees
  the over-limit goroutine attempts while slots are occupied, regardless of
  scheduling. Also fixed: errCount++ was not atomic.

- TestRollingCounter_Inc: use StringAt(now)/RollingSumAt(now) instead of
  String()/RollingSum(). The latter use real time.Now() which can advance past
  the 10ms rolling window on a slow runner, rolling out buckets before later
  Inc(now) calls.

- TestTickUntil: register AfterFunc callbacks BEFORE starting the TickUntil
  goroutine. If TickUntil advances mock time first, callbacks get scheduled
  relative to the advanced time, breaking the final m.Now() == now+3h assertion.

- TestManagerConcurrentFactoryConfiguration: drop require.NoError on Execute.
  With 300 concurrent executions under -race and only a 2x sleep/timeout margin,
  CI jitter can cause spurious timeouts. The test is about concurrent factory
  creation, not execution timing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
cep21 added a commit that referenced this pull request Feb 28, 2026
TestRollingCounterBucketRolloverRace had a rollingSum < 0 check inside
the hot concurrent loop since PR #139/#150. This check was always wrong
for the lock-free RollingCounter design (documented at rolling_bucket.go):

  Thread A (Inc):      buckets[idx].Add(1)
  Thread B (rollover): Swap(0) returns the incremented value,
                       rollingSum.Add(-toDec) — briefly over-decrements
  Thread A (Inc cont): rollingSum.Add(1) — corrects

A reader between B and A's rollingSum write observes a transient negative.
The value is correct once writers quiesce.

Moved the < 0 check to after wg.Wait(). The rollingSum <= TotalSum check
stays in the hot loop (safe: reads are correctly ordered).
cep21 added a commit that referenced this pull request Feb 28, 2026
…efile (#152) (#152)

* Add property/fuzz/invariant tests; fix 2 fuzz-found bugs; restore Makefile

## Test infrastructure

- circuit_property_test.go: invariant-asserting stress tests + deterministic
  state-machine property test (testing/quick + MockClock + reference model).
  Verified catches the TOCTOU bugs from PR #151 by temporarily reverting CAS.
    - TestCircuit_RunMetricsExactlyOnce_Concurrent: verifies the exactly-once
      RunMetrics contract under 100K concurrent Execute calls
    - TestCircuit_TransitionAlternation_Concurrent: verifies Opened/Closed
      alternate correctly (closed ≤ opened ≤ closed+1) under contention
    - TestCircuit_FallbackMetricsExactlyOnce_Concurrent: same for fallback path
    - TestCircuit_StateMachineProperty: 500 random op sequences replayed
      against real circuit + trivial reference model, 6 invariants checked
      per step, fully deterministic via MockClock

- faststats/fuzz_test.go: 5 Go-native fuzz targets + naiveRollingCounter
  differential oracle. Seeds run as unit tests in ~5ms.
    - FuzzRollingCounterOps: Inc/RollingSumAt/GetBuckets vs naive oracle
    - FuzzRollingCounterJSON: round-trip + hostile-input safety
    - FuzzSortedDurationsPercentile: bounds + monotonicity
    - FuzzRollingBucketAdvance: index bounds + LastAbsIndex monotonicity
    - FuzzTimedCheckJSON: round-trip + hostile-input safety

- Tightened existing stress tests (were assertion-free, just t.Logf):
    - circuit_stress_test.go: TestCircuitStateTransitionRace now verifies
      Opened/Closed alternation + ConcurrentCommands balance
    - faststats/rolling_stress_test.go: deleted 2 permanent-skip tests,
      added real TestTimedCheckConcurrency, added RollingSum≤TotalSum check

## Bug fixes (found by fuzzers in ~10s)

- faststats/rolling_percentile.go: Percentile(NaN) panicked — NaN falls
  through both p<=0 and p>=100 guards, int(math.Floor(NaN)) is INT64_MIN
  on amd64, index-out-of-range. Now returns -1.

- faststats/rolling_counter.go: UnmarshalJSON accepted hostile JSON with
  NumBuckets > len(Buckets), then GetBuckets panicked. Now rejected at
  unmarshal time with a descriptive error.

## Build tooling

- Makefile: restored after Jan 2024 deletion. make (fast dev loop),
  make ci (pre-PR gate mirroring GitHub Actions), make fuzz, make bench,
  make cover, make fix, make help.

- .github/workflows/build.yml: Build/Test steps now call make build /
  make test-race (same commands, now DRY).

- .github/workflows/fuzz.yml: daily scheduled fuzzing, 5min per target,
  uploads crash inputs as artifacts on failure. Manual dispatch enabled.

- README.md: Development section points to make targets. Fixed dead
  Makefile link.

- .gitignore: added /coverage.html (from make cover).

* Fix pre-existing flake: RollingSumAt < 0 check must be post-quiescence

TestRollingCounterBucketRolloverRace had a rollingSum < 0 check inside
the hot concurrent loop since PR #139/#150. This check was always wrong
for the lock-free RollingCounter design (documented at rolling_bucket.go):

  Thread A (Inc):      buckets[idx].Add(1)
  Thread B (rollover): Swap(0) returns the incremented value,
                       rollingSum.Add(-toDec) — briefly over-decrements
  Thread A (Inc cont): rollingSum.Add(1) — corrects

A reader between B and A's rollingSum write observes a transient negative.
The value is correct once writers quiesce.

Moved the < 0 check to after wg.Wait(). The rollingSum <= TotalSum check
stays in the hot loop (safe: reads are correctly ordered).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant