Skip to content

fix(connection): response-scoped notification barrier (fixes WaitGroup reuse panic)#30

Merged
ThomasK33 merged 2 commits into
mainfrom
connection-9n9w
Apr 20, 2026
Merged

fix(connection): response-scoped notification barrier (fixes WaitGroup reuse panic)#30
ThomasK33 merged 2 commits into
mainfrom
connection-9n9w

Conversation

@ThomasK33
Copy link
Copy Markdown
Member

Summary

Fixes a sync: WaitGroup is reused before previous Wait has returned panic in
Connection under concurrent requests that overlap with inbound notifications
(for example, ClientSideConnection.LoadSession replaying session/update
notifications during a resumed session).

The root cause is that SendRequest[T] / SendRequestNoResult were calling
c.notificationWg.Wait() on a shared sync.WaitGroup after every response.
With overlapping requests, two waiters could race the counter's reuse and
trigger the runtime panic. The shared counter also could not express the
intended "wait only for notifications that were enqueued before this response
was observed" semantics.

What changed

Runtime (connection.go)

  • Removed the shared notificationWg barrier from the request / shutdown path.
  • Added a response-scoped, monotonic notification-sequence barrier:
    • notifyMu, notifyCond
    • lastEnqueuedNotificationSeq (advanced under notifyMu when a
      notification is accepted into the queue, rolled back on queue overflow)
    • completedNotificationSeq (advanced under notifyMu after each handler
      returns; broadcast on notifyCond)
    • invariant: completedNotificationSeq <= lastEnqueuedNotificationSeq
      (runtime-checked with a panic on impossible states)
  • Replaced chan *anyMessage with chan queuedNotification so the single
    consumer knows which sequence it is completing.
  • Added a responseEnvelope so handleResponse snapshots the notification
    watermark in the receive goroutine before waking the waiter, giving each
    response a response-scoped barrier target.
  • SendRequest[T] / SendRequestNoResult now wait via
    waitNotificationsUpTo(ctx, target) using notifyCond in a re-checking
    loop, honoring ctx.Done() and the connection context. If the connection
    is canceled while a request is waiting for its pre-response barrier, it now
    returns a connection error instead of blocking indefinitely (intentional
    behavior change — previously the shared WaitGroup could deadlock here).
  • shutdownReceive drains to the final enqueued sequence using the same cond,
    keeping the existing 5s timeout.

Notification order is still preserved by the single-consumer
processNotifications goroutine. Public API is unchanged. No generated files
were touched.

Tests

  • New file connection_notification_barrier_test.go adds regression coverage:
    • TestSendRequest_WaitsForPreResponseNotificationLoadSession blocks
      until the pre-response session/update handler finishes.
    • TestSendRequest_DoesNotWaitForPostResponseNotificationLoadSession
      returns promptly even while a post-response notification handler is still
      blocked.
    • TestSendRequest_ConcurrentRequestsDoNotPanic — overlapping requests +
      notifications; this is the previously-panicking scenario.
    • TestLoadSession_NotificationReplayOrdering — resumed-session replay
      ordering across multiple session/update notifications.
    • TestShutdownDrainsNotifications_WithBarrier — shutdown drains through
      the final queued notification.
  • Incidental updates:
    • acp_test.go queue-overflow test migrated to the new barrier primitive.
    • connection_cancel_test.go adjusted for the new responseEnvelope
      channel type.

Validation

Run locally on this branch before opening the PR:

Command Result
go test ./... ok github.com/coder/acp-go-sdk 1.314s
go test ./... -race -count=20 -run 'TestSendRequest_ConcurrentRequestsDoNotPanic|TestLoadSession_NotificationReplayOrdering|TestSendRequest_WaitsForPreResponseNotification|TestSendRequest_DoesNotWaitForPostResponseNotification|TestShutdownDrainsNotifications_WithBarrier' ok 8.116s (100 runs of the 5 regressions under -race)
make test (library tests + example builds) PASS
grep -rn notificationWg --include='*.go' . 0 hits

Diff stat

 acp_test.go                             |  15 +-
 connection.go                           | 246 +++++++++++++++----
 connection_cancel_test.go               |   2 +-
 connection_notification_barrier_test.go | 409 ++++++++++++++++++++++++++++++++
 4 files changed, 608 insertions(+), 64 deletions(-)

📋 Implementation Plan

Plan: Fix notification wait race in acp-go-sdk

Goal

Fix the upstream SDK bug in connection.go that can panic with sync: WaitGroup is reused before previous Wait has returned, while preserving the intended behavior that SendRequest only waits for notifications that were received before the matching response was observed.

Verified repository context

  • Connection currently uses a shared notificationWg in connection.go across the inbound notification path.
  • receive() increments notification tracking when it accepts a notification, processNotifications() decrements it after serial handler execution, and both SendRequest[T]() / SendRequestNoResult() wait after waitForResponse() returns.
  • shutdownReceive() also drains pending notifications before canceling inbound work.
  • Existing tests cover concurrent requests, notifications, shutdown draining, and queue overflow, but they do not currently lock in the response-vs-notification interleaving that mirrors ClientSideConnection.LoadSession replay on resumed sessions.
  • The generated wrappers in client_gen.go / agent_gen.go appear to route through the shared connection helpers, so the fix should stay centered in connection.go plus tests.
Design constraint that should drive the fix

A plain shared counter/cond replacement is not enough by itself. The fix needs a response-scoped barrier:

  • notification order must still be preserved by the existing queue,
  • SendRequest must wait for notifications enqueued before the response boundary,
  • and it must not be forced to wait for notifications that arrive after that boundary.

That means the notification watermark has to be captured in the receive path when the response is observed, then waited on later by request/shutdown code.

Phase 1 — Build a repo-local reproducible setup

  1. Add an SDK-only regression harness using the repo’s existing connection test style (pipes/fake peer helpers), so validation does not depend on the unavailable reviewfixer source tree.
  2. Cover two levels of reproduction:
    • a raw Connection / SendRequest interleaving test that can precisely gate notifications and the response,
    • and a higher-level ClientSideConnection.LoadSession replay scenario that mimics resumed-session session/update notifications around session/load.
  3. Use channels or other deterministic gates to force the important boundaries:
    • a notification that is enqueued before the response but finishes later,
    • and a notification that arrives after the response boundary.
  4. Add explicit regression assertions for both semantics:
    • pre-response notifications block the request from returning,
    • post-response notifications do not block the request from returning.
  5. Add a repeatable stress target for the private-consumer-shaped scenario (for example, a focused go test ... -count=100 run) so the old panic shape can be exercised without external repositories.

Quality gate: before the refactor is considered done, the repo contains a self-contained test setup that reproduces the session/load replay conditions inside acp-go-sdk itself.

Phase 2 — Replace the shared WaitGroup with an ordered notification barrier

  1. Refactor connection.go to remove notificationWg from the response/shutdown synchronization path.
  2. Introduce response-scoped barrier state on Connection, implemented with a mutex/cond plus monotonic sequence counters, e.g.:
    • lastEnqueuedNotificationSeq
    • completedNotificationSeq
    • a queued notification wrapper that carries both the message and its sequence number.
  3. In receive():
    • assign a monotonically increasing sequence number when a notification is successfully accepted into the notification queue,
    • snapshot lastEnqueuedNotificationSeq when a response is observed,
    • and attach that snapshot to the response handed to the waiting request path.
  4. In processNotifications():
    • keep the existing serial queue semantics,
    • advance completedNotificationSeq only after the handler finishes,
    • and broadcast the cond so any waiting request/shutdown path can re-check its target.
  5. In SendRequest[T]() and SendRequestNoResult():
    • wait until completedNotificationSeq reaches the response’s captured target sequence,
    • then decode/return the response,
    • without waiting on notifications that arrived after the response boundary.
  6. In shutdownReceive():
    • reuse the same barrier instead of a second drain primitive,
    • snapshot the final queued notification sequence only after no more notifications can be enqueued,
    • and keep the existing timeout-based drain behavior.
  7. Add defensive internal invariants/comments around the new barrier state so the monotonic sequence assumptions stay obvious during future changes (for example: completed seq never exceeds last enqueued seq, and waiters always re-check in a loop).

Quality gate: the old shared WaitGroup no longer participates in request/shutdown notification draining, and the new barrier preserves the intended “received before response” ordering contract.

Phase 3 — Update and extend tests around the new barrier

  1. Add focused regression tests alongside the existing connection tests (either in acp_test.go or a new narrow *_test.go file) for:
    • raw interleaving of notifications and responses,
    • ClientSideConnection.LoadSession replay behavior,
    • and repeated concurrent runs that previously risked the panic.
  2. Update any existing tests that reach into notificationWg directly so they assert observable behavior instead of the removed primitive.
  3. Keep generated wrappers unchanged unless a compile-time type update is unavoidable; the behavior change should remain centralized in connection.go.
  4. Document in test names/comments why the previous WaitGroup design was invalid: an open-ended receive loop cannot safely use a shared WaitGroup to model a response-specific notification boundary.

Quality gate: new regression tests are stable under repeated runs, and existing shutdown/overflow/concurrency tests still validate the same external behavior.

Validation

  • Run focused regression coverage while iterating, especially the new connection/load-session tests.
  • Run repeated targeted stress passes for the interleaving scenario (for example go test ./... -run '<new regression names>' -count=100).
  • Run the full suite with go test ./....
  • Run make test so example binaries continue to compile alongside the library tests.
  • If practical in the environment, run go test ./... -race for the touched package or full repo as an additional concurrency check.

Dogfooding / self-verification

  1. Treat the new SDK-only LoadSession replay test as the permanent reproducible setup for this issue; do not rely on the private consumer repository.
  2. Record terminal evidence for reviewers while validating the fix:
    • one recording of the focused replay/interleaving regression run,
    • one recording of the repeated stress run,
    • and one recording or transcript of the final suite command(s).
  3. Capture screenshots of:
    • the targeted replay test passing without panic,
    • and the full validation run succeeding.
  4. If a temporary standalone harness is useful during implementation, keep it test-local or remove it before finishing so the committed reproducible setup is the regression test suite itself.

Acceptance criteria

  • The fix lives in acp-go-sdk, not in a downstream consumer workaround.
  • SendRequest[T]() and SendRequestNoResult() wait only for notifications queued before the matching response was observed.
  • Notification handling remains serialized in queue order.
  • shutdownReceive() drains through the final queued notification sequence and keeps timeout behavior.
  • The SDK has a repo-local regression setup covering both raw connection interleaving and a ClientSideConnection.LoadSession replay scenario.
  • go test ./... and make test pass, with targeted repeated runs showing the panic no longer occurs in the reproducible setup.

Generated with mux • Model: anthropic:claude-opus-4-7 • Thinking: max

@ThomasK33 ThomasK33 merged commit 3a19dcb into main Apr 20, 2026
1 check passed
@ThomasK33 ThomasK33 deleted the connection-9n9w branch April 20, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant