fix(connection): response-scoped notification barrier (fixes WaitGroup reuse panic)#30
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a
sync: WaitGroup is reused before previous Wait has returnedpanic inConnectionunder concurrent requests that overlap with inbound notifications(for example,
ClientSideConnection.LoadSessionreplayingsession/updatenotifications during a resumed session).
The root cause is that
SendRequest[T]/SendRequestNoResultwere callingc.notificationWg.Wait()on a sharedsync.WaitGroupafter every response.With overlapping requests, two waiters could race the counter's reuse and
trigger the runtime panic. The shared counter also could not express the
intended "wait only for notifications that were enqueued before this response
was observed" semantics.
What changed
Runtime (
connection.go)notificationWgbarrier from the request / shutdown path.notifyMu,notifyCondlastEnqueuedNotificationSeq(advanced undernotifyMuwhen anotification is accepted into the queue, rolled back on queue overflow)
completedNotificationSeq(advanced undernotifyMuafter each handlerreturns; broadcast on
notifyCond)completedNotificationSeq <= lastEnqueuedNotificationSeq(runtime-checked with a
panicon impossible states)chan *anyMessagewithchan queuedNotificationso the singleconsumer knows which sequence it is completing.
responseEnvelopesohandleResponsesnapshots the notificationwatermark in the receive goroutine before waking the waiter, giving each
response a response-scoped barrier target.
SendRequest[T]/SendRequestNoResultnow wait viawaitNotificationsUpTo(ctx, target)usingnotifyCondin a re-checkingloop, honoring
ctx.Done()and the connection context. If the connectionis canceled while a request is waiting for its pre-response barrier, it now
returns a connection error instead of blocking indefinitely (intentional
behavior change — previously the shared
WaitGroupcould deadlock here).shutdownReceivedrains to the final enqueued sequence using the same cond,keeping the existing 5s timeout.
Notification order is still preserved by the single-consumer
processNotificationsgoroutine. Public API is unchanged. No generated fileswere touched.
Tests
connection_notification_barrier_test.goadds regression coverage:TestSendRequest_WaitsForPreResponseNotification—LoadSessionblocksuntil the pre-response
session/updatehandler finishes.TestSendRequest_DoesNotWaitForPostResponseNotification—LoadSessionreturns promptly even while a post-response notification handler is still
blocked.
TestSendRequest_ConcurrentRequestsDoNotPanic— overlapping requests +notifications; this is the previously-panicking scenario.
TestLoadSession_NotificationReplayOrdering— resumed-session replayordering across multiple
session/updatenotifications.TestShutdownDrainsNotifications_WithBarrier— shutdown drains throughthe final queued notification.
acp_test.goqueue-overflow test migrated to the new barrier primitive.connection_cancel_test.goadjusted for the newresponseEnvelopechannel type.
Validation
Run locally on this branch before opening the PR:
go test ./...ok github.com/coder/acp-go-sdk 1.314sgo test ./... -race -count=20 -run 'TestSendRequest_ConcurrentRequestsDoNotPanic|TestLoadSession_NotificationReplayOrdering|TestSendRequest_WaitsForPreResponseNotification|TestSendRequest_DoesNotWaitForPostResponseNotification|TestShutdownDrainsNotifications_WithBarrier'ok 8.116s(100 runs of the 5 regressions under-race)make test(library tests + example builds)grep -rn notificationWg --include='*.go' .Diff stat
📋 Implementation Plan
Plan: Fix notification wait race in
acp-go-sdkGoal
Fix the upstream SDK bug in
connection.gothat can panic withsync: WaitGroup is reused before previous Wait has returned, while preserving the intended behavior thatSendRequestonly waits for notifications that were received before the matching response was observed.Verified repository context
Connectioncurrently uses a sharednotificationWginconnection.goacross the inbound notification path.receive()increments notification tracking when it accepts a notification,processNotifications()decrements it after serial handler execution, and bothSendRequest[T]()/SendRequestNoResult()wait afterwaitForResponse()returns.shutdownReceive()also drains pending notifications before canceling inbound work.ClientSideConnection.LoadSessionreplay on resumed sessions.client_gen.go/agent_gen.goappear to route through the shared connection helpers, so the fix should stay centered inconnection.goplus tests.Design constraint that should drive the fix
A plain shared counter/cond replacement is not enough by itself. The fix needs a response-scoped barrier:
SendRequestmust wait for notifications enqueued before the response boundary,That means the notification watermark has to be captured in the receive path when the response is observed, then waited on later by request/shutdown code.
Phase 1 — Build a repo-local reproducible setup
reviewfixersource tree.Connection/SendRequestinterleaving test that can precisely gate notifications and the response,ClientSideConnection.LoadSessionreplay scenario that mimics resumed-sessionsession/updatenotifications aroundsession/load.go test ... -count=100run) so the old panic shape can be exercised without external repositories.Quality gate: before the refactor is considered done, the repo contains a self-contained test setup that reproduces the session/load replay conditions inside
acp-go-sdkitself.Phase 2 — Replace the shared WaitGroup with an ordered notification barrier
connection.goto removenotificationWgfrom the response/shutdown synchronization path.Connection, implemented with a mutex/cond plus monotonic sequence counters, e.g.:lastEnqueuedNotificationSeqcompletedNotificationSeqreceive():lastEnqueuedNotificationSeqwhen a response is observed,processNotifications():completedNotificationSeqonly after the handler finishes,SendRequest[T]()andSendRequestNoResult():completedNotificationSeqreaches the response’s captured target sequence,shutdownReceive():Quality gate: the old shared WaitGroup no longer participates in request/shutdown notification draining, and the new barrier preserves the intended “received before response” ordering contract.
Phase 3 — Update and extend tests around the new barrier
acp_test.goor a new narrow*_test.gofile) for:ClientSideConnection.LoadSessionreplay behavior,notificationWgdirectly so they assert observable behavior instead of the removed primitive.connection.go.Quality gate: new regression tests are stable under repeated runs, and existing shutdown/overflow/concurrency tests still validate the same external behavior.
Validation
go test ./... -run '<new regression names>' -count=100).go test ./....make testso example binaries continue to compile alongside the library tests.go test ./... -racefor the touched package or full repo as an additional concurrency check.Dogfooding / self-verification
LoadSessionreplay test as the permanent reproducible setup for this issue; do not rely on the private consumer repository.Acceptance criteria
acp-go-sdk, not in a downstream consumer workaround.SendRequest[T]()andSendRequestNoResult()wait only for notifications queued before the matching response was observed.shutdownReceive()drains through the final queued notification sequence and keeps timeout behavior.ClientSideConnection.LoadSessionreplay scenario.go test ./...andmake testpass, with targeted repeated runs showing the panic no longer occurs in the reproducible setup.Generated with
mux• Model:anthropic:claude-opus-4-7• Thinking:max