fix(anthropic): handle SSE in-band errors with correct HTTP status codes#2880
Conversation
Anthropic streams reply with HTTP 200 even when errors occur mid-stream (delivered as SSE 'event: error'). The SDK builds an *anthropic.Error with StatusCode == 200, which prevented proper error classification. Synthesize the correct HTTP status from the error type (api_error→500, overloaded_error→529, rate_limit_error→429, etc.) so the retry and format pipeline behaves correctly. Fixes docker#2870
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟢 APPROVE
The fix correctly synthesizes HTTP status codes from Anthropic SSE in-band error types, integrating cleanly with the existing modelerrors.WrapHTTPError and ClassifyModelError pipeline. The error type → status code mapping matches the PR description and Anthropic's API semantics. No high or medium severity bugs were found in the added code.
Observations (low severity, no action required):
-
statusCode < 400guard is slightly broader than needed (wrap.go:35): The condition catches 1xx/2xx/3xx errors and remaps them. The only documented SSE in-band case is HTTP 200, sostatusCode == http.StatusOKwould be a tighter guard — though 1xx/3xx Anthropic errors don't occur in practice. -
SSE
rate_limit_errorsynthesizes 429 withoutRetry-After(wrap.go:36): The original HTTP 200 response carries noRetry-Afterheader, so the synthesizedStatusErrorwill always haveRetryAfter == 0. Retries still kick in correctly via the exponential back-off strategy; the provider-hint path is simply not available for in-band errors. Known limitation, not a bug. -
Test helper fragility (
wrap_test.go:56):makeTestSSEAnthropicErrorsetsStatusCode: http.StatusOKbefore callingUnmarshalJSON. If a future SDK version overwritesStatusCodeduring unmarshaling, the fixture would silently stop validating the production path. Minor, SDK-version-dependent concern. -
Missing test cases for several error types:
permission_error,not_found_error,billing_error,timeout_error, andinvalid_request_errorhave no dedicated test assertions. The existing coverage forapi_error,overloaded_error,rate_limit_error,authentication_error, and unknown type provides reasonable confidence, but adding the remaining cases would complete the coverage matrix.
|
❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs. |
Anthropic streams reply with HTTP 200 even when an error occurs mid-stream. Instead of a typical HTTP error status, the error is delivered as an SSE
event: error. This caused the agent to stop mid-session with cryptic error messages likemodel failed: error receiving from stream: POST "...": 200 {"type":"error",...}instead of retrying transient errors.The root cause was that
modelerrors.WrapHTTPErroronly wraps errors withstatusCode >= 400. When the SDK built an*anthropic.ErrorwithStatusCode == 200, it bypassed the wrapping logic. Without a proper*StatusErrorin the error chain,ClassifyModelErrorfell back to substring matching, and the user-facing message was never cleaned up.This fix synthesizes the correct HTTP status code from the Anthropic error type before passing it to the existing error handling pipeline:
api_error→ 500,overloaded_error→ 529,rate_limit_error→ 429,timeout_error→ 504,authentication_error→ 401,permission_error→ 403,not_found_error→ 404,billing_error→ 402,invalid_request_error→ 400, and unknown → 500. With the right status code in place, retries kick in for transient errors, and if retries are exhausted the message becomes friendly:HTTP 500: api_error: Internal server error (Request-ID: req_...)instead of the raw SSE blob.Tests cover
api_error,overloaded_error,rate_limit_error,authentication_error, and an unknown error type, asserting both the synthesized status code and the downstreamClassifyModelErrordecision.Fixes #2870