Skip to content

fix(anthropic): handle SSE in-band errors with correct HTTP status codes#2880

Merged
dgageot merged 1 commit into
docker:mainfrom
dgageot:board/870a547870fff724
May 22, 2026
Merged

fix(anthropic): handle SSE in-band errors with correct HTTP status codes#2880
dgageot merged 1 commit into
docker:mainfrom
dgageot:board/870a547870fff724

Conversation

@dgageot
Copy link
Copy Markdown
Member

@dgageot dgageot commented May 22, 2026

Anthropic streams reply with HTTP 200 even when an error occurs mid-stream. Instead of a typical HTTP error status, the error is delivered as an SSE event: error. This caused the agent to stop mid-session with cryptic error messages like model failed: error receiving from stream: POST "...": 200 {"type":"error",...} instead of retrying transient errors.

The root cause was that modelerrors.WrapHTTPError only wraps errors with statusCode >= 400. When the SDK built an *anthropic.Error with StatusCode == 200, it bypassed the wrapping logic. Without a proper *StatusError in the error chain, ClassifyModelError fell back to substring matching, and the user-facing message was never cleaned up.

This fix synthesizes the correct HTTP status code from the Anthropic error type before passing it to the existing error handling pipeline: api_error → 500, overloaded_error → 529, rate_limit_error → 429, timeout_error → 504, authentication_error → 401, permission_error → 403, not_found_error → 404, billing_error → 402, invalid_request_error → 400, and unknown → 500. With the right status code in place, retries kick in for transient errors, and if retries are exhausted the message becomes friendly: HTTP 500: api_error: Internal server error (Request-ID: req_...) instead of the raw SSE blob.

Tests cover api_error, overloaded_error, rate_limit_error, authentication_error, and an unknown error type, asserting both the synthesized status code and the downstream ClassifyModelError decision.

Fixes #2870

Anthropic streams reply with HTTP 200 even when errors occur mid-stream
(delivered as SSE 'event: error'). The SDK builds an *anthropic.Error with
StatusCode == 200, which prevented proper error classification. Synthesize
the correct HTTP status from the error type (api_error→500, overloaded_error→529,
rate_limit_error→429, etc.) so the retry and format pipeline behaves correctly.

Fixes docker#2870
@dgageot dgageot requested a review from a team as a code owner May 22, 2026 15:39
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

The fix correctly synthesizes HTTP status codes from Anthropic SSE in-band error types, integrating cleanly with the existing modelerrors.WrapHTTPError and ClassifyModelError pipeline. The error type → status code mapping matches the PR description and Anthropic's API semantics. No high or medium severity bugs were found in the added code.

Observations (low severity, no action required):

  1. statusCode < 400 guard is slightly broader than needed (wrap.go:35): The condition catches 1xx/2xx/3xx errors and remaps them. The only documented SSE in-band case is HTTP 200, so statusCode == http.StatusOK would be a tighter guard — though 1xx/3xx Anthropic errors don't occur in practice.

  2. SSE rate_limit_error synthesizes 429 without Retry-After (wrap.go:36): The original HTTP 200 response carries no Retry-After header, so the synthesized StatusError will always have RetryAfter == 0. Retries still kick in correctly via the exponential back-off strategy; the provider-hint path is simply not available for in-band errors. Known limitation, not a bug.

  3. Test helper fragility (wrap_test.go:56): makeTestSSEAnthropicError sets StatusCode: http.StatusOK before calling UnmarshalJSON. If a future SDK version overwrites StatusCode during unmarshaling, the fixture would silently stop validating the production path. Minor, SDK-version-dependent concern.

  4. Missing test cases for several error types: permission_error, not_found_error, billing_error, timeout_error, and invalid_request_error have no dedicated test assertions. The existing coverage for api_error, overloaded_error, rate_limit_error, authentication_error, and unknown type provides reasonable confidence, but adding the remaining cases would complete the coverage matrix.

@aheritier aheritier added area/providers/anthropic For features/issues/fixes related to the usage of Anthropic models area/providers For features/issues/fixes related to LLM providers (Bedrock, LiteLLM, Qwen, custom, etc.) kind/fix PR fixes a bug (maps to fix: commit prefix) labels May 22, 2026
@docker-agent
Copy link
Copy Markdown

PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

@dgageot dgageot merged commit 3d552c4 into docker:main May 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/providers/anthropic For features/issues/fixes related to the usage of Anthropic models area/providers For features/issues/fixes related to LLM providers (Bedrock, LiteLLM, Qwen, custom, etc.) kind/fix PR fixes a bug (maps to fix: commit prefix)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent stops mid-session with cryptic, unactionable error message

4 participants