Skip to content

fix(responses): native Responses-API passthrough for gpt-5.x + Codex CLI#427

Merged
lalalune merged 5 commits intodevfrom
fix/responses-native-passthrough
Apr 8, 2026
Merged

fix(responses): native Responses-API passthrough for gpt-5.x + Codex CLI#427
lalalune merged 5 commits intodevfrom
fix/responses-native-passthrough

Conversation

@HaruHunab1320
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #425. The tool format normalization in #425 handled flat function-tools but can't express the full set of Responses-API-only features that gpt-5.x clients (Codex CLI, AI SDK v5 Responses transport) send. This PR adds a native passthrough path that proxies the raw body to Vercel AI Gateway's /v1/responses endpoint unchanged.

What #425 couldn't handle

  • Top-level instructions and input fields
  • type: "custom" tools (Codex's apply_patch with Lark grammar)
  • type: "web_search", type: "local_shell" built-in tools
  • include: ["reasoning.encrypted_content"]
  • prompt_cache_key, parallel_tool_calls, reasoning summaries

Collapsing these into Chat Completions format drops information and produces the same class of tools.0.function undefined error #425 was meant to fix.

What this PR adds

  1. AIProvider.responses(body, options) — optional method on the provider interface for native Responses-API passthrough.
  2. vercel-gateway.ts — implements responses() against Vercel AI Gateway's https://ai-gateway.vercel.sh/v1/responses passthrough. No body transformation.
  3. /v1/responses route — new early branch that detects native Responses-API payloads and routes them through the passthrough. Detection triggers on:
    • Presence of top-level instructions
    • Any tool with type !== "function" (custom, web_search, local_shell, image_generation, …)
    • Model id starting with gpt-5
  4. Model prefix rewrite — Vercel AI Gateway requires provider/model format. Codex sends bare gpt-5.4, so normalizeGatewayModelId() prefixes openai/ for bare OpenAI ids, anthropic/ for claude-*, google/ for gemini-*. Caller body is cloned, not mutated.
  5. Type unificationChatCompletionsTool and ChatCompletionsToolChoice exported from providers/types.ts and shared with OpenAIChatRequest, eliminating structural drift.
  6. Credit handling — reserves credits via estimateRequestCost (50% safety buffer applies) and settles on completion. Accurate reconciliation from response.completed SSE event is a TODO.

Verified against production upstream

Before opening this PR we ran a live HTTP proxy (scripts/codex-sniff.ts in the milady repo) between Codex CLI and https://ai-gateway.vercel.sh/v1/responses with a real gateway key. Captured and confirmed:

  • POST /v1/responses with model openai/gpt-5.4, instructions, input, 12 tools (10 function + 1 custom apply_patch with Lark grammar + 1 web_search), include: ["reasoning.encrypted_content"], reasoning: {effort: "medium"}, prompt_cache_key, stream: true200 OK SSE stream
  • Custom tools with Lark grammar forwarded verbatim, accepted by gateway
  • web_search tool forwarded verbatim, accepted by gateway
  • Model prefix rewrite accepted

Cross-referenced Vercel AI Gateway Responses API docs and Codex source (codex-rs/core/src/client_common.rs ToolSpec enum + codex-rs/core/src/client.rs:build_responses_request) to confirm payload shape is stable.

Test coverage

28 tests in packages/tests/unit/api/v1-responses-route.test.ts (15 from #425 still passing + 13 new):

Routing

  • gpt-5.4 + instructions + flat/custom/web_search tools → passthrough
  • Custom tool on any model → passthrough
  • gpt-5-mini (no dot suffix) detected as native Responses
  • Already-prefixed openai/gpt-5.4 → passthrough
  • Normal gpt-4o nested tool → transform path (not passthrough)

Model prefix rewrite

  • Bare gpt-5.4openai/gpt-5.4
  • claude-sonnet-4.6anthropic/claude-sonnet-4.6
  • gemini-2.5-progoogle/gemini-2.5-pro
  • Already-prefixed ids preserved
  • Caller body not mutated

Payment path

  • Upstream 4xx forwarded verbatim with credit refund (actualCost: 0)
  • Insufficient credits → 402, upstream not touched
  • Provider without .responses() → 400 unsupported_provider
  • Anonymous users skip credit reservation
  • Reconcile called with reserved amount on success

Passthrough body integrity

  • Custom / web_search / local_shell tools forwarded verbatim
  • Flat function tools forwarded verbatim (not normalized)
  • instructions field forwarded

Follow-ups (not blocking)

  1. 429 retry-with-backoff on vercel-gateway.responses(). Hit Vercel's free-tier throttle during testing and Codex's own retry budget exhausted. Exponential backoff inside the gateway wrapper would make this invisible to clients.
  2. Accurate credit reconciliation from response.completed SSE event (currently settles to reserved estimate).
  3. Per-route rate limit preset. Native passthrough currently inherits RELAXED (200/min). Agentic sessions can burn that in a burst — consider an AGENTIC preset or raising to 600/min on this route.

Test plan

  • bun test packages/tests/unit/api/v1-responses-route.test.ts — 28/28 passing
  • Typecheck clean on changed files
  • Lint clean (biome auto-fix applied)
  • End-to-end verified against production Vercel AI Gateway via local sniff proxy with real gateway key
  • Deploy to dev and test with Codex CLI via milady's PARALLAX_LLM_PROVIDER=cloud flow

🤖 Generated with Claude Code

HaruHunab1320 and others added 2 commits April 7, 2026 10:08
Follow-up to #425. The tool format normalization in #425 handles flat
function-tools but cannot express Responses-API-only features:

  - top-level `instructions` and `input` fields
  - custom tools `{type: "custom", name, format}` (apply_patch)
  - built-in tools `{type: "web_search"}`, `image_generation`, etc.
  - reasoning.effort, prompt_cache_key, encrypted reasoning content

Codex CLI and the AI SDK v5 Responses transport rely on these. Trying
to collapse them into Chat Completions format drops information and
produces the same class of "tools.0.function undefined" error #425 was
meant to fix.

This PR adds a native passthrough path:

- AIProvider gains an optional `responses(body, options)` method that
  forwards an opaque body to the upstream `/responses` endpoint.
- vercel-gateway implements it against Vercel AI Gateway's
  `/responses` passthrough — no body transformation.
- /v1/responses detects native Responses-API payloads (gpt-5.x model,
  presence of `instructions`, or any non-function tool) and routes
  them through the passthrough, streaming the upstream response back
  verbatim. Older clients still go through the transform + normalize
  path unchanged.
- Credits are reserved via estimateRequestCost (50% safety buffer
  already applies) and settled to the reserved amount on completion.
  Accurate reconciliation from the `response.completed` SSE event is
  a TODO — filed for follow-up.

Tests cover the three routing branches (gpt-5.x + instructions →
passthrough, custom tools on any model → passthrough, vanilla nested
tools on gpt-4o → transform path) plus the existing 15-test
normalization suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on the initial native Responses-API passthrough with the
follow-ups required to make it actually work against Vercel AI
Gateway's /v1/responses endpoint and to cover the payment paths
we missed in the first cut.

- Model prefix rewrite: Vercel AI Gateway requires `provider/model`
  format (e.g. `openai/gpt-5.4`), but Codex CLI and the AI SDK v5
  Responses transport send bare ids like `gpt-5.4`. Added
  `normalizeGatewayModelId()` which prefixes `openai/` for bare OpenAI
  ids, `anthropic/` for `claude-*`, `google/` for `gemini-*`, and
  leaves already-prefixed ids untouched. The rewrite builds a shallow
  clone of the caller body so the original is not mutated.

- Loosened the "is native Responses payload" detector to match
  `gpt-5` and `gpt-5-mini` (previously it only matched `gpt-5.x`).

- Custom/web_search/local_shell tool pass-through explicitly tested —
  verified from Codex source that these are the only four tool
  variants it emits, and Vercel AI Gateway accepts them all verbatim
  on /v1/responses per live test against the production endpoint.

Tests added (13 new, bringing the file to 28 total):
- Model prefix rewrite for bare / prefixed / claude / gemini ids
- Caller body not mutated by rewrite
- Bare `gpt-5` / `gpt-5-mini` detected as native Responses
- Upstream 4xx forwarded verbatim with credit refund
- Insufficient credits → 402, upstream not touched
- Provider without .responses() method → 400 unsupported_provider
- Anonymous users skip credit reservation in passthrough
- Reconcile called with reserved amount on success
- Custom / web_search / local_shell tools forwarded verbatim in body

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
eliza-cloud-v2 Ready Ready Preview, Comment Apr 8, 2026 3:15am

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9523e133-1dad-401b-bdd9-7d09c8994e6f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/responses-native-passthrough

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

Code Review

Clean implementation overall — the detection + passthrough approach is the right call for Responses-API-only features. A few findings below.


Bugs / Issues

1. apiKey parameter is unused — audit trail gap

handleNativeResponsesPassthrough accepts apiKey: { id: string } | null but never references it. The chat completions path uses this for attribution/logging. Passthrough requests made via API key will have no key-level audit trail in logs.

// app/api/v1/responses/route.ts
async function handleNativeResponsesPassthrough(
  ...
  apiKey: { id: string } | null,   // ← never used
  ...

At minimum, include apiKey?.id in the logger.info call at the end of the function.


2. Credit reconciliation fires before stream is consumed

The void (async () => { await settle(reserved) })() fires immediately after returning the Response — not after the client finishes reading the stream body. For large streamed responses, this means the reservation settles before we know the request actually completed. If the upstream drops mid-stream, we still charge the full estimate.

This is noted as a TODO but I'd flag it explicitly: the 50% safety buffer helps, but a cancelled mid-stream request (e.g. Codex CLI ctrl-C) will still be charged the full estimate. Consider settling on stream close or cancel instead.


Code Quality

3. _startTime is dead code in the passthrough path

It's passed through but never used — no timing metrics for passthrough requests. The regular path tracks elapsed time for logging. Passthrough latency is invisible.

4. AIProvider.responses?() JSDoc says "should throw" but route checks for method existence

// types.ts — "Providers that cannot proxy Responses API should throw."
// route.ts — if (!providerInstance.responses) { return 400 }

The contract is actually "if missing, route returns 400 before calling". The JSDoc is misleading — it implies a throwing implementation is expected, but the check prevents the call entirely.

5. Upstream response headers forwarded verbatim (beyond hop-by-hop)

x-ratelimit-*, cf-ray, x-request-id, and other gateway-internal headers will be forwarded to the client. This may be intentional for transparency (clients can see rate limit state), but worth confirming it doesn't leak internal Vercel AI Gateway infrastructure details.


Minor

6. estimateRequestCost fallback of 0.01 may be too small

For large Codex sessions (long apply_patch exchanges), $0.01 could be well below actual cost. The 50% buffer in estimateRequestCost itself doesn't apply here since we're bypassing it. Consider a higher floor (e.g. 0.10) or document why 0.01 is safe.


Tests

Coverage is solid for a new path (13 tests covering the main branches). Two notes:

  • The setTimeout(r, 0) flush in "reconciles passthrough reservation" is fine for now but fragile if the background settle ever becomes truly async (e.g. a DB call). Consider using vi.waitFor or an explicit flush helper.
  • No test for the _startTime / timing path — but given the parameter is unused, this is consistent.

Summary

Area Status
Detection logic (isNativeResponsesPayload) ✅ Correct, well-tested
Model ID normalization ✅ Clean, non-mutating
Credit reservation & refund ✅ Correct path logic
Settlement timing ⚠️ Settles before stream consumed (known TODO)
apiKey attribution ❌ Missing — should be logged
Test coverage ✅ Good breadth

The apiKey attribution gap is the only thing I'd fix before merging — the rest are acceptable tradeoffs or TODOs.

Addresses the Claude bot review on #427:

1. apiKey attribution gap (blocker) — `handleNativeResponsesPassthrough`
   accepted an `apiKey` parameter but never used it, creating an
   audit-trail hole. Now threads `apiKey?.id` into both the
   `reserveAndDeductCredits` and `reconcile` metadata, and includes it
   in the final `logger.info` so credits can be traced back to the
   paying key.

2. Dead code: `_startTime` → `startTime` is now used to compute and
   log `durationMs` on every passthrough response. The regular path
   had this; the passthrough was invisible.

3. Settlement-before-stream trade-off — explicit comment now calls out
   the Codex CLI Ctrl-C / mid-stream-cancel case that the previous
   "TODO" glossed over. Keeps the behavior (50% safety buffer is an
   acceptable overcharge upper bound for now) but documents the gap.

4. `AIProvider.responses?()` JSDoc said "should throw" but the route
   actually checks for method existence and returns 400. Fixed the
   contract description so the next reader doesn't look for a
   throwing implementation that never existed.

5. Header leak: previously we stripped only the four hop-by-hop
   headers, passing `x-vercel-*`, `cf-ray`, `server`, `via` etc.
   through to clients unchanged. Now strips all gateway-internal
   infra headers while explicitly keeping `x-ratelimit-*` for client
   transparency.

6. Fallback credit floor: when `estimateRequestCost` throws we now
   reserve $0.10 instead of $0.01. The old floor was too small to
   protect against a runaway session when estimation silently fails
   — the Chat Completions path gets a 50% safety buffer from
   `estimateRequestCost` that the passthrough was bypassing.

New tests (3, bringing total to 31):
- api_key_id flows into reserve + reconcile metadata
- gateway-internal headers stripped, x-ratelimit-* forwarded
- fallback floor is 0.10 when estimateRequestCost throws

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HaruHunab1320
Copy link
Copy Markdown
Contributor Author

Thanks for the review! All six findings addressed in 6f5224e:

# Finding Fix
1 apiKey parameter unused (blocker) Threaded apiKey?.id into reserveAndDeductCredits metadata, reconcile metadata, and the final logger.info payload
2 Credit reconciliation fires before stream consumed Kept the behavior (still a TODO — proper fix requires teeing the stream) but rewrote the inline comment to explicitly call out the Codex Ctrl-C case so the trade-off is visible
3 _startTime dead code Now startTime, used to compute and log durationMs on every passthrough response
4 AIProvider.responses?() JSDoc misleading Rewrote to describe the actual contract ("optional; caller must check presence and return unsupported_provider")
5 Gateway-internal headers leaked to clients Now strips x-vercel-*, cf-ray, cf-cache-status, server, via, keep-alive in addition to hop-by-hop. Explicitly keeps x-ratelimit-* for client transparency
6 estimateRequestCost fallback of $0.01 too small Raised floor to $0.10 with an inline comment explaining why the Chat Completions path's 50% safety buffer doesn't apply here

3 new tests (bringing total to 31):

  • api_key_id flows into reserve + reconcile metadata
  • Gateway-internal headers stripped, x-ratelimit-* forwarded verbatim
  • Fallback floor is 0.10 when estimateRequestCost throws

The test-side setTimeout(r, 0) flush and the missing _startTime timing test are both resolved as a side effect (the latter is now meaningfully testable since the value is actually used; happy to add an explicit duration assertion if you'd like). The proper stream-teed reconciliation stays on the follow-up list since it's a bigger change that deserves its own PR.

@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

Code Review — #427

Overall this is solid work. The design rationale is well-documented, the moderation/credit paths mirror the Chat Completions route correctly, and the 31-test suite covers the important branches. A few things worth addressing before merge:


Bugs / Correctness

1. Provider capability check happens after credit reservation

handleNativeResponsesPassthrough reserves credits on lines 202-231, then checks providerInstance.responses on line 256. If the provider lacks the method, credits are correctly refunded via settleReservation?.(0), but we've done a round-trip to reserveAndDeductCredits for nothing. Move the provider existence check (and the unsupported_provider return) to before the credit reservation block. This also prevents any scenario where settleReservation is null but we still enter the provider-check branch.

2. set-cookie not in STRIP_HEADERS

The upstream gateway could theoretically return set-cookie headers. Propagating those to end clients through a proxy is almost always wrong (cookie domain/path semantics break, and it leaks session state). Add "set-cookie" to the strip list.

3. pseudoMessages as never type escape hatch

estimatedCost = await estimateRequestCost(
  model,
  pseudoMessages as never,   // <-- bypasses type safety
  ...
);

as never silences a real type mismatch between the synthesized pseudo-messages and what estimateRequestCost expects. Either fix the upstream function signature to accept the looser type, or cast through unknown with a comment explaining why it is safe. as never should not appear in production code.


Security

4. No body size guard

The raw request body is forwarded to the upstream gateway without an explicit size check. Chat Completions presumably has a size limit applied somewhere in the middleware stack — verify that limit also covers the /v1/responses branch. If it doesn't, a large payload (e.g. a multi-MB grammar definition in a custom tool) bypasses it here.


Minor / Code quality

5. isNativeResponsesPayload: missing test for bare "gpt-5" (no suffix)

The test suite covers "gpt-5-mini" and "gpt-5.4" but not a plain "gpt-5" string. The regex /^gpt-5(\b|[-.])/.test("gpt-5") should match (word boundary at end of string), but it's worth a dedicated test case given this was an explicit fixup from the original regex.

6. Redundant condition on lines 108-115

const bodyForUpstream =
  gatewayModel === model ? body : { ...body, model: gatewayModel };
if (gatewayModel !== model) {
  logger.debug(...)
}

The second if repeats the same comparison. An else branch on the ternary or a local needsRewrite boolean would be cleaner.

7. handleNativeResponsesPassthrough function length

At ~300 lines this function is doing a lot: validation, moderation, cost estimation, credit reservation, provider lookup, upstream fetch, error handling, header filtering, logging, and settlement. Consider extracting the credit-reservation block into a reserveCreditsForPassthrough() helper and the header-stripping into buildPassthroughResponseHeaders(). The existing Chat Completions handler has this same length problem, so not a blocker here, but worth a follow-up.


Acknowledged trade-offs (no action required)


Verdict: two correctness items (#1, #2) and the type-safety issue (#3) should be fixed before merge. The rest are minor polish. The test coverage and credit-audit improvements from the previous review round are appreciated.

HaruHunab1320 added a commit that referenced this pull request Apr 7, 2026
…ough

Builds on #427. The native passthrough previously settled the credit
reservation to the reserved estimate on every request regardless of
actual usage — so a Codex turn that emitted 200 output tokens was
charged at the same rate as one that emitted 4000, as long as both
fit under the reserved upper bound.

This PR wraps the upstream ReadableStream with a pass-through reader
that also extracts `response.usage` from the terminal `response.completed`
SSE event. When the stream ends, we compute the real cost via
`calculateCost(model, provider, inputTokens, outputTokens)` and
reconcile the reservation down to actual (capped at the reservation
as an upper bound — over-runs would need a separate post-hoc ledger
entry which is out of scope).

### Design

- Zero behavioral impact on the client: bytes flow in the exact order
  and size the upstream produced them. We do not batch, rewrite, or
  buffer beyond a single SSE frame for parser bookkeeping.
- SSE events are parsed out-of-band on the side. Parse errors are
  swallowed — a malformed frame must never break the forward path.
- Exactly one terminal callback per stream lifecycle: `end`, `cancel`,
  or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the
  callback with whatever usage we had seen before the cancel, so a
  turn that completed-then-cancelled still reconciles to actual.
- Pull-based ReadableStream so the upstream is only drained when the
  client reads, matching the semantics of a direct proxy.
- Cost is clamped at the reservation: `actualCost = min(computed, reserved)`.

### Trade-offs

- If the client aborts mid-stream before `response.completed` arrives,
  we settle to the reserved estimate (same as pre-stream-wrap
  behavior). The 50% safety buffer in `estimateRequestCost` stays as
  the upper bound for that case.
- If `calculateCost` throws, we fall back to settling at the reserved
  amount rather than crashing the reconciliation.
- We cap at reserved rather than allowing over-collection. Anything
  beyond the reservation would need a separate post-hoc charge which
  this PR doesn't implement.

### Tests

14 unit tests for `wrapWithUsageExtraction` covering:
- Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel,
  malformed JSON recovery)
- Usage extraction (headline + cached + reasoning tokens, missing
  fields default to 0, null when no completed event)
- Termination paths (end, cancel before completed, cancel after
  completed, error, throwing callback swallowed)

4 new integration tests in the route suite:
- Reconciles to actual cost when response.completed reports usage
- Caps actual cost at reserved when model over-runs the estimate
- Falls back to reserved when no response.completed arrives
- Existing reconcile/api_key_id tests updated to actually drain the
  stream (pull-based wrapper gates the callback on reader progress)

Total: 48 passing tests across the two affected suites (14 + 34).

### Stacked on #427

This branch is stacked on `fix/responses-native-passthrough`. Merge
#427 first, then rebase this onto `dev`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses the Claude bot's second review on #427. All three correctness
items (#1-#3) fixed; two of the minor items (#5, #6) fixed; the body
size guard (#4) was flagged as a concern worth resolving so I added
an explicit one.

1. Provider check now happens BEFORE credit reservation. Previously
   an unsupported provider caused a reserve → refund round-trip on
   every request that touched the credits ledger for no reason. Now
   the check returns 400 before any ledger activity. Also eliminates
   the failure mode where a reservation could leak if the refund
   path threw.

2. `set-cookie` added to STRIP_HEADERS. Forwarding upstream cookies
   through the proxy breaks cookie domain/path semantics and leaks
   upstream session state — must be stripped regardless of what the
   gateway returns.

3. `pseudoMessages as never` removed. The previous cast was
   silencing a non-existent type mismatch: `{role, content: string}[]`
   is a valid subtype of `{role, content: string | object}[]` which
   is what `estimateRequestCost` accepts. Plain assignment works.
   Added an explicit type annotation on `pseudoMessages` so the
   subtyping relationship is visible at the call site.

4. Body size guard (4 MiB cap via `Content-Length` header) added at
   the top of `handlePOST` before any auth/parse/reserve work. The
   passthrough forwards bodies verbatim upstream, so an unbounded
   payload could bypass whatever limits the middleware stack has
   elsewhere. Chat Completions is still unguarded — that's a
   separate gap, but the passthrough is where we're most exposed.
   Returns 413 with `code: "request_too_large"`.

5. Added an explicit test for bare `model: "gpt-5"` (no suffix) to
   lock in the `/^gpt-5(\b|[-.])/` regex behavior.

6. Removed the redundant `gatewayModel !== model` check — now uses
   a single `needsModelRewrite` local and a cleaner ternary. Fixes
   the double-compare noise the review pointed out.

Type-system housekeeping: captured a local `providerResponses`
reference after the early check so the non-null narrowing survives
across the intervening credit reservation block (previously used a
`!` non-null assertion which biome flagged). Added an unreachable
safety fallback that returns 500 if provider.responses somehow
vanishes between check and call — pure type-level belt and braces.

New tests (5, bringing total to 36):
- Bare `gpt-5` → native Responses detection
- Provider check happens before credit reservation (ledger untouched)
- set-cookie header stripped from passthrough response
- Request body over 4 MiB cap rejected with 413
- Request body well under cap accepted normally

Deferred: function length (#7) — acknowledged as worth extracting
into helpers but the reviewer flagged as not a blocker; the Chat
Completions route has the same shape, so doing one without the
other is more churn than signal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HaruHunab1320 added a commit that referenced this pull request Apr 7, 2026
…ough

Builds on #427. The native passthrough previously settled the credit
reservation to the reserved estimate on every request regardless of
actual usage — so a Codex turn that emitted 200 output tokens was
charged at the same rate as one that emitted 4000, as long as both
fit under the reserved upper bound.

This PR wraps the upstream ReadableStream with a pass-through reader
that also extracts `response.usage` from the terminal `response.completed`
SSE event. When the stream ends, we compute the real cost via
`calculateCost(model, provider, inputTokens, outputTokens)` and
reconcile the reservation down to actual (capped at the reservation
as an upper bound — over-runs would need a separate post-hoc ledger
entry which is out of scope).

### Design

- Zero behavioral impact on the client: bytes flow in the exact order
  and size the upstream produced them. We do not batch, rewrite, or
  buffer beyond a single SSE frame for parser bookkeeping.
- SSE events are parsed out-of-band on the side. Parse errors are
  swallowed — a malformed frame must never break the forward path.
- Exactly one terminal callback per stream lifecycle: `end`, `cancel`,
  or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the
  callback with whatever usage we had seen before the cancel, so a
  turn that completed-then-cancelled still reconciles to actual.
- Pull-based ReadableStream so the upstream is only drained when the
  client reads, matching the semantics of a direct proxy.
- Cost is clamped at the reservation: `actualCost = min(computed, reserved)`.

### Trade-offs

- If the client aborts mid-stream before `response.completed` arrives,
  we settle to the reserved estimate (same as pre-stream-wrap
  behavior). The 50% safety buffer in `estimateRequestCost` stays as
  the upper bound for that case.
- If `calculateCost` throws, we fall back to settling at the reserved
  amount rather than crashing the reconciliation.
- We cap at reserved rather than allowing over-collection. Anything
  beyond the reservation would need a separate post-hoc charge which
  this PR doesn't implement.

### Tests

14 unit tests for `wrapWithUsageExtraction` covering:
- Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel,
  malformed JSON recovery)
- Usage extraction (headline + cached + reasoning tokens, missing
  fields default to 0, null when no completed event)
- Termination paths (end, cancel before completed, cancel after
  completed, error, throwing callback swallowed)

4 new integration tests in the route suite:
- Reconciles to actual cost when response.completed reports usage
- Caps actual cost at reserved when model over-runs the estimate
- Falls back to reserved when no response.completed arrives
- Existing reconcile/api_key_id tests updated to actually drain the
  stream (pull-based wrapper gates the callback on reader progress)

Total: 48 passing tests across the two affected suites (14 + 34).

### Stacked on #427

This branch is stacked on `fix/responses-native-passthrough`. Merge
#427 first, then rebase this onto `dev`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HaruHunab1320
Copy link
Copy Markdown
Contributor Author

Thanks for the second pass. All three correctness items addressed in 9dbeb4f, plus both minor items, plus an explicit body-size guard for finding #4. Details:

# Finding Fix
1 Provider capability check after credit reservation (blocker) Moved the providerInstance.responses check to the top of handleNativeResponsesPassthrough, before any credit work. Eliminates the reserve → refund round-trip on unsupported-provider requests and removes the failure mode where a refund path throwing could leak a reservation.
2 set-cookie not stripped (blocker) Added to STRIP_HEADERS. Upstream cookies forwarded through a proxy always leak session state — must be stripped regardless of what the gateway sends.
3 as never type escape hatch (blocker) Removed. The cast was silencing a non-existent mismatch: {role, content: string}[] is a valid subtype of {role, content: string | object}[] which is what estimateRequestCost takes. Added an explicit type annotation on pseudoMessages so the subtyping is visible at the call site.
4 No body size guard (security) Added an explicit 4 MiB cap at the top of handlePOST, checked against Content-Length before any auth/parse/reserve work. Returns 413 with code: "request_too_large". Chat Completions is still unguarded — flagged as a separate gap since doing both is more churn than signal, but the passthrough is where we're most exposed so I added it here.
5 Missing test for bare "gpt-5" Added explicit test case. The /^gpt-5(\b|[-.])/ regex does match via the end-of-input word boundary but it's now locked in.
6 Redundant gatewayModel !== model double-check Replaced with a single needsModelRewrite local and a clean ternary.

Type-system housekeeping: captured a local providerResponses reference after the early check so the non-null narrowing survives across the intervening credit reservation block (previously relied on ! which biome flagged as style warning). Added an unreachable safety fallback that returns 500 if provider.responses somehow vanishes between check and call — pure type-level belt and braces.

Deferred (#7, function length): acknowledged as worth an extraction into helpers but the Chat Completions route has the same shape, so doing one without the other is more noise than value. Worth a dedicated cleanup PR after the passthrough stabilizes.

Acknowledged trade-offs update: the settlement-before-stream-consumed trade-off you listed as "documented and acceptable" is actually resolved in the follow-up PR #428 (stacked on this branch). That PR wraps the upstream ReadableStream so bytes flow unchanged to the client while the wrapper extracts response.usage from the terminal response.completed SSE event and reconciles to real cost on stream close. If you have bandwidth, reviewing the two in order (#427#428) makes the most sense — the stacking is clean and #428 rebased cleanly onto the updated #427.

Test totals (this PR): 36/36 passing (was 31 + 5 new).

@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

Code Review

Overall this is a well-structured PR with thorough test coverage and clear intent. The architecture of the passthrough path is sound — detect early, bail before touching the credits ledger if unsupported, forward verbatim. A few findings below.


Security

Body size guard can be bypassed (medium)

The Content-Length check at the top of handlePOST is a hint, not enforcement. Clients using chunked transfer encoding (Transfer-Encoding: chunked) omit Content-Length entirely, so the guard is silently skipped. More critically, req.json() is called unconditionally with no actual read limit — a client can stream an unbounded body past the guard.

For a passthrough route that forwards bodies verbatim upstream this is the highest-exposure spot in the file. Recommend enforcing a hard read limit:

// Before req.json(), enforce the actual read limit
const bodyText = await req.text();
if (bodyText.length > MAX_RESPONSES_BODY_BYTES) {
  return Response.json({ error: { message: `...`, code: "request_too_large" } }, { status: 413 });
}
const rawBody = JSON.parse(bodyText) as Record<string, unknown>;

The existing Content-Length check is still worth keeping as an early fast path, but it should not be the only guard.

Upstream error object forwarded verbatim (low)

In handleNativeResponsesPassthrough (the catch block around providerResponses):

const gwErr = err as { status: number; error: unknown };
return Response.json({ error: gwErr.error }, { status: gwErr.status });

gwErr.error is typed as unknown and forwarded to the client without inspection. If the gateway ever includes internal stack traces or infrastructure details in error objects, those leak directly. Consider wrapping in a sanitized structure or at minimum asserting it's a plain object before forwarding.


Correctness

Background settlement race in serverless environments (medium)

void (async () => {
  try {
    await settle(reserved);
  } catch (err) { ... }
})();

In Next.js on Vercel (serverless), the function invocation can be frozen or terminated once the Response is returned. The background settle() may never complete, leaving the reservation without reconciliation. The PR comment acknowledges this as a known trade-off for streams, but for non-streaming (synchronous) responses it's an unnecessary risk — the body is already fully materialized at that point.

Recommendation: if upstreamResponse.headers.get("content-type") does not include text/event-stream, await settle(reserved) synchronously before returning.

isNativeResponsesPayload doesn't handle non-string instructions (low)

if (typeof body.instructions === "string") return true;

If a malformed payload sends instructions: 42 or instructions: {}, it won't trigger passthrough. The request will then fall through to the Chat Completions transform path, which will choke on an unexpected instructions field. This is an edge case but worth a comment explaining the deliberate typeof === "string" choice, or widening to body.instructions != null.

Double providerInstance.responses check (nit)

The !providerInstance.responses check at line ~126 is followed by a second guard at line ~292 with a comment "Unreachable at runtime — load-bearing only for TypeScript." You can eliminate the second check entirely by capturing the narrowed reference immediately after the first:

const providerResponses = providerInstance.responses;
if (!providerResponses) {
  return Response.json({ error: { code: "unsupported_provider" } }, { status: 400 });
}
// providerResponses is now definitively non-null — no second guard needed

Test Coverage

Coverage is solid for the happy path and most error branches. A few gaps:

  1. No test for chunked/missing Content-Length with a large body — the current 413 test only exercises the Content-Length hint path. If the fix above is applied, a test that sends no Content-Length but a large body would be a useful regression guard.

  2. No test for non-streaming passthrough — all passthrough tests use text/event-stream. A test verifying a synchronous (non-streaming) JSON response passes through correctly would improve confidence.

  3. No test for malformed JSON bodyreq.json() is called without a try/catch in the passthrough path's outer handlePOST. A request with a non-JSON body currently falls through to the general catch at the bottom of handlePOST, but that catch is designed for the Chat Completions path and may return an unexpected error shape.

  4. reconciles passthrough reservation to reserved amount on success — this test uses setTimeout(0) to flush the background settle. This works today but is fragile under test parallelism. Consider exposing a __testFlushSettlement or making the background fire synchronous in test mode.


Minor

  • const aiSdkRequest = rawBody as unknown as AISdkRequest (line ~520) is a double-cast that bypasses type safety for the existing path. The existing path hasn't changed, but this weakens the type contract — worth narrowing with a runtime check or at least a comment.
  • The MAX_RESPONSES_BODY_BYTES constant is defined inside route.ts right above handlePOST, far from the top of the file. Moving it to the module-level constants area (or alongside other config) would aid discoverability.

Summary

Severity Finding
Medium Body size guard bypassable via chunked encoding / missing Content-Length
Medium Background settlement may not complete in serverless for synchronous responses
Low Upstream error object forwarded verbatim without sanitization
Low instructions non-string values fall through to wrong code path
Nit Redundant second providerInstance.responses null guard

The passthrough architecture itself is the right call — trying to collapse Responses-API-only features into Chat Completions would lose information. The test suite is well-structured and the credit/reservation logic is carefully handled (particularly the provider check before reservation). The items above are improvements rather than blockers, with the body size enforcement being the most actionable security fix.

Addresses the Claude bot's third review on #427. All 5 substantive
findings fixed plus the 4 test gaps the reviewer flagged.

1. Body size guard bypassable via chunked encoding (medium): the
   Content-Length header check is now an early fast path only. The
   real enforcement is a post-read length check on the buffered
   request text. Clients using `Transfer-Encoding: chunked` (which
   omits Content-Length) or lying about Content-Length will be
   caught by the post-read check before we touch req.json(). The
   header check stays for the cheap-fast-path benefit.

2. Background settle in serverless (medium): for non-streaming
   responses (content-type != "text/event-stream") we now `await`
   the settle synchronously before returning. The body is fully
   materialized at that point so deferring serves no purpose, and
   on Vercel the function can be frozen once the Response is
   returned, leaving a background promise to never complete and
   stranding the reservation. Streaming responses still
   fire-and-forget — that path is the one PR #428 addresses
   properly via stream-teed reconciliation.

3. Upstream error object forwarded verbatim (low): added
   `sanitizeGatewayError()` which extracts only the well-known
   OpenAI-compatible fields (message, type, code, optional param)
   from an unknown gateway error envelope. Stack traces,
   infrastructure host names, and arbitrary nested objects are
   stripped. Whitelist + length cap means even a hostile gateway
   payload can't leak internals through this path.

4. `isNativeResponsesPayload` non-string instructions (low): now
   triggers passthrough on ANY presence of `instructions`, not
   just string. A malformed `instructions: 42` payload routes
   through the passthrough so the upstream returns a coherent
   validation error instead of falling through to Chat Completions
   which would choke on the unexpected field.

5. Double `providerResponses` null guard (nit): captured the
   narrowed local at the early-bail check site so the second
   guard before the forward call is gone. Removed both the redundant
   if-block AND the unreachable 500 fallback that was load-bearing
   only for the type system.

New tests (6, bringing total to 42):
- 413 with no Content-Length (chunked encoding bypass test)
- 400 invalid_json on malformed JSON body
- Synchronous settle for non-streaming passthrough (no flush needed)
- Non-streaming JSON passthrough body forwarded unchanged
- Sanitized gateway error envelope (hostile fields stripped)
- Non-string `instructions: 42` routes to passthrough

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Apr 8, 2026

Code Review — Round 4

Great work getting through three rounds of fixes. The code is in good shape. Two remaining issues I can see in the current state:


Bugs

1. Gateway-internal headers leak through non-ok upstream responses (low)

The success path carefully builds outHeaders, strips STRIP_HEADERS, and sets x-eliza-responses-passthrough. But the non-ok early-return path forwards the upstream response unchanged:

if (!upstreamResponse.ok) {
  await settleReservation?.(0);
  return new Response(upstreamResponse.body, {
    status: upstreamResponse.status,
    statusText: upstreamResponse.statusText,
    headers: upstreamResponse.headers,   // ← raw, unfiltered
  });
}

A gateway 429 or 503 response could still carry cf-ray, x-vercel-*, server, set-cookie, etc. The same outHeaders construction (strip + add x-eliza-responses-passthrough) should apply here. Suggest extracting the header-building block into a small helper and calling it from both paths.

2. JSON.parse(bodyText) can return a non-object; type assertion masks the crash (low)

JSON.parse("null") returns null — valid JSON, no exception thrown, so parseErr is never triggered. rawBody then holds null at runtime despite being typed as Record<string, unknown>. Two lines later:

if (isNativeResponsesPayload(rawBody)) {

isNativeResponsesPayload accesses body.instructions on line 1 — that throws TypeError: Cannot read properties of null. The outer handlePOST catch catches it, but the client gets a generic 500 instead of a descriptive 400. A simple guard after the JSON.parse call would fix both the crash and the error shape:

rawBody = JSON.parse(bodyText);
if (!rawBody || typeof rawBody !== "object" || Array.isArray(rawBody)) {
  return Response.json(
    { error: { message: "Request body must be a JSON object", type: "invalid_request_error", code: "invalid_json" } },
    { status: 400 },
  );
}

Minor

  • bodyText.length counts UTF-16 code units, not bytes. For the overwhelmingly ASCII content of JSON payloads this is functionally equivalent to MAX_RESPONSES_BODY_BYTES, but the comparison is semantically inconsistent with the Content-Length byte check above it. A comment to that effect (// length in JS chars ≈ bytes for ASCII JSON) would prevent future confusion.

Summary

Severity Finding Status
Low Header stripping skipped on non-ok upstream responses New
Low JSON.parse(null) → type-masked crash → 500 instead of 400 New
Nit bodyText.length vs byte comparison inconsistency New

The three-round cleanup has been thorough — the credit-audit trail, error sanitization, body-size enforcement, serverless settlement fix, and header stripping on the success path are all solid. These two low items are the only things I can find still open.

HaruHunab1320 added a commit that referenced this pull request Apr 8, 2026
…ough

Builds on #427. The native passthrough previously settled the credit
reservation to the reserved estimate on every request regardless of
actual usage — so a Codex turn that emitted 200 output tokens was
charged at the same rate as one that emitted 4000, as long as both
fit under the reserved upper bound.

This PR wraps the upstream ReadableStream with a pass-through reader
that also extracts `response.usage` from the terminal `response.completed`
SSE event. When the stream ends, we compute the real cost via
`calculateCost(model, provider, inputTokens, outputTokens)` and
reconcile the reservation down to actual (capped at the reservation
as an upper bound — over-runs would need a separate post-hoc ledger
entry which is out of scope).

- Zero behavioral impact on the client: bytes flow in the exact order
  and size the upstream produced them. We do not batch, rewrite, or
  buffer beyond a single SSE frame for parser bookkeeping.
- SSE events are parsed out-of-band on the side. Parse errors are
  swallowed — a malformed frame must never break the forward path.
- Exactly one terminal callback per stream lifecycle: `end`, `cancel`,
  or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the
  callback with whatever usage we had seen before the cancel, so a
  turn that completed-then-cancelled still reconciles to actual.
- Pull-based ReadableStream so the upstream is only drained when the
  client reads, matching the semantics of a direct proxy.
- Cost is clamped at the reservation: `actualCost = min(computed, reserved)`.

- If the client aborts mid-stream before `response.completed` arrives,
  we settle to the reserved estimate (same as pre-stream-wrap
  behavior). The 50% safety buffer in `estimateRequestCost` stays as
  the upper bound for that case.
- If `calculateCost` throws, we fall back to settling at the reserved
  amount rather than crashing the reconciliation.
- We cap at reserved rather than allowing over-collection. Anything
  beyond the reservation would need a separate post-hoc charge which
  this PR doesn't implement.

14 unit tests for `wrapWithUsageExtraction` covering:
- Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel,
  malformed JSON recovery)
- Usage extraction (headline + cached + reasoning tokens, missing
  fields default to 0, null when no completed event)
- Termination paths (end, cancel before completed, cancel after
  completed, error, throwing callback swallowed)

4 new integration tests in the route suite:
- Reconciles to actual cost when response.completed reports usage
- Caps actual cost at reserved when model over-runs the estimate
- Falls back to reserved when no response.completed arrives
- Existing reconcile/api_key_id tests updated to actually drain the
  stream (pull-based wrapper gates the callback on reader progress)

Total: 48 passing tests across the two affected suites (14 + 34).

This branch is stacked on `fix/responses-native-passthrough`. Merge

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HaruHunab1320
Copy link
Copy Markdown
Contributor Author

Third review round addressed in c91a33d. All 5 substantive findings + the 4 test gaps fixed.

Severity Finding Fix
Medium Body size guard bypassable via chunked encoding Added a post-read text-length check. The Content-Length header check stays as an early fast path, but the real enforcement is now await req.text() followed by a length check before JSON.parse. Clients using Transfer-Encoding: chunked (which omits Content-Length) or lying about Content-Length are caught by the post-read check.
Medium Background settle in serverless for non-streaming responses Now dispatches on content-type. For text/event-stream we keep the fire-and-forget behavior (PR #428 wraps the stream so settlement is gated on consumption). For everything else (JSON responses, headers-only edge case) we await the settle synchronously before returning the Response, so a Vercel function freeze can't strand the reservation.
Low Upstream error object forwarded verbatim Added sanitizeGatewayError() which extracts only the well-known OpenAI-compatible fields (message, type, code, optional param) from an unknown gateway error envelope, applies length caps, and discards everything else. Even a hostile gateway payload with stack, infrastructure, or arbitrary nested objects can't leak through.
Low isNativeResponsesPayload non-string instructions Widened from typeof === "string" to != null. A malformed instructions: 42 payload now routes through the passthrough so the upstream returns a coherent validation error instead of falling through to Chat Completions which would choke on the unexpected field.
Nit Double providerResponses null guard Captured the narrowed local at the early-bail check site so the second guard before the forward call is gone. Removed both the redundant if-block AND the unreachable 500 fallback that was load-bearing only for the type system.

New tests (6, bringing total to 42):

  1. 413 with no Content-Length (chunked encoding bypass test) — confirms the post-read length check is the real enforcement
  2. 400 invalid_json on malformed JSON body
  3. Synchronous settle for non-streaming passthrough — mockReconcileCredits is called by the time responsesPost returns, no flush helper needed
  4. Non-streaming JSON passthrough body forwarded unchanged
  5. Sanitized gateway error envelope — hostile fields (stack, infrastructure, nested) stripped, whitelisted fields preserved
  6. Non-string instructions: 42 routes to passthrough

setTimeout(0) flush note: for the streaming path the test file still uses setTimeout(0) to flush the background callback. That path is restructured by #428 to use the stream wrapper, where settlement is gated on the wrapped reader's terminal callback rather than a microtask hop. After #428 lands those tests can use await response.text() instead of the timeout flush, which is what the new non-streaming tests already do. Cleaning that up in #428's rebase makes more sense than churning #427.

#428 has been rebased onto this branch — clean conflict resolution combining the streaming wrapper (from #428) with the streaming/non-streaming dispatch (from this round). 59/59 passing on the rebased #428.

Test totals (this PR): 42/42 passing (was 36 + 6 new).

@lalalune lalalune merged commit c833ae9 into dev Apr 8, 2026
10 of 12 checks passed
HaruHunab1320 added a commit to milady-ai/milady that referenced this pull request Apr 8, 2026
- CodingAgentSettingsSection: move updateConfig out of the setPrefs
  state updater into a debounced useEffect (400ms). React state
  updaters must be pure; the old path fired twice per keystroke in
  Strict Mode and persisted 40+ partial-key snapshots while a user
  typed an API key. The debounce coalesces rapid edits into a single
  POST so a mid-flight failure can no longer leave milady.json with
  a partial credential.
- coding-agent-bridge: drop the unused duplicate of the
  POST /api/coding-agents/auth/:agent handler. The canonical copy
  lives in server.ts's private handleCodingAgentsFallback — the
  bridge export was never wired in, so keeping both would silently
  drift.
- agent-orchestrator-compat: stop treating Codex as auth-ready under
  cloudReady. elizaOS/cloud#427 (responses-stream reconciliation)
  has not deployed, so Codex-through-cloud would fail at runtime
  with no explanation. Restore once cloud#427/#428 ship.
- credentials: call readClaudeCodeOAuthBlob() once per provider row
  and destructure, instead of hitting it twice (the second call
  shelled out to `security` on macOS on every status poll).

deploy/cloud-agent-template/package.json pin is left in place: the
release contract check requires exact versions there (the template
is published as a standalone deploy artifact, not a workspace
member, so `workspace:*` won't resolve against any registry).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants