fix(responses): native Responses-API passthrough for gpt-5.x + Codex CLI by HaruHunab1320 · Pull Request #427 · elizaOS/cloud

HaruHunab1320 · 2026-04-07T21:25:15Z

Summary

Follow-up to #425. The tool format normalization in #425 handled flat function-tools but can't express the full set of Responses-API-only features that gpt-5.x clients (Codex CLI, AI SDK v5 Responses transport) send. This PR adds a native passthrough path that proxies the raw body to Vercel AI Gateway's /v1/responses endpoint unchanged.

What #425 couldn't handle

Top-level instructions and input fields
type: "custom" tools (Codex's apply_patch with Lark grammar)
type: "web_search", type: "local_shell" built-in tools
include: ["reasoning.encrypted_content"]
prompt_cache_key, parallel_tool_calls, reasoning summaries

Collapsing these into Chat Completions format drops information and produces the same class of tools.0.function undefined error #425 was meant to fix.

What this PR adds

AIProvider.responses(body, options) — optional method on the provider interface for native Responses-API passthrough.
vercel-gateway.ts — implements responses() against Vercel AI Gateway's https://ai-gateway.vercel.sh/v1/responses passthrough. No body transformation.
/v1/responses route — new early branch that detects native Responses-API payloads and routes them through the passthrough. Detection triggers on:
- Presence of top-level instructions
- Any tool with type !== "function" (custom, web_search, local_shell, image_generation, …)
- Model id starting with gpt-5
Model prefix rewrite — Vercel AI Gateway requires provider/model format. Codex sends bare gpt-5.4, so normalizeGatewayModelId() prefixes openai/ for bare OpenAI ids, anthropic/ for claude-*, google/ for gemini-*. Caller body is cloned, not mutated.
Type unification — ChatCompletionsTool and ChatCompletionsToolChoice exported from providers/types.ts and shared with OpenAIChatRequest, eliminating structural drift.
Credit handling — reserves credits via estimateRequestCost (50% safety buffer applies) and settles on completion. Accurate reconciliation from response.completed SSE event is a TODO.

Verified against production upstream

Before opening this PR we ran a live HTTP proxy (scripts/codex-sniff.ts in the milady repo) between Codex CLI and https://ai-gateway.vercel.sh/v1/responses with a real gateway key. Captured and confirmed:

POST /v1/responses with model openai/gpt-5.4, instructions, input, 12 tools (10 function + 1 custom apply_patch with Lark grammar + 1 web_search), include: ["reasoning.encrypted_content"], reasoning: {effort: "medium"}, prompt_cache_key, stream: true → 200 OK SSE stream
Custom tools with Lark grammar forwarded verbatim, accepted by gateway
web_search tool forwarded verbatim, accepted by gateway
Model prefix rewrite accepted

Cross-referenced Vercel AI Gateway Responses API docs and Codex source (codex-rs/core/src/client_common.rs ToolSpec enum + codex-rs/core/src/client.rs:build_responses_request) to confirm payload shape is stable.

Test coverage

28 tests in packages/tests/unit/api/v1-responses-route.test.ts (15 from #425 still passing + 13 new):

Routing

gpt-5.4 + instructions + flat/custom/web_search tools → passthrough
Custom tool on any model → passthrough
gpt-5-mini (no dot suffix) detected as native Responses
Already-prefixed openai/gpt-5.4 → passthrough
Normal gpt-4o nested tool → transform path (not passthrough)

Model prefix rewrite

Bare gpt-5.4 → openai/gpt-5.4
claude-sonnet-4.6 → anthropic/claude-sonnet-4.6
gemini-2.5-pro → google/gemini-2.5-pro
Already-prefixed ids preserved
Caller body not mutated

Payment path

Upstream 4xx forwarded verbatim with credit refund (actualCost: 0)
Insufficient credits → 402, upstream not touched
Provider without .responses() → 400 unsupported_provider
Anonymous users skip credit reservation
Reconcile called with reserved amount on success

Passthrough body integrity

Custom / web_search / local_shell tools forwarded verbatim
Flat function tools forwarded verbatim (not normalized)
instructions field forwarded

Follow-ups (not blocking)

429 retry-with-backoff on vercel-gateway.responses(). Hit Vercel's free-tier throttle during testing and Codex's own retry budget exhausted. Exponential backoff inside the gateway wrapper would make this invisible to clients.
Accurate credit reconciliation from response.completed SSE event (currently settles to reserved estimate).
Per-route rate limit preset. Native passthrough currently inherits RELAXED (200/min). Agentic sessions can burn that in a burst — consider an AGENTIC preset or raising to 600/min on this route.

Test plan

bun test packages/tests/unit/api/v1-responses-route.test.ts — 28/28 passing
Typecheck clean on changed files
Lint clean (biome auto-fix applied)
End-to-end verified against production Vercel AI Gateway via local sniff proxy with real gateway key
Deploy to dev and test with Codex CLI via milady's PARALLAX_LLM_PROVIDER=cloud flow

🤖 Generated with Claude Code

Follow-up to #425. The tool format normalization in #425 handles flat function-tools but cannot express Responses-API-only features: - top-level `instructions` and `input` fields - custom tools `{type: "custom", name, format}` (apply_patch) - built-in tools `{type: "web_search"}`, `image_generation`, etc. - reasoning.effort, prompt_cache_key, encrypted reasoning content Codex CLI and the AI SDK v5 Responses transport rely on these. Trying to collapse them into Chat Completions format drops information and produces the same class of "tools.0.function undefined" error #425 was meant to fix. This PR adds a native passthrough path: - AIProvider gains an optional `responses(body, options)` method that forwards an opaque body to the upstream `/responses` endpoint. - vercel-gateway implements it against Vercel AI Gateway's `/responses` passthrough — no body transformation. - /v1/responses detects native Responses-API payloads (gpt-5.x model, presence of `instructions`, or any non-function tool) and routes them through the passthrough, streaming the upstream response back verbatim. Older clients still go through the transform + normalize path unchanged. - Credits are reserved via estimateRequestCost (50% safety buffer already applies) and settled to the reserved amount on completion. Accurate reconciliation from the `response.completed` SSE event is a TODO — filed for follow-up. Tests cover the three routing branches (gpt-5.x + instructions → passthrough, custom tools on any model → passthrough, vanilla nested tools on gpt-4o → transform path) plus the existing 15-test normalization suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Builds on the initial native Responses-API passthrough with the follow-ups required to make it actually work against Vercel AI Gateway's /v1/responses endpoint and to cover the payment paths we missed in the first cut. - Model prefix rewrite: Vercel AI Gateway requires `provider/model` format (e.g. `openai/gpt-5.4`), but Codex CLI and the AI SDK v5 Responses transport send bare ids like `gpt-5.4`. Added `normalizeGatewayModelId()` which prefixes `openai/` for bare OpenAI ids, `anthropic/` for `claude-*`, `google/` for `gemini-*`, and leaves already-prefixed ids untouched. The rewrite builds a shallow clone of the caller body so the original is not mutated. - Loosened the "is native Responses payload" detector to match `gpt-5` and `gpt-5-mini` (previously it only matched `gpt-5.x`). - Custom/web_search/local_shell tool pass-through explicitly tested — verified from Codex source that these are the only four tool variants it emits, and Vercel AI Gateway accepts them all verbatim on /v1/responses per live test against the production endpoint. Tests added (13 new, bringing the file to 28 total): - Model prefix rewrite for bare / prefixed / claude / gemini ids - Caller body not mutated by rewrite - Bare `gpt-5` / `gpt-5-mini` detected as native Responses - Upstream 4xx forwarded verbatim with credit refund - Insufficient credits → 402, upstream not touched - Provider without .responses() method → 400 unsupported_provider - Anonymous users skip credit reservation in passthrough - Reconcile called with reserved amount on success - Custom / web_search / local_shell tools forwarded verbatim in body Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vercel · 2026-04-07T21:25:23Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
eliza-cloud-v2	Ready	Preview, Comment	Apr 8, 2026 3:15am

coderabbitai · 2026-04-07T21:25:26Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9523e133-1dad-401b-bdd9-7d09c8994e6f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/responses-native-passthrough

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-07T21:28:23Z

Code Review

Clean implementation overall — the detection + passthrough approach is the right call for Responses-API-only features. A few findings below.

Bugs / Issues

1. apiKey parameter is unused — audit trail gap

handleNativeResponsesPassthrough accepts apiKey: { id: string } | null but never references it. The chat completions path uses this for attribution/logging. Passthrough requests made via API key will have no key-level audit trail in logs.

// app/api/v1/responses/route.ts
async function handleNativeResponsesPassthrough(
  ...
  apiKey: { id: string } | null,   // ← never used
  ...

At minimum, include apiKey?.id in the logger.info call at the end of the function.

2. Credit reconciliation fires before stream is consumed

The void (async () => { await settle(reserved) })() fires immediately after returning the Response — not after the client finishes reading the stream body. For large streamed responses, this means the reservation settles before we know the request actually completed. If the upstream drops mid-stream, we still charge the full estimate.

This is noted as a TODO but I'd flag it explicitly: the 50% safety buffer helps, but a cancelled mid-stream request (e.g. Codex CLI ctrl-C) will still be charged the full estimate. Consider settling on stream close or cancel instead.

Code Quality

3. _startTime is dead code in the passthrough path

It's passed through but never used — no timing metrics for passthrough requests. The regular path tracks elapsed time for logging. Passthrough latency is invisible.

4. AIProvider.responses?() JSDoc says "should throw" but route checks for method existence

// types.ts — "Providers that cannot proxy Responses API should throw."
// route.ts — if (!providerInstance.responses) { return 400 }

The contract is actually "if missing, route returns 400 before calling". The JSDoc is misleading — it implies a throwing implementation is expected, but the check prevents the call entirely.

5. Upstream response headers forwarded verbatim (beyond hop-by-hop)

x-ratelimit-*, cf-ray, x-request-id, and other gateway-internal headers will be forwarded to the client. This may be intentional for transparency (clients can see rate limit state), but worth confirming it doesn't leak internal Vercel AI Gateway infrastructure details.

Minor

6. estimateRequestCost fallback of 0.01 may be too small

For large Codex sessions (long apply_patch exchanges), $0.01 could be well below actual cost. The 50% buffer in estimateRequestCost itself doesn't apply here since we're bypassing it. Consider a higher floor (e.g. 0.10) or document why 0.01 is safe.

Tests

Coverage is solid for a new path (13 tests covering the main branches). Two notes:

The setTimeout(r, 0) flush in "reconciles passthrough reservation" is fine for now but fragile if the background settle ever becomes truly async (e.g. a DB call). Consider using vi.waitFor or an explicit flush helper.
No test for the _startTime / timing path — but given the parameter is unused, this is consistent.

Summary

Area	Status
Detection logic (`isNativeResponsesPayload`)	✅ Correct, well-tested
Model ID normalization	✅ Clean, non-mutating
Credit reservation & refund	✅ Correct path logic
Settlement timing	⚠️ Settles before stream consumed (known TODO)
`apiKey` attribution	❌ Missing — should be logged
Test coverage	✅ Good breadth

The apiKey attribution gap is the only thing I'd fix before merging — the rest are acceptable tradeoffs or TODOs.

Addresses the Claude bot review on #427: 1. apiKey attribution gap (blocker) — `handleNativeResponsesPassthrough` accepted an `apiKey` parameter but never used it, creating an audit-trail hole. Now threads `apiKey?.id` into both the `reserveAndDeductCredits` and `reconcile` metadata, and includes it in the final `logger.info` so credits can be traced back to the paying key. 2. Dead code: `_startTime` → `startTime` is now used to compute and log `durationMs` on every passthrough response. The regular path had this; the passthrough was invisible. 3. Settlement-before-stream trade-off — explicit comment now calls out the Codex CLI Ctrl-C / mid-stream-cancel case that the previous "TODO" glossed over. Keeps the behavior (50% safety buffer is an acceptable overcharge upper bound for now) but documents the gap. 4. `AIProvider.responses?()` JSDoc said "should throw" but the route actually checks for method existence and returns 400. Fixed the contract description so the next reader doesn't look for a throwing implementation that never existed. 5. Header leak: previously we stripped only the four hop-by-hop headers, passing `x-vercel-*`, `cf-ray`, `server`, `via` etc. through to clients unchanged. Now strips all gateway-internal infra headers while explicitly keeping `x-ratelimit-*` for client transparency. 6. Fallback credit floor: when `estimateRequestCost` throws we now reserve $0.10 instead of $0.01. The old floor was too small to protect against a runaway session when estimation silently fails — the Chat Completions path gets a 50% safety buffer from `estimateRequestCost` that the passthrough was bypassing. New tests (3, bringing total to 31): - api_key_id flows into reserve + reconcile metadata - gateway-internal headers stripped, x-ratelimit-* forwarded - fallback floor is 0.10 when estimateRequestCost throws Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HaruHunab1320 · 2026-04-07T21:46:08Z

Thanks for the review! All six findings addressed in 6f5224e:

#	Finding	Fix
1	`apiKey` parameter unused (blocker)	Threaded `apiKey?.id` into `reserveAndDeductCredits` metadata, `reconcile` metadata, and the final `logger.info` payload
2	Credit reconciliation fires before stream consumed	Kept the behavior (still a TODO — proper fix requires teeing the stream) but rewrote the inline comment to explicitly call out the Codex Ctrl-C case so the trade-off is visible
3	`_startTime` dead code	Now `startTime`, used to compute and log `durationMs` on every passthrough response
4	`AIProvider.responses?()` JSDoc misleading	Rewrote to describe the actual contract ("optional; caller must check presence and return unsupported_provider")
5	Gateway-internal headers leaked to clients	Now strips `x-vercel-`, `cf-ray`, `cf-cache-status`, `server`, `via`, `keep-alive` in addition to hop-by-hop. Explicitly keeps `x-ratelimit-` for client transparency
6	`estimateRequestCost` fallback of `$0.01` too small	Raised floor to `$0.10` with an inline comment explaining why the Chat Completions path's 50% safety buffer doesn't apply here

3 new tests (bringing total to 31):

api_key_id flows into reserve + reconcile metadata
Gateway-internal headers stripped, x-ratelimit-* forwarded verbatim
Fallback floor is 0.10 when estimateRequestCost throws

The test-side setTimeout(r, 0) flush and the missing _startTime timing test are both resolved as a side effect (the latter is now meaningfully testable since the value is actually used; happy to add an explicit duration assertion if you'd like). The proper stream-teed reconciliation stays on the follow-up list since it's a bigger change that deserves its own PR.

claude · 2026-04-07T21:48:32Z

Code Review — #427

Overall this is solid work. The design rationale is well-documented, the moderation/credit paths mirror the Chat Completions route correctly, and the 31-test suite covers the important branches. A few things worth addressing before merge:

Bugs / Correctness

1. Provider capability check happens after credit reservation

handleNativeResponsesPassthrough reserves credits on lines 202-231, then checks providerInstance.responses on line 256. If the provider lacks the method, credits are correctly refunded via settleReservation?.(0), but we've done a round-trip to reserveAndDeductCredits for nothing. Move the provider existence check (and the unsupported_provider return) to before the credit reservation block. This also prevents any scenario where settleReservation is null but we still enter the provider-check branch.

2. set-cookie not in STRIP_HEADERS

The upstream gateway could theoretically return set-cookie headers. Propagating those to end clients through a proxy is almost always wrong (cookie domain/path semantics break, and it leaks session state). Add "set-cookie" to the strip list.

3. pseudoMessages as never type escape hatch

estimatedCost = await estimateRequestCost(
  model,
  pseudoMessages as never,   // <-- bypasses type safety
  ...
);

as never silences a real type mismatch between the synthesized pseudo-messages and what estimateRequestCost expects. Either fix the upstream function signature to accept the looser type, or cast through unknown with a comment explaining why it is safe. as never should not appear in production code.

Security

4. No body size guard

The raw request body is forwarded to the upstream gateway without an explicit size check. Chat Completions presumably has a size limit applied somewhere in the middleware stack — verify that limit also covers the /v1/responses branch. If it doesn't, a large payload (e.g. a multi-MB grammar definition in a custom tool) bypasses it here.

Minor / Code quality

5. isNativeResponsesPayload: missing test for bare "gpt-5" (no suffix)

The test suite covers "gpt-5-mini" and "gpt-5.4" but not a plain "gpt-5" string. The regex /^gpt-5(\b|[-.])/.test("gpt-5") should match (word boundary at end of string), but it's worth a dedicated test case given this was an explicit fixup from the original regex.

6. Redundant condition on lines 108-115

const bodyForUpstream =
  gatewayModel === model ? body : { ...body, model: gatewayModel };
if (gatewayModel !== model) {
  logger.debug(...)
}

The second if repeats the same comparison. An else branch on the ternary or a local needsRewrite boolean would be cleaner.

7. handleNativeResponsesPassthrough function length

At ~300 lines this function is doing a lot: validation, moderation, cost estimation, credit reservation, provider lookup, upstream fetch, error handling, header filtering, logging, and settlement. Consider extracting the credit-reservation block into a reserveCreditsForPassthrough() helper and the header-stripping into buildPassthroughResponseHeaders(). The existing Chat Completions handler has this same length problem, so not a blocker here, but worth a follow-up.

Acknowledged trade-offs (no action required)

Settlement fires immediately after the Response is returned rather than on stream close — documented and acceptable given the 50 % safety buffer.
response.completed SSE parsing for accurate reconciliation — filed as follow-up feat : Add generation metadata tracking and auditing across chat, image, and video APIs #3.
RELAXED rate-limit preset for the passthrough route — flagged in follow-ups.

Verdict: two correctness items (#1, #2) and the type-safety issue (#3) should be fixed before merge. The rest are minor polish. The test coverage and credit-audit improvements from the previous review round are appreciated.

…ough Builds on #427. The native passthrough previously settled the credit reservation to the reserved estimate on every request regardless of actual usage — so a Codex turn that emitted 200 output tokens was charged at the same rate as one that emitted 4000, as long as both fit under the reserved upper bound. This PR wraps the upstream ReadableStream with a pass-through reader that also extracts `response.usage` from the terminal `response.completed` SSE event. When the stream ends, we compute the real cost via `calculateCost(model, provider, inputTokens, outputTokens)` and reconcile the reservation down to actual (capped at the reservation as an upper bound — over-runs would need a separate post-hoc ledger entry which is out of scope). ### Design - Zero behavioral impact on the client: bytes flow in the exact order and size the upstream produced them. We do not batch, rewrite, or buffer beyond a single SSE frame for parser bookkeeping. - SSE events are parsed out-of-band on the side. Parse errors are swallowed — a malformed frame must never break the forward path. - Exactly one terminal callback per stream lifecycle: `end`, `cancel`, or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the callback with whatever usage we had seen before the cancel, so a turn that completed-then-cancelled still reconciles to actual. - Pull-based ReadableStream so the upstream is only drained when the client reads, matching the semantics of a direct proxy. - Cost is clamped at the reservation: `actualCost = min(computed, reserved)`. ### Trade-offs - If the client aborts mid-stream before `response.completed` arrives, we settle to the reserved estimate (same as pre-stream-wrap behavior). The 50% safety buffer in `estimateRequestCost` stays as the upper bound for that case. - If `calculateCost` throws, we fall back to settling at the reserved amount rather than crashing the reconciliation. - We cap at reserved rather than allowing over-collection. Anything beyond the reservation would need a separate post-hoc charge which this PR doesn't implement. ### Tests 14 unit tests for `wrapWithUsageExtraction` covering: - Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel, malformed JSON recovery) - Usage extraction (headline + cached + reasoning tokens, missing fields default to 0, null when no completed event) - Termination paths (end, cancel before completed, cancel after completed, error, throwing callback swallowed) 4 new integration tests in the route suite: - Reconciles to actual cost when response.completed reports usage - Caps actual cost at reserved when model over-runs the estimate - Falls back to reserved when no response.completed arrives - Existing reconcile/api_key_id tests updated to actually drain the stream (pull-based wrapper gates the callback on reader progress) Total: 48 passing tests across the two affected suites (14 + 34). ### Stacked on #427 This branch is stacked on `fix/responses-native-passthrough`. Merge #427 first, then rebase this onto `dev`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses the Claude bot's second review on #427. All three correctness items (#1-#3) fixed; two of the minor items (#5, #6) fixed; the body size guard (#4) was flagged as a concern worth resolving so I added an explicit one. 1. Provider check now happens BEFORE credit reservation. Previously an unsupported provider caused a reserve → refund round-trip on every request that touched the credits ledger for no reason. Now the check returns 400 before any ledger activity. Also eliminates the failure mode where a reservation could leak if the refund path threw. 2. `set-cookie` added to STRIP_HEADERS. Forwarding upstream cookies through the proxy breaks cookie domain/path semantics and leaks upstream session state — must be stripped regardless of what the gateway returns. 3. `pseudoMessages as never` removed. The previous cast was silencing a non-existent type mismatch: `{role, content: string}[]` is a valid subtype of `{role, content: string | object}[]` which is what `estimateRequestCost` accepts. Plain assignment works. Added an explicit type annotation on `pseudoMessages` so the subtyping relationship is visible at the call site. 4. Body size guard (4 MiB cap via `Content-Length` header) added at the top of `handlePOST` before any auth/parse/reserve work. The passthrough forwards bodies verbatim upstream, so an unbounded payload could bypass whatever limits the middleware stack has elsewhere. Chat Completions is still unguarded — that's a separate gap, but the passthrough is where we're most exposed. Returns 413 with `code: "request_too_large"`. 5. Added an explicit test for bare `model: "gpt-5"` (no suffix) to lock in the `/^gpt-5(\b|[-.])/` regex behavior. 6. Removed the redundant `gatewayModel !== model` check — now uses a single `needsModelRewrite` local and a cleaner ternary. Fixes the double-compare noise the review pointed out. Type-system housekeeping: captured a local `providerResponses` reference after the early check so the non-null narrowing survives across the intervening credit reservation block (previously used a `!` non-null assertion which biome flagged). Added an unreachable safety fallback that returns 500 if provider.responses somehow vanishes between check and call — pure type-level belt and braces. New tests (5, bringing total to 36): - Bare `gpt-5` → native Responses detection - Provider check happens before credit reservation (ledger untouched) - set-cookie header stripped from passthrough response - Request body over 4 MiB cap rejected with 413 - Request body well under cap accepted normally Deferred: function length (#7) — acknowledged as worth extracting into helpers but the reviewer flagged as not a blocker; the Chat Completions route has the same shape, so doing one without the other is more churn than signal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ough Builds on #427. The native passthrough previously settled the credit reservation to the reserved estimate on every request regardless of actual usage — so a Codex turn that emitted 200 output tokens was charged at the same rate as one that emitted 4000, as long as both fit under the reserved upper bound. This PR wraps the upstream ReadableStream with a pass-through reader that also extracts `response.usage` from the terminal `response.completed` SSE event. When the stream ends, we compute the real cost via `calculateCost(model, provider, inputTokens, outputTokens)` and reconcile the reservation down to actual (capped at the reservation as an upper bound — over-runs would need a separate post-hoc ledger entry which is out of scope). ### Design - Zero behavioral impact on the client: bytes flow in the exact order and size the upstream produced them. We do not batch, rewrite, or buffer beyond a single SSE frame for parser bookkeeping. - SSE events are parsed out-of-band on the side. Parse errors are swallowed — a malformed frame must never break the forward path. - Exactly one terminal callback per stream lifecycle: `end`, `cancel`, or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the callback with whatever usage we had seen before the cancel, so a turn that completed-then-cancelled still reconciles to actual. - Pull-based ReadableStream so the upstream is only drained when the client reads, matching the semantics of a direct proxy. - Cost is clamped at the reservation: `actualCost = min(computed, reserved)`. ### Trade-offs - If the client aborts mid-stream before `response.completed` arrives, we settle to the reserved estimate (same as pre-stream-wrap behavior). The 50% safety buffer in `estimateRequestCost` stays as the upper bound for that case. - If `calculateCost` throws, we fall back to settling at the reserved amount rather than crashing the reconciliation. - We cap at reserved rather than allowing over-collection. Anything beyond the reservation would need a separate post-hoc charge which this PR doesn't implement. ### Tests 14 unit tests for `wrapWithUsageExtraction` covering: - Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel, malformed JSON recovery) - Usage extraction (headline + cached + reasoning tokens, missing fields default to 0, null when no completed event) - Termination paths (end, cancel before completed, cancel after completed, error, throwing callback swallowed) 4 new integration tests in the route suite: - Reconciles to actual cost when response.completed reports usage - Caps actual cost at reserved when model over-runs the estimate - Falls back to reserved when no response.completed arrives - Existing reconcile/api_key_id tests updated to actually drain the stream (pull-based wrapper gates the callback on reader progress) Total: 48 passing tests across the two affected suites (14 + 34). ### Stacked on #427 This branch is stacked on `fix/responses-native-passthrough`. Merge #427 first, then rebase this onto `dev`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HaruHunab1320 · 2026-04-07T22:02:12Z

Thanks for the second pass. All three correctness items addressed in 9dbeb4f, plus both minor items, plus an explicit body-size guard for finding #4. Details:

#	Finding	Fix
1	Provider capability check after credit reservation (blocker)	Moved the `providerInstance.responses` check to the top of `handleNativeResponsesPassthrough`, before any credit work. Eliminates the reserve → refund round-trip on unsupported-provider requests and removes the failure mode where a refund path throwing could leak a reservation.
2	`set-cookie` not stripped (blocker)	Added to `STRIP_HEADERS`. Upstream cookies forwarded through a proxy always leak session state — must be stripped regardless of what the gateway sends.
3	`as never` type escape hatch (blocker)	Removed. The cast was silencing a non-existent mismatch: `{role, content: string}[]` is a valid subtype of `{role, content: string \| object}[]` which is what `estimateRequestCost` takes. Added an explicit type annotation on `pseudoMessages` so the subtyping is visible at the call site.
4	No body size guard (security)	Added an explicit 4 MiB cap at the top of `handlePOST`, checked against `Content-Length` before any auth/parse/reserve work. Returns 413 with `code: "request_too_large"`. Chat Completions is still unguarded — flagged as a separate gap since doing both is more churn than signal, but the passthrough is where we're most exposed so I added it here.
5	Missing test for bare `"gpt-5"`	Added explicit test case. The `/^gpt-5(\b\|[-.])/` regex does match via the end-of-input word boundary but it's now locked in.
6	Redundant `gatewayModel !== model` double-check	Replaced with a single `needsModelRewrite` local and a clean ternary.

Type-system housekeeping: captured a local providerResponses reference after the early check so the non-null narrowing survives across the intervening credit reservation block (previously relied on ! which biome flagged as style warning). Added an unreachable safety fallback that returns 500 if provider.responses somehow vanishes between check and call — pure type-level belt and braces.

Deferred (#7, function length): acknowledged as worth an extraction into helpers but the Chat Completions route has the same shape, so doing one without the other is more noise than value. Worth a dedicated cleanup PR after the passthrough stabilizes.

Acknowledged trade-offs update: the settlement-before-stream-consumed trade-off you listed as "documented and acceptable" is actually resolved in the follow-up PR #428 (stacked on this branch). That PR wraps the upstream ReadableStream so bytes flow unchanged to the client while the wrapper extracts response.usage from the terminal response.completed SSE event and reconciles to real cost on stream close. If you have bandwidth, reviewing the two in order (#427 → #428) makes the most sense — the stacking is clean and #428 rebased cleanly onto the updated #427.

Test totals (this PR): 36/36 passing (was 31 + 5 new).

claude · 2026-04-07T22:03:32Z

Code Review

Overall this is a well-structured PR with thorough test coverage and clear intent. The architecture of the passthrough path is sound — detect early, bail before touching the credits ledger if unsupported, forward verbatim. A few findings below.

Security

Body size guard can be bypassed (medium)

The Content-Length check at the top of handlePOST is a hint, not enforcement. Clients using chunked transfer encoding (Transfer-Encoding: chunked) omit Content-Length entirely, so the guard is silently skipped. More critically, req.json() is called unconditionally with no actual read limit — a client can stream an unbounded body past the guard.

For a passthrough route that forwards bodies verbatim upstream this is the highest-exposure spot in the file. Recommend enforcing a hard read limit:

// Before req.json(), enforce the actual read limit
const bodyText = await req.text();
if (bodyText.length > MAX_RESPONSES_BODY_BYTES) {
  return Response.json({ error: { message: `...`, code: "request_too_large" } }, { status: 413 });
}
const rawBody = JSON.parse(bodyText) as Record<string, unknown>;

The existing Content-Length check is still worth keeping as an early fast path, but it should not be the only guard.

Upstream error object forwarded verbatim (low)

In handleNativeResponsesPassthrough (the catch block around providerResponses):

const gwErr = err as { status: number; error: unknown };
return Response.json({ error: gwErr.error }, { status: gwErr.status });

gwErr.error is typed as unknown and forwarded to the client without inspection. If the gateway ever includes internal stack traces or infrastructure details in error objects, those leak directly. Consider wrapping in a sanitized structure or at minimum asserting it's a plain object before forwarding.

Correctness

Background settlement race in serverless environments (medium)

void (async () => {
  try {
    await settle(reserved);
  } catch (err) { ... }
})();

In Next.js on Vercel (serverless), the function invocation can be frozen or terminated once the Response is returned. The background settle() may never complete, leaving the reservation without reconciliation. The PR comment acknowledges this as a known trade-off for streams, but for non-streaming (synchronous) responses it's an unnecessary risk — the body is already fully materialized at that point.

Recommendation: if upstreamResponse.headers.get("content-type") does not include text/event-stream, await settle(reserved) synchronously before returning.

isNativeResponsesPayload doesn't handle non-string instructions (low)

if (typeof body.instructions === "string") return true;

If a malformed payload sends instructions: 42 or instructions: {}, it won't trigger passthrough. The request will then fall through to the Chat Completions transform path, which will choke on an unexpected instructions field. This is an edge case but worth a comment explaining the deliberate typeof === "string" choice, or widening to body.instructions != null.

Double providerInstance.responses check (nit)

The !providerInstance.responses check at line ~126 is followed by a second guard at line ~292 with a comment "Unreachable at runtime — load-bearing only for TypeScript." You can eliminate the second check entirely by capturing the narrowed reference immediately after the first:

const providerResponses = providerInstance.responses;
if (!providerResponses) {
  return Response.json({ error: { code: "unsupported_provider" } }, { status: 400 });
}
// providerResponses is now definitively non-null — no second guard needed

Test Coverage

Coverage is solid for the happy path and most error branches. A few gaps:

No test for chunked/missing Content-Length with a large body — the current 413 test only exercises the Content-Length hint path. If the fix above is applied, a test that sends no Content-Length but a large body would be a useful regression guard.
No test for non-streaming passthrough — all passthrough tests use text/event-stream. A test verifying a synchronous (non-streaming) JSON response passes through correctly would improve confidence.
No test for malformed JSON body — req.json() is called without a try/catch in the passthrough path's outer handlePOST. A request with a non-JSON body currently falls through to the general catch at the bottom of handlePOST, but that catch is designed for the Chat Completions path and may return an unexpected error shape.
reconciles passthrough reservation to reserved amount on success — this test uses setTimeout(0) to flush the background settle. This works today but is fragile under test parallelism. Consider exposing a __testFlushSettlement or making the background fire synchronous in test mode.

Minor

const aiSdkRequest = rawBody as unknown as AISdkRequest (line ~520) is a double-cast that bypasses type safety for the existing path. The existing path hasn't changed, but this weakens the type contract — worth narrowing with a runtime check or at least a comment.
The MAX_RESPONSES_BODY_BYTES constant is defined inside route.ts right above handlePOST, far from the top of the file. Moving it to the module-level constants area (or alongside other config) would aid discoverability.

Summary

Severity	Finding
Medium	Body size guard bypassable via chunked encoding / missing Content-Length
Medium	Background settlement may not complete in serverless for synchronous responses
Low	Upstream error object forwarded verbatim without sanitization
Low	`instructions` non-string values fall through to wrong code path
Nit	Redundant second `providerInstance.responses` null guard

The passthrough architecture itself is the right call — trying to collapse Responses-API-only features into Chat Completions would lose information. The test suite is well-structured and the credit/reservation logic is carefully handled (particularly the provider check before reservation). The items above are improvements rather than blockers, with the body size enforcement being the most actionable security fix.

Addresses the Claude bot's third review on #427. All 5 substantive findings fixed plus the 4 test gaps the reviewer flagged. 1. Body size guard bypassable via chunked encoding (medium): the Content-Length header check is now an early fast path only. The real enforcement is a post-read length check on the buffered request text. Clients using `Transfer-Encoding: chunked` (which omits Content-Length) or lying about Content-Length will be caught by the post-read check before we touch req.json(). The header check stays for the cheap-fast-path benefit. 2. Background settle in serverless (medium): for non-streaming responses (content-type != "text/event-stream") we now `await` the settle synchronously before returning. The body is fully materialized at that point so deferring serves no purpose, and on Vercel the function can be frozen once the Response is returned, leaving a background promise to never complete and stranding the reservation. Streaming responses still fire-and-forget — that path is the one PR #428 addresses properly via stream-teed reconciliation. 3. Upstream error object forwarded verbatim (low): added `sanitizeGatewayError()` which extracts only the well-known OpenAI-compatible fields (message, type, code, optional param) from an unknown gateway error envelope. Stack traces, infrastructure host names, and arbitrary nested objects are stripped. Whitelist + length cap means even a hostile gateway payload can't leak internals through this path. 4. `isNativeResponsesPayload` non-string instructions (low): now triggers passthrough on ANY presence of `instructions`, not just string. A malformed `instructions: 42` payload routes through the passthrough so the upstream returns a coherent validation error instead of falling through to Chat Completions which would choke on the unexpected field. 5. Double `providerResponses` null guard (nit): captured the narrowed local at the early-bail check site so the second guard before the forward call is gone. Removed both the redundant if-block AND the unreachable 500 fallback that was load-bearing only for the type system. New tests (6, bringing total to 42): - 413 with no Content-Length (chunked encoding bypass test) - 400 invalid_json on malformed JSON body - Synchronous settle for non-streaming passthrough (no flush needed) - Non-streaming JSON passthrough body forwarded unchanged - Sanitized gateway error envelope (hostile fields stripped) - Non-string `instructions: 42` routes to passthrough Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-04-08T03:14:47Z

Code Review — Round 4

Great work getting through three rounds of fixes. The code is in good shape. Two remaining issues I can see in the current state:

Bugs

1. Gateway-internal headers leak through non-ok upstream responses (low)

The success path carefully builds outHeaders, strips STRIP_HEADERS, and sets x-eliza-responses-passthrough. But the non-ok early-return path forwards the upstream response unchanged:

if (!upstreamResponse.ok) {
  await settleReservation?.(0);
  return new Response(upstreamResponse.body, {
    status: upstreamResponse.status,
    statusText: upstreamResponse.statusText,
    headers: upstreamResponse.headers,   // ← raw, unfiltered
  });
}

A gateway 429 or 503 response could still carry cf-ray, x-vercel-*, server, set-cookie, etc. The same outHeaders construction (strip + add x-eliza-responses-passthrough) should apply here. Suggest extracting the header-building block into a small helper and calling it from both paths.

2. JSON.parse(bodyText) can return a non-object; type assertion masks the crash (low)

JSON.parse("null") returns null — valid JSON, no exception thrown, so parseErr is never triggered. rawBody then holds null at runtime despite being typed as Record<string, unknown>. Two lines later:

if (isNativeResponsesPayload(rawBody)) {

isNativeResponsesPayload accesses body.instructions on line 1 — that throws TypeError: Cannot read properties of null. The outer handlePOST catch catches it, but the client gets a generic 500 instead of a descriptive 400. A simple guard after the JSON.parse call would fix both the crash and the error shape:

rawBody = JSON.parse(bodyText);
if (!rawBody || typeof rawBody !== "object" || Array.isArray(rawBody)) {
  return Response.json(
    { error: { message: "Request body must be a JSON object", type: "invalid_request_error", code: "invalid_json" } },
    { status: 400 },
  );
}

Minor

bodyText.length counts UTF-16 code units, not bytes. For the overwhelmingly ASCII content of JSON payloads this is functionally equivalent to MAX_RESPONSES_BODY_BYTES, but the comparison is semantically inconsistent with the Content-Length byte check above it. A comment to that effect (// length in JS chars ≈ bytes for ASCII JSON) would prevent future confusion.

Summary

Severity	Finding	Status
Low	Header stripping skipped on non-ok upstream responses	New
Low	`JSON.parse(null)` → type-masked crash → 500 instead of 400	New
Nit	`bodyText.length` vs byte comparison inconsistency	New

The three-round cleanup has been thorough — the credit-audit trail, error sanitization, body-size enforcement, serverless settlement fix, and header stripping on the success path are all solid. These two low items are the only things I can find still open.

…ough Builds on #427. The native passthrough previously settled the credit reservation to the reserved estimate on every request regardless of actual usage — so a Codex turn that emitted 200 output tokens was charged at the same rate as one that emitted 4000, as long as both fit under the reserved upper bound. This PR wraps the upstream ReadableStream with a pass-through reader that also extracts `response.usage` from the terminal `response.completed` SSE event. When the stream ends, we compute the real cost via `calculateCost(model, provider, inputTokens, outputTokens)` and reconcile the reservation down to actual (capped at the reservation as an upper bound — over-runs would need a separate post-hoc ledger entry which is out of scope). - Zero behavioral impact on the client: bytes flow in the exact order and size the upstream produced them. We do not batch, rewrite, or buffer beyond a single SSE frame for parser bookkeeping. - SSE events are parsed out-of-band on the side. Parse errors are swallowed — a malformed frame must never break the forward path. - Exactly one terminal callback per stream lifecycle: `end`, `cancel`, or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the callback with whatever usage we had seen before the cancel, so a turn that completed-then-cancelled still reconciles to actual. - Pull-based ReadableStream so the upstream is only drained when the client reads, matching the semantics of a direct proxy. - Cost is clamped at the reservation: `actualCost = min(computed, reserved)`. - If the client aborts mid-stream before `response.completed` arrives, we settle to the reserved estimate (same as pre-stream-wrap behavior). The 50% safety buffer in `estimateRequestCost` stays as the upper bound for that case. - If `calculateCost` throws, we fall back to settling at the reserved amount rather than crashing the reconciliation. - We cap at reserved rather than allowing over-collection. Anything beyond the reservation would need a separate post-hoc charge which this PR doesn't implement. 14 unit tests for `wrapWithUsageExtraction` covering: - Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel, malformed JSON recovery) - Usage extraction (headline + cached + reasoning tokens, missing fields default to 0, null when no completed event) - Termination paths (end, cancel before completed, cancel after completed, error, throwing callback swallowed) 4 new integration tests in the route suite: - Reconciles to actual cost when response.completed reports usage - Caps actual cost at reserved when model over-runs the estimate - Falls back to reserved when no response.completed arrives - Existing reconcile/api_key_id tests updated to actually drain the stream (pull-based wrapper gates the callback on reader progress) Total: 48 passing tests across the two affected suites (14 + 34). This branch is stacked on `fix/responses-native-passthrough`. Merge Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HaruHunab1320 · 2026-04-08T03:41:22Z

Third review round addressed in c91a33d. All 5 substantive findings + the 4 test gaps fixed.

Severity	Finding	Fix
Medium	Body size guard bypassable via chunked encoding	Added a post-read text-length check. The Content-Length header check stays as an early fast path, but the real enforcement is now `await req.text()` followed by a length check before `JSON.parse`. Clients using `Transfer-Encoding: chunked` (which omits Content-Length) or lying about Content-Length are caught by the post-read check.
Medium	Background settle in serverless for non-streaming responses	Now dispatches on `content-type`. For `text/event-stream` we keep the fire-and-forget behavior (PR #428 wraps the stream so settlement is gated on consumption). For everything else (JSON responses, headers-only edge case) we `await` the settle synchronously before returning the Response, so a Vercel function freeze can't strand the reservation.
Low	Upstream error object forwarded verbatim	Added `sanitizeGatewayError()` which extracts only the well-known OpenAI-compatible fields (`message`, `type`, `code`, optional `param`) from an unknown gateway error envelope, applies length caps, and discards everything else. Even a hostile gateway payload with `stack`, `infrastructure`, or arbitrary nested objects can't leak through.
Low	`isNativeResponsesPayload` non-string `instructions`	Widened from `typeof === "string"` to `!= null`. A malformed `instructions: 42` payload now routes through the passthrough so the upstream returns a coherent validation error instead of falling through to Chat Completions which would choke on the unexpected field.
Nit	Double `providerResponses` null guard	Captured the narrowed local at the early-bail check site so the second guard before the forward call is gone. Removed both the redundant `if`-block AND the unreachable 500 fallback that was load-bearing only for the type system.

New tests (6, bringing total to 42):

413 with no Content-Length (chunked encoding bypass test) — confirms the post-read length check is the real enforcement
400 invalid_json on malformed JSON body
Synchronous settle for non-streaming passthrough — mockReconcileCredits is called by the time responsesPost returns, no flush helper needed
Non-streaming JSON passthrough body forwarded unchanged
Sanitized gateway error envelope — hostile fields (stack, infrastructure, nested) stripped, whitelisted fields preserved
Non-string instructions: 42 routes to passthrough

setTimeout(0) flush note: for the streaming path the test file still uses setTimeout(0) to flush the background callback. That path is restructured by #428 to use the stream wrapper, where settlement is gated on the wrapped reader's terminal callback rather than a microtask hop. After #428 lands those tests can use await response.text() instead of the timeout flush, which is what the new non-streaming tests already do. Cleaning that up in #428's rebase makes more sense than churning #427.

#428 has been rebased onto this branch — clean conflict resolution combining the streaming wrapper (from #428) with the streaming/non-streaming dispatch (from this round). 59/59 passing on the rebased #428.

Test totals (this PR): 42/42 passing (was 36 + 6 new).

- CodingAgentSettingsSection: move updateConfig out of the setPrefs state updater into a debounced useEffect (400ms). React state updaters must be pure; the old path fired twice per keystroke in Strict Mode and persisted 40+ partial-key snapshots while a user typed an API key. The debounce coalesces rapid edits into a single POST so a mid-flight failure can no longer leave milady.json with a partial credential. - coding-agent-bridge: drop the unused duplicate of the POST /api/coding-agents/auth/:agent handler. The canonical copy lives in server.ts's private handleCodingAgentsFallback — the bridge export was never wired in, so keeping both would silently drift. - agent-orchestrator-compat: stop treating Codex as auth-ready under cloudReady. elizaOS/cloud#427 (responses-stream reconciliation) has not deployed, so Codex-through-cloud would fail at runtime with no explanation. Restore once cloud#427/#428 ship. - credentials: call readClaudeCodeOAuthBlob() once per provider row and destructure, instead of hitting it twice (the second call shelled out to `security` on macOS on every status poll). deploy/cloud-agent-template/package.json pin is left in place: the release contract check requires exact versions there (the template is published as a standalone deploy artifact, not a workspace member, so `workspace:*` won't resolve against any registry). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HaruHunab1320 and others added 2 commits April 7, 2026 10:08

vercel bot deployed to Preview April 7, 2026 21:50 View deployment

HaruHunab1320 mentioned this pull request Apr 7, 2026

feat(responses): stream-teed credit reconciliation for native passthrough #428

Closed

6 tasks

vercel bot deployed to Preview April 7, 2026 22:04 View deployment

vercel bot deployed to Preview April 8, 2026 03:15 View deployment

HaruHunab1320 mentioned this pull request Apr 8, 2026

feat(cloud): Eliza Cloud as provider for CLI coding agents + reliability fixes elizaos-plugins/plugin-agent-orchestrator#27

Merged

4 tasks

lalalune merged commit c833ae9 into dev Apr 8, 2026
10 of 12 checks passed

HaruHunab1320 mentioned this pull request Apr 8, 2026

feat(cloud-coding-agents): Eliza Cloud as provider for CLI coding agents milady-ai/milady#1757

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(responses): native Responses-API passthrough for gpt-5.x + Codex CLI#427

fix(responses): native Responses-API passthrough for gpt-5.x + Codex CLI#427
lalalune merged 5 commits intodevfrom
fix/responses-native-passthrough

HaruHunab1320 commented Apr 7, 2026

Uh oh!

vercel bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Review skipped

Uh oh!

claude bot commented Apr 7, 2026

Uh oh!

HaruHunab1320 commented Apr 7, 2026

Uh oh!

claude bot commented Apr 7, 2026

Uh oh!

HaruHunab1320 commented Apr 7, 2026

Uh oh!

claude bot commented Apr 7, 2026

Uh oh!

claude bot commented Apr 8, 2026

Uh oh!

HaruHunab1320 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HaruHunab1320 commented Apr 7, 2026

Summary

What #425 couldn't handle

What this PR adds

Verified against production upstream

Test coverage

Follow-ups (not blocking)

Test plan

Uh oh!

vercel bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

claude bot commented Apr 7, 2026

Code Review

Bugs / Issues

Code Quality

Minor

Tests

Summary

Uh oh!

HaruHunab1320 commented Apr 7, 2026

Uh oh!

claude bot commented Apr 7, 2026

Code Review — #427

Bugs / Correctness

Security

Minor / Code quality

Acknowledged trade-offs (no action required)

Uh oh!

HaruHunab1320 commented Apr 7, 2026

Uh oh!

claude bot commented Apr 7, 2026

Code Review

Security

Correctness

Test Coverage

Minor

Summary

Uh oh!

claude bot commented Apr 8, 2026

Code Review — Round 4

Bugs

Minor

Summary

Uh oh!

HaruHunab1320 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Apr 7, 2026 •

edited

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading