feat(responses): stream-teed credit reconciliation for native passthrough by HaruHunab1320 · Pull Request #428 · elizaOS/cloud

HaruHunab1320 · 2026-04-07T21:54:59Z

Summary

Follow-up to #427, addressing the "settlement fires before stream consumed" finding from that PR's review. Stacked on top of fix/responses-native-passthrough — merge #427 first, then this rebases onto dev cleanly.

#427 settled the credit reservation to the reserved estimate on every request regardless of actual usage. A Codex turn that emitted 200 output tokens was charged at the same rate as one that emitted 4000, as long as both fit under the reserved upper bound. This PR fixes that by extracting real usage from the SSE stream and reconciling on stream close.

What changed

New: `packages/lib/utils/responses-stream-reconcile.ts`

Pure helper that wraps a ReadableStream<Uint8Array> so bytes flow to the client unchanged while the wrapper scans for the terminal response.completed SSE event and extracts response.usage. Exactly one terminal callback per stream lifecycle (end | cancel | error), regardless of how the stream ends.

const trackedBody = wrapWithUsageExtraction(upstream.body, (usage, reason) => {
  if (usage) reconcile(calculateCost(model, provider, usage.inputTokens, usage.outputTokens));
  else       reconcile(reservedAmount);
});

Design constraints (all enforced by tests):

Byte-exact passthrough — bytes flow in the exact order and size the upstream produced them. No batching, rewriting, or buffering beyond a single SSE frame for parser bookkeeping.
Parse errors are swallowed — a malformed frame must never break the forward path.
Chunk-boundary handling — SSE frames split mid-JSON across multiple reads are reassembled correctly.
Exactly-one terminal callback — fires on end, cancel, or error. A buggy callback that throws does not break the stream.
Pull-based — upstream is only drained when the client reads, matching direct-proxy semantics.
[DONE] sentinel ignored.

`app/api/v1/responses/route.ts` — wired into the passthrough

Replaces the previous fire-and-forget background settle(reservedAmount) with:

const reconciledBody = wrapWithUsageExtraction(upstreamResponse.body, (usage, reason) => {
  logger.debug(...);
  void runReconciliation(usage);
});

return new Response(reconciledBody ?? upstreamResponse.body, { ... });

runReconciliation():

If usage is null (no response.completed seen) → settle to reserved (same as pre-stream-wrap behavior)
If usage is present → calculateCost(model, provider, inputTokens, outputTokens) then settle to min(computed, reserved)
If calculateCost throws → fall back to settling at reserved

Trade-offs (documented in inline comments)

Cap at reserved, not over-collect. If the model somehow runs hotter than the reservation covered, we can't retroactively collect more from this path — the reserved amount is the upper bound. Anything beyond would need a separate post-hoc ledger entry which this PR doesn't implement.
Cancel-before-completed charges the upper bound. Client aborts (Codex CLI Ctrl-C, tab close) before response.completed arrives settle to reserved. That's the same behavior as before this PR for the cancel case — the 50% safety buffer in estimateRequestCost is still our protection there.
No impact when upstream.body === null. We fall back to an immediate synchronous settle-to-reserved so we don't strand the reservation.

Tests

14 unit tests — `packages/tests/unit/utils/responses-stream-reconcile.test.ts`

Passthrough fidelity

Forwards bytes unchanged across multiple chunks
Handles SSE frames that span chunk boundaries
Handles the [DONE] sentinel without crashing
Malformed JSON in a data: line does not break the forward path

Usage extraction

Extracts input_tokens / output_tokens from response.completed
Extracts cached_tokens and reasoning_tokens when present
Omits cached/reasoning fields when zero or missing
Returns null when no response.completed event arrives
Defaults missing numeric fields to 0 (zero-usage event is still a reconciliation)

Termination

Fires onComplete exactly once on normal close
Fires with 'cancel' when reader is cancelled mid-stream
Still extracts usage when cancel fires AFTER response.completed
Fires with 'error' when the upstream throws
A throwing onComplete callback does not break the stream

4 new integration tests — `packages/tests/unit/api/v1-responses-route.test.ts`

Reconciles to actual cost when response.completed reports usage
Caps actual cost at reserved when model over-runs the estimate
Falls back to reserved when no response.completed arrives
Existing mockReconcileCredits assertion tests updated to drain the wrapped body (pull-based wrapper gates the callback on reader progress)

Test totals: 48 passing across the two affected suites (14 unit + 34 route, was 31).

Test plan

bun test packages/tests/unit/utils/responses-stream-reconcile.test.ts — 14/14
bun test packages/tests/unit/api/v1-responses-route.test.ts — 34/34
bunx tsc --noEmit clean on all changed files (only pre-existing bs58 errors)
bun run lint clean (biome auto-fix applied)
Merge fix(responses): native Responses-API passthrough for gpt-5.x + Codex CLI #427 first (this branch is stacked on it)
Deploy to dev and test end-to-end with Codex CLI via milady's PARALLAX_LLM_PROVIDER=cloud flow, confirm small turns get reconciled down from the estimate and large turns stay capped at reserved

Follow-ups (not in this PR)

Post-hoc over-run ledger — if the model uses more than reserved, we silently cap. A proper fix would emit a separate ledger entry for the delta, visible in the user's billing dashboard. Out of scope here.
Per-model usage details — we extract cached_tokens and reasoning_tokens from the usage object but don't yet use them in calculateCost (which is input/output only). When cost calculation grows to support cached/reasoning tiers, the plumbing is already in place.

🤖 Generated with Claude Code

vercel · 2026-04-07T21:55:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
eliza-cloud-v2	Ready	Preview, Comment	Apr 8, 2026 3:21am

coderabbitai · 2026-04-07T21:55:06Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6c0736a4-db9e-4651-b4cc-7edd87b0f304

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/responses-passthrough-stream-reconciliation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Addresses the Claude bot's third review on #427. All 5 substantive findings fixed plus the 4 test gaps the reviewer flagged. 1. Body size guard bypassable via chunked encoding (medium): the Content-Length header check is now an early fast path only. The real enforcement is a post-read length check on the buffered request text. Clients using `Transfer-Encoding: chunked` (which omits Content-Length) or lying about Content-Length will be caught by the post-read check before we touch req.json(). The header check stays for the cheap-fast-path benefit. 2. Background settle in serverless (medium): for non-streaming responses (content-type != "text/event-stream") we now `await` the settle synchronously before returning. The body is fully materialized at that point so deferring serves no purpose, and on Vercel the function can be frozen once the Response is returned, leaving a background promise to never complete and stranding the reservation. Streaming responses still fire-and-forget — that path is the one PR #428 addresses properly via stream-teed reconciliation. 3. Upstream error object forwarded verbatim (low): added `sanitizeGatewayError()` which extracts only the well-known OpenAI-compatible fields (message, type, code, optional param) from an unknown gateway error envelope. Stack traces, infrastructure host names, and arbitrary nested objects are stripped. Whitelist + length cap means even a hostile gateway payload can't leak internals through this path. 4. `isNativeResponsesPayload` non-string instructions (low): now triggers passthrough on ANY presence of `instructions`, not just string. A malformed `instructions: 42` payload routes through the passthrough so the upstream returns a coherent validation error instead of falling through to Chat Completions which would choke on the unexpected field. 5. Double `providerResponses` null guard (nit): captured the narrowed local at the early-bail check site so the second guard before the forward call is gone. Removed both the redundant if-block AND the unreachable 500 fallback that was load-bearing only for the type system. New tests (6, bringing total to 42): - 413 with no Content-Length (chunked encoding bypass test) - 400 invalid_json on malformed JSON body - Synchronous settle for non-streaming passthrough (no flush needed) - Non-streaming JSON passthrough body forwarded unchanged - Sanitized gateway error envelope (hostile fields stripped) - Non-string `instructions: 42` routes to passthrough Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ough Builds on #427. The native passthrough previously settled the credit reservation to the reserved estimate on every request regardless of actual usage — so a Codex turn that emitted 200 output tokens was charged at the same rate as one that emitted 4000, as long as both fit under the reserved upper bound. This PR wraps the upstream ReadableStream with a pass-through reader that also extracts `response.usage` from the terminal `response.completed` SSE event. When the stream ends, we compute the real cost via `calculateCost(model, provider, inputTokens, outputTokens)` and reconcile the reservation down to actual (capped at the reservation as an upper bound — over-runs would need a separate post-hoc ledger entry which is out of scope). - Zero behavioral impact on the client: bytes flow in the exact order and size the upstream produced them. We do not batch, rewrite, or buffer beyond a single SSE frame for parser bookkeeping. - SSE events are parsed out-of-band on the side. Parse errors are swallowed — a malformed frame must never break the forward path. - Exactly one terminal callback per stream lifecycle: `end`, `cancel`, or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the callback with whatever usage we had seen before the cancel, so a turn that completed-then-cancelled still reconciles to actual. - Pull-based ReadableStream so the upstream is only drained when the client reads, matching the semantics of a direct proxy. - Cost is clamped at the reservation: `actualCost = min(computed, reserved)`. - If the client aborts mid-stream before `response.completed` arrives, we settle to the reserved estimate (same as pre-stream-wrap behavior). The 50% safety buffer in `estimateRequestCost` stays as the upper bound for that case. - If `calculateCost` throws, we fall back to settling at the reserved amount rather than crashing the reconciliation. - We cap at reserved rather than allowing over-collection. Anything beyond the reservation would need a separate post-hoc charge which this PR doesn't implement. 14 unit tests for `wrapWithUsageExtraction` covering: - Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel, malformed JSON recovery) - Usage extraction (headline + cached + reasoning tokens, missing fields default to 0, null when no completed event) - Termination paths (end, cancel before completed, cancel after completed, error, throwing callback swallowed) 4 new integration tests in the route suite: - Reconciles to actual cost when response.completed reports usage - Caps actual cost at reserved when model over-runs the estimate - Falls back to reserved when no response.completed arrives - Existing reconcile/api_key_id tests updated to actually drain the stream (pull-based wrapper gates the callback on reader progress) Total: 48 passing tests across the two affected suites (14 + 34). This branch is stacked on `fix/responses-native-passthrough`. Merge Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HaruHunab1320 · 2026-04-08T17:19:02Z

Closing — code already landed on dev via Shaw's manual merge 44a29116 ("merge: integrate PR #428 responses stream reconciliation").

GitHub kept this PR in OPEN state because the merge was done with git merge rather than the GitHub "Merge pull request" button, and because this branch's base was fix/responses-native-passthrough (the #427 branch, now deleted). The head commit 542e9607 is reachable from origin/dev — verified via git branch -r --contains — and the merge commit brought in all four files from this PR (app/api/v1/responses/route.ts, packages/lib/utils/responses-stream-reconcile.ts, both test files).

Shaw also added a follow-up hardening commit df4e31b6 ("fix: harden responses passthrough and managed env config") on top of both merges.

Closing to keep the open-PR list clean. No further action needed on this branch.

vercel bot deployed to Preview April 7, 2026 21:58 View deployment

HaruHunab1320 force-pushed the fix/responses-passthrough-stream-reconciliation branch from 7e4ad31 to 8631087 Compare April 7, 2026 22:01

HaruHunab1320 mentioned this pull request Apr 7, 2026

fix(responses): native Responses-API passthrough for gpt-5.x + Codex CLI #427

Merged

5 tasks

vercel bot deployed to Preview April 7, 2026 22:06 View deployment

HaruHunab1320 force-pushed the fix/responses-passthrough-stream-reconciliation branch from 8631087 to 542e960 Compare April 8, 2026 03:16

vercel bot deployed to Preview April 8, 2026 03:21 View deployment

lalalune added a commit that referenced this pull request Apr 8, 2026

merge: integrate PR #428 responses stream reconciliation

44a2911

HaruHunab1320 mentioned this pull request Apr 8, 2026

feat(cloud-coding-agents): Eliza Cloud as provider for CLI coding agents milady-ai/milady#1757

Open

4 tasks

HaruHunab1320 closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(responses): stream-teed credit reconciliation for native passthrough#428

feat(responses): stream-teed credit reconciliation for native passthrough#428
HaruHunab1320 wants to merge 1 commit intofix/responses-native-passthroughfrom
fix/responses-passthrough-stream-reconciliation

HaruHunab1320 commented Apr 7, 2026

Uh oh!

vercel bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Review skipped

Uh oh!

HaruHunab1320 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HaruHunab1320 commented Apr 7, 2026

Summary

What changed

New: packages/lib/utils/responses-stream-reconcile.ts

app/api/v1/responses/route.ts — wired into the passthrough

Trade-offs (documented in inline comments)

Tests

14 unit tests — packages/tests/unit/utils/responses-stream-reconcile.test.ts

4 new integration tests — packages/tests/unit/api/v1-responses-route.test.ts

Test plan

Follow-ups (not in this PR)

Uh oh!

vercel bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

HaruHunab1320 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New: `packages/lib/utils/responses-stream-reconcile.ts`

`app/api/v1/responses/route.ts` — wired into the passthrough

14 unit tests — `packages/tests/unit/utils/responses-stream-reconcile.test.ts`

4 new integration tests — `packages/tests/unit/api/v1-responses-route.test.ts`

vercel bot commented Apr 7, 2026 •

edited

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading