feat(responses): stream-teed credit reconciliation for native passthrough#428
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
7e4ad31 to
8631087
Compare
Addresses the Claude bot's third review on #427. All 5 substantive findings fixed plus the 4 test gaps the reviewer flagged. 1. Body size guard bypassable via chunked encoding (medium): the Content-Length header check is now an early fast path only. The real enforcement is a post-read length check on the buffered request text. Clients using `Transfer-Encoding: chunked` (which omits Content-Length) or lying about Content-Length will be caught by the post-read check before we touch req.json(). The header check stays for the cheap-fast-path benefit. 2. Background settle in serverless (medium): for non-streaming responses (content-type != "text/event-stream") we now `await` the settle synchronously before returning. The body is fully materialized at that point so deferring serves no purpose, and on Vercel the function can be frozen once the Response is returned, leaving a background promise to never complete and stranding the reservation. Streaming responses still fire-and-forget — that path is the one PR #428 addresses properly via stream-teed reconciliation. 3. Upstream error object forwarded verbatim (low): added `sanitizeGatewayError()` which extracts only the well-known OpenAI-compatible fields (message, type, code, optional param) from an unknown gateway error envelope. Stack traces, infrastructure host names, and arbitrary nested objects are stripped. Whitelist + length cap means even a hostile gateway payload can't leak internals through this path. 4. `isNativeResponsesPayload` non-string instructions (low): now triggers passthrough on ANY presence of `instructions`, not just string. A malformed `instructions: 42` payload routes through the passthrough so the upstream returns a coherent validation error instead of falling through to Chat Completions which would choke on the unexpected field. 5. Double `providerResponses` null guard (nit): captured the narrowed local at the early-bail check site so the second guard before the forward call is gone. Removed both the redundant if-block AND the unreachable 500 fallback that was load-bearing only for the type system. New tests (6, bringing total to 42): - 413 with no Content-Length (chunked encoding bypass test) - 400 invalid_json on malformed JSON body - Synchronous settle for non-streaming passthrough (no flush needed) - Non-streaming JSON passthrough body forwarded unchanged - Sanitized gateway error envelope (hostile fields stripped) - Non-string `instructions: 42` routes to passthrough Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ough Builds on #427. The native passthrough previously settled the credit reservation to the reserved estimate on every request regardless of actual usage — so a Codex turn that emitted 200 output tokens was charged at the same rate as one that emitted 4000, as long as both fit under the reserved upper bound. This PR wraps the upstream ReadableStream with a pass-through reader that also extracts `response.usage` from the terminal `response.completed` SSE event. When the stream ends, we compute the real cost via `calculateCost(model, provider, inputTokens, outputTokens)` and reconcile the reservation down to actual (capped at the reservation as an upper bound — over-runs would need a separate post-hoc ledger entry which is out of scope). - Zero behavioral impact on the client: bytes flow in the exact order and size the upstream produced them. We do not batch, rewrite, or buffer beyond a single SSE frame for parser bookkeeping. - SSE events are parsed out-of-band on the side. Parse errors are swallowed — a malformed frame must never break the forward path. - Exactly one terminal callback per stream lifecycle: `end`, `cancel`, or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the callback with whatever usage we had seen before the cancel, so a turn that completed-then-cancelled still reconciles to actual. - Pull-based ReadableStream so the upstream is only drained when the client reads, matching the semantics of a direct proxy. - Cost is clamped at the reservation: `actualCost = min(computed, reserved)`. - If the client aborts mid-stream before `response.completed` arrives, we settle to the reserved estimate (same as pre-stream-wrap behavior). The 50% safety buffer in `estimateRequestCost` stays as the upper bound for that case. - If `calculateCost` throws, we fall back to settling at the reserved amount rather than crashing the reconciliation. - We cap at reserved rather than allowing over-collection. Anything beyond the reservation would need a separate post-hoc charge which this PR doesn't implement. 14 unit tests for `wrapWithUsageExtraction` covering: - Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel, malformed JSON recovery) - Usage extraction (headline + cached + reasoning tokens, missing fields default to 0, null when no completed event) - Termination paths (end, cancel before completed, cancel after completed, error, throwing callback swallowed) 4 new integration tests in the route suite: - Reconciles to actual cost when response.completed reports usage - Caps actual cost at reserved when model over-runs the estimate - Falls back to reserved when no response.completed arrives - Existing reconcile/api_key_id tests updated to actually drain the stream (pull-based wrapper gates the callback on reader progress) Total: 48 passing tests across the two affected suites (14 + 34). This branch is stacked on `fix/responses-native-passthrough`. Merge Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8631087 to
542e960
Compare
|
Closing — code already landed on GitHub kept this PR in OPEN state because the merge was done with Shaw also added a follow-up hardening commit Closing to keep the open-PR list clean. No further action needed on this branch. |
Summary
Follow-up to #427, addressing the "settlement fires before stream consumed" finding from that PR's review. Stacked on top of
fix/responses-native-passthrough— merge #427 first, then this rebases ontodevcleanly.#427 settled the credit reservation to the reserved estimate on every request regardless of actual usage. A Codex turn that emitted 200 output tokens was charged at the same rate as one that emitted 4000, as long as both fit under the reserved upper bound. This PR fixes that by extracting real usage from the SSE stream and reconciling on stream close.
What changed
New:
packages/lib/utils/responses-stream-reconcile.tsPure helper that wraps a
ReadableStream<Uint8Array>so bytes flow to the client unchanged while the wrapper scans for the terminalresponse.completedSSE event and extractsresponse.usage. Exactly one terminal callback per stream lifecycle (end|cancel|error), regardless of how the stream ends.Design constraints (all enforced by tests):
end,cancel, orerror. A buggy callback that throws does not break the stream.[DONE]sentinel ignored.app/api/v1/responses/route.ts— wired into the passthroughReplaces the previous fire-and-forget background
settle(reservedAmount)with:runReconciliation():response.completedseen) → settle to reserved (same as pre-stream-wrap behavior)calculateCost(model, provider, inputTokens, outputTokens)then settle tomin(computed, reserved)calculateCostthrows → fall back to settling at reservedTrade-offs (documented in inline comments)
response.completedarrives settle to reserved. That's the same behavior as before this PR for the cancel case — the 50% safety buffer inestimateRequestCostis still our protection there.upstream.body === null. We fall back to an immediate synchronous settle-to-reserved so we don't strand the reservation.Tests
14 unit tests —
packages/tests/unit/utils/responses-stream-reconcile.test.tsPassthrough fidelity
[DONE]sentinel without crashingdata:line does not break the forward pathUsage extraction
input_tokens/output_tokensfromresponse.completedcached_tokensandreasoning_tokenswhen presentresponse.completedevent arrivesTermination
onCompleteexactly once on normal close'cancel'when reader is cancelled mid-streamresponse.completed'error'when the upstream throwsonCompletecallback does not break the stream4 new integration tests —
packages/tests/unit/api/v1-responses-route.test.tsresponse.completedreports usageresponse.completedarrivesmockReconcileCreditsassertion tests updated to drain the wrapped body (pull-based wrapper gates the callback on reader progress)Test totals: 48 passing across the two affected suites (14 unit + 34 route, was 31).
Test plan
bun test packages/tests/unit/utils/responses-stream-reconcile.test.ts— 14/14bun test packages/tests/unit/api/v1-responses-route.test.ts— 34/34bunx tsc --noEmitclean on all changed files (only pre-existing bs58 errors)bun run lintclean (biome auto-fix applied)PARALLAX_LLM_PROVIDER=cloudflow, confirm small turns get reconciled down from the estimate and large turns stay capped at reservedFollow-ups (not in this PR)
cached_tokensandreasoning_tokensfrom the usage object but don't yet use them incalculateCost(which is input/output only). When cost calculation grows to support cached/reasoning tiers, the plumbing is already in place.🤖 Generated with Claude Code