feat(api): SP-2 PR-A · AIN-271 streaming + tool-use lift on /v1/messages#72
Conversation
AIN-271 [Phase 5 · §0 · AIN-309 (planning)] Gate: P1-WS2 prod deploy of /v1/messages streaming + tool-use
Hard gate for Phase 5. WS1+ cannot start until this is CERT GREEN.
|
| Probe | Result |
|---|---|
POST /v1/messages base (non-stream) |
LIVE — 401 unauth (was 404). Shim deployed via api#68. |
POST /v1/messages stream=true |
501/404 — streaming NOT implemented (the remaining keystone) |
POST /v1/messages tools=[...] |
422/404 — tool-use NOT implemented |
Base shim is deployed. Streaming + tool-use is the highest-value remaining build — it also closes the WS5 perceived-latency issue (33s is 98.5% model inference, NOT a bug; streaming gives <1s first-token). Unblocks Aulë SDK + the Phase-5 MAF target.
What needs to happen
- Implement streaming + tool-use on
/v1/messages(AIN-174 Phase B) — SSE piped not buffered; tool round-trip on both dialects. - Ship with canonical string
ainfera-inference(mithril/auto accepted as inbound aliases ONLY during grace, then hard-removed — founder wants hard-delete, so confirm grace length). - Founder rules
"router"wire-format (hard-cut to ainfera-inference recommended given hard-delete stance). - Founder pushes + Railway deploy + re-cert.
Owner
Founder-only: branch + commit + deploy. Aulë drafts PR + cert checklist + the streaming/tool-use implementation on request.
Linked
Repo-root MASTER_LOG.md WS0 · ainfera-os/MASTER_LOG_P2.md §0 · the descriptive-rename sweep (needs its own ticket once Linear 250-cap freed).
| finish_reason=response.finish_reason, | ||
| receipt_id=receipt.id, | ||
| provider=provider_slug, # AIN-126 | ||
| content_blocks=list(response.content_blocks), |
There was a problem hiding this comment.
Tools omitted from inference dispatch
High Severity
The /v1/messages endpoint accepts tools and tool_choice but doesn't forward them to the underlying inference handler. This occurs in both non-streaming and streaming paths, causing tool definitions to be ignored. Consequently, tool_use functionality and ToolsNotSupportedError handling don't work as intended, making tool-enabled calls behave as text-only.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 5a57625. Configure here.
| # SDK clients read `content[]` and dispatch on `type` per block. | ||
| blocks: list[dict[str, Any]] = list(inf_resp.content_blocks or []) | ||
| if not blocks: | ||
| blocks = [{"type": "text", "text": inf_resp.content}] |
There was a problem hiding this comment.
Non-stream drops adapter content blocks
High Severity
The non-streaming /v1/messages endpoint discards tool_use blocks. post_inference doesn't correctly transfer structured tool_use content from DispatchResult.content_blocks to InferenceResponse.content_blocks, causing responses to fall back to a single text block.
Reviewed by Cursor Bugbot for commit 5a57625. Configure here.
| idempotency_key=idempotency_key, | ||
| caller_task_type=task_type, | ||
| request_id=request_id, | ||
| ) |
There was a problem hiding this comment.
Streaming skips vendor model passthrough
High Severity
When stream:true is used with /v1/messages, requests are always routed through the brain, even when a specific vendor model is provided. This diverges from non-streamed requests, which honor vendor pins directly. As a result, streamed calls can lead to unexpected provider selection, billing, and audit behavior.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 5a57625. Configure here.
| }, | ||
| ) | ||
| yield _sse(_EVENT_MESSAGE_STOP, {"type": "message_stop"}) | ||
| return |
There was a problem hiding this comment.
Stream errors after HTTP 200
Medium Severity
stream_messages only catches NoCandidateError and AllCandidatesFailedError after StreamingResponse already returned 200. CapViolationError, InsufficientFundsError, AgentNotActiveError, and ProviderError from dispatch_with_brain propagate out of the generator, unlike non-stream /v1/messages which maps them to 402, 409, 502, or 422 JSON errors.
Reviewed by Cursor Bugbot for commit 5a57625. Configure here.
| # inference then replay). Native end-to-end is the planned | ||
| # follow-up; the adapter-level primitives shipped here. | ||
| "x-ainfera-stream-mode": "wrapped", | ||
| }, |
There was a problem hiding this comment.
Stream missing audit headers
Medium Severity
Streamed /v1/messages responses set x-ainfera-agent-id, x-ainfera-audit-url, and x-ainfera-stream-mode before the body runs, but omit x-ainfera-inference-id and x-ainfera-receipt-id that non-stream post_messages sets after dispatch completes.
Reviewed by Cursor Bugbot for commit 5a57625. Configure here.
Completes the half of AIN-271 that SP-1 deferred. `/v1/messages` now honors `stream:true` (200 + text/event-stream with ordered Anthropic SSE frames) and `tools[]` (pass-through to backends, `tool_use` blocks in the response). The §16 capture invariant holds: every routed call — streamed or not — writes exactly one `routing_outcomes` row plus the matching audit events plus the ledger debit. Stacks on SP-1's `chore/sp1-inference-rename` (PR #70). Merges AFTER that PR. ## Adapter contract lift - `ProviderAdapter.chat()` gains `tools` + `tool_choice` (defaults None — back-compat preserved across all 5 adapters). - New `ProviderAdapter.stream_chat()` async generator yields normalized `StreamEvent`s. Default impl wraps `chat()` into one content_delta + one message_delta so adapters that don't yet override honor the contract surface. - New `StreamEvent` dataclass: kinds `content_delta`, `tool_use_start`, `tool_use_delta`, `message_delta`. - New `ToolsNotSupportedError` — adapters that don't yet wire tool calling raise this at the adapter boundary; the handler maps it to a 422 with backend slug + remediation. - `AdapterResponse.content_blocks` added so tool_use round-trips through the non-streaming path too. ## Per-adapter native streaming - AnthropicAdapter: real native SSE against `api.anthropic.com/v1/messages` with `stream:true`; sub-1s TTFT on the wire. tool_use blocks pass through natively. - OpenAICompatAdapter (base for OpenAI/Mistral/Together/xAI/Groq): real native SSE against `/v1/chat/completions` with `stream:true` + `stream_options.include_usage`; translates `delta.tool_calls[]` → normalized tool_use events. - OpenAIAdapter responses-tier (gpt-5.5-pro): tools non-empty raises ToolsNotSupportedError → 422 with backend slug. - GeminiAdapter / MistralAdapter: signature extended; inherit OpenAICompatAdapter native streaming. ## Streaming dispatch + /v1/messages - `services/streaming.py` runs the dispatcher to completion (full §16 capture + ledger + audit), then synthesizes Anthropic SSE frames from the resulting DispatchResult. v0 posture: `wrapped` (TTFT = full inference time); response header `x-ainfera-stream-mode` reports the mode so SDK clients can observe it. Adapter-level native streaming primitives in this same PR are ready for the follow-up that refactors `dispatch_inference` to consume them end-to-end (flipping the header to `native`). - `routers/anthropic_compat.py`: - Drops 501-on-stream → returns StreamingResponse with text/event-stream content-type. - Drops blanket 422-on-tools → tools pass through. Legacy code `tool_calling_not_supported_on_shim` retired; backends without tools surface `tools_not_supported_by_backend` with hint. - `MessagesResponse.content[]` polymorphic (text OR tool_use); SDK sees one shape across stream + non-stream. - Alias resolver honored on streamed calls (`_log_alias_hit` fires for the three SP-1 legacy strings). - Audit-trace headers (`x-ainfera-agent-id`, `x-ainfera-audit-url`) set on streaming responses identical to non-streaming. ## Tests - tests/unit/test_streaming_wire_format.py — 6 pure tests against default `stream_chat()` wrapper + AIN-176→Anthropic finish_reason mapping + `supports_native_streaming()` flag. - tests/integration/test_anthropic_compat.py — replaces SP-1 501/422 assertions with SP-2 coverage: · stream:true → 200 + text/event-stream + ordered Anthropic frames · streaming writes §16 row on close · streaming honors silent-alias resolver (parametrized × 3) · non-empty tools passes through Pre-commit: ruff + ruff-format + mypy --strict + pytest unit+smoke all green (505 unit+smoke tests). ## SP-2 v0 honesty caveat Contract surface (200 text/event-stream, ordered Anthropic frames, §16 capture, tool_use round-trip, alias parity) is real and verified. TTFT is NOT sub-1s in v0 because the streaming wrapper runs non-streaming dispatch first and replays its full response as SSE. The adapter-level native streaming primitives are in place; the follow-up refactors dispatch_inference to consume them end-to-end. `x-ainfera-stream-mode: wrapped` today → `native` after the follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5a57625 to
7281e42
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
There are 8 total unresolved issues (including 5 from previous reviews).
Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issues.
Reviewed by Cursor Bugbot for commit 7281e42. Configure here.
| blocks = [{"type": "text", "text": inf_resp.content}] | ||
| return MessagesResponse( | ||
| id=f"msg_{uuid4().hex[:24]}", | ||
| content=[_TextBlock(text=inf_resp.content)], |
There was a problem hiding this comment.
Tools never reach dispatch
High Severity
The Anthropic /v1/messages endpoint accepts tools and tool_choice, but these aren't fully forwarded to the inference dispatch logic. For non-streaming requests, they're omitted from the InferenceRequest. For streaming, stream_messages receives them but doesn't pass them to dispatch_with_brain. This prevents tool definitions from reaching backend providers.
Reviewed by Cursor Bugbot for commit 7281e42. Configure here.
| tenant_id=tenant.id, | ||
| flattened_msgs=flattened_msgs, | ||
| idempotency_key=idempotency_key, | ||
| ) |
There was a problem hiding this comment.
Stream ignores vendor model
Medium Severity
With stream:true, every request goes through _serve_messages_stream and dispatch_with_brain, and body.model is never passed into dispatch. Non-stream calls use post_inference, which routes vendor-pinned models via direct dispatch_inference. Pinned models with streaming are treated as brain-routed ainfera-inference, breaking vendor passthrough parity.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7281e42. Configure here.
| # OpenAI Chat Completions tools shape — surface that clearly | ||
| # to the dispatcher rather than silently dropping tools. | ||
| if tools: | ||
| raise ToolsNotSupportedError(adapter_slug=f"{self.slug}/responses") |
There was a problem hiding this comment.
Unsupported tools mark failure
Medium Severity
When tools are eventually passed to dispatch, ToolsNotSupportedError from the OpenAI responses path is handled like an unknown exception in dispatch_inference, triggering _finalize_failure before post_messages can map it to 422 tools_not_supported_by_backend, leaving a failed inference and refund alongside the client-facing 422.
Reviewed by Cursor Bugbot for commit 7281e42. Configure here.
#80) * feat(api): SP-2 PR-A · AIN-271 streaming + tool-use lift on /v1/messages Completes the half of AIN-271 that SP-1 deferred. `/v1/messages` now honors `stream:true` (200 + text/event-stream with ordered Anthropic SSE frames) and `tools[]` (pass-through to backends, `tool_use` blocks in the response). The §16 capture invariant holds: every routed call — streamed or not — writes exactly one `routing_outcomes` row plus the matching audit events plus the ledger debit. Stacks on SP-1's `chore/sp1-inference-rename` (PR #70). Merges AFTER that PR. ## Adapter contract lift - `ProviderAdapter.chat()` gains `tools` + `tool_choice` (defaults None — back-compat preserved across all 5 adapters). - New `ProviderAdapter.stream_chat()` async generator yields normalized `StreamEvent`s. Default impl wraps `chat()` into one content_delta + one message_delta so adapters that don't yet override honor the contract surface. - New `StreamEvent` dataclass: kinds `content_delta`, `tool_use_start`, `tool_use_delta`, `message_delta`. - New `ToolsNotSupportedError` — adapters that don't yet wire tool calling raise this at the adapter boundary; the handler maps it to a 422 with backend slug + remediation. - `AdapterResponse.content_blocks` added so tool_use round-trips through the non-streaming path too. ## Per-adapter native streaming - AnthropicAdapter: real native SSE against `api.anthropic.com/v1/messages` with `stream:true`; sub-1s TTFT on the wire. tool_use blocks pass through natively. - OpenAICompatAdapter (base for OpenAI/Mistral/Together/xAI/Groq): real native SSE against `/v1/chat/completions` with `stream:true` + `stream_options.include_usage`; translates `delta.tool_calls[]` → normalized tool_use events. - OpenAIAdapter responses-tier (gpt-5.5-pro): tools non-empty raises ToolsNotSupportedError → 422 with backend slug. - GeminiAdapter / MistralAdapter: signature extended; inherit OpenAICompatAdapter native streaming. ## Streaming dispatch + /v1/messages - `services/streaming.py` runs the dispatcher to completion (full §16 capture + ledger + audit), then synthesizes Anthropic SSE frames from the resulting DispatchResult. v0 posture: `wrapped` (TTFT = full inference time); response header `x-ainfera-stream-mode` reports the mode so SDK clients can observe it. Adapter-level native streaming primitives in this same PR are ready for the follow-up that refactors `dispatch_inference` to consume them end-to-end (flipping the header to `native`). - `routers/anthropic_compat.py`: - Drops 501-on-stream → returns StreamingResponse with text/event-stream content-type. - Drops blanket 422-on-tools → tools pass through. Legacy code `tool_calling_not_supported_on_shim` retired; backends without tools surface `tools_not_supported_by_backend` with hint. - `MessagesResponse.content[]` polymorphic (text OR tool_use); SDK sees one shape across stream + non-stream. - Alias resolver honored on streamed calls (`_log_alias_hit` fires for the three SP-1 legacy strings). - Audit-trace headers (`x-ainfera-agent-id`, `x-ainfera-audit-url`) set on streaming responses identical to non-streaming. ## Tests - tests/unit/test_streaming_wire_format.py — 6 pure tests against default `stream_chat()` wrapper + AIN-176→Anthropic finish_reason mapping + `supports_native_streaming()` flag. - tests/integration/test_anthropic_compat.py — replaces SP-1 501/422 assertions with SP-2 coverage: · stream:true → 200 + text/event-stream + ordered Anthropic frames · streaming writes §16 row on close · streaming honors silent-alias resolver (parametrized × 3) · non-empty tools passes through Pre-commit: ruff + ruff-format + mypy --strict + pytest unit+smoke all green (505 unit+smoke tests). ## SP-2 v0 honesty caveat Contract surface (200 text/event-stream, ordered Anthropic frames, §16 capture, tool_use round-trip, alias parity) is real and verified. TTFT is NOT sub-1s in v0 because the streaming wrapper runs non-streaming dispatch first and replays its full response as SSE. The adapter-level native streaming primitives are in place; the follow-up refactors dispatch_inference to consume them end-to-end. `x-ainfera-stream-mode: wrapped` today → `native` after the follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(api): SP-4 PR-A · forward capture-coverage guard (AIN-244 instrumentation) (#73) * feat(api): SP-4 PR-A · forward capture-coverage guard for routed dispatches Adds the durable forward-coverage guarantee for §16 capture: every routed dispatch (canonical `ainfera-inference` OR any of the 3 SP-1 aliases) writes exactly one `routing_outcomes` row, regardless of outcome (success / reject / fallback / fail). Pinned passthroughs (vendor slugs) write zero AND carry a `router: "direct"` audit marker. Stacks on SP-2 PR-A (`feat/ain271-streaming-tooluse`, api#72) — that PR's stream-close capture path is the last exit covered by this guard. ## Moat-sensitive scope (read this first) This PR is **pure observability**. Per the SP-4 §1 guardrails: - ZERO change to routing decisions, scores, weights, thresholds, candidate ordering, `M_allowed`, `q_prior`, `q_empirical`, ruleset_hash. The diff against `services/routing_brain.py` and `services/routing.py` is **empty**. Verifiable: `git diff feat/ain271-streaming-tooluse..HEAD -- ainfera_api/services/routing*.py` shows no hunks. - `routing_outcomes` schema is unchanged. No new columns, no migration. The row is written by the existing `insert_decision()` / `complete_decision()` calls in `dispatch_with_brain` (§0/P3 walk-through confirmed every exit path already writes the row). - `routing/ainfera_routing/decide.py` is untouched. ## What's new 1. `ainfera_api/services/capture_invariant.py`: - `route_outcome_kind(model_slug) -> "routed" | "passthrough"` — pure classifier keyed off the SP-1 alias resolver's `ROUTING_TARGETS`, so any string added to the resolver becomes "routed" without a second edit. - `assert_capture_invariant(db, inference_id, kind)` — read-only post-condition check the test sweep runs after every probe. Raises `CaptureInvariantViolationError` with diagnostic context when a routed call returns without a row or a passthrough produces one unexpectedly. - `find_passthrough_audit_event()` — helper for the test sweep to assert the `router: "direct"` marker is present. - `DispatchCaptureCounter.dispatch_without_capture_total` — the headline regression signal. Stays 0 in green builds; production scrape (future Prometheus surface) alerts on any non-zero. 2. `tests/unit/test_capture_invariant.py` — 9 pure tests locking the classifier (canonical + 3 aliases → routed; vendor slugs + typos → passthrough) + the counter semantics (routed-miss bumps the regression signal; passthrough-captured-unexpectedly bumps the contamination signal; reset zeros everything). 3. `tests/integration/test_capture_coverage.py` — parametrized sweep that drives a routed-success call for EACH of the 4 routing targets, a reject-floor routed call, and passthrough calls against two vendor slugs (anthropic native + openai). After each, asserts: - routed success → exactly 1 routing_outcomes row, `outcome_status='succeeded'` - reject path → 1 row, `outcome_status='rejected_floor'`, `inference_id IS NULL` (the only branch where it's NULL by design — see RoutingOutcomeORM docstring) - passthrough → 0 rows AND `router: "direct"` in the audit chain (distinguishes a properly-bypassed passthrough from a routed call that silently lost its row) Plus a coverage-sweep test that asserts `DispatchCaptureCounter.dispatch_without_capture_total == 0` at the end of a mixed dispatch sequence. ## §0/P2 denominator finding (documented for the audit chain) Live read against Supabase `dftfpwzqxoebwzepygzl`: - 778 historical inferences / 5 routing_outcomes rows - 0 historical `request_payload.model` was a routing string (ainfera-inference / ainfera-mithril / ainfera-auto / ainfera/auto) - ALL 778 were pinned passthroughs — vendor slugs (claude-opus-4-7 x220, gpt-5-5 x189, claude-haiku-4-5 x105, ...) - The 3 succeeded outcome rows are integration-test side effects **The 773-row "gap" is honest fleet posture, not a capture failure.** The fleet's been on pinned passthroughs (AULE_PLANNER / YAVANNA_X_MODEL opt-outs). No backfill is owed (§D3). PR-A's value is the forward GUARANTEE: every NEW routed call going forward writes exactly one row. ## Pre-commit ruff + ruff-format + mypy --strict + pytest tests/unit + tests/smoke all green (523 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(api): SP-CLOSE · capture-invariant uses AuditEventType enum (not raw string) Same class as the dashboard.py:127 fix landed in #71. The capture-invariant service + integration test compared `AuditEventORM.event_type == "inference_routed"` (underscored Python name), but the actual DB enum value is `inference.routed` (dotted) per migration 20260514_0001. Postgres rejected the literal with: invalid input value for enum audit_event_type: "inference_routed" Fix: pass `AuditEventType.inference_routed` (the enum *member*) instead of the raw string — SQLAlchemy's `values_callable` resolves it to the correct DB value (`inference.routed`). Docstring updated to spell the dotted form for any future reader. Unblocks the SP-4 PR-A integration tests: test_capture_coverage.py::test_passthrough_writes_zero_outcome_rows_and_router_direct_audit No engine touch, no routing_outcomes touch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(api): SP-4 PR-B · routing_preference dial — balanced byte-identical, quality/cost gated (AIN-244 dial) (#74) Exposes `routing_preference: "quality" | "balanced" | "cost"` in the routing_hint body as sugar over the existing caps. **`balanced` is byte-identical to today's behavior** (the dial is a no-op when balanced is selected — proved by the parametrized regression lock in the test file). **`quality` / `cost` are accepted on the wire but INERT** until the env gate `AINFERA_ROUTING_PREFERENCE_LIVE=1` is set (founder Disc#12 authorization of the lever values). Stacks on SP-2 api#72 (`feat/ain271-streaming-tooluse`); independent of SP-4 PR-A (#73 capture-coverage). ## Moat-sensitive scope · Disc#12 boundary This PR is Disc#12-adjacent — the dial CAN change routing decisions once the env gate is on. To stay safe: - The default (gate OFF) means `quality`/`cost` resolve to today's policy IDENTICALLY to `balanced`. SP-4 ships with the gate OFF. - Explicit caller `min_quality` always wins. The dial only nudges the default-derived floor — a quality-conscious caller never has their floor silently lowered by a `cost` preference. - Safety clamps: dial output is bounded by [good=0.50, frontier=0.85] so neither lever can exclude every voter or admit a sub-floor model. - Pure-function `_apply_preference()` is deterministic — same input → same output, testable without the brain. ## Proposed mapping (Aulë's conservative starting point — founder authorizes) `balanced` — no-op. Resolves exactly as today. `quality` — bump default min_quality by +0.10 (default 0.50 → 0.60), clamped to the `frontier` tier (0.85). Caller's explicit `min_quality` wins if higher. `cost` — drop default min_quality by -0.10, clamped to the `good` tier (0.50). Caller's explicit `min_quality` wins if higher. Both bumps are conservative: ≤0.10 delta, with hard safety clamps. No weighted-λ, no score surgery, no candidate-ordering changes. The dial moves the FLOOR; the engine still picks cheapest-clearing-floor. The founder reviews + authorizes the exact lever values in this PR. Once signed off, `railway env set AINFERA_ROUTING_PREFERENCE_LIVE=1` on the api service flips the gate ON. Until then, only `balanced` ships live behavior. ## What's new - `services/routing_brain.py`: - `VALID_PREFERENCES` frozenset + `DEFAULT_PREFERENCE = "balanced"`. - `_apply_preference(base_min_q, preference) -> Decimal` — pure function honoring the gate-off semantic. - `_routing_preference_live()` — env-var read at call time so ops can flip the gate without restart. - `_PREFERENCE_FLOOR_DELTA` + safety clamps `_SAFETY_MIN_QUALITY` + `_SAFETY_MAX_QUALITY` (= good / frontier tier numerics). - `resolve_policy()` reads `routing_preference` from the hint and applies the dial ONLY when the caller did NOT pass an explicit `min_quality` — preserves caller-intent-wins semantics. - `models/inference.py`: `InferenceRequest.routing_hint` description documents the new key (so it surfaces in openapi.json). - `tests/unit/test_routing_preference_dial.py`: - 8-case parametrized **byte-identical regression lock** for `balanced` — the moat invariant. Any divergence fails the build. - Dial-inert-when-gate-off coverage × all 3 preferences. - Dial-active mapping × bumps + clamps + explicit-caller-wins. - Unknown / typo preference values fall through to `balanced`. - 23 tests; all pure (no DB). ## Pre-commit ruff + ruff-format + mypy --strict + pytest unit+smoke = 528 green. ## Out of scope (per SP-4 §1) - methodology v1.3 changes - weights / λ-blending - online learning (AIN-246 — Backlog/deferred) - `M_allowed` / `q_prior` / `q_empirical` semantics - engine code in `routing/ainfera_routing/decide.py` — untouched ## Public copy (founder/Varda) Drafted README/STRATEGY paragraph for the routing repo describing the dial — see `docs/routing-preference.md` in the next PR after founder sign-off on the mapping values. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
Completes the half of AIN-271 that SP-1 deferred.
/v1/messagesnow honorsstream:true(200 +text/event-streamwith ordered Anthropic SSE frames) andtools[](pass-through to backends,tool_useblocks in the response). The §16 capture invariant holds: every routed call — streamed or not — writes exactly onerouting_outcomesrow plus matching audit events plus ledger debit.Stacks on SP-1 (#70). Base is
chore/sp1-inference-rename; merges AFTER that PR.What lands
ProviderAdapter.chat()gainstools+tool_choice(defaultsNone— back-compat). Newstream_chat()async generator +StreamEventdataclass +ToolsNotSupportedError.AdapterResponse.content_blocksfor tool round-trip.services/streaming.pysynthesizes Anthropic-shape SSE frames from a full DispatchResult, preserving §16 + ledger + audit invariants on the streaming path.routers/anthropic_compat.pydrops 501-on-stream + blanket 422-on-tools. ReturnsStreamingResponse(text/event-stream); alias resolver honored on streamed calls; polymorphiccontent[](text + tool_use).x-ainfera-agent-id,x-ainfera-audit-url) set on streamed responses identical to non-streaming.SP-2 v0 honesty caveat
Contract surface is real (200, text/event-stream, ordered Anthropic frames, §16 capture, tool_use round-trip, alias parity, all 3 SP-1 aliases). TTFT is NOT sub-1s in v0 — the streaming wrapper runs non-streaming dispatch first and replays its full response as SSE. The adapter-level native streaming primitives shipped above are ready for the follow-up that refactors
dispatch_inferenceto consume them end-to-end.x-ainfera-stream-mode: wrappedtoday; flips tonativeafter the follow-up so SDK probes can observe TTFT.Tests
tests/unit/test_streaming_wire_format.py— 6 pure tests against defaultstream_chat()wrapper + AIN-176→Anthropic finish_reason mapping +supports_native_streaming().tests/integration/test_anthropic_compat.py— replaces SP-1 501/422 assertions with:stream:true→ 200 + text/event-stream + ordered Anthropic framestoolspasses through (no blanket 422)Pre-commit ran ruff + ruff-format + mypy --strict + pytest (unit + smoke) — all green (505 tests).
Test plan
curl -N -X POST $URL/v1/messages -d '{"model":"ainfera-inference","stream":true,...}'→ 200 +content-type: text/event-stream+ sequence ofevent: message_start→event: content_block_*→event: message_delta→event: message_stopmodel:"ainfera-mithril"→ 200 + Railway logrouter_alias_hit legacy=ainfera-mithril canonical=ainfera-inferenceANTHROPIC_BASE_URL=https://api.ainfera.aistreams text successfullytools:[{...}]request → 200 withcontent[]containing atool_useblock when the model invoked the tool;stop_reason: "tool_use"selectonrouting_outcomesfor the streamed call's agent_id returns exactly 1 new row🤖 Generated with Claude Code
Note
Medium Risk
Introduces new streaming and tool-calling surfaces across adapters and the
/v1/messagesrouter, which affects core inference I/O shapes and error handling. While largely additive, changes touch routing results and provider adapters and could impact compatibility if SSE framing or tool translation is incorrect.Overview
Enables
/v1/messagesto acceptstream=trueand return200 text/event-streamwith Anthropic-style SSE frames, implemented via a newservices/streaming.pywrapper that replays a completed routed inference asmessage_start/content_block_*/message_delta/message_stopwhile preserving §16/audit/ledger invariants.Adds tool-calling plumb-through:
ProviderAdapter.chat()gainstools/tool_choice, responses can carry structuredcontent_blocks, and OpenAI-compat providers translatetool_callsinto Anthropic-shapedtool_useblocks; unsupported backends raiseToolsNotSupportedErrorwhich/v1/messagessurfaces as a structured 422.Adds native provider-level SSE parsing for Anthropic and OpenAI-compat adapters (
stream_chat()+ normalizedStreamEvent), plus tests updating the Anthropic-compat integration suite for the new streaming behavior and adding unit coverage for stream fallback and stop-reason mapping.Reviewed by Cursor Bugbot for commit 7281e42. Bugbot is set up for automated code reviews on this repo. Configure here.