Skip to content

feat(api): SP-2 PR-A · AIN-271 streaming + tool-use lift on /v1/messages#72

Merged
hizrianraz merged 1 commit into
chore/sp1-inference-renamefrom
feat/ain271-streaming-tooluse
May 24, 2026
Merged

feat(api): SP-2 PR-A · AIN-271 streaming + tool-use lift on /v1/messages#72
hizrianraz merged 1 commit into
chore/sp1-inference-renamefrom
feat/ain271-streaming-tooluse

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

@hizrianraz hizrianraz commented May 23, 2026

Summary

Completes the half of AIN-271 that SP-1 deferred. /v1/messages now honors stream:true (200 + text/event-stream with ordered Anthropic SSE frames) and tools[] (pass-through to backends, tool_use blocks in the response). The §16 capture invariant holds: every routed call — streamed or not — writes exactly one routing_outcomes row plus matching audit events plus ledger debit.

Stacks on SP-1 (#70). Base is chore/sp1-inference-rename; merges AFTER that PR.

What lands

  • ProviderAdapter.chat() gains tools + tool_choice (defaults None — back-compat). New stream_chat() async generator + StreamEvent dataclass + ToolsNotSupportedError. AdapterResponse.content_blocks for tool round-trip.
  • AnthropicAdapter + OpenAICompatAdapter ship native SSE stream_chat overrides (sub-1s TTFT at the adapter layer). Mistral/Together/Gemini inherit native streaming via OpenAICompatAdapter.
  • services/streaming.py synthesizes Anthropic-shape SSE frames from a full DispatchResult, preserving §16 + ledger + audit invariants on the streaming path.
  • routers/anthropic_compat.py drops 501-on-stream + blanket 422-on-tools. Returns StreamingResponse(text/event-stream); alias resolver honored on streamed calls; polymorphic content[] (text + tool_use).
  • Audit headers (x-ainfera-agent-id, x-ainfera-audit-url) set on streamed responses identical to non-streaming.

SP-2 v0 honesty caveat

Contract surface is real (200, text/event-stream, ordered Anthropic frames, §16 capture, tool_use round-trip, alias parity, all 3 SP-1 aliases). TTFT is NOT sub-1s in v0 — the streaming wrapper runs non-streaming dispatch first and replays its full response as SSE. The adapter-level native streaming primitives shipped above are ready for the follow-up that refactors dispatch_inference to consume them end-to-end. x-ainfera-stream-mode: wrapped today; flips to native after the follow-up so SDK probes can observe TTFT.

Tests

  • tests/unit/test_streaming_wire_format.py — 6 pure tests against default stream_chat() wrapper + AIN-176→Anthropic finish_reason mapping + supports_native_streaming().
  • tests/integration/test_anthropic_compat.py — replaces SP-1 501/422 assertions with:
    • stream:true → 200 + text/event-stream + ordered Anthropic frames
    • streaming writes §16 row on close
    • streaming honors silent-alias resolver (parametrized × 3)
    • non-empty tools passes through (no blanket 422)

Pre-commit ran ruff + ruff-format + mypy --strict + pytest (unit + smoke) — all green (505 tests).

Test plan

  • CI green
  • Branch preview: curl -N -X POST $URL/v1/messages -d '{"model":"ainfera-inference","stream":true,...}' → 200 + content-type: text/event-stream + sequence of event: message_startevent: content_block_*event: message_deltaevent: message_stop
  • Same with model:"ainfera-mithril" → 200 + Railway log router_alias_hit legacy=ainfera-mithril canonical=ainfera-inference
  • Anthropic SDK with ANTHROPIC_BASE_URL=https://api.ainfera.ai streams text successfully
  • tools:[{...}] request → 200 with content[] containing a tool_use block when the model invoked the tool; stop_reason: "tool_use"
  • Read-only select on routing_outcomes for the streamed call's agent_id returns exactly 1 new row

🤖 Generated with Claude Code


Note

Medium Risk
Introduces new streaming and tool-calling surfaces across adapters and the /v1/messages router, which affects core inference I/O shapes and error handling. While largely additive, changes touch routing results and provider adapters and could impact compatibility if SSE framing or tool translation is incorrect.

Overview
Enables /v1/messages to accept stream=true and return 200 text/event-stream with Anthropic-style SSE frames, implemented via a new services/streaming.py wrapper that replays a completed routed inference as message_start/content_block_*/message_delta/message_stop while preserving §16/audit/ledger invariants.

Adds tool-calling plumb-through: ProviderAdapter.chat() gains tools/tool_choice, responses can carry structured content_blocks, and OpenAI-compat providers translate tool_calls into Anthropic-shaped tool_use blocks; unsupported backends raise ToolsNotSupportedError which /v1/messages surfaces as a structured 422.

Adds native provider-level SSE parsing for Anthropic and OpenAI-compat adapters (stream_chat() + normalized StreamEvent), plus tests updating the Anthropic-compat integration suite for the new streaming behavior and adding unit coverage for stream fallback and stop-reason mapping.

Reviewed by Cursor Bugbot for commit 7281e42. Bugbot is set up for automated code reviews on this repo. Configure here.

@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 23, 2026

AIN-271 [Phase 5 · §0 · AIN-309 (planning)] Gate: P1-WS2 prod deploy of /v1/messages streaming + tool-use

Hard gate for Phase 5. WS1+ cannot start until this is CERT GREEN.

⚠️ NAMING MANDATE SUPERSEDES (LOCKED 2026-05-23 PM, Disc#12)

Founder locked ainfera-inference as the flagship wire string and HARD-DELETE of ainfera-mithril + ainfera-auto + all Tolkien PRODUCT names (Mithril/Valinor/Galvorn). Products are now DESCRIPTIVE: Ainfera Inference (flagship), Ainfera OS, Ainfera Robotics. Agent BEINGS keep Tolkien (Varda etc).
Therefore this deploy MUST ship ainfera-inference as canonical — NOT ainfera-mithril. All 4 CC sessions (2026-05-23 AM) stamped ainfera-mithril; that is now the dead name. The deploy + the rename-recode sweep are coupled: do not deploy the dead string to prod.

Reconciled state 2026-05-23 PM (post 4-session consolidation)

Probe Result
POST /v1/messages base (non-stream) LIVE — 401 unauth (was 404). Shim deployed via api#68.
POST /v1/messages stream=true 501/404 — streaming NOT implemented (the remaining keystone)
POST /v1/messages tools=[...] 422/404 — tool-use NOT implemented

Base shim is deployed. Streaming + tool-use is the highest-value remaining build — it also closes the WS5 perceived-latency issue (33s is 98.5% model inference, NOT a bug; streaming gives <1s first-token). Unblocks Aulë SDK + the Phase-5 MAF target.

What needs to happen

  • Implement streaming + tool-use on /v1/messages (AIN-174 Phase B) — SSE piped not buffered; tool round-trip on both dialects.
  • Ship with canonical string ainfera-inference (mithril/auto accepted as inbound aliases ONLY during grace, then hard-removed — founder wants hard-delete, so confirm grace length).
  • Founder rules "router" wire-format (hard-cut to ainfera-inference recommended given hard-delete stance).
  • Founder pushes + Railway deploy + re-cert.

Owner

Founder-only: branch + commit + deploy. Aulë drafts PR + cert checklist + the streaming/tool-use implementation on request.

Linked

Repo-root MASTER_LOG.md WS0 · ainfera-os/MASTER_LOG_P2.md §0 · the descriptive-rename sweep (needs its own ticket once Linear 250-cap freed).

Review in Linear

finish_reason=response.finish_reason,
receipt_id=receipt.id,
provider=provider_slug, # AIN-126
content_blocks=list(response.content_blocks),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tools omitted from inference dispatch

High Severity

The /v1/messages endpoint accepts tools and tool_choice but doesn't forward them to the underlying inference handler. This occurs in both non-streaming and streaming paths, causing tool definitions to be ignored. Consequently, tool_use functionality and ToolsNotSupportedError handling don't work as intended, making tool-enabled calls behave as text-only.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5a57625. Configure here.

# SDK clients read `content[]` and dispatch on `type` per block.
blocks: list[dict[str, Any]] = list(inf_resp.content_blocks or [])
if not blocks:
blocks = [{"type": "text", "text": inf_resp.content}]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-stream drops adapter content blocks

High Severity

The non-streaming /v1/messages endpoint discards tool_use blocks. post_inference doesn't correctly transfer structured tool_use content from DispatchResult.content_blocks to InferenceResponse.content_blocks, causing responses to fall back to a single text block.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5a57625. Configure here.

idempotency_key=idempotency_key,
caller_task_type=task_type,
request_id=request_id,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming skips vendor model passthrough

High Severity

When stream:true is used with /v1/messages, requests are always routed through the brain, even when a specific vendor model is provided. This diverges from non-streamed requests, which honor vendor pins directly. As a result, streamed calls can lead to unexpected provider selection, billing, and audit behavior.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5a57625. Configure here.

},
)
yield _sse(_EVENT_MESSAGE_STOP, {"type": "message_stop"})
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream errors after HTTP 200

Medium Severity

stream_messages only catches NoCandidateError and AllCandidatesFailedError after StreamingResponse already returned 200. CapViolationError, InsufficientFundsError, AgentNotActiveError, and ProviderError from dispatch_with_brain propagate out of the generator, unlike non-stream /v1/messages which maps them to 402, 409, 502, or 422 JSON errors.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5a57625. Configure here.

# inference then replay). Native end-to-end is the planned
# follow-up; the adapter-level primitives shipped here.
"x-ainfera-stream-mode": "wrapped",
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream missing audit headers

Medium Severity

Streamed /v1/messages responses set x-ainfera-agent-id, x-ainfera-audit-url, and x-ainfera-stream-mode before the body runs, but omit x-ainfera-inference-id and x-ainfera-receipt-id that non-stream post_messages sets after dispatch completes.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5a57625. Configure here.

Completes the half of AIN-271 that SP-1 deferred. `/v1/messages` now
honors `stream:true` (200 + text/event-stream with ordered Anthropic
SSE frames) and `tools[]` (pass-through to backends, `tool_use` blocks
in the response). The §16 capture invariant holds: every routed call —
streamed or not — writes exactly one `routing_outcomes` row plus the
matching audit events plus the ledger debit.

Stacks on SP-1's `chore/sp1-inference-rename` (PR #70). Merges AFTER
that PR.

## Adapter contract lift

- `ProviderAdapter.chat()` gains `tools` + `tool_choice` (defaults
  None — back-compat preserved across all 5 adapters).
- New `ProviderAdapter.stream_chat()` async generator yields normalized
  `StreamEvent`s. Default impl wraps `chat()` into one content_delta +
  one message_delta so adapters that don't yet override honor the
  contract surface.
- New `StreamEvent` dataclass: kinds `content_delta`, `tool_use_start`,
  `tool_use_delta`, `message_delta`.
- New `ToolsNotSupportedError` — adapters that don't yet wire tool
  calling raise this at the adapter boundary; the handler maps it to
  a 422 with backend slug + remediation.
- `AdapterResponse.content_blocks` added so tool_use round-trips
  through the non-streaming path too.

## Per-adapter native streaming

- AnthropicAdapter: real native SSE against `api.anthropic.com/v1/messages`
  with `stream:true`; sub-1s TTFT on the wire. tool_use blocks pass
  through natively.
- OpenAICompatAdapter (base for OpenAI/Mistral/Together/xAI/Groq): real
  native SSE against `/v1/chat/completions` with `stream:true` +
  `stream_options.include_usage`; translates `delta.tool_calls[]` →
  normalized tool_use events.
- OpenAIAdapter responses-tier (gpt-5.5-pro): tools non-empty raises
  ToolsNotSupportedError → 422 with backend slug.
- GeminiAdapter / MistralAdapter: signature extended; inherit
  OpenAICompatAdapter native streaming.

## Streaming dispatch + /v1/messages

- `services/streaming.py` runs the dispatcher to completion (full §16
  capture + ledger + audit), then synthesizes Anthropic SSE frames
  from the resulting DispatchResult. v0 posture: `wrapped` (TTFT =
  full inference time); response header `x-ainfera-stream-mode`
  reports the mode so SDK clients can observe it. Adapter-level
  native streaming primitives in this same PR are ready for the
  follow-up that refactors `dispatch_inference` to consume them
  end-to-end (flipping the header to `native`).
- `routers/anthropic_compat.py`:
  - Drops 501-on-stream → returns StreamingResponse with
    text/event-stream content-type.
  - Drops blanket 422-on-tools → tools pass through. Legacy code
    `tool_calling_not_supported_on_shim` retired; backends without
    tools surface `tools_not_supported_by_backend` with hint.
  - `MessagesResponse.content[]` polymorphic (text OR tool_use);
    SDK sees one shape across stream + non-stream.
  - Alias resolver honored on streamed calls (`_log_alias_hit` fires
    for the three SP-1 legacy strings).
- Audit-trace headers (`x-ainfera-agent-id`, `x-ainfera-audit-url`)
  set on streaming responses identical to non-streaming.

## Tests

- tests/unit/test_streaming_wire_format.py — 6 pure tests against
  default `stream_chat()` wrapper + AIN-176→Anthropic finish_reason
  mapping + `supports_native_streaming()` flag.
- tests/integration/test_anthropic_compat.py — replaces SP-1 501/422
  assertions with SP-2 coverage:
    · stream:true → 200 + text/event-stream + ordered Anthropic frames
    · streaming writes §16 row on close
    · streaming honors silent-alias resolver (parametrized × 3)
    · non-empty tools passes through

Pre-commit: ruff + ruff-format + mypy --strict + pytest unit+smoke
all green (505 unit+smoke tests).

## SP-2 v0 honesty caveat

Contract surface (200 text/event-stream, ordered Anthropic frames,
§16 capture, tool_use round-trip, alias parity) is real and verified.
TTFT is NOT sub-1s in v0 because the streaming wrapper runs
non-streaming dispatch first and replays its full response as SSE.
The adapter-level native streaming primitives are in place; the
follow-up refactors dispatch_inference to consume them end-to-end.
`x-ainfera-stream-mode: wrapped` today → `native` after the follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hizrianraz hizrianraz force-pushed the feat/ain271-streaming-tooluse branch from 5a57625 to 7281e42 Compare May 23, 2026 23:00
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 8 total unresolved issues (including 5 from previous reviews).

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issues.

Reviewed by Cursor Bugbot for commit 7281e42. Configure here.

blocks = [{"type": "text", "text": inf_resp.content}]
return MessagesResponse(
id=f"msg_{uuid4().hex[:24]}",
content=[_TextBlock(text=inf_resp.content)],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tools never reach dispatch

High Severity

The Anthropic /v1/messages endpoint accepts tools and tool_choice, but these aren't fully forwarded to the inference dispatch logic. For non-streaming requests, they're omitted from the InferenceRequest. For streaming, stream_messages receives them but doesn't pass them to dispatch_with_brain. This prevents tool definitions from reaching backend providers.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7281e42. Configure here.

tenant_id=tenant.id,
flattened_msgs=flattened_msgs,
idempotency_key=idempotency_key,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream ignores vendor model

Medium Severity

With stream:true, every request goes through _serve_messages_stream and dispatch_with_brain, and body.model is never passed into dispatch. Non-stream calls use post_inference, which routes vendor-pinned models via direct dispatch_inference. Pinned models with streaming are treated as brain-routed ainfera-inference, breaking vendor passthrough parity.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7281e42. Configure here.

# OpenAI Chat Completions tools shape — surface that clearly
# to the dispatcher rather than silently dropping tools.
if tools:
raise ToolsNotSupportedError(adapter_slug=f"{self.slug}/responses")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsupported tools mark failure

Medium Severity

When tools are eventually passed to dispatch, ToolsNotSupportedError from the OpenAI responses path is handled like an unknown exception in dispatch_inference, triggering _finalize_failure before post_messages can map it to 422 tools_not_supported_by_backend, leaving a failed inference and refund alongside the client-facing 422.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7281e42. Configure here.

@hizrianraz hizrianraz merged commit 10e1cd8 into chore/sp1-inference-rename May 24, 2026
4 checks passed
hizrianraz added a commit that referenced this pull request May 24, 2026
#80)

* feat(api): SP-2 PR-A · AIN-271 streaming + tool-use lift on /v1/messages

Completes the half of AIN-271 that SP-1 deferred. `/v1/messages` now
honors `stream:true` (200 + text/event-stream with ordered Anthropic
SSE frames) and `tools[]` (pass-through to backends, `tool_use` blocks
in the response). The §16 capture invariant holds: every routed call —
streamed or not — writes exactly one `routing_outcomes` row plus the
matching audit events plus the ledger debit.

Stacks on SP-1's `chore/sp1-inference-rename` (PR #70). Merges AFTER
that PR.

## Adapter contract lift

- `ProviderAdapter.chat()` gains `tools` + `tool_choice` (defaults
  None — back-compat preserved across all 5 adapters).
- New `ProviderAdapter.stream_chat()` async generator yields normalized
  `StreamEvent`s. Default impl wraps `chat()` into one content_delta +
  one message_delta so adapters that don't yet override honor the
  contract surface.
- New `StreamEvent` dataclass: kinds `content_delta`, `tool_use_start`,
  `tool_use_delta`, `message_delta`.
- New `ToolsNotSupportedError` — adapters that don't yet wire tool
  calling raise this at the adapter boundary; the handler maps it to
  a 422 with backend slug + remediation.
- `AdapterResponse.content_blocks` added so tool_use round-trips
  through the non-streaming path too.

## Per-adapter native streaming

- AnthropicAdapter: real native SSE against `api.anthropic.com/v1/messages`
  with `stream:true`; sub-1s TTFT on the wire. tool_use blocks pass
  through natively.
- OpenAICompatAdapter (base for OpenAI/Mistral/Together/xAI/Groq): real
  native SSE against `/v1/chat/completions` with `stream:true` +
  `stream_options.include_usage`; translates `delta.tool_calls[]` →
  normalized tool_use events.
- OpenAIAdapter responses-tier (gpt-5.5-pro): tools non-empty raises
  ToolsNotSupportedError → 422 with backend slug.
- GeminiAdapter / MistralAdapter: signature extended; inherit
  OpenAICompatAdapter native streaming.

## Streaming dispatch + /v1/messages

- `services/streaming.py` runs the dispatcher to completion (full §16
  capture + ledger + audit), then synthesizes Anthropic SSE frames
  from the resulting DispatchResult. v0 posture: `wrapped` (TTFT =
  full inference time); response header `x-ainfera-stream-mode`
  reports the mode so SDK clients can observe it. Adapter-level
  native streaming primitives in this same PR are ready for the
  follow-up that refactors `dispatch_inference` to consume them
  end-to-end (flipping the header to `native`).
- `routers/anthropic_compat.py`:
  - Drops 501-on-stream → returns StreamingResponse with
    text/event-stream content-type.
  - Drops blanket 422-on-tools → tools pass through. Legacy code
    `tool_calling_not_supported_on_shim` retired; backends without
    tools surface `tools_not_supported_by_backend` with hint.
  - `MessagesResponse.content[]` polymorphic (text OR tool_use);
    SDK sees one shape across stream + non-stream.
  - Alias resolver honored on streamed calls (`_log_alias_hit` fires
    for the three SP-1 legacy strings).
- Audit-trace headers (`x-ainfera-agent-id`, `x-ainfera-audit-url`)
  set on streaming responses identical to non-streaming.

## Tests

- tests/unit/test_streaming_wire_format.py — 6 pure tests against
  default `stream_chat()` wrapper + AIN-176→Anthropic finish_reason
  mapping + `supports_native_streaming()` flag.
- tests/integration/test_anthropic_compat.py — replaces SP-1 501/422
  assertions with SP-2 coverage:
    · stream:true → 200 + text/event-stream + ordered Anthropic frames
    · streaming writes §16 row on close
    · streaming honors silent-alias resolver (parametrized × 3)
    · non-empty tools passes through

Pre-commit: ruff + ruff-format + mypy --strict + pytest unit+smoke
all green (505 unit+smoke tests).

## SP-2 v0 honesty caveat

Contract surface (200 text/event-stream, ordered Anthropic frames,
§16 capture, tool_use round-trip, alias parity) is real and verified.
TTFT is NOT sub-1s in v0 because the streaming wrapper runs
non-streaming dispatch first and replays its full response as SSE.
The adapter-level native streaming primitives are in place; the
follow-up refactors dispatch_inference to consume them end-to-end.
`x-ainfera-stream-mode: wrapped` today → `native` after the follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api): SP-4 PR-A · forward capture-coverage guard (AIN-244 instrumentation) (#73)

* feat(api): SP-4 PR-A · forward capture-coverage guard for routed dispatches

Adds the durable forward-coverage guarantee for §16 capture: every
routed dispatch (canonical `ainfera-inference` OR any of the 3 SP-1
aliases) writes exactly one `routing_outcomes` row, regardless of
outcome (success / reject / fallback / fail). Pinned passthroughs
(vendor slugs) write zero AND carry a `router: "direct"` audit marker.

Stacks on SP-2 PR-A (`feat/ain271-streaming-tooluse`, api#72) — that
PR's stream-close capture path is the last exit covered by this guard.

## Moat-sensitive scope (read this first)

This PR is **pure observability**. Per the SP-4 §1 guardrails:

- ZERO change to routing decisions, scores, weights, thresholds,
  candidate ordering, `M_allowed`, `q_prior`, `q_empirical`,
  ruleset_hash. The diff against `services/routing_brain.py` and
  `services/routing.py` is **empty**. Verifiable: `git diff
  feat/ain271-streaming-tooluse..HEAD -- ainfera_api/services/routing*.py`
  shows no hunks.
- `routing_outcomes` schema is unchanged. No new columns, no
  migration. The row is written by the existing `insert_decision()`
  / `complete_decision()` calls in `dispatch_with_brain` (§0/P3
  walk-through confirmed every exit path already writes the row).
- `routing/ainfera_routing/decide.py` is untouched.

## What's new

1. `ainfera_api/services/capture_invariant.py`:
   - `route_outcome_kind(model_slug) -> "routed" | "passthrough"` —
     pure classifier keyed off the SP-1 alias resolver's
     `ROUTING_TARGETS`, so any string added to the resolver becomes
     "routed" without a second edit.
   - `assert_capture_invariant(db, inference_id, kind)` — read-only
     post-condition check the test sweep runs after every probe.
     Raises `CaptureInvariantViolationError` with diagnostic context
     when a routed call returns without a row or a passthrough
     produces one unexpectedly.
   - `find_passthrough_audit_event()` — helper for the test sweep
     to assert the `router: "direct"` marker is present.
   - `DispatchCaptureCounter.dispatch_without_capture_total` — the
     headline regression signal. Stays 0 in green builds; production
     scrape (future Prometheus surface) alerts on any non-zero.

2. `tests/unit/test_capture_invariant.py` — 9 pure tests locking
   the classifier (canonical + 3 aliases → routed; vendor slugs +
   typos → passthrough) + the counter semantics (routed-miss bumps
   the regression signal; passthrough-captured-unexpectedly bumps
   the contamination signal; reset zeros everything).

3. `tests/integration/test_capture_coverage.py` — parametrized
   sweep that drives a routed-success call for EACH of the 4 routing
   targets, a reject-floor routed call, and passthrough calls
   against two vendor slugs (anthropic native + openai). After each,
   asserts:
     - routed success → exactly 1 routing_outcomes row,
       `outcome_status='succeeded'`
     - reject path  → 1 row, `outcome_status='rejected_floor'`,
       `inference_id IS NULL` (the only branch where it's NULL by
       design — see RoutingOutcomeORM docstring)
     - passthrough → 0 rows AND `router: "direct"` in the audit
       chain (distinguishes a properly-bypassed passthrough from a
       routed call that silently lost its row)
   Plus a coverage-sweep test that asserts
   `DispatchCaptureCounter.dispatch_without_capture_total == 0` at
   the end of a mixed dispatch sequence.

## §0/P2 denominator finding (documented for the audit chain)

Live read against Supabase `dftfpwzqxoebwzepygzl`:
  - 778 historical inferences / 5 routing_outcomes rows
  - 0 historical `request_payload.model` was a routing string
    (ainfera-inference / ainfera-mithril / ainfera-auto / ainfera/auto)
  - ALL 778 were pinned passthroughs — vendor slugs (claude-opus-4-7
    x220, gpt-5-5 x189, claude-haiku-4-5 x105, ...)
  - The 3 succeeded outcome rows are integration-test side effects

**The 773-row "gap" is honest fleet posture, not a capture failure.**
The fleet's been on pinned passthroughs (AULE_PLANNER /
YAVANNA_X_MODEL opt-outs). No backfill is owed (§D3). PR-A's value
is the forward GUARANTEE: every NEW routed call going forward writes
exactly one row.

## Pre-commit

ruff + ruff-format + mypy --strict + pytest tests/unit + tests/smoke
all green (523 tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(api): SP-CLOSE · capture-invariant uses AuditEventType enum (not raw string)

Same class as the dashboard.py:127 fix landed in #71. The
capture-invariant service + integration test compared
`AuditEventORM.event_type == "inference_routed"` (underscored Python
name), but the actual DB enum value is `inference.routed` (dotted)
per migration 20260514_0001.

Postgres rejected the literal with:
  invalid input value for enum audit_event_type: "inference_routed"

Fix: pass `AuditEventType.inference_routed` (the enum *member*)
instead of the raw string — SQLAlchemy's `values_callable` resolves
it to the correct DB value (`inference.routed`). Docstring updated
to spell the dotted form for any future reader.

Unblocks the SP-4 PR-A integration tests:
  test_capture_coverage.py::test_passthrough_writes_zero_outcome_rows_and_router_direct_audit

No engine touch, no routing_outcomes touch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api): SP-4 PR-B · routing_preference dial — balanced byte-identical, quality/cost gated (AIN-244 dial) (#74)

Exposes `routing_preference: "quality" | "balanced" | "cost"` in the
routing_hint body as sugar over the existing caps. **`balanced` is
byte-identical to today's behavior** (the dial is a no-op when
balanced is selected — proved by the parametrized regression lock in
the test file). **`quality` / `cost` are accepted on the wire but
INERT** until the env gate `AINFERA_ROUTING_PREFERENCE_LIVE=1` is set
(founder Disc#12 authorization of the lever values).

Stacks on SP-2 api#72 (`feat/ain271-streaming-tooluse`); independent
of SP-4 PR-A (#73 capture-coverage).

## Moat-sensitive scope · Disc#12 boundary

This PR is Disc#12-adjacent — the dial CAN change routing decisions
once the env gate is on. To stay safe:

- The default (gate OFF) means `quality`/`cost` resolve to today's
  policy IDENTICALLY to `balanced`. SP-4 ships with the gate OFF.
- Explicit caller `min_quality` always wins. The dial only nudges the
  default-derived floor — a quality-conscious caller never has their
  floor silently lowered by a `cost` preference.
- Safety clamps: dial output is bounded by [good=0.50, frontier=0.85]
  so neither lever can exclude every voter or admit a sub-floor model.
- Pure-function `_apply_preference()` is deterministic — same input →
  same output, testable without the brain.

## Proposed mapping (Aulë's conservative starting point — founder authorizes)

  `balanced` — no-op. Resolves exactly as today.
  `quality`  — bump default min_quality by +0.10 (default 0.50 → 0.60),
               clamped to the `frontier` tier (0.85). Caller's explicit
               `min_quality` wins if higher.
  `cost`     — drop default min_quality by -0.10, clamped to the `good`
               tier (0.50). Caller's explicit `min_quality` wins if higher.

Both bumps are conservative: ≤0.10 delta, with hard safety clamps.
No weighted-λ, no score surgery, no candidate-ordering changes. The
dial moves the FLOOR; the engine still picks cheapest-clearing-floor.

The founder reviews + authorizes the exact lever values in this PR.
Once signed off, `railway env set AINFERA_ROUTING_PREFERENCE_LIVE=1`
on the api service flips the gate ON. Until then, only `balanced`
ships live behavior.

## What's new

- `services/routing_brain.py`:
  - `VALID_PREFERENCES` frozenset + `DEFAULT_PREFERENCE = "balanced"`.
  - `_apply_preference(base_min_q, preference) -> Decimal` — pure
    function honoring the gate-off semantic.
  - `_routing_preference_live()` — env-var read at call time so ops
    can flip the gate without restart.
  - `_PREFERENCE_FLOOR_DELTA` + safety clamps `_SAFETY_MIN_QUALITY`
    + `_SAFETY_MAX_QUALITY` (= good / frontier tier numerics).
  - `resolve_policy()` reads `routing_preference` from the hint and
    applies the dial ONLY when the caller did NOT pass an explicit
    `min_quality` — preserves caller-intent-wins semantics.
- `models/inference.py`: `InferenceRequest.routing_hint` description
  documents the new key (so it surfaces in openapi.json).
- `tests/unit/test_routing_preference_dial.py`:
  - 8-case parametrized **byte-identical regression lock** for
    `balanced` — the moat invariant. Any divergence fails the build.
  - Dial-inert-when-gate-off coverage × all 3 preferences.
  - Dial-active mapping × bumps + clamps + explicit-caller-wins.
  - Unknown / typo preference values fall through to `balanced`.
  - 23 tests; all pure (no DB).

## Pre-commit

ruff + ruff-format + mypy --strict + pytest unit+smoke = 528 green.

## Out of scope (per SP-4 §1)

- methodology v1.3 changes
- weights / λ-blending
- online learning (AIN-246 — Backlog/deferred)
- `M_allowed` / `q_prior` / `q_empirical` semantics
- engine code in `routing/ainfera_routing/decide.py` — untouched

## Public copy (founder/Varda)

Drafted README/STRATEGY paragraph for the routing repo describing the
dial — see `docs/routing-preference.md` in the next PR after founder
sign-off on the mapping values.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant