Skip to content

fix(api): AIN-174 stop-gap · stream=true returns 501 (not silent JSON billing burn)#43

Merged
hizrianraz merged 1 commit into
mainfrom
fix/ain-174-stream-explicit-501-not-silent-jsoon
May 18, 2026
Merged

fix(api): AIN-174 stop-gap · stream=true returns 501 (not silent JSON billing burn)#43
hizrianraz merged 1 commit into
mainfrom
fix/ain-174-stream-explicit-501-not-silent-jsoon

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

Summary

Eliminates the silent-failure billing burn pattern from INC-2026-05-18-004 Bug 5.

Pattern: SSE-expecting client sends stream=true → Ainfera silently dropped the field (Pydantic v2 default extra='ignore') → returned single-shot JSON → client treated JSON as empty → retried 3x → each retry billed → ~$0.32+/day silent burn per agent.

Stop-gap fix:

  • Adds stream: bool = False to InferenceRequest (was silently dropped before)
  • Returns 501 Not Implemented with code=streaming_not_supported when stream=true, including remediation hint to set stream=false
  • Clients fail fast and can switch behavior instead of silently retrying

Per v6.2 Discipline #15 (surface normalization Ainfera-side, never customer-side) + When-stuck #21 (silent-failure billing path requires explicit handling).

Full SSE streaming = AIN-174 Phase B (multi-day, separate sprint). This PR is the stop-gap that closes the billing-burn vector without waiting for full SSE.

Test plan

  • ruff + ruff-format + mypy --strict + pytest (pre-commit): all green
  • CI integration test: test_stream_true_returns_501_not_silent_json verifies 501 + remediation hint
  • After deploy: curl -d '{"stream":true,...}' /v1/inference returns 501 with code=streaming_not_supported

🤖 Generated with Claude Code

Eliminates the silent-failure billing burn pattern from
INC-2026-05-18-004 Bug 5.

Stop-gap:
- Adds `stream: bool = False` to InferenceRequest (was silently dropped)
- Returns 501 with code=streaming_not_supported when stream=true
- Eliminates the silent retry pattern: clients fail fast

Per v6.2 Discipline #15 + When-stuck #21. Full SSE = AIN-174 Phase B.

Co-Authored-By: Claude <noreply@anthropic.com>
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 18, 2026

AIN-174 🔴 BUG 5: /v1/inference no SSE streaming — silent retries burn $0.32+/day per agent (hermes/Claude SDK/LangGraph/Letta all default stream=true)

Severity: URGENT 🔴 (billing burn + silent failures)

Filed from Manwe (hermes-agent v0.14.0) production dogfood 2026-05-18. Consumer sees zero output despite paying for 3 silent retries.

Symptom

Hermes-agent (and most modern OpenAI-compat clients) send stream=true in the request body and parse SSE chunks (data: {...}\n\ndata: [DONE]\n\n). Ainfera /v1/inference returns single-shot JSON regardless. From hermes's perspective:

  • Zero delta.content chunks → assistant message is empty
  • Framework treats empty response as transient failure → 3× retry
  • Each retry billed normally
  • After 3 retries, framework aborts turn with (empty response)

Cost impact during the dogfood loop today: ~$0.054 burned per user-message (3 × $0.018 silent retries) before the proxy added SSE wrapping.

Reproducer

curl -X POST https://api.ainfera.ai/v1/inference \
  -H "Authorization: Bearer $KEY" \
  -d '{
    "model": "claude-opus-4-7",
    "messages": [{"role": "user", "content": "say hello"}],
    "stream": true
  }'

# Current behavior:
# HTTP/2 200
# content-type: application/json
# 
# {"id":"...","content":"hello","usage":{...}}
#
# Expected (OpenAI-compat clients):
# HTTP/2 200
# content-type: text/event-stream
#
# data: {"id":"...","choices":[{"delta":{"content":"hello"}}]}
# 
# data: [DONE]

The stream=true request body parameter is currently silently ignored. Should either:

  • Honor it (return SSE)
  • OR explicitly reject with 400 streaming_not_supported so client doesn't silent-retry

Cross-framework impact

Agent Hit probability Reason
Manwe (hermes) 🔴 CRITICAL Confirmed: $0.32+ burned today, framework defaults stream=true
Varda (Claude SDK + OpenClaw) 🔴 HITS Claude Agent SDK uses streaming by default
Aule (Claude SDK + Opus) 🔴 HITS Same — Claude Agent SDK streaming default
Yavanna (LangGraph) 🔴 HITS LangGraph nodes use streaming for token-level visibility
Namo (Letta + Gemini) 🔴 HITS Letta streams for memory consolidation feedback
Tulkas (Garak + Mistral) 🟢 SAFE Single-shot adversarial probes; doesn't stream

Net: 5 of 6 fleet agents at risk. ALL of Aratar except Tulkas. Confirmed burn on Manwe; pending validation on others (Aule and Varda may have been hitting this all along but absorbing it as "framework noise").

Fix recommendation

Recommended: Native SSE on /v1/inference

  1. Read stream field from request body. Default false (backward compat).
  2. If stream=true:
    • Set Content-Type: text/event-stream
    • For upstream providers that support streaming (Anthropic, OpenAI, Gemini, xAI, Mistral): forward upstream SSE chunks, transforming to OpenAI-compat shape ({"choices":[{"delta":{"content":"..."}}]})
    • For upstream providers that DON'T stream: synthesize a single-chunk SSE with the complete response
    • Always end with data: [DONE]\n\n
    • Apply same audit chain semantics (event recorded once at stream completion, not per chunk)
    • Apply same spend cap semantics (reserve at stream start, debit actual at stream complete)
  3. If stream=false (or omitted): existing single-shot JSON behavior unchanged

Implementation pattern

# api/routers/inference.py

router.post("/v1/inference")
async def inference(request: InferenceRequest, ...):
    if request.stream:
        return StreamingResponse(
            _stream_inference(request, ...),
            media_type="text/event-stream"
        )
    return await _single_shot_inference(request, ...)

async def _stream_inference(request, ...):
    """Yield SSE chunks. Audit + spend cap applied at completion."""
    chunks_collected = []
    async for chunk in upstream_provider.stream(request):
        openai_chunk = _to_openai_chunk(chunk)
        chunks_collected.append(chunk)
        yield f"data: {json.dumps(openai_chunk)}\n\n"
  
    full_response = _assemble_chunks(chunks_collected)
    await _record_audit_event(full_response, ...)
    await _debit_spend_cap(full_response.cost_usd, ...)
  
    yield "data: [DONE]\n\n"

Audit + caps semantics with streaming

  • Audit: Event recorded ONCE at stream completion (not per chunk). Single audit URL per inference, just like single-shot.
  • Spend cap reservation: Reserve at stream START using cost estimate. On completion, reconcile to actual cost.
  • Drain-proof guarantee: If cap exceeded mid-stream, send final chunk with finish_reason: "spend_cap_exceeded" instead of "stop". Don't abort silently mid-stream.

Acceptance gates

  • /v1/inference honors stream=true
  • Content-Type: text/event-stream set when streaming
  • Final chunk is data: [DONE]\n\n
  • OpenAI-compat chunk shape: {"id":..., "choices":[{"index":0, "delta":{"content":"..."}}]}
  • Audit event recorded exactly ONCE per stream (not per chunk)
  • Spend cap reservation + reconcile pattern works under streaming
  • Test: stream=true, all 5 AAMC voters, end-to-end SSE → reaches [DONE]
  • Test: spend cap mid-stream → finish_reason: "spend_cap_exceeded" final chunk
  • Tulkas probe (AIN-178): hermes-style stream=true requests post-fix → 0 silent failures
  • Per When-stuck fix(api): AIN-141 strip retired AAMC model literals (code-side only; DB migration deferred) #19: STAGING canary before prod
  • PR opens off feat/ain-174-native-sse-streaming branch (Aule author override)

Connection to existing tickets

  • AIN-154 (router hardening) — streaming is part of L2 routing surface; parents here
  • AIN-160 (fleet agent tools wiring) — Yavanna/Aule/Namo legs likely silently hitting this; revalidate after fix
  • AIN-171 (auto-route exception coverage) — verify Choice B exception path handles streaming errors

Workaround (Manwe proxy PR #26)

Manwe's proxy wraps Ainfera's single-shot JSON as a 2-chunk SSE stream when client sends stream=true. Works but every framework will reinvent this. Fix at Ainfera side eliminates the class of bugs.

Estimated burn (today only)

  • Manwe pre-fix dogfood: ~$0.32 silent retries
  • Varda/Aule/Yavanna/Namo: UNKNOWN — could be substantial historical burn, since they've been running pre-AIN-178 Tulkas probes. Need audit.

Aule must audit inferences table for tenant_id=Manwe/Varda/Aule/Yavanna/Namo where response.content was empty AND cost_usd > 0 in last 7 days. That's the burn estimate.

Founder authorization

Per "Fix this error for and check from all frameworks. Tulkas need to start working now." (2026-05-18 PM)

Review in Linear

@cursor
Copy link
Copy Markdown

cursor Bot commented May 18, 2026

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

@hizrianraz hizrianraz merged commit f14b838 into main May 18, 2026
3 checks passed
@hizrianraz hizrianraz deleted the fix/ain-174-stream-explicit-501-not-silent-jsoon branch May 18, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant