fix(api): AIN-174 stop-gap · stream=true returns 501 (not silent JSON billing burn) by hizrianraz · Pull Request #43 · ainfera-ai/api

hizrianraz · 2026-05-18T15:31:18Z

Summary

Eliminates the silent-failure billing burn pattern from INC-2026-05-18-004 Bug 5.

Pattern: SSE-expecting client sends stream=true → Ainfera silently dropped the field (Pydantic v2 default extra='ignore') → returned single-shot JSON → client treated JSON as empty → retried 3x → each retry billed → ~$0.32+/day silent burn per agent.

Stop-gap fix:

Adds stream: bool = False to InferenceRequest (was silently dropped before)
Returns 501 Not Implemented with code=streaming_not_supported when stream=true, including remediation hint to set stream=false
Clients fail fast and can switch behavior instead of silently retrying

Per v6.2 Discipline #15 (surface normalization Ainfera-side, never customer-side) + When-stuck #21 (silent-failure billing path requires explicit handling).

Full SSE streaming = AIN-174 Phase B (multi-day, separate sprint). This PR is the stop-gap that closes the billing-burn vector without waiting for full SSE.

Test plan

ruff + ruff-format + mypy --strict + pytest (pre-commit): all green
CI integration test: test_stream_true_returns_501_not_silent_json verifies 501 + remediation hint
After deploy: curl -d '{"stream":true,...}' /v1/inference returns 501 with code=streaming_not_supported

🤖 Generated with Claude Code

Eliminates the silent-failure billing burn pattern from INC-2026-05-18-004 Bug 5. Stop-gap: - Adds `stream: bool = False` to InferenceRequest (was silently dropped) - Returns 501 with code=streaming_not_supported when stream=true - Eliminates the silent retry pattern: clients fail fast Per v6.2 Discipline #15 + When-stuck #21. Full SSE = AIN-174 Phase B. Co-Authored-By: Claude <noreply@anthropic.com>

linear-code · 2026-05-18T15:31:22Z

AIN-174 🔴 BUG 5: /v1/inference no SSE streaming — silent retries burn $0.32+/day per agent (hermes/Claude SDK/LangGraph/Letta all default stream=true)

Severity: URGENT 🔴 (billing burn + silent failures)

Filed from Manwe (hermes-agent v0.14.0) production dogfood 2026-05-18. Consumer sees zero output despite paying for 3 silent retries.

Symptom

Hermes-agent (and most modern OpenAI-compat clients) send stream=true in the request body and parse SSE chunks (data: {...}\n\ndata: [DONE]\n\n). Ainfera /v1/inference returns single-shot JSON regardless. From hermes's perspective:

Zero delta.content chunks → assistant message is empty
Framework treats empty response as transient failure → 3× retry
Each retry billed normally
After 3 retries, framework aborts turn with (empty response)

Cost impact during the dogfood loop today: ~$0.054 burned per user-message (3 × $0.018 silent retries) before the proxy added SSE wrapping.

Reproducer

curl -X POST https://api.ainfera.ai/v1/inference \
  -H "Authorization: Bearer $KEY" \
  -d '{
    "model": "claude-opus-4-7",
    "messages": [{"role": "user", "content": "say hello"}],
    "stream": true
  }'

# Current behavior:
# HTTP/2 200
# content-type: application/json
# 
# {"id":"...","content":"hello","usage":{...}}
#
# Expected (OpenAI-compat clients):
# HTTP/2 200
# content-type: text/event-stream
#
# data: {"id":"...","choices":[{"delta":{"content":"hello"}}]}
# 
# data: [DONE]

The stream=true request body parameter is currently silently ignored. Should either:

Honor it (return SSE)
OR explicitly reject with 400 streaming_not_supported so client doesn't silent-retry

Cross-framework impact

Agent	Hit probability	Reason
Manwe (hermes)	🔴 CRITICAL	Confirmed: $0.32+ burned today, framework defaults `stream=true`
Varda (Claude SDK + OpenClaw)	🔴 HITS	Claude Agent SDK uses streaming by default
Aule (Claude SDK + Opus)	🔴 HITS	Same — Claude Agent SDK streaming default
Yavanna (LangGraph)	🔴 HITS	LangGraph nodes use streaming for token-level visibility
Namo (Letta + Gemini)	🔴 HITS	Letta streams for memory consolidation feedback
Tulkas (Garak + Mistral)	🟢 SAFE	Single-shot adversarial probes; doesn't stream

Net: 5 of 6 fleet agents at risk. ALL of Aratar except Tulkas. Confirmed burn on Manwe; pending validation on others (Aule and Varda may have been hitting this all along but absorbing it as "framework noise").

Fix recommendation

Recommended: Native SSE on `/v1/inference`

Read stream field from request body. Default false (backward compat).
If stream=true:
- Set Content-Type: text/event-stream
- For upstream providers that support streaming (Anthropic, OpenAI, Gemini, xAI, Mistral): forward upstream SSE chunks, transforming to OpenAI-compat shape ({"choices":[{"delta":{"content":"..."}}]})
- For upstream providers that DON'T stream: synthesize a single-chunk SSE with the complete response
- Always end with data: [DONE]\n\n
- Apply same audit chain semantics (event recorded once at stream completion, not per chunk)
- Apply same spend cap semantics (reserve at stream start, debit actual at stream complete)
If stream=false (or omitted): existing single-shot JSON behavior unchanged

Implementation pattern

# api/routers/inference.py

router.post("/v1/inference")
async def inference(request: InferenceRequest, ...):
    if request.stream:
        return StreamingResponse(
            _stream_inference(request, ...),
            media_type="text/event-stream"
        )
    return await _single_shot_inference(request, ...)

async def _stream_inference(request, ...):
    """Yield SSE chunks. Audit + spend cap applied at completion."""
    chunks_collected = []
    async for chunk in upstream_provider.stream(request):
        openai_chunk = _to_openai_chunk(chunk)
        chunks_collected.append(chunk)
        yield f"data: {json.dumps(openai_chunk)}\n\n"
  
    full_response = _assemble_chunks(chunks_collected)
    await _record_audit_event(full_response, ...)
    await _debit_spend_cap(full_response.cost_usd, ...)
  
    yield "data: [DONE]\n\n"

Audit + caps semantics with streaming

Audit: Event recorded ONCE at stream completion (not per chunk). Single audit URL per inference, just like single-shot.
Spend cap reservation: Reserve at stream START using cost estimate. On completion, reconcile to actual cost.
Drain-proof guarantee: If cap exceeded mid-stream, send final chunk with finish_reason: "spend_cap_exceeded" instead of "stop". Don't abort silently mid-stream.

Acceptance gates

Connection to existing tickets

AIN-154 (router hardening) — streaming is part of L2 routing surface; parents here
AIN-160 (fleet agent tools wiring) — Yavanna/Aule/Namo legs likely silently hitting this; revalidate after fix
AIN-171 (auto-route exception coverage) — verify Choice B exception path handles streaming errors

Workaround (Manwe proxy PR #26)

Manwe's proxy wraps Ainfera's single-shot JSON as a 2-chunk SSE stream when client sends stream=true. Works but every framework will reinvent this. Fix at Ainfera side eliminates the class of bugs.

Estimated burn (today only)

Manwe pre-fix dogfood: ~$0.32 silent retries
Varda/Aule/Yavanna/Namo: UNKNOWN — could be substantial historical burn, since they've been running pre-AIN-178 Tulkas probes. Need audit.

Aule must audit inferences table for tenant_id=Manwe/Varda/Aule/Yavanna/Namo where response.content was empty AND cost_usd > 0 in last 7 days. That's the burn estimate.

Founder authorization

Per "Fix this error for and check from all frameworks. Tulkas need to start working now." (2026-05-18 PM)

Review in Linear

cursor · 2026-05-18T15:31:23Z

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

hizrianraz merged commit f14b838 into main May 18, 2026
3 checks passed

hizrianraz deleted the fix/ain-174-stream-explicit-501-not-silent-jsoon branch May 18, 2026 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): AIN-174 stop-gap · stream=true returns 501 (not silent JSON billing burn)#43

fix(api): AIN-174 stop-gap · stream=true returns 501 (not silent JSON billing burn)#43
hizrianraz merged 1 commit into
mainfrom
fix/ain-174-stream-explicit-501-not-silent-jsoon

hizrianraz commented May 18, 2026

Uh oh!

linear-code Bot commented May 18, 2026 •

edited

Loading

Severity: URGENT 🔴 (billing burn + silent failures)

Symptom

Reproducer

Cross-framework impact

Fix recommendation

Recommended: Native SSE on `/v1/inference`

Implementation pattern

Audit + caps semantics with streaming

Acceptance gates

Connection to existing tickets

Workaround (Manwe proxy PR #26)

Estimated burn (today only)

Founder authorization

Uh oh!

cursor Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hizrianraz commented May 18, 2026

Summary

Test plan

Uh oh!

linear-code Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Severity: URGENT 🔴 (billing burn + silent failures)

Symptom

Reproducer

Cross-framework impact

Fix recommendation

Recommended: Native SSE on /v1/inference

Implementation pattern

Audit + caps semantics with streaming

Acceptance gates

Connection to existing tickets

Workaround (Manwe proxy PR #26)

Estimated burn (today only)

Founder authorization

Uh oh!

cursor Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

linear-code Bot commented May 18, 2026 •

edited

Loading

Recommended: Native SSE on `/v1/inference`