fix(api): AIN-171 add 4 missing exception handlers to /v1/inference auto branch + JSON 500 middleware#25
Conversation
…uto branch
The model=ainfera-auto path in routers/inference.py caught only ProviderError.
When dispatch_inference raised any of ModelUnavailableError, AgentNotActiveError,
CapViolationError, or InsufficientFundsError from the auto candidate loop, the
exception propagated uncaught → FastAPI default handler → plain-text 500
"Internal Server Error" body.
Symptom (INC-2026-05-18-003): deterministic 500 with plain-text body for any
tenant request shape `model=ainfera-auto`. Triggered ~19:10 UTC after api
service restart re-read post-AIN-141-migration catalog state.
Fix:
- ModelUnavailableError → log + audit fallback + try next candidate (catalog
state can drift between auto_route selection and dispatch; one bad candidate
shouldn't blow the whole request)
- AgentNotActiveError → terminal HTTP 409 (state doesn't change between candidates)
- CapViolationError → terminal HTTP 402 with drain-proof refusal audit (matches
non-auto branch pattern)
- InsufficientFundsError → terminal HTTP 402 with drain-proof refusal audit
Plus main.py: register Exception fallback handler returning structured JSON
500 envelope (no more plain-text leaks regardless of which exception escapes).
Stack trace logged server-side; client gets {"code": "internal_server_error", ...}.
Test results: 387/387 unit tests pass; 471 tests collected.
Acceptance gates (AIN-171):
- [X] Root cause identified
- [X] Fix shipped
- [X] Structured JSON 500 path added
- [ ] Curl proof: post-deploy, founder side
- [ ] Manwe workaround removal: post-deploy, founder side
- [ ] Incident Log + v6.1 corollary: founder side / next session
Co-Authored-By: Claude <noreply@anthropic.com>
AIN-171 🔴 P1 INCIDENT — api.ainfera.ai /v1/inference returning deterministic 500 since ~19:10 UTC (2026-05-18)
INC-2026-05-18-003 (third incident this session, post INC-001 transcript leak + INC-002 re-rotation) Symptom (Manwe-side observation)Manwe proxy → Verified working baseline (19:09–19:10 UTC)
Verified failing state (19:20 UTC onward)
Hypothesis (high confidence)Something changed in the auto-router path (
Aule investigation protocol (P1, execute now)Per Discipline #14 — reference logs by handler path + time window + metadata, never echo auth tokens or full payload hashes into transcript. # 0. Verify role + tenant scope
pwd
git status # confirm on appropriate branch / repo
# 1. Pull /v1/inference handler logs for the failing window
# Railway / wherever api.ainfera.ai logs land
# Window: 19:10–19:25 UTC 2026-05-18
# Filter: status=500 OR exception present
# REDACT: full payload hashes, auth bearer prefixes beyond first 8 chars
# 2. Check auto-router service health
curl -sS -o /dev/null -w "%{http_code}\n" https://api.ainfera.ai/v1/router/health
# If endpoint doesn't exist yet (AIN-154 in flight), check service process directly
# 3. Verify agent_id auto-inference for Manwe key
# Query Supabase agents table:
# WHERE tenant_id = '6220449f...' (Manwe tenant per memory)
# AND created_at < '2026-05-18 19:00'
# Confirm agent record + capabilities intact
# 4. Look for stack traces in last 30 min
# Search for: TypeError, AttributeError, KeyError, asyncio.TimeoutError
# Around payload_hash starting with "e41656eb"
# 5. Check recent deploys / config pushes
git log --since="2026-05-18 18:30 UTC" --oneline main # api repo + ainfera-os
# Doppler audit log: any prd config changes since 18:00 UTC?Manwe-side workaround (APPLIED 2026-05-18 ~19:35 UTC)
Workaround verifies Manwe continues operating while Ainfera-OS team investigates root cause. Acceptance gates for closure
v6.1 corollary candidate (founder review)Manwe + all 5 fleet agents dogfood Ainfera. When Ainfera breaks, EVERY agent breaks. This is both feature (dogfood catches bugs) and risk (single point of failure for entire Aratar fleet operations). Mitigations to consider for v6.1 patch:
Cross-references
|
Closes root cause of AIN-171 (INC-2026-05-18-003). Auto-route branch was catching only ProviderError; non-auto branch catches 5 exception types. ModelUnavailableError leaking uncaught from auto loop → FastAPI default plain-text 500. Fix adds the 4 missing handlers (B-mode: ModelUnavailableError tries next candidate, others terminal HTTPException) plus a residual Exception handler in main.py for structured JSON 500 envelope. Defensive change — only adds handlers, no happy-path changes. 387/387 unit tests pass. See commit message for full ticket acceptance status. Founder authorization: 'auth fix B' 2026-05-18 PM per v6.1 Discipline #6 corollary (AIN-169).
Note
Medium Risk
Changes error-handling behavior on the core
/v1/inferencepath and adds a global exception handler, which can affect client-visible responses and retry/fallback behavior, but is limited to failure paths.Overview
Fixes AIN-171 by closing gaps in failure-path handling for
POST /v1/inferencewhenmodel=ainfera-auto.The auto-routing dispatch loop now catches
ModelUnavailableError(recording aninference_routedfallback event and trying the next candidate) and treatsAgentNotActiveError,CapViolationError, andInsufficientFundsErroras terminal with the same HTTP semantics/auditing used in the non-auto path.Adds a global
Exceptionhandler inmain.pyto ensure any uncaught errors return a structured JSON500envelope and logs the full exception server-side.Reviewed by Cursor Bugbot for commit 0f32b7c. Bugbot is set up for automated code reviews on this repo. Configure here.