fix(api): AIN-171 add 4 missing exception handlers to /v1/inference auto branch + JSON 500 middleware by hizrianraz · Pull Request #25 · ainfera-ai/api

hizrianraz · 2026-05-18T12:33:06Z

Closes root cause of AIN-171 (INC-2026-05-18-003). Auto-route branch was catching only ProviderError; non-auto branch catches 5 exception types. ModelUnavailableError leaking uncaught from auto loop → FastAPI default plain-text 500. Fix adds the 4 missing handlers (B-mode: ModelUnavailableError tries next candidate, others terminal HTTPException) plus a residual Exception handler in main.py for structured JSON 500 envelope. Defensive change — only adds handlers, no happy-path changes. 387/387 unit tests pass. See commit message for full ticket acceptance status. Founder authorization: 'auth fix B' 2026-05-18 PM per v6.1 Discipline #6 corollary (AIN-169).

Note

Medium Risk
Changes error-handling behavior on the core /v1/inference path and adds a global exception handler, which can affect client-visible responses and retry/fallback behavior, but is limited to failure paths.

Overview
Fixes AIN-171 by closing gaps in failure-path handling for POST /v1/inference when model=ainfera-auto.

The auto-routing dispatch loop now catches ModelUnavailableError (recording an inference_routed fallback event and trying the next candidate) and treats AgentNotActiveError, CapViolationError, and InsufficientFundsError as terminal with the same HTTP semantics/auditing used in the non-auto path.

Adds a global Exception handler in main.py to ensure any uncaught errors return a structured JSON 500 envelope and logs the full exception server-side.

^{Reviewed by Cursor Bugbot for commit 0f32b7c. Bugbot is set up for automated code reviews on this repo. Configure here.}

…uto branch The model=ainfera-auto path in routers/inference.py caught only ProviderError. When dispatch_inference raised any of ModelUnavailableError, AgentNotActiveError, CapViolationError, or InsufficientFundsError from the auto candidate loop, the exception propagated uncaught → FastAPI default handler → plain-text 500 "Internal Server Error" body. Symptom (INC-2026-05-18-003): deterministic 500 with plain-text body for any tenant request shape `model=ainfera-auto`. Triggered ~19:10 UTC after api service restart re-read post-AIN-141-migration catalog state. Fix: - ModelUnavailableError → log + audit fallback + try next candidate (catalog state can drift between auto_route selection and dispatch; one bad candidate shouldn't blow the whole request) - AgentNotActiveError → terminal HTTP 409 (state doesn't change between candidates) - CapViolationError → terminal HTTP 402 with drain-proof refusal audit (matches non-auto branch pattern) - InsufficientFundsError → terminal HTTP 402 with drain-proof refusal audit Plus main.py: register Exception fallback handler returning structured JSON 500 envelope (no more plain-text leaks regardless of which exception escapes). Stack trace logged server-side; client gets {"code": "internal_server_error", ...}. Test results: 387/387 unit tests pass; 471 tests collected. Acceptance gates (AIN-171): - [X] Root cause identified - [X] Fix shipped - [X] Structured JSON 500 path added - [ ] Curl proof: post-deploy, founder side - [ ] Manwe workaround removal: post-deploy, founder side - [ ] Incident Log + v6.1 corollary: founder side / next session Co-Authored-By: Claude <noreply@anthropic.com>

linear-code · 2026-05-18T12:33:10Z

AIN-171 🔴 P1 INCIDENT — api.ainfera.ai /v1/inference returning deterministic 500 since ~19:10 UTC (2026-05-18)

INC-2026-05-18-003 (third incident this session, post INC-001 transcript leak + INC-002 re-rotation)

Symptom (Manwe-side observation)

Manwe proxy → api.ainfera.ai/v1/inference returning HTTP 500 + plain-text body "Internal Server Error". Deterministic: body_hash identical across multiple calls.

Verified working baseline (19:09–19:10 UTC)

Same request shape: model_in=t1 → model_out=ainfera-auto, message_count=2
Returned 200 OK with real claude-opus-4-7 content
Auth path: AINFERA_MANWE_KEY (Manwe tenant)

Verified failing state (19:20 UTC onward)

Identical request shape
HTTP 500, plain-text "Internal Server Error" (note: NOT JSON — handler is returning before serializer)
Auth passing (response hash bucket differs from known invalid-key responses)
payload_hash=e41656eb... deterministic across multiple retries
Window between last 200 and first 500: ~10 minutes (19:10–19:20)

Hypothesis (high confidence)

Something changed in the auto-router path (model_out=ainfera-auto) between 19:10 and 19:20 that breaks 500-error-handling and emits plain-text instead of structured error. Either:

Auto-router service crashed (unhealthy state)
agent_id auto-inference broke for Manwe tenant key
Provider downstream raised an exception that error middleware doesn't catch → naked 500
Recent deploy / config push that didn't roll back

Aule investigation protocol (P1, execute now)

Per Discipline #14 — reference logs by handler path + time window + metadata, never echo auth tokens or full payload hashes into transcript.

# 0. Verify role + tenant scope
pwd
git status   # confirm on appropriate branch / repo

# 1. Pull /v1/inference handler logs for the failing window
#    Railway / wherever api.ainfera.ai logs land
#    Window: 19:10–19:25 UTC 2026-05-18
#    Filter: status=500 OR exception present
#    REDACT: full payload hashes, auth bearer prefixes beyond first 8 chars

# 2. Check auto-router service health
curl -sS -o /dev/null -w "%{http_code}\n" https://api.ainfera.ai/v1/router/health
# If endpoint doesn't exist yet (AIN-154 in flight), check service process directly

# 3. Verify agent_id auto-inference for Manwe key
#    Query Supabase agents table:
#      WHERE tenant_id = '6220449f...' (Manwe tenant per memory)
#      AND created_at < '2026-05-18 19:00'
#    Confirm agent record + capabilities intact

# 4. Look for stack traces in last 30 min
#    Search for: TypeError, AttributeError, KeyError, asyncio.TimeoutError
#    Around payload_hash starting with "e41656eb"

# 5. Check recent deploys / config pushes
git log --since="2026-05-18 18:30 UTC" --oneline main   # api repo + ainfera-os
#    Doppler audit log: any prd config changes since 18:00 UTC?

Manwe-side workaround (APPLIED 2026-05-18 ~19:35 UTC)

MANWE_AINFERA_DEFAULT_MODEL=gpt-5.5-pro set in backend/.env. Bypasses the auto-router path (ainfera-auto) entirely. Uses model verified working in prior session. No code change — proxy already supports this env var.

Workaround verifies Manwe continues operating while Ainfera-OS team investigates root cause.

Acceptance gates for closure

Root cause identified (which of 1/2/3/4 above, or new hypothesis)
Fix shipped with curl proof: identical request shape returns 200 OK from model_out=ainfera-auto
Structured JSON error response added to 500 path (no more plain-text leaks — observability hygiene)
Manwe-side workaround removed: MANWE_AINFERA_DEFAULT_MODEL unset, restart, retest with ainfera-auto
Incident Log entry filed in Notion (INC-2026-05-18-003)
v6.1 corollary surfaced: single-point-of-failure observation (see below)

v6.1 corollary candidate (founder review)

Manwe + all 5 fleet agents dogfood Ainfera. When Ainfera breaks, EVERY agent breaks. This is both feature (dogfood catches bugs) and risk (single point of failure for entire Aratar fleet operations).

Mitigations to consider for v6.1 patch:

Per-agent fallback model config — already supported via MANWE_AINFERA_DEFAULT_MODEL env pattern. Generalize to all fleet agents.
Auto-router circuit breaker — related to AIN-154 router hardening. Add fallback chain so ainfera-auto failure doesn't 500, falls through to first-registered model with audit log of degradation.
Structured 500 responses — every API path returns JSON error envelope, never plain-text. Observability hygiene.
Synthetic monitoring on /v1/inference — production probe every 60s, alert on 500.

Cross-references

AIN-154 router hardening (circuit breakers + fallback chains directly address this)
AIN-128 USDC settlement (also depends on /v1/inference reliability)
INC-2026-05-18-001 transcript leak (this is third session incident, pattern emerging)
INC-2026-05-18-002 re-rotation (pending v6.1 docs)
Discipline feat(users): add recent_events + chain_meta to /v1/users/{handle}/dashboard #14 transcript secret hygiene (applies to log inspection in this investigation)
Memory: AIN-164 SHA 7f924fea clean per last session bracket (rules out token-rotation cascade)

Review in Linear

hizrianraz merged commit daa7b32 into main May 18, 2026
4 checks passed

hizrianraz deleted the fix/ain-171-auto-route-exception-coverage branch May 18, 2026 12:43

hizrianraz mentioned this pull request May 18, 2026

feat(aule): AIN-160 (Aule leg) · DM "review PR <ref>" + SDK code-review ainfera-ai/valinor#41

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): AIN-171 add 4 missing exception handlers to /v1/inference auto branch + JSON 500 middleware#25

fix(api): AIN-171 add 4 missing exception handlers to /v1/inference auto branch + JSON 500 middleware#25
hizrianraz merged 1 commit into
mainfrom
fix/ain-171-auto-route-exception-coverage

hizrianraz commented May 18, 2026 •

edited by cursor Bot

Loading

Uh oh!

linear-code Bot commented May 18, 2026 •

edited

Loading

Symptom (Manwe-side observation)

Verified working baseline (19:09–19:10 UTC)

Verified failing state (19:20 UTC onward)

Hypothesis (high confidence)

Aule investigation protocol (P1, execute now)

Manwe-side workaround (APPLIED 2026-05-18 ~19:35 UTC)

Acceptance gates for closure

v6.1 corollary candidate (founder review)

Cross-references

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hizrianraz commented May 18, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linear-code Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptom (Manwe-side observation)

Verified working baseline (19:09–19:10 UTC)

Verified failing state (19:20 UTC onward)

Hypothesis (high confidence)

Aule investigation protocol (P1, execute now)

Manwe-side workaround (APPLIED 2026-05-18 ~19:35 UTC)

Acceptance gates for closure

v6.1 corollary candidate (founder review)

Cross-references

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hizrianraz commented May 18, 2026 •

edited by cursor Bot

Loading

linear-code Bot commented May 18, 2026 •

edited

Loading