Skip to content

fix(api): AIN-171 add 4 missing exception handlers to /v1/inference auto branch + JSON 500 middleware#25

Merged
hizrianraz merged 1 commit into
mainfrom
fix/ain-171-auto-route-exception-coverage
May 18, 2026
Merged

fix(api): AIN-171 add 4 missing exception handlers to /v1/inference auto branch + JSON 500 middleware#25
hizrianraz merged 1 commit into
mainfrom
fix/ain-171-auto-route-exception-coverage

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

@hizrianraz hizrianraz commented May 18, 2026

Closes root cause of AIN-171 (INC-2026-05-18-003). Auto-route branch was catching only ProviderError; non-auto branch catches 5 exception types. ModelUnavailableError leaking uncaught from auto loop → FastAPI default plain-text 500. Fix adds the 4 missing handlers (B-mode: ModelUnavailableError tries next candidate, others terminal HTTPException) plus a residual Exception handler in main.py for structured JSON 500 envelope. Defensive change — only adds handlers, no happy-path changes. 387/387 unit tests pass. See commit message for full ticket acceptance status. Founder authorization: 'auth fix B' 2026-05-18 PM per v6.1 Discipline #6 corollary (AIN-169).


Note

Medium Risk
Changes error-handling behavior on the core /v1/inference path and adds a global exception handler, which can affect client-visible responses and retry/fallback behavior, but is limited to failure paths.

Overview
Fixes AIN-171 by closing gaps in failure-path handling for POST /v1/inference when model=ainfera-auto.

The auto-routing dispatch loop now catches ModelUnavailableError (recording an inference_routed fallback event and trying the next candidate) and treats AgentNotActiveError, CapViolationError, and InsufficientFundsError as terminal with the same HTTP semantics/auditing used in the non-auto path.

Adds a global Exception handler in main.py to ensure any uncaught errors return a structured JSON 500 envelope and logs the full exception server-side.

Reviewed by Cursor Bugbot for commit 0f32b7c. Bugbot is set up for automated code reviews on this repo. Configure here.

…uto branch

The model=ainfera-auto path in routers/inference.py caught only ProviderError.
When dispatch_inference raised any of ModelUnavailableError, AgentNotActiveError,
CapViolationError, or InsufficientFundsError from the auto candidate loop, the
exception propagated uncaught → FastAPI default handler → plain-text 500
"Internal Server Error" body.

Symptom (INC-2026-05-18-003): deterministic 500 with plain-text body for any
tenant request shape `model=ainfera-auto`. Triggered ~19:10 UTC after api
service restart re-read post-AIN-141-migration catalog state.

Fix:
- ModelUnavailableError → log + audit fallback + try next candidate (catalog
  state can drift between auto_route selection and dispatch; one bad candidate
  shouldn't blow the whole request)
- AgentNotActiveError → terminal HTTP 409 (state doesn't change between candidates)
- CapViolationError → terminal HTTP 402 with drain-proof refusal audit (matches
  non-auto branch pattern)
- InsufficientFundsError → terminal HTTP 402 with drain-proof refusal audit

Plus main.py: register Exception fallback handler returning structured JSON
500 envelope (no more plain-text leaks regardless of which exception escapes).
Stack trace logged server-side; client gets {"code": "internal_server_error", ...}.

Test results: 387/387 unit tests pass; 471 tests collected.

Acceptance gates (AIN-171):
- [X] Root cause identified
- [X] Fix shipped
- [X] Structured JSON 500 path added
- [ ] Curl proof: post-deploy, founder side
- [ ] Manwe workaround removal: post-deploy, founder side
- [ ] Incident Log + v6.1 corollary: founder side / next session

Co-Authored-By: Claude <noreply@anthropic.com>
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 18, 2026

AIN-171 🔴 P1 INCIDENT — api.ainfera.ai /v1/inference returning deterministic 500 since ~19:10 UTC (2026-05-18)

INC-2026-05-18-003 (third incident this session, post INC-001 transcript leak + INC-002 re-rotation)

Symptom (Manwe-side observation)

Manwe proxy → api.ainfera.ai/v1/inference returning HTTP 500 + plain-text body "Internal Server Error". Deterministic: body_hash identical across multiple calls.

Verified working baseline (19:09–19:10 UTC)

  • Same request shape: model_in=t1 → model_out=ainfera-auto, message_count=2
  • Returned 200 OK with real claude-opus-4-7 content
  • Auth path: AINFERA_MANWE_KEY (Manwe tenant)

Verified failing state (19:20 UTC onward)

  • Identical request shape
  • HTTP 500, plain-text "Internal Server Error" (note: NOT JSON — handler is returning before serializer)
  • Auth passing (response hash bucket differs from known invalid-key responses)
  • payload_hash=e41656eb... deterministic across multiple retries
  • Window between last 200 and first 500: ~10 minutes (19:10–19:20)

Hypothesis (high confidence)

Something changed in the auto-router path (model_out=ainfera-auto) between 19:10 and 19:20 that breaks 500-error-handling and emits plain-text instead of structured error. Either:

  1. Auto-router service crashed (unhealthy state)
  2. agent_id auto-inference broke for Manwe tenant key
  3. Provider downstream raised an exception that error middleware doesn't catch → naked 500
  4. Recent deploy / config push that didn't roll back

Aule investigation protocol (P1, execute now)

Per Discipline #14 — reference logs by handler path + time window + metadata, never echo auth tokens or full payload hashes into transcript.

# 0. Verify role + tenant scope
pwd
git status   # confirm on appropriate branch / repo

# 1. Pull /v1/inference handler logs for the failing window
#    Railway / wherever api.ainfera.ai logs land
#    Window: 19:10–19:25 UTC 2026-05-18
#    Filter: status=500 OR exception present
#    REDACT: full payload hashes, auth bearer prefixes beyond first 8 chars

# 2. Check auto-router service health
curl -sS -o /dev/null -w "%{http_code}\n" https://api.ainfera.ai/v1/router/health
# If endpoint doesn't exist yet (AIN-154 in flight), check service process directly

# 3. Verify agent_id auto-inference for Manwe key
#    Query Supabase agents table:
#      WHERE tenant_id = '6220449f...' (Manwe tenant per memory)
#      AND created_at < '2026-05-18 19:00'
#    Confirm agent record + capabilities intact

# 4. Look for stack traces in last 30 min
#    Search for: TypeError, AttributeError, KeyError, asyncio.TimeoutError
#    Around payload_hash starting with "e41656eb"

# 5. Check recent deploys / config pushes
git log --since="2026-05-18 18:30 UTC" --oneline main   # api repo + ainfera-os
#    Doppler audit log: any prd config changes since 18:00 UTC?

Manwe-side workaround (APPLIED 2026-05-18 ~19:35 UTC)

MANWE_AINFERA_DEFAULT_MODEL=gpt-5.5-pro set in backend/.env. Bypasses the auto-router path (ainfera-auto) entirely. Uses model verified working in prior session. No code change — proxy already supports this env var.

Workaround verifies Manwe continues operating while Ainfera-OS team investigates root cause.

Acceptance gates for closure

  • Root cause identified (which of 1/2/3/4 above, or new hypothesis)
  • Fix shipped with curl proof: identical request shape returns 200 OK from model_out=ainfera-auto
  • Structured JSON error response added to 500 path (no more plain-text leaks — observability hygiene)
  • Manwe-side workaround removed: MANWE_AINFERA_DEFAULT_MODEL unset, restart, retest with ainfera-auto
  • Incident Log entry filed in Notion (INC-2026-05-18-003)
  • v6.1 corollary surfaced: single-point-of-failure observation (see below)

v6.1 corollary candidate (founder review)

Manwe + all 5 fleet agents dogfood Ainfera. When Ainfera breaks, EVERY agent breaks. This is both feature (dogfood catches bugs) and risk (single point of failure for entire Aratar fleet operations).

Mitigations to consider for v6.1 patch:

  • Per-agent fallback model config — already supported via MANWE_AINFERA_DEFAULT_MODEL env pattern. Generalize to all fleet agents.
  • Auto-router circuit breaker — related to AIN-154 router hardening. Add fallback chain so ainfera-auto failure doesn't 500, falls through to first-registered model with audit log of degradation.
  • Structured 500 responses — every API path returns JSON error envelope, never plain-text. Observability hygiene.
  • Synthetic monitoring on /v1/inference — production probe every 60s, alert on 500.

Cross-references

  • AIN-154 router hardening (circuit breakers + fallback chains directly address this)
  • AIN-128 USDC settlement (also depends on /v1/inference reliability)
  • INC-2026-05-18-001 transcript leak (this is third session incident, pattern emerging)
  • INC-2026-05-18-002 re-rotation (pending v6.1 docs)
  • Discipline feat(users): add recent_events + chain_meta to /v1/users/{handle}/dashboard #14 transcript secret hygiene (applies to log inspection in this investigation)
  • Memory: AIN-164 SHA 7f924fea clean per last session bracket (rules out token-rotation cascade)

Review in Linear

@hizrianraz hizrianraz merged commit daa7b32 into main May 18, 2026
4 checks passed
@hizrianraz hizrianraz deleted the fix/ain-171-auto-route-exception-coverage branch May 18, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant