[api] AIN-232+AIN-234 W2: gateway resilience probes + cost kill-switch by hizrianraz · Pull Request #89 · ainfera-ai/api

hizrianraz · 2026-05-28T04:42:58Z

Summary (W2/9 — ship-now)

Gateway hardening from the closed SP-5 PR #77, re-baselined onto current main.

New: `/healthz` + `/readyz` (AIN-232)

/healthz — liveness, in-process only, no I/O. Platform HEALTHCHECK target.
/readyz — aggregate readiness: process_drain gate · DB · audit chain · killswitch snapshot. Returns 503 on any probe fail so the LB stops routing.

New: cost kill-switch (AIN-234)

Guard at the entry of dispatch_with_brain. Before the brain runs.
Rolling-window spend vs AINFERA_SPEND_KILLSWITCH_USD (default $50, configurable, hot-reload).
Engages → 503 cost_killswitch_engaged with spent/threshold/window_start in body.
Pinned passthroughs bypass the guard by design.

Disc #12

This is observability + a guard, never a routing-engine change:

ZERO change to scoring / weights / candidate ordering / M_allowed / q_prior.
The killswitch raises BEFORE the brain is invoked. If it doesn't fire, the engine runs exactly as today.
Threshold is a founder env config; the code only enforces what env says.

Tests

test_health_probes.py (4): liveness, readyz ok, readyz db-fail → 503, drain-gate flip.
test_cost_killswitch.py (20): env reads, spend rollup, snapshot composition, guard raise/no-raise, edge cases, disabled posture.
test_openapi_contract.py (4): /healthz + /readyz documented as non-v1 platform endpoints.
All 28 pass in 0.91s.

Founder action (post-merge)

Set in Railway env on ainfera-api:

AINFERA_SPEND_KILLSWITCH_USD=<real_threshold>   # default \$50
AINFERA_SPEND_KILLSWITCH_ENABLED=1              # default

Refs

AIN-232 · AIN-234 · supersedes closed PR #77 · Disc #12 preserved

🤖 Generated with Claude Code

Note

Medium Risk
Changes operational behavior (LB routing via /readyz, possible 503 on all routed inference when killswitch trips) and spend aggregation across all inferences for the day; routing engine itself is unchanged.

Overview
Adds Kubernetes-style gateway probes and a global spend circuit breaker on routed inference, without changing routing logic.

Health (/healthz, /readyz): A new health router exposes /healthz (liveness, no I/O) and /readyz (readiness: drain gate, DB SELECT 1, audit table read, plus killswitch snapshot in the body). Failed readiness probes return 503 with a per-probe breakdown. Legacy /health stays as health_legacy for existing monitors. A process-local readiness gate can flip to draining (intended for SIGTERM) so load balancers stop new traffic while work drains.

Cost kill-switch: New cost_killswitch service sums daily inferences.cost_usd against AINFERA_SPEND_KILLSWITCH_USD (default $50, hot-reloaded) and AINFERA_SPEND_KILLSWITCH_ENABLED. guard_or_raise runs at the start of dispatch_with_brain (before the brain). When engaged, routed POST /v1/inference returns 503 with cost_killswitch_engaged and spend/threshold/window metadata; pinned vendor passthroughs are unchanged.

Tests: Unit tests for env parsing, engagement math, liveness/legacy health, and drain gate; OpenAPI contract notes probes are platform-level, not part of the v1 surface.

^{Reviewed by Cursor Bugbot for commit aa4b0d4. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ll-switch W2/9 SHIP-NOW. Brings in the gateway hardening from closed PR #77 (re-baselined). Wires through brain + inference router + deploys to Railway on merge. routers/health.py (NEW): - GET /healthz: in-process liveness, no I/O (HEALTHCHECK target) - GET /readyz: aggregate readiness (process_drain + db + audit + ks snapshot); 503 on any probe fail - get_readiness_gate() flipped FALSE on SIGTERM for drain services/cost_killswitch.py (NEW): - guard_or_raise() called at dispatch_with_brain entry - rolling-window spend (default today UTC) vs AINFERA_SPEND_KILLSWITCH_USD - Default $50 + enabled; ops env-config without restart - Pinned passthroughs bypass guard by design (moat-safe) - Aggregate-only logging (no PII) routing_brain.py: - await cost_killswitch.guard_or_raise(db) before brain runs - Disc #12 preserved: scoring/candidate-set/weights untouched inference.py: - Catch CostKillswitchEngagedError → 503 with code + spent/threshold main.py: - Register health.router; rename inline /health → health_legacy Tests: - test_health_probes.py (4) + test_cost_killswitch.py (20) + openapi contract (4 — /healthz, /readyz documented as non-v1) - All 28 pass Founder config (set in Railway env on api): AINFERA_SPEND_KILLSWITCH_USD=<real_threshold> # default $50 AINFERA_SPEND_KILLSWITCH_ENABLED=1 # default Refs: AIN-232 · AIN-234 · supersedes closed PR #77 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

linear-code · 2026-05-28T04:43:01Z

AIN-232 [Foundation] Gateway resilience — kill the fleet SPOF (circuit breakers + gateway-independent fallback)

Fleet pressure-test finding #1 (LOCKED 2026-05-22). The dogfood thesis — every agent routes through Ainfera Routing — is ALSO the fleet's single biggest point of failure. Gateway down → all 7 agents go blind simultaneously. This ticket hardens the gateway so a single failure is contained, not fleet-wide.

Problem

One-router invariant means zero agent diversity at the routing layer. No agent can act if the gateway is unreachable.
The existing SPOF fallback (Ainfera + explicit model) does NOT help if the gateway PROCESS itself is down — the fallback still goes through the gateway.

What it must do

Gateway-independent degraded mode: a documented path where critical agents (Varda, Aulë) can reach a provider WITHOUT the gateway in the hot path, used ONLY when the gateway healthcheck is red. Must be auditable + rate-limited + alarmed (this is the one sanctioned exception to "never direct provider"; it is break-glass, not normal operation).
Circuit breakers per upstream provider: trip on error-rate/latency threshold, fail fast instead of retry-storming. Anomaly-gated.
Retry-storm / thundering-herd guards: capped retries + jittered backoff across all agents; correlated-spike detection.
Healthcheck + auto-failover across provider routes inside the gateway (one provider down ≠ gateway down).
Self-healing: automatic provider failover with explicit fallback chains per model class.

Acceptance criteria

Per-provider circuit breakers live; verified to trip + recover under induced upstream failure
Break-glass gateway-independent path for Varda + Aulë; gated on red healthcheck; fully audited + alarmed when used
Retry caps + jittered backoff enforced fleet-wide; no unbounded retry path exists
Gateway healthcheck endpoint + auto-failover across provider routes
Chaos test: kill primary provider → fleet degrades gracefully, no retry storm, audit chain intact
Runbook: "gateway down" + "provider down" degraded-mode procedures documented

Why M1

Gates the whole one-router invariant's safety. Every agent (Phase 5) inherits this. Pairs with AIN-226 (the gateway itself).

AIN-234 [Foundation] Fleet cost governance + financial kill-switch (per-agent metering + circuit breakers)

Fleet pressure-test finding #5 (LOCKED 2026-05-22). 7 agents, Námo overnight Brain-Factory, Ulmo continuous monitors, Tulkas red-team sweeps — all routing through the gateway. Aggregate inference cost can balloon, and a cascading failure can exhaust the API budget in minutes. The 2026 playbook prescription: treat the agent swarm as governed software with hard financial kill-switches + per-agent cost observability. We already have the immutable audit trail (L4); we're missing the meters and the switches.

What it must do

Per-agent cost + latency + error observability: token consumption metered per agent per run (the gateway is the natural meter — every call already passes through it). Mainstream frameworks only give partial per-agent latency; the rest is on us.
Per-agent + fleet-wide budgets: soft alert + hard cap. Breach → throttle/halt that agent, alarm the founder.
Financial kill-switch: fleet-wide emergency stop the founder can trip; auto-trip on runaway-spend anomaly. Hard switch, not advisory.
Cost attribution to dogfood metric: since every agent's spend IS L2 routing revenue-margin data, wire agent cost into the same ledger (turns the cost risk into a product signal).
Extends existing partials: Ulmo runway-watch + Tulkas drain-proof sacrificial key cover pieces — this unifies them into fleet-wide governance.

Acceptance criteria

Per-agent cost/latency/error metrics emitted by the gateway; dashboard per agent per run
Soft-alert + hard-cap budgets per agent AND fleet-wide; breach throttles + alarms
Financial kill-switch: manual founder trip + auto-trip on spend anomaly; verified to halt fleet inference
Runaway-spend chaos test: simulate retry storm → kill-switch trips before budget blown
Agent spend reconciled into L4 audit chain / cost ledger
Ulmo consumes these signals for runway watch; Tulkas drain-proof folded in

Why M1

Cost governance + kill-switch are foundational guardrails the whole fleet (Phase 5) runs inside. Pre-revenue, runaway inference is an existential risk for a solo founder.

Review in Linear

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.}

^{Reviewed by Cursor Bugbot for commit aa4b0d4. Configure here.}

cursor · 2026-05-28T04:47:27Z

+    # router. This is a guard, NOT a routing change: it fires BEFORE the
+    # brain is invoked; if it doesn't fire, the engine runs exactly as
+    # before. Disc #12 preserved (no scoring, no candidate-set change).
+    await cost_killswitch.guard_or_raise(db)


Streaming path doesn't catch new CostKillswitchEngagedError

High Severity

dispatch_with_brain now raises CostKillswitchEngagedError (via cost_killswitch.guard_or_raise), and the non-streaming /v1/inference handler catches it properly. However, services/streaming.py also calls dispatch_with_brain (for /v1/messages with stream=true) and only catches NoCandidateError and AllCandidatesFailedError. When the killswitch engages during a streaming request, the unhandled exception will propagate through the SSE async generator, producing a broken stream or 500 instead of a graceful termination frame or structured error.

Additional Locations (1)

ainfera_api/routers/inference.py#L344-L357

^{Reviewed by Cursor Bugbot for commit aa4b0d4. Configure here.}

hizrianraz merged commit 60e1fc8 into main May 28, 2026
4 checks passed

cursor Bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api] AIN-232+AIN-234 W2: gateway resilience probes + cost kill-switch#89

[api] AIN-232+AIN-234 W2: gateway resilience probes + cost kill-switch#89
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-232-w2-resilience

hizrianraz commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

linear-code Bot commented May 28, 2026 •

edited

Loading

Problem

What it must do

Acceptance criteria

Why M1

What it must do

Acceptance criteria

Why M1

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hizrianraz commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary (W2/9 — ship-now)

New: /healthz + /readyz (AIN-232)

New: cost kill-switch (AIN-234)

Disc #12

Tests

Founder action (post-merge)

Refs

Uh oh!

linear-code Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What it must do

Acceptance criteria

Why M1

What it must do

Acceptance criteria

Why M1

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 28, 2026

Choose a reason for hiding this comment

Streaming path doesn't catch new CostKillswitchEngagedError

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hizrianraz commented May 28, 2026 •

edited by cursor Bot

Loading

New: `/healthz` + `/readyz` (AIN-232)

linear-code Bot commented May 28, 2026 •

edited

Loading