Skip to content

[api] AIN-232+AIN-234 W2: gateway resilience probes + cost kill-switch#89

Merged
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-232-w2-resilience
May 28, 2026
Merged

[api] AIN-232+AIN-234 W2: gateway resilience probes + cost kill-switch#89
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-232-w2-resilience

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

@hizrianraz hizrianraz commented May 28, 2026

Summary (W2/9 — ship-now)

Gateway hardening from the closed SP-5 PR #77, re-baselined onto current main.

New: /healthz + /readyz (AIN-232)

  • /healthz — liveness, in-process only, no I/O. Platform HEALTHCHECK target.
  • /readyz — aggregate readiness: process_drain gate · DB · audit chain · killswitch snapshot. Returns 503 on any probe fail so the LB stops routing.

New: cost kill-switch (AIN-234)

  • Guard at the entry of dispatch_with_brain. Before the brain runs.
  • Rolling-window spend vs AINFERA_SPEND_KILLSWITCH_USD (default $50, configurable, hot-reload).
  • Engages → 503 cost_killswitch_engaged with spent/threshold/window_start in body.
  • Pinned passthroughs bypass the guard by design.

Disc #12

This is observability + a guard, never a routing-engine change:

  • ZERO change to scoring / weights / candidate ordering / M_allowed / q_prior.
  • The killswitch raises BEFORE the brain is invoked. If it doesn't fire, the engine runs exactly as today.
  • Threshold is a founder env config; the code only enforces what env says.

Tests

  • test_health_probes.py (4): liveness, readyz ok, readyz db-fail → 503, drain-gate flip.
  • test_cost_killswitch.py (20): env reads, spend rollup, snapshot composition, guard raise/no-raise, edge cases, disabled posture.
  • test_openapi_contract.py (4): /healthz + /readyz documented as non-v1 platform endpoints.
  • All 28 pass in 0.91s.

Founder action (post-merge)

Set in Railway env on ainfera-api:

AINFERA_SPEND_KILLSWITCH_USD=<real_threshold>   # default \$50
AINFERA_SPEND_KILLSWITCH_ENABLED=1              # default

Refs

AIN-232 · AIN-234 · supersedes closed PR #77 · Disc #12 preserved

🤖 Generated with Claude Code


Note

Medium Risk
Changes operational behavior (LB routing via /readyz, possible 503 on all routed inference when killswitch trips) and spend aggregation across all inferences for the day; routing engine itself is unchanged.

Overview
Adds Kubernetes-style gateway probes and a global spend circuit breaker on routed inference, without changing routing logic.

Health (/healthz, /readyz): A new health router exposes /healthz (liveness, no I/O) and /readyz (readiness: drain gate, DB SELECT 1, audit table read, plus killswitch snapshot in the body). Failed readiness probes return 503 with a per-probe breakdown. Legacy /health stays as health_legacy for existing monitors. A process-local readiness gate can flip to draining (intended for SIGTERM) so load balancers stop new traffic while work drains.

Cost kill-switch: New cost_killswitch service sums daily inferences.cost_usd against AINFERA_SPEND_KILLSWITCH_USD (default $50, hot-reloaded) and AINFERA_SPEND_KILLSWITCH_ENABLED. guard_or_raise runs at the start of dispatch_with_brain (before the brain). When engaged, routed POST /v1/inference returns 503 with cost_killswitch_engaged and spend/threshold/window metadata; pinned vendor passthroughs are unchanged.

Tests: Unit tests for env parsing, engagement math, liveness/legacy health, and drain gate; OpenAPI contract notes probes are platform-level, not part of the v1 surface.

Reviewed by Cursor Bugbot for commit aa4b0d4. Bugbot is set up for automated code reviews on this repo. Configure here.

…ll-switch

W2/9 SHIP-NOW. Brings in the gateway hardening from closed PR #77
(re-baselined). Wires through brain + inference router + deploys to
Railway on merge.

routers/health.py (NEW):
- GET /healthz: in-process liveness, no I/O (HEALTHCHECK target)
- GET /readyz: aggregate readiness (process_drain + db + audit + ks
  snapshot); 503 on any probe fail
- get_readiness_gate() flipped FALSE on SIGTERM for drain

services/cost_killswitch.py (NEW):
- guard_or_raise() called at dispatch_with_brain entry
- rolling-window spend (default today UTC) vs AINFERA_SPEND_KILLSWITCH_USD
- Default $50 + enabled; ops env-config without restart
- Pinned passthroughs bypass guard by design (moat-safe)
- Aggregate-only logging (no PII)

routing_brain.py:
- await cost_killswitch.guard_or_raise(db) before brain runs
- Disc #12 preserved: scoring/candidate-set/weights untouched

inference.py:
- Catch CostKillswitchEngagedError → 503 with code + spent/threshold

main.py:
- Register health.router; rename inline /health → health_legacy

Tests:
- test_health_probes.py (4) + test_cost_killswitch.py (20) + openapi
  contract (4 — /healthz, /readyz documented as non-v1)
- All 28 pass

Founder config (set in Railway env on api):
  AINFERA_SPEND_KILLSWITCH_USD=<real_threshold>  # default $50
  AINFERA_SPEND_KILLSWITCH_ENABLED=1             # default

Refs: AIN-232 · AIN-234 · supersedes closed PR #77

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 28, 2026

AIN-232 [Foundation] Gateway resilience — kill the fleet SPOF (circuit breakers + gateway-independent fallback)

Fleet pressure-test finding #1 (LOCKED 2026-05-22). The dogfood thesis — every agent routes through Ainfera Routing — is ALSO the fleet's single biggest point of failure. Gateway down → all 7 agents go blind simultaneously. This ticket hardens the gateway so a single failure is contained, not fleet-wide.

Problem

  • One-router invariant means zero agent diversity at the routing layer. No agent can act if the gateway is unreachable.
  • The existing SPOF fallback (Ainfera + explicit model) does NOT help if the gateway PROCESS itself is down — the fallback still goes through the gateway.

What it must do

  • Gateway-independent degraded mode: a documented path where critical agents (Varda, Aulë) can reach a provider WITHOUT the gateway in the hot path, used ONLY when the gateway healthcheck is red. Must be auditable + rate-limited + alarmed (this is the one sanctioned exception to "never direct provider"; it is break-glass, not normal operation).
  • Circuit breakers per upstream provider: trip on error-rate/latency threshold, fail fast instead of retry-storming. Anomaly-gated.
  • Retry-storm / thundering-herd guards: capped retries + jittered backoff across all agents; correlated-spike detection.
  • Healthcheck + auto-failover across provider routes inside the gateway (one provider down ≠ gateway down).
  • Self-healing: automatic provider failover with explicit fallback chains per model class.

Acceptance criteria

  • Per-provider circuit breakers live; verified to trip + recover under induced upstream failure
  • Break-glass gateway-independent path for Varda + Aulë; gated on red healthcheck; fully audited + alarmed when used
  • Retry caps + jittered backoff enforced fleet-wide; no unbounded retry path exists
  • Gateway healthcheck endpoint + auto-failover across provider routes
  • Chaos test: kill primary provider → fleet degrades gracefully, no retry storm, audit chain intact
  • Runbook: "gateway down" + "provider down" degraded-mode procedures documented

Why M1

Gates the whole one-router invariant's safety. Every agent (Phase 5) inherits this. Pairs with AIN-226 (the gateway itself).

AIN-234 [Foundation] Fleet cost governance + financial kill-switch (per-agent metering + circuit breakers)

Fleet pressure-test finding #5 (LOCKED 2026-05-22). 7 agents, Námo overnight Brain-Factory, Ulmo continuous monitors, Tulkas red-team sweeps — all routing through the gateway. Aggregate inference cost can balloon, and a cascading failure can exhaust the API budget in minutes. The 2026 playbook prescription: treat the agent swarm as governed software with hard financial kill-switches + per-agent cost observability. We already have the immutable audit trail (L4); we're missing the meters and the switches.

What it must do

  • Per-agent cost + latency + error observability: token consumption metered per agent per run (the gateway is the natural meter — every call already passes through it). Mainstream frameworks only give partial per-agent latency; the rest is on us.
  • Per-agent + fleet-wide budgets: soft alert + hard cap. Breach → throttle/halt that agent, alarm the founder.
  • Financial kill-switch: fleet-wide emergency stop the founder can trip; auto-trip on runaway-spend anomaly. Hard switch, not advisory.
  • Cost attribution to dogfood metric: since every agent's spend IS L2 routing revenue-margin data, wire agent cost into the same ledger (turns the cost risk into a product signal).
  • Extends existing partials: Ulmo runway-watch + Tulkas drain-proof sacrificial key cover pieces — this unifies them into fleet-wide governance.

Acceptance criteria

  • Per-agent cost/latency/error metrics emitted by the gateway; dashboard per agent per run
  • Soft-alert + hard-cap budgets per agent AND fleet-wide; breach throttles + alarms
  • Financial kill-switch: manual founder trip + auto-trip on spend anomaly; verified to halt fleet inference
  • Runaway-spend chaos test: simulate retry storm → kill-switch trips before budget blown
  • Agent spend reconciled into L4 audit chain / cost ledger
  • Ulmo consumes these signals for runway watch; Tulkas drain-proof folded in

Why M1

Cost governance + kill-switch are foundational guardrails the whole fleet (Phase 5) runs inside. Pre-revenue, runaway inference is an existential risk for a solo founder.

Review in Linear

@hizrianraz hizrianraz merged commit 60e1fc8 into main May 28, 2026
4 checks passed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.

Reviewed by Cursor Bugbot for commit aa4b0d4. Configure here.

# router. This is a guard, NOT a routing change: it fires BEFORE the
# brain is invoked; if it doesn't fire, the engine runs exactly as
# before. Disc #12 preserved (no scoring, no candidate-set change).
await cost_killswitch.guard_or_raise(db)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming path doesn't catch new CostKillswitchEngagedError

High Severity

dispatch_with_brain now raises CostKillswitchEngagedError (via cost_killswitch.guard_or_raise), and the non-streaming /v1/inference handler catches it properly. However, services/streaming.py also calls dispatch_with_brain (for /v1/messages with stream=true) and only catches NoCandidateError and AllCandidatesFailedError. When the killswitch engages during a streaming request, the unhandled exception will propagate through the SSE async generator, producing a broken stream or 500 instead of a graceful termination frame or structured error.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit aa4b0d4. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant