feat(api): AIN-154 Phase A · router hardening tables (policy + health + breakers)#27
Conversation
… + breakers) Lands the three persistent stores for L2 router production hardening per AIN-154 Phase 0 prep: ## Tables ### tenant_routing_policy - One row per tenant (PK = tenant_id, CASCADE on tenant delete) - 5 preset `policy_name` values locked via CHECK - Default weights Q 0.40 / L 0.30 / C 0.30 sum=1.00 per D-154-1 - weights_sum_lock CHECK enforces invariant at DB layer - compliance_veto_threshold default 0.50 per D-154-2 - fallback_cost_penalty_pct default 5% per D-154-3 - Bounds CHECKs on all numeric fields ### provider_health_checks - Synthetic probe results, one row per probe - (provider_slug, model_slug) NOT FK — supports soft-deleted catalog rows per AIN-141 archival semantics - outcome CHECK locked to 5 values matching InvocationResult.status from AIN-154 architecture - Composite DESC index on (provider, model, probed_at) serves both dashboard latency rollup + routing decision read path ### circuit_breakers - One row per (provider, model) — composite PK, no surrogate - 3-state machine (CLOSED / OPEN / HALF_OPEN) CHECK-locked - Tracks opened_at + half_open_at + closed_at timestamps for the 60s cool-off window logic - consecutive_failures + trip_count + last_failure_at + last_success_at for the 5-fail-in-60s trip rule quick-path - State index for "show me all open breakers" dashboard query ## Phase 0 decisions all locked in DB D-154-1, D-154-2, D-154-3 encoded as DEFAULT + CHECK so the DB enforces the policy invariants — app layer can override per tenant but defaults match the founder-recommended values from Phase 0 prep. ## What this migration does NOT include (Phase B+ scope) - Per-provider rate-limit tracking — lives in Redis (ephemeral, TTL) not Postgres; durable trace surfaces via audit chain - ATS scores table — separate sprint deliverable (Phase E of AIN-154) - routing.decided audit event integration — separate PR; will reference workflow_id from AIN-153 migration 0012 ## Stack note Migration 0013 chains off 0012 (AIN-153 Phase A). This PR's base is the AIN-153 PR branch so the migration chain stays linear. When AIN-153 merges to main, AIN-154 PR auto-rebases. ## Cross-refs - AIN-154 (parent epic — Sprint v1.8 router hardening) - AIN-154 Phase 0 prep comment (2026-05-18 PM) - AIN-153 Phase A (PR api#26 — chained parent migration) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AIN-154 [L2] Auto-route hardening: circuit breakers, fallback chains, ATS-weighted routing, observability
Parent epic for Sprint v1.8 Launch Prerequisites. Founder locked 2026-05-18 PM. ScopeToday's L2 Routing does provider-neutral dispatch across 10 providers + AAMC voter selection. It works in happy path but lacks production-grade hardening: no circuit breakers, no automatic fallback chains, no ATS-weighted routing, no per-provider rate-limit tracking, no observability of routing decisions. Founder ask: harden auto-route to production-grade before launch. Hardening dimensions1. Resilience (circuit breakers + fallback chains)
2. Latency-aware routing
3. Cost-aware routing
4. Quality-aware routing (ATS-weighted)
5. Rate-limit awareness
6. Reasoning-token floor enforcement (already locked)
7. Provider health monitor (background task)
8. Observability — routing decision in audit chain
9. Per-tenant routing policyCREATE TABLE tenant_routing_policy (
tenant_id uuid PRIMARY KEY REFERENCES tenants(id),
policy_name text NOT NULL DEFAULT 'balanced', -- balanced / cost_first / quality_first / latency_first / strict_no_fallback
cost_weight numeric(3,2) NOT NULL DEFAULT 0.30,
quality_weight numeric(3,2) NOT NULL DEFAULT 0.40,
latency_weight numeric(3,2) NOT NULL DEFAULT 0.30,
fallback_enabled boolean NOT NULL DEFAULT true,
fallback_cost_penalty_pct numeric(4,2) NOT NULL DEFAULT 5.00,
compliance_veto_threshold numeric(3,2) NOT NULL DEFAULT 0.50,
CONSTRAINT weights_sum CHECK (cost_weight + quality_weight + latency_weight = 1.00)
);10. Graceful degradation
ArchitectureNew code paths in Provider adapters get standardized failure interface: class ProviderAdapter:
async def invoke(self, ...) -> InvocationResult:
# InvocationResult has:
# - status: ok | retriable_error | terminal_error | rate_limited
# - latency_ms
# - rate_limit_headers (if exposed)
# - error_class (for circuit breaker decision)Phases (sub-tickets after design lock)
Acceptance criteria
Out of scope (this epic)
Sprint targetingSprint v1.8 — Launch Window (after Sprint v1.7 closes ~June 30, ships D30–D45 ~mid-July). ROUTER HARDENING is the production-stability prerequisite for launch. Without it, the first burst of preview-user inferences would expose unhardened paths. Cross-references
Ontology vocabulary additions (need locking)
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.
Reviewed by Cursor Bugbot for commit 825a66e. Configure here.
| ), | ||
| sa.Column( | ||
| "fallback_cost_penalty_pct", | ||
| sa.Numeric(4, 2), |
There was a problem hiding this comment.
Numeric(4,2) column cannot store upper CHECK bound 100
Medium Severity
fallback_cost_penalty_pct is typed Numeric(4, 2), which in PostgreSQL supports values up to 99.99 (4 total digits, 2 after decimal → 2 before decimal). The CHECK constraint claims <= 100 is valid, and the PR description specifies bounds [0, 100], but inserting 100 would cause a "numeric field overflow" error before the CHECK is even evaluated. The column type needs to be Numeric(5, 2) to accommodate the documented upper bound.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 825a66e. Configure here.


Summary
Tables
tenant_routing_policyprovider_health_checkscircuit_breakersPhase 0 decisions all locked in DB
Stack note
feat/ain-153-phase-a-workflows-tasks-schema(PR api#26)Pre-commit hooks
ruff + ruff format + mypy --strict + pytest -x — all green.
Test plan
alembic upgrade headafter merge (no backfill required)Refs
🤖 Generated with Claude Code
Note
Medium Risk
Adds new Postgres tables and constraints used by routing decisions; while isolated from existing data, schema/constraint mistakes could block writes or degrade query performance once the router starts using them.
Overview
Adds Alembic migration
20260518_0013_router_hardening_tables.pyto introduce three new persistent stores for router hardening:tenant_routing_policy,provider_health_checks, andcircuit_breakers.The migration encodes Phase 0 routing defaults and guardrails at the DB layer via server defaults + CHECK constraints (policy name vocabulary, weights sum/bounds, fallback penalty/veto bounds), plus adds indexes for common health-check recency queries and breaker state lookup; downgrade drops indexes/tables in reverse order.
Reviewed by Cursor Bugbot for commit 825a66e. Bugbot is set up for automated code reviews on this repo. Configure here.