feat(server): add TenantRegistry with per-tenant health tracking#628
feat(server): add TenantRegistry with per-tenant health tracking#628
Conversation
Adopter feedback (filed #619 — most-advanced Python multi-tenant adopter)Read the full diff. The health-state machine, validator hook, and runtime mutation surface match what we asked for and look great. One critical design issue blocks adoption for us as written: the registration model is eager, and we need lazy. Eager-vs-lazy mismatchThe PR's docstring example: ```python We currently use `LazyPlatformRouter` precisely because we don't want to pay per-tenant platform-build cost at boot. `build_platform_for_tenant` for a GAM tenant means a network handshake to the GAM auth server, credential fetch from KMS, and inventory-manager construction. Multiplied across N tenants, boot time scales linearly with tenant count. We do this lazily on first request per tenant (core/main.py:191). If we adopt `TenantRegistry` as written, we'd either:
Proposed fix: lazy-factory variantAdd a `register_lazy()` (or `factory=` kwarg on `register`) that takes a builder callable instead of an already-built platform: ```python ``` Internal: the registry holds the factory + a `platform: DecisioningPlatform | None` slot. `resolve_by_host` checks the slot, awaits the factory if empty, fills the slot. The validator + health-state machine work unchanged on the resolved platform. This composes cleanly with the existing eager API — adopters with few tenants pre-build, adopters with many lazy-build, same registry primitive. Health states + lazy: behavioral noteWith lazy registration, `pending` would mean "registered, factory not yet invoked." The first `resolve_by_host` triggers factory + validator. Health transitions `pending → healthy` on success, `pending → disabled` on factory failure. Matches the eager semantics. Other observations (non-blocking)
Bottom lineIf the lazy-factory variant lands, we adopt this PR and delete our hand-rolled tenant lifecycle. As written, we'd have to skip it or wrap it. The lazy mode is the difference between "useful primitive for single-host SaaS" and "useful primitive for multi-tenant SaaS at any scale." JS `createTenantRegistry` has the same eager-only limitation — Python could ship the better surface here. |
3a0f3ff to
787749d
Compare
Adds lazy platform construction support requested by @bokelley (#628): - `register_lazy(factory=...)` — defers per-tenant DecisioningPlatform construction to first `resolve()` call; avoids paying KMS/GAM auth costs for all N tenants at boot - `async resolve(host)` — handles both eager and lazy tenants; invokes the factory on first hit, caches the result, serializes concurrent first-hit resolves with the per-tenant lock (single factory invocation per tenant) - `PlatformFactory` type alias exported from `adcp.server` - `register()` clears any lazy factory on eager re-registration; `register_lazy()` clears any cached platform on lazy re-registration; `unregister()` clears both - Docstring fixes: `_normalize_host` load-balancer port note, `serve_options` multi-tenant clarification, lock lifecycle docs - 14 new tests (39 total, all passing): lazy lifecycle, concurrent first-hit, factory/validator failures, unregister-during-resolve zombie guard, eager↔lazy re-registration https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
Follow-up: lazy-factory variant landedThree commits on this branch address your feedback:
Health semantics — Two correctness issues found + fixed in pre-PR review:
Your non-blocking observations all addressed: One nit to flag: 41 tests pass. Triaged by Claude Code. Session: https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd Generated by Claude Code |
Adds lazy platform construction support requested by @bokelley (#628): - `register_lazy(factory=...)` — defers per-tenant DecisioningPlatform construction to first `resolve()` call; avoids paying KMS/GAM auth costs for all N tenants at boot - `async resolve(host)` — handles both eager and lazy tenants; invokes the factory on first hit, caches the result, serializes concurrent first-hit resolves with the per-tenant lock (single factory invocation per tenant) - `PlatformFactory` type alias exported from `adcp.server` - `register()` clears any lazy factory on eager re-registration; `register_lazy()` clears any cached platform on lazy re-registration; `unregister()` clears both - Docstring fixes: `_normalize_host` load-balancer port note, `serve_options` multi-tenant clarification, lock lifecycle docs - 14 new tests (39 total, all passing): lazy lifecycle, concurrent first-hit, factory/validator failures, unregister-during-resolve zombie guard, eager↔lazy re-registration https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
84cad78 to
743029a
Compare
Closes #619. Adds TenantRegistry — a higher-level multi-tenant primitive that provides JS createTenantRegistry parity for Python deployments. Adopters pre-build per-tenant DecisioningPlatform instances and register them; the registry tracks health state (pending/healthy/unverified/disabled) and surfaces a synchronous resolve_by_host for the hot request path. Key design choices over the JS surface: - resolve_by_host is sync (in-memory dict) rather than async — Python owns the mapping directly via register()/unregister(), no external resolver. - auto_validate dropped; validator presence is the opt-in (None = principal- token mode, always healthy). - Per-tenant asyncio.Lock prevents TOCTOU on concurrent recheck() calls. - unregister()-during-recheck race is guarded: post-validator writes check whether the tenant was removed while the validator was awaited. Exports added to adcp.server: TenantRegistry, TenantResolution, TenantHealthState, TenantValidator. Tested with pytest: 25 tests covering lifecycle, state machine, host normalization, sync/async validators, and concurrency correctness. https://claude.ai/code/session_019ucMgF6YS9X5bygYUADztx
Adds lazy platform construction support requested by @bokelley (#628): - `register_lazy(factory=...)` — defers per-tenant DecisioningPlatform construction to first `resolve()` call; avoids paying KMS/GAM auth costs for all N tenants at boot - `async resolve(host)` — handles both eager and lazy tenants; invokes the factory on first hit, caches the result, serializes concurrent first-hit resolves with the per-tenant lock (single factory invocation per tenant) - `PlatformFactory` type alias exported from `adcp.server` - `register()` clears any lazy factory on eager re-registration; `register_lazy()` clears any cached platform on lazy re-registration; `unregister()` clears both - Docstring fixes: `_normalize_host` load-balancer port note, `serve_options` multi-tenant clarification, lock lifecycle docs - 14 new tests (39 total, all passing): lazy lifecycle, concurrent first-hit, factory/validator failures, unregister-during-resolve zombie guard, eager↔lazy re-registration https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
When resolve() fails (factory raises, validator raises, or validator returns False), the _factories entry was left intact, causing every subsequent resolve() to re-enter the lazy path and invoke the dead factory again. A disabled tenant requires operator recheck() to recover — repeated silent retries are incorrect. Fix: pop _factories alongside the _health="disabled" write in all three failure paths. Also pop on success (factory no longer needed once the platform is cached in _platforms). Adds test: test_resolve_factory_failure_does_not_retry_on_subsequent_calls https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
Two correctness issues found in pre-PR review: 1. register_lazy(await_first_validation=True) + validator returns False: the platform was being written to _platforms even though the tenant is disabled, and the factory was left in _factories. Mirrors the resolve() cold-path: on validator failure, discard the platform and clear the factory so the disabled tenant behaves consistently regardless of how it reached that state. 2. resolve() docstring claimed it returns None when validator returns False — only true for the lazy cold-path. The fast-path (eager or previously-resolved lazy) returns TenantResolution(health="disabled"). Rewritten to make the contract unambiguous and to warn against gating solely on result is None. Also adds: - Class docstring: do-not-use-as-SubdomainTenantRouter warning (same resolve(host) signature, incompatible return type) - recheck() docstring: lazy-pending and lazy-disabled caveats (recheck on pending-lazy advances health without building the platform; recheck on factory-disabled is insufficient — re-register is required) - Test: register_lazy + await_first_validation + validator=False must not cache platform and must not retry factory on subsequent resolve() https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
743029a to
b4d18db
Compare
Closes #619
Adds
TenantRegistry— a higher-level multi-tenant primitive that closes the JS↔Python parity gap oncreateTenantRegistry. Adopters pre-build per-tenantDecisioningPlatforminstances and register them; the registry tracks health state and surfaces a synchronousresolve_by_hostfor the hot request path.Summary
src/adcp/server/tenant_registry.pyexportingTenantRegistry,TenantResolution,TenantHealthState,TenantValidatorTenantRegistrymanages four health states per tenant:pending → healthy / disabled,healthy → unverified(graceful-degrade on failed recheck),unverified / disabled → healthy(successful recheck)asyncio.Lockprevents TOCTOU races on concurrentrecheck()calls; post-validator guard prevents zombie_healthentries whenunregister()races with an in-flightrecheck()resolve_by_hostis synchronous (in-memory dict updated eagerly byregister/unregister) — intentional departure from the JS async variant; the Python registry owns its mapping directlyvalidator=Noneis principal-token mode (always healthy); JWKS or custom health-check adopters pass a(tenant_id, agent_url) -> boolcallable (sync or async)auto_validatekwarg from the issue proposal dropped per DX review; validator presence is the opt-inNits noted (not fixed in this PR):
_normalize_hostdoes not handle bare bracketless IPv6 hosts (edge case, DNS names only in practice)health()returningNonefor unknown tenants vsplatform_for_tenantraisingKeyError— intentional for a probe/observation method; documented in docstringWhat-tested
ruff check— cleanmypy src/adcp/server/tenant_registry.py --ignore-missing-imports— no errors in new filepytest tests/test_tenant_registry.py -v— 25/25 passedPre-PR review
_platformstype tightened todict[str, DecisioningPlatform];serve_optionsreturns copy (fixed)serve_optionspublic property added; docstring usage snippet addedSession: https://claude.ai/code/session_019ucMgF6YS9X5bygYUADztx
Generated by Claude Code