Skip to content

feat(server): add TenantRegistry with per-tenant health tracking#628

Merged
bokelley merged 4 commits intomainfrom
claude/issue-619-tenant-registry
May 10, 2026
Merged

feat(server): add TenantRegistry with per-tenant health tracking#628
bokelley merged 4 commits intomainfrom
claude/issue-619-tenant-registry

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

Closes #619

Adds TenantRegistry — a higher-level multi-tenant primitive that closes the JS↔Python parity gap on createTenantRegistry. Adopters pre-build per-tenant DecisioningPlatform instances and register them; the registry tracks health state and surfaces a synchronous resolve_by_host for the hot request path.

Summary

  • New src/adcp/server/tenant_registry.py exporting TenantRegistry, TenantResolution, TenantHealthState, TenantValidator
  • TenantRegistry manages four health states per tenant: pending → healthy / disabled, healthy → unverified (graceful-degrade on failed recheck), unverified / disabled → healthy (successful recheck)
  • Per-tenant asyncio.Lock prevents TOCTOU races on concurrent recheck() calls; post-validator guard prevents zombie _health entries when unregister() races with an in-flight recheck()
  • resolve_by_host is synchronous (in-memory dict updated eagerly by register/unregister) — intentional departure from the JS async variant; the Python registry owns its mapping directly
  • validator=None is principal-token mode (always healthy); JWKS or custom health-check adopters pass a (tenant_id, agent_url) -> bool callable (sync or async)
  • auto_validate kwarg from the issue proposal dropped per DX review; validator presence is the opt-in

Nits noted (not fixed in this PR):

  • _normalize_host does not handle bare bracketless IPv6 hosts (edge case, DNS names only in practice)
  • health() returning None for unknown tenants vs platform_for_tenant raising KeyError — intentional for a probe/observation method; documented in docstring

What-tested

  • ruff check — clean
  • mypy src/adcp/server/tenant_registry.py --ignore-missing-imports — no errors in new file
  • pytest tests/test_tenant_registry.py -v25/25 passed
    • Lifecycle: register, unregister, resolve_by_host
    • Health state machine: all 4 states, all transition arms
    • Host normalization: port stripping, case folding, URL change
    • Validators: sync, async, raising
    • Concurrency: concurrent rechecks (lock serializes), unregister-during-recheck (no zombie entry)

Pre-PR review

  • code-reviewer: approved — ruff F401/I001 nits (both fixed before commit); zombie health entry blocker (fixed); _platforms type tightened to dict[str, DecisioningPlatform]; serve_options returns copy (fixed)
  • dx-expert: approved (1 nit noted) — serve_options public property added; docstring usage snippet added

Triage-managed PR. This bot does not currently iterate on
review comments or PR conversation threads (only on the source
issue). To unblock:

  • Push fixup commits directly: gh pr checkout <num>
    fix → push.
  • Or re-trigger: comment /triage execute on the source
    issue.

See adcp#3121
for context.

Session: https://claude.ai/code/session_019ucMgF6YS9X5bygYUADztx


Generated by Claude Code

@bokelley
Copy link
Copy Markdown
Contributor Author

Adopter feedback (filed #619 — most-advanced Python multi-tenant adopter)

Read the full diff. The health-state machine, validator hook, and runtime mutation surface match what we asked for and look great. One critical design issue blocks adoption for us as written: the registration model is eager, and we need lazy.

Eager-vs-lazy mismatch

The PR's docstring example:

```python
for tenant in load_tenants_from_db():
await registry.register(
tenant.id,
agent_url=tenant.agent_url,
platform=build_platform_for(tenant), # ← built at boot
)
```

We currently use `LazyPlatformRouter` precisely because we don't want to pay per-tenant platform-build cost at boot. `build_platform_for_tenant` for a GAM tenant means a network handshake to the GAM auth server, credential fetch from KMS, and inventory-manager construction. Multiplied across N tenants, boot time scales linearly with tenant count. We do this lazily on first request per tenant (core/main.py:191).

If we adopt `TenantRegistry` as written, we'd either:

  1. Pay the boot-time cost (regression), or
  2. Skip `register()` and lazily call it from a `resolve_by_host` shim — defeating the point of the registry, since unregistered tenants 503

Proposed fix: lazy-factory variant

Add a `register_lazy()` (or `factory=` kwarg on `register`) that takes a builder callable instead of an already-built platform:

```python
async def register_lazy(
self,
tenant_id: str,
*,
agent_url: str,
factory: Callable[[str], Awaitable[DecisioningPlatform]],
await_first_validation: bool = False,
) -> None:
"""Register a tenant with a lazy platform factory.

The platform is built on first resolve_by_host hit, then cached.
Subsequent resolves return the cached instance. Suitable for
deployments with many tenants where eager construction is too
expensive.
"""

```

Internal: the registry holds the factory + a `platform: DecisioningPlatform | None` slot. `resolve_by_host` checks the slot, awaits the factory if empty, fills the slot. The validator + health-state machine work unchanged on the resolved platform.

This composes cleanly with the existing eager API — adopters with few tenants pre-build, adopters with many lazy-build, same registry primitive.

Health states + lazy: behavioral note

With lazy registration, `pending` would mean "registered, factory not yet invoked." The first `resolve_by_host` triggers factory + validator. Health transitions `pending → healthy` on success, `pending → disabled` on factory failure. Matches the eager semantics.

Other observations (non-blocking)

  • `_normalize_host` strips port — correct for our setup, but note that some load balancers re-emit non-default ports in `X-Forwarded-Host`. Worth a docstring note.
  • The `default_serve_options` plumbing is nice but the docstring example `serve(platform, **registry.serve_options)` only works for single-tenant deployments. Multi-tenant deployments don't pass a single platform to `serve()` — they pass a router. Worth clarifying or removing.
  • TOCTOU lock per-tenant is correct. Document that locks aren't garbage-collected on `unregister` (the current code preserves them, which is right — race-resistant — but adopters might wonder).

Bottom line

If the lazy-factory variant lands, we adopt this PR and delete our hand-rolled tenant lifecycle. As written, we'd have to skip it or wrap it. The lazy mode is the difference between "useful primitive for single-host SaaS" and "useful primitive for multi-tenant SaaS at any scale." JS `createTenantRegistry` has the same eager-only limitation — Python could ship the better surface here.

@bokelley bokelley force-pushed the claude/issue-619-tenant-registry branch from 3a0f3ff to 787749d Compare May 10, 2026 13:14
@bokelley bokelley marked this pull request as ready for review May 10, 2026 13:14
bokelley pushed a commit that referenced this pull request May 10, 2026
Adds lazy platform construction support requested by @bokelley (#628):

- `register_lazy(factory=...)` — defers per-tenant DecisioningPlatform
  construction to first `resolve()` call; avoids paying KMS/GAM auth
  costs for all N tenants at boot
- `async resolve(host)` — handles both eager and lazy tenants; invokes
  the factory on first hit, caches the result, serializes concurrent
  first-hit resolves with the per-tenant lock (single factory invocation
  per tenant)
- `PlatformFactory` type alias exported from `adcp.server`
- `register()` clears any lazy factory on eager re-registration;
  `register_lazy()` clears any cached platform on lazy re-registration;
  `unregister()` clears both
- Docstring fixes: `_normalize_host` load-balancer port note,
  `serve_options` multi-tenant clarification, lock lifecycle docs
- 14 new tests (39 total, all passing): lazy lifecycle, concurrent
  first-hit, factory/validator failures, unregister-during-resolve
  zombie guard, eager↔lazy re-registration

https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
@bokelley
Copy link
Copy Markdown
Contributor Author

Follow-up: lazy-factory variant landed

Three commits on this branch address your feedback:

register_lazy() + async resolve()register_lazy(tenant_id, *, agent_url, factory, await_first_validation=False) takes a PlatformFactory = Callable[[str], Awaitable[DecisioningPlatform]]; the platform is built on the first await registry.resolve(host) hit and cached. Concurrent first-hits serialize on the per-tenant lock — one factory call per tenant. Existing resolve_by_host (sync) is unchanged and returns None until the platform is built.

Health semanticspending = registered, factory not yet invoked. First resolve(): pending → healthy on success, pending → disabled on factory/validator failure. Factory is cleared on disabled so subsequent resolve() calls return None without retrying; recovery requires register_lazy() + recheck().

Two correctness issues found + fixed in pre-PR review:

  • register_lazy(await_first_validation=True) + validator returns False: platform was being cached in _platforms and the factory was leaking into _factories. Fixed to match resolve() cold-path: discard the rejected platform, clear the factory.
  • resolve() docstring implied it always returns None when the validator rejects — only true on the cold-path. Fast-path (eager or previously-resolved lazy) returns TenantResolution(health="disabled"). Docstring and inline warning updated: always check result.health, never gate solely on result is None.

Your non-blocking observations all addressed: _normalize_host load-balancer port caveat added, serve_options single-vs-multi-tenant clarification updated, lock lifecycle documented.

One nit to flag: async def resolve(host) shares its name and arity with SubdomainTenantRouter.resolve — a @runtime_checkable Protocol in this package. Mypy catches the type mismatch; duck-typing doesn't. Added a class-docstring warning. Happy to rename to resolve_platform if you'd prefer to close the pitfall before this merges.

41 tests pass.


Triaged by Claude Code. Session: https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd


Generated by Claude Code

bokelley pushed a commit that referenced this pull request May 10, 2026
Adds lazy platform construction support requested by @bokelley (#628):

- `register_lazy(factory=...)` — defers per-tenant DecisioningPlatform
  construction to first `resolve()` call; avoids paying KMS/GAM auth
  costs for all N tenants at boot
- `async resolve(host)` — handles both eager and lazy tenants; invokes
  the factory on first hit, caches the result, serializes concurrent
  first-hit resolves with the per-tenant lock (single factory invocation
  per tenant)
- `PlatformFactory` type alias exported from `adcp.server`
- `register()` clears any lazy factory on eager re-registration;
  `register_lazy()` clears any cached platform on lazy re-registration;
  `unregister()` clears both
- Docstring fixes: `_normalize_host` load-balancer port note,
  `serve_options` multi-tenant clarification, lock lifecycle docs
- 14 new tests (39 total, all passing): lazy lifecycle, concurrent
  first-hit, factory/validator failures, unregister-during-resolve
  zombie guard, eager↔lazy re-registration

https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
@bokelley bokelley force-pushed the claude/issue-619-tenant-registry branch from 84cad78 to 743029a Compare May 10, 2026 15:15
claude added 4 commits May 10, 2026 11:15
Closes #619. Adds TenantRegistry — a higher-level multi-tenant primitive
that provides JS createTenantRegistry parity for Python deployments.

Adopters pre-build per-tenant DecisioningPlatform instances and register
them; the registry tracks health state (pending/healthy/unverified/disabled)
and surfaces a synchronous resolve_by_host for the hot request path.

Key design choices over the JS surface:
- resolve_by_host is sync (in-memory dict) rather than async — Python owns
  the mapping directly via register()/unregister(), no external resolver.
- auto_validate dropped; validator presence is the opt-in (None = principal-
  token mode, always healthy).
- Per-tenant asyncio.Lock prevents TOCTOU on concurrent recheck() calls.
- unregister()-during-recheck race is guarded: post-validator writes check
  whether the tenant was removed while the validator was awaited.

Exports added to adcp.server: TenantRegistry, TenantResolution,
TenantHealthState, TenantValidator.

Tested with pytest: 25 tests covering lifecycle, state machine, host
normalization, sync/async validators, and concurrency correctness.

https://claude.ai/code/session_019ucMgF6YS9X5bygYUADztx
Adds lazy platform construction support requested by @bokelley (#628):

- `register_lazy(factory=...)` — defers per-tenant DecisioningPlatform
  construction to first `resolve()` call; avoids paying KMS/GAM auth
  costs for all N tenants at boot
- `async resolve(host)` — handles both eager and lazy tenants; invokes
  the factory on first hit, caches the result, serializes concurrent
  first-hit resolves with the per-tenant lock (single factory invocation
  per tenant)
- `PlatformFactory` type alias exported from `adcp.server`
- `register()` clears any lazy factory on eager re-registration;
  `register_lazy()` clears any cached platform on lazy re-registration;
  `unregister()` clears both
- Docstring fixes: `_normalize_host` load-balancer port note,
  `serve_options` multi-tenant clarification, lock lifecycle docs
- 14 new tests (39 total, all passing): lazy lifecycle, concurrent
  first-hit, factory/validator failures, unregister-during-resolve
  zombie guard, eager↔lazy re-registration

https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
When resolve() fails (factory raises, validator raises, or validator
returns False), the _factories entry was left intact, causing every
subsequent resolve() to re-enter the lazy path and invoke the dead
factory again. A disabled tenant requires operator recheck() to
recover — repeated silent retries are incorrect.

Fix: pop _factories alongside the _health="disabled" write in all
three failure paths. Also pop on success (factory no longer needed
once the platform is cached in _platforms).

Adds test: test_resolve_factory_failure_does_not_retry_on_subsequent_calls

https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
Two correctness issues found in pre-PR review:

1. register_lazy(await_first_validation=True) + validator returns False:
   the platform was being written to _platforms even though the tenant
   is disabled, and the factory was left in _factories. Mirrors the
   resolve() cold-path: on validator failure, discard the platform and
   clear the factory so the disabled tenant behaves consistently
   regardless of how it reached that state.

2. resolve() docstring claimed it returns None when validator returns
   False — only true for the lazy cold-path. The fast-path (eager or
   previously-resolved lazy) returns TenantResolution(health="disabled").
   Rewritten to make the contract unambiguous and to warn against
   gating solely on result is None.

Also adds:
- Class docstring: do-not-use-as-SubdomainTenantRouter warning (same
  resolve(host) signature, incompatible return type)
- recheck() docstring: lazy-pending and lazy-disabled caveats (recheck
  on pending-lazy advances health without building the platform; recheck
  on factory-disabled is insufficient — re-register is required)
- Test: register_lazy + await_first_validation + validator=False must
  not cache platform and must not retry factory on subsequent resolve()

https://claude.ai/code/session_01DRv6qahN7Jjt3Q4oxGBXkd
@bokelley bokelley force-pushed the claude/issue-619-tenant-registry branch from 743029a to b4d18db Compare May 10, 2026 15:15
@bokelley bokelley merged commit ae687b6 into main May 10, 2026
1 of 2 checks passed
@bokelley bokelley deleted the claude/issue-619-tenant-registry branch May 10, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(server): TenantRegistry parity with JS createTenantRegistry

2 participants