Skip to content

feat(server): CallableSubdomainTenantRouter for DB-backed tenant lookups#544

Merged
bokelley merged 2 commits intomainfrom
claude/db-backed-subdomain-router
May 4, 2026
Merged

feat(server): CallableSubdomainTenantRouter for DB-backed tenant lookups#544
bokelley merged 2 commits intomainfrom
claude/db-backed-subdomain-router

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 4, 2026

Summary

Adds an adopter-callable SubdomainTenantRouter that takes a single sync-or-async callable mapping a normalized host to a Tenant. Framework owns host normalization (lower-case + port-strip) and optionally provides a bounded TTL cache; adopters write ~5 LOC of glue against their tenant table instead of ~25 LOC of hand-rolled routing.

Closes the salesagent SDK_FEEDBACK round-2 #20. Reference impl in salesagent's core/main.py::_load_tenant_subdomain_map() is now a ~5-line CallableSubdomainTenantRouter instantiation.

Surface

from adcp.server import CallableSubdomainTenantRouter, Tenant

async def lookup(host: str) -> Tenant | None:
    subdomain = host.split(".", 1)[0]
    async with my_db.session() as s:
        row = await s.scalar(
            select(TenantRow).filter_by(subdomain=subdomain, is_active=True)
        )
    return Tenant(id=row.tenant_id, display_name=row.name) if row else None

# No caching (default — every request hits the DB)
router = CallableSubdomainTenantRouter(lookup)

# With opt-in bounded TTL cache
router = CallableSubdomainTenantRouter(
    lookup, cache_size=1024, cache_ttl_seconds=60.0,
)

# Adopter-driven eviction (e.g. on tenant deactivation)
router.invalidate("acme.example.com")  # one entry
router.invalidate()                     # whole cache

Design notes

Caching is opt-in. cache_size=0 (default) skips the cache entirely; the resolver is awaited on every request. Adopters opt in with explicit bounds.

Explicit TTL required when caching is enabled. No "cache forever" mode — the constructor raises ValueError when cache_size > 0 and cache_ttl_seconds <= 0. Tenants come and go (suspension, deactivation); long-lived caches without TTL are a stale-tenant footgun.

Negative results cached too. None returns are stored alongside positive hits so DOS-style probing for unknown hosts doesn't bypass the cache.

Bounded LRU via OrderedDict. No third-party dependency. Eviction on overflow is popitem(last=False) — oldest entry first.

Sync or async resolvers. Mirrors MediaBuyStore's pattern; inspect.isawaitable(result) decides whether to await.

Memory profile

This was designed with the salesagent slow-leak investigation in mind:

  • Without caching: zero state held by the router. Each resolve() calls the adopter directly.
  • With caching: bounded by cache_size entries × (host string + frozen Tenant + 16-byte expiry float). For a typical 1024-entry cache that's well under 1 MB. No unbounded growth paths.

Tests

14 new tests in tests/test_subdomain_tenant_router.py:

  • Normalization passes lower-cased + port-stripped host to the resolver
  • Sync and async callables both work
  • Default no-caching behavior
  • Cache dedupes within TTL (positive)
  • Cache dedupes within TTL (negative — None cached)
  • Cache evicts after TTL expires (uses monkeypatched time.monotonic)
  • Cache size is a hard ceiling — eldest entries evicted on overflow
  • invalidate(host) drops a specific entry; invalidate() clears all
  • invalidate is a no-op when caching is disabled
  • Constructor rejects cache_size > 0 without cache_ttl_seconds
  • Constructor rejects negative cache_size
  • End-to-end through the existing SubdomainTenantMiddleware

30 total tests pass in test_subdomain_tenant_router.py; 3801 framework tests pass with no regressions.

Test plan

  • New tests cover happy paths + every documented edge
  • Existing InMemorySubdomainTenantRouter tests still pass
  • Full framework test suite passes (3801 + 30)
  • Mypy clean on new code (the one removed unused # type: ignore in this PR)

🤖 Generated with Claude Code

Adds an adopter-callable :class:`SubdomainTenantRouter` that takes a single
sync-or-async callable mapping a normalized host to a Tenant. Framework
owns host normalization (lower-case + port-strip) and optionally provides
a bounded TTL cache; adopters write ~5 LOC of glue against their tenant
table instead of ~25 LOC of hand-rolled routing.

Closes salesagent SDK_FEEDBACK round 2 #20. Reference impl in
salesagent's core/main.py::_load_tenant_subdomain_map() collapses to a
~5-line CallableSubdomainTenantRouter instantiation.

Surface:
- CallableSubdomainTenantRouter(resolver, *, cache_size=0, cache_ttl_seconds=0.0)
- TenantResolver callable type alias

Caching is opt-in (cache_size > 0). Explicit TTL required when caching is
enabled — no "cache forever" mode (production safety against stale-tenant
footguns). Bounded LRU via OrderedDict (no third-party dependency).
Negative results cached too (DOS-style probing can't bypass).
invalidate(host=None) for adopter-driven eviction.

Memory profile: zero state without caching; with caching, bounded by
cache_size entries — typically <1MB for a 1024-entry cache. Designed
specifically with salesagent's slow-leak-investigation lens.

Tests: 14 new tests covering normalization, sync+async resolvers,
cache TTL/bound/invalidation, validation errors, end-to-end through the
existing middleware. 30 total pass in test_subdomain_tenant_router.py;
3801 framework tests pass with no regressions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@bokelley
Copy link
Copy Markdown
Contributor Author

bokelley commented May 4, 2026

Strong PR — design instincts are right (opt-in caching, explicit TTL required, negative caching, bounded LRU, no deps). A few thoughts:

Worth a docstring callout (not a blocker):

  1. Negative cache + tenant onboarding race. If a probe for acme.example.com hits at 11:59:55 (host doesn't exist yet) and the tenant is provisioned at 12:00:00, the cached None causes 404s for up to cache_ttl_seconds afterward. Adopters need to call invalidate(host) on tenant creation, not just deactivation. Worth one line in the class docstring next to the existing eviction guidance.

  2. Memory-profile claim is adopter-dependent. "Well under 1 MB at 1024 entries" is true if Tenant is small, but Tenant is an adopter-defined dataclass — they could stuff arbitrary metadata in it. More honest framing: bounded by cache_size × sizeof(your Tenant). Doesn't change the design, just the marketing.

Minor — not asking for a change:

  1. Thundering herd on cold cache. Two concurrent requests for the same uncached host both await the resolver. asyncio cooperative scheduling means no corruption (last-write wins on identical values), but the DB takes 2x calls instead of 1x for that brief window. Singleflight is overkill for v1; flag if salesagent sees DB-pressure spikes on cache restart.

  2. Acme.localhost:3001 vs acme.localhost. Cache key is the normalized host (lowercased + port-stripped per _normalize_host), so case/port variants converge to one entry. Worth pinning that with one test (assert resolver_calls == 1 after both casings hit).

Nothing here blocks merge — the design is sound and the test coverage is good.

…le in CallableSubdomainTenantRouter

Per @bokelley review on #544:
- Add explicit docstring callout that adopters must call `invalidate(host)` on
  tenant *creation* too (not only deactivation), because cached None results
  cause 404s until TTL expires when a new tenant is provisioned mid-cache.
- Update `invalidate()` param docs to mention creation alongside deactivation.
- Reframe memory-profile claim: sizeof(Tenant) is adopter-controlled (ext field),
  so "well under 1 MB" is replaced with the honest `cache_size × sizeof(your_Tenant)`.
- Add test pinning that case/port variants (Acme.localhost:3001 + acme.localhost)
  normalize to one cache key and hit the resolver exactly once.

https://claude.ai/code/${CLAUDE_CODE_REMOTE_SESSION_ID}
@bokelley
Copy link
Copy Markdown
Contributor Author

bokelley commented May 4, 2026

Thanks for the thorough review — pushed a follow-up commit (9fb4587) addressing the actionable items:

  1. Negative-cache + onboarding race — added an explicit callout in the class docstring, with an inline code snippet showing router.invalidate("acme.example.com") in the creation path. Also updated the invalidate() param docs to list creation alongside deactivation.

  2. Memory-profile framing — replaced "well under 1 MB" with the adopter-honest cache_size × sizeof(your_Tenant) framing, noting that ext is adopter-controlled.

  3. Case/port normalization test — added test_callable_router_case_and_port_variants_share_cache_entry which asserts call_count == 1 after both Acme.localhost:3001 and acme.localhost hit the router. All 31 tests pass.

Items 3 (thundering herd) and 4 (the test suggestion already addressed above) are noted — the thundering-herd callout is a good future issue if DB-pressure spikes show up in salesagent telemetry.


Generated by Claude Code

@bokelley bokelley merged commit 80876c1 into main May 4, 2026
15 checks passed
@bokelley
Copy link
Copy Markdown
Contributor Author

bokelley commented May 4, 2026

Thanks for the follow-up — the three addressed items (negative-cache onboarding callout, adopter-honest memory framing, and the case/port normalization test) look like solid resolutions. Deferring the thundering-herd callout to a follow-up issue is reasonable given current scope.


Triaged by Claude Code.


Generated by Claude Code

bokelley added a commit that referenced this pull request May 4, 2026
…struction (#552)

Drop-in alternative to PlatformRouter that defers per-tenant
DecisioningPlatform construction to first request, with a bounded LRU
+ TTL cache. Closes the eager-router pain point for adopters with
N tenants × per-tenant SDK auth handshake (Google Ad Manager
service-account, Kevel API key) where boot scales O(N).

* Async or sync factory; awaited via inspect.isawaitable, matching
  CallableSubdomainTenantRouter's convention (PR #544).
* Bounded cache: cache_size > 0 mandatory (default 256);
  cache_ttl_seconds >= 0 (default 3600.0; 0 = size-only eviction).
  Distinct from CallableSubdomainTenantRouter which rejects ttl=0 —
  there tenants go stale, here platform adapters don't (unless the
  factory reads mutable config — docstring calls that out).
* invalidate(tenant_id=None) for hot-reload. Per-tenant + global
  generation counter snapshots prevent the invalidate-during-build
  race from resurrecting an evicted slot.
* Drop-in: isinstance(router, DecisioningPlatform) is true,
  serve() accepts it identically, ACCOUNT_NOT_FOUND /
  UNSUPPORTED_FEATURE projection matches PlatformRouter.
* No singleflight in v1 — concurrent cold requests each build (locked
  by test_concurrent_cold_requests_each_build_v1_contract).
* proposal_managers stays an eager dict (cheap to hold).
* platform_for_tenant() async sibling-API parity for
  admin/health endpoints.
* Extracted _select_proposal_method module-level so PlatformRouter
  and LazyPlatformRouter share the same routing logic without drift.

Closes #547.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants