[KPIP] LDAP Authentication — UnboundID-backed pool with failover, request coalescing, and observability #7497

moelhoussein · 2026-06-04T13:55:41Z

moelhoussein
Jun 4, 2026

Abstract

This proposal replaces Kyuubi's JNDI-based LDAP authentication path with an UnboundID LDAP SDK stack that adds (1) ordered server failover that covers both bind and search operations, (2) a persistent pre-bound connection pool, (3) result-code-driven failure classification, (4) request coalescing with a short-TTL cache, and (5) a first-class metrics surface for both auth outcomes and pool health. Existing config keys, providers, and fallback to ephemeral connections are preserved.

Q1. What are you trying to do?

Make Kyuubi's LDAP login path:

Resilient — if one of the configured LDAP servers goes offline, users do not see auth failures. This must hold even when a server fails during an in-progress request, not only at the moment a new connection is opened.
Fast under load — keep a small set of authenticated connections open and reuse them, instead of opening a brand-new network connection for every single login.
Observable — when something goes wrong, an operator can tell within seconds whether it is "users typed the wrong password" or "the directory is sick," without grepping logs.
Honest — different failure causes produce different, machine-readable signals, instead of every problem looking like the same generic "access denied."

Q2. How is it done today, and what are the limits of current practice?

Kyuubi's LdapAuthenticationProviderImpl uses the JDK's JNDI (com.sun.jndi.ldap.*) opened via InitialDirContext per request.

Limitation	Impact
JNDI's URL-list failover applies only at initial context creation (bind/connect). If a SEARCH on an established context fails mid-flight, JNDI does not retry against the next URL — the operation surfaces as `CommunicationException` and the user sees a 403.	Server failures that occur after the initial bind (mid-search server restart, dropped socket, idle-connection reset by an intermediary) are not recovered, even though another configured server is available.
No connection reuse — fresh TCP+TLS+BIND+SEARCH per request	Bursts hit directory connection-rate limits → cascading auth failures
Error opacity — every directory condition surfaces as `NamingException` with free-text	Wrong-password, refused-connection, TLS-handshake, timeout all indistinguishable
No metrics — zero counters or gauges for LDAP outcomes or pool health	Operators learn of outages from user tickets
No coalescing — N identical concurrent logins → N backend binds	Login-storm amplification against the directory
No success caching — every JDBC reconnect re-binds	Steady-state directory QPS = session-open QPS

The first row is the most important: deployments that put a load balancer or hostname-fronted cluster in front of multiple LDAP servers expect "configure two URLs, get full failover." With JNDI today, that expectation only holds for the bind step. A SEARCH that fails mid-operation is not retried elsewhere.

Q3. What is new in your approach, and why do you think it will be successful?

Core innovation: UnboundID SDK pool with classifier-aware coalescing cache

┌──────────────────────────────────────────────────────────┐
│  AuthenticationProviderFactory  (recorder + lifecycle)    │
└────────────────────────────┬─────────────────────────────┘
                             ▼
┌──────────────────────────────────────────────────────────┐
│  CachingLdapAuthenticationProvider                        │
│   Guava Cache<HMAC-SHA256(user+pw), Boolean>              │
│   wasLeader flag → correct metric attribution for waiters │
└────────────────────────────┬─────────────────────────────┘
                             ▼
┌──────────────────────────────────────────────────────────┐
│  LdapAuthenticationProviderImpl / UnboundIdDirSearchFactory│
│   LdapAuthFailureClassifier → {invalid_credentials,        │
│                                invalid_input,             │
│                                infrastructure}            │
└────────────────────────────┬─────────────────────────────┘
                             ▼
┌──────────────────────────────────────────────────────────┐
│  UnboundIdConnectionPool                                  │
│   FailoverServerSet([url1, url2, …])  (ordered failover)  │
│   LDAPConnectionPool(min, max)  (pre-bound, autoReconnect)│
│   LoggingHealthCheck  (WARN on connection discard)        │
└──────────────────────────────────────────────────────────┘

Key design decisions

1. FailoverServerSet over JNDI URL-list — failover covers bind and subsequent operations. The pool obtains a connection through FailoverServerSet, which advances to the next URL on any connection-level failure. Because all auth work (bind + search) runs against a pooled connection that is health-checked and replaced on failure, a server that goes down mid-search results in the next checkout (or the same caller's retry through the SDK's autoReconnect) landing on a different server. This closes the JNDI gap where only the initial bind benefited from failover.

2. Pre-bound pool with background health check. All connections bound as the service account at creation; per-request work is a single bindRequest against a warm socket. Background thread validates idle connections against the root DSE. autoReconnect=true so a server bounce does not drain the pool.

3. Classifier-driven failure metrics. LDAPException.getResultCode() is mapped to exactly one of three stable buckets:

Bucket	Triggered by	Who pages
`invalid_credentials`	`ResultCode.INVALID_CREDENTIALS`	Nobody — user typed wrong password
`invalid_input`	Null/empty user or password, malformed DN	App team — caller-side bug
`infrastructure`	Connect refused, TLS, timeout, server unavailable	LDAP / directory on-call

4. Request-coalescing TTL cache. Guava Cache.get(key, Callable) collapses N concurrent identical authentications into one backend bind. Leader executes; waiters block on the LoadingValueReference. A wasLeader flag ensures each coalesced waiter increments the correct success/failure counter — the easy bug to ship without it would be a silent metric undercount on the waiter path.

5. Lifecycle ordering invariant. MetricsSystem is registered before frontends in KyuubiServer.initialize. CompositeService.stop tears down in reverse order, so frontends stop accepting auth requests before MetricsSystem (and the pool) close. No race where the auth path increments a closed counter.

6. Fallback parity. If bind.user / bind.password are not configured, the pool is not instantiated; the code falls back to ephemeral per-request connections — behavioral parity with the JNDI path for the no-bind-user case.

Why it will be successful

Drop-in replacement. Same config namespace (kyuubi.authentication.ldap.*), same provider hook; existing deployments work without new config.
Battle-tested SDK. UnboundID LDAP SDK is Apache-2.0 and widely deployed; pool, failover, and health-check semantics are not reimplemented in tree.
Failure path is instrumented, not handwaved. The classifier, the per-bucket counters, and the connection-discard log line are specified before implementation — the gap that motivates the proposal is the gap the design measurably closes.

Q4. Who cares? If you are successful, what difference will it make?

Stakeholder	Before	After
End user	Auth failure on any LDAP hiccup; opaque error	Survives single-server outages transparently, including failures mid-request
Kyuubi operator	No LDAP signal in metrics; relies on log grep	Three-bucket failure counter + pool gauges; on-call can triage from a dashboard in seconds instead of greping for stack traces
Directory / IAM team	Blamed for every auth failure, often wrongly	The `infrastructure` counter is the only signal that should ever page them
Directory itself	N binds per user burst, full reconnect each	One bind per coalesced burst; warm connections only
Postmortem authors	Outage scope inferable only from ticket volume	Pre / during / after windows visible in Prometheus, with clear attribution

Headline outcomes: ≥ 70% reduction in backend bind volume at the directory under steady-state load, zero user-visible auth failures from any single LDAP server outage (including mid-request server loss), and incident-response time bounded by dashboard read latency rather than log-grep cycle time.

Q5. How will you measure success?

These are the implementation-level bars the code must clear before the feature ships. Targets are verifiable from tests or short-run benchmarks — not from long-term ops outcomes (which appear in Q4).

Measurement	Target	How to verify
Failure classification correctness	100% of injected `LDAPException` result codes land in the correct bucket (`invalid_credentials` / `invalid_input` / `infrastructure`)	Parameterized unit test per `ResultCode` value
Coalescing correctness	N concurrent identical authentications → exactly 1 backend bind, and exactly N correctly-attributed metric increments (no waiter under-count on success or failure)	Concurrency test using a barrier-synchronized executor
Failover correctness — initial connect	With the first configured server unreachable, the pool establishes connections against the next server with no caller-visible failure	Integration test against an embedded directory pair
Failover correctness — mid-request	Connection drop during an authenticated SEARCH does not surface as a user-facing failure; the pool retries against another server	Integration test that closes the backing socket mid-operation
Pool reuse under load	After warmup, a 1000-request burst opens ≤ `max.connections` sockets to the directory (not 1000)	Counter on the test directory's connect handler
Lifecycle safety	No metric increment after `MetricsSystem.close()`; no auth attempt after `pool.close()`	Lifecycle test asserting ordering invariant
Behavioral parity in the no-bind-user fallback	When `bind.user` is unset, the existing `LdapAuthenticationProviderImpl` test suite passes against the new ephemeral-connection code path (same accept/reject decisions as JNDI)	Re-run the existing suite, no test changes allowed
Non-LDAP auth regression	NONE / KERBEROS / CUSTOM / JDBC providers: 0 test failures	Existing auth provider suites
Cache hit ratio under realistic burst	≥ 50% on the documented burst-load benchmark	Benchmark suite added with the patch
Backend bind reduction under realistic burst	≥ 70% vs. JNDI baseline on the same benchmark	Same benchmark, comparing branch vs. baseline

Q6. What are the mid-term and final "exams"?

Phase-gate milestones. Each phase ships independently and is gated on a single concrete demonstration.

Phase 1: Pool + failover + classifier (mid-term)

UnboundIdConnectionPool with FailoverServerSet instantiated on first auth
LdapAuthFailureClassifier + three failure-bucket counters
Pool gauges (numAvailable, numFailed/SuccessfulCheckouts, numConnectionsClosed{Defunct,Expired})
Fallback to ephemeral connections when no bind user is configured
LoggingHealthCheck WARN on connection discard

Exam: with two LDAP servers configured and the first one killed mid-request, no user sees an auth failure, and the infrastructure counter increments for the failed connection.

Phase 2: Coalescing cache + benchmark (final)

CachingLdapAuthenticationProvider with HMAC-keyed Guava cache
Correct metric attribution for coalesced waiters (leader / waiter concurrency test)
Burst-load benchmark added to the test suite
User-facing documentation for all new config keys
Operator runbook for interpreting the new metrics

Exam: the burst-load benchmark shows ≥ 70% reduction in backend bind QPS vs. the JNDI baseline at the same user-session QPS, and the coalescing concurrency test passes 100 consecutive runs without flaking.

Q7. What are the risks?

Risk	Likelihood	Impact	Mitigation
Cache TTL extends window for revoked credentials	Medium	Medium	Short default TTL; documented; configurable to 0
Pool starvation under sustained burst	Low	Medium	`checkoutTimeoutMs` configurable; `numFailedCheckouts` alertable
Behavior delta vs. JNDI on edge-case directories	Medium	Low	Ephemeral fallback path preserves legacy parity
Failed / unreachable server in the failover list is operationally silent — no log or metric attributes the failure to a specific URL	Medium	Low	Acknowledged gap — `DisconnectHandler` is follow-up (see Rejected Alternatives)
New UnboundID SDK dependency surface	Low	Low	Apache-2.0; well-isolated; no new transitive deps beyond the SDK jar
Lifecycle regression (closing pool with traffic in flight)	Low	High	Ordering invariant documented + asserted by integration test

Implementation Roadmap

Phase	Scope	Estimated effort
Phase 0	Community discussion + design review	This KPIP
Phase 1	UnboundID pool + `FailoverServerSet` + ephemeral fallback
Phase 2	Failure classifier + auth-outcome counters + pool gauges
Phase 3	Coalescing cache + correct waiter attribution + concurrency tests
Phase 4	Docs, operator runbook, dashboard
Phase 5	Staging soak

Configuration surface

All new keys default to safe values; existing deployments work without changes.

Key	Type	Default	Purpose
`kyuubi.authentication.ldap.pool.min.connections`	int	`1`	Warm-pool lower bound
`kyuubi.authentication.ldap.pool.max.connections`	int	`10`	Upper bound; checkout fails if exhausted
`kyuubi.authentication.ldap.pool.health.check.interval`	duration	`60s`	Root-DSE validation cadence
`kyuubi.authentication.ldap.pool.checkout.timeout`	duration	`5s`	Max wait for a connection
`kyuubi.authentication.ldap.cache.enabled`	bool	`true`	Cache master switch
`kyuubi.authentication.ldap.cache.ttl`	duration	`5m`	Reuse window
`kyuubi.authentication.ldap.cache.max.size`	int	`10000`	Entry cap (LRU eviction)

Metrics surface

Counters (Prometheus _total via DropwizardExports):

authentication.success.ldap
authentication.failure.ldap.{invalid_credentials, invalid_input, infrastructure}
authentication.cache.hit.ldap / authentication.cache.miss.ldap

Gauges (live from LDAPConnectionPoolStatistics):

numAvailableConnections, numFailedCheckouts, numSuccessfulCheckouts
numConnectionsClosedDefunct, numConnectionsClosedExpired

Rejected Alternatives

Alternative 1: Stay on JNDI, add a retry decorator

Does not address the core gap — JNDI fails over only on the initial bind, not on operations against an already-established context. Retrying at the call-site level either reopens the bind (full TCP+TLS+BIND cost on the failure path) or re-uses the same broken context. Retries on INVALID_CREDENTIALS are actively harmful (account lockout). Does not enable pooling without re-implementing SDK pool semantics in tree.

Alternative 2: Apache Directory `ldap-client-api`

Smaller deployed footprint, less battle-tested pool implementation, no equivalent of FailoverServerSet's ordered-failover semantic at the ServerSet layer.

Alternative 3: Per-server failure attribution via wrapping `FailoverServerSet`

Deferred, not rejected. The SDK does not expose per-server failure events on the public ServerSet API; getting them requires either a DisconnectHandler (works only for already-established connections) or wrapping every getConnection call. Tracked as follow-up.

Alternative 4: Cache successful auths only, not failures

Defeats the coalescing benefit on the case that hurts most — credential stuffing or thundering-herd of wrong passwords. Short TTL bounds the cost.

Alternative 5: Cache keyed by user only with a separate "verified" map of passwords

Would require holding password material in memory beyond the bind boundary. HMAC-keyed boolean cache is strictly weaker on attack surface.

EDIT: Reference Implementation in this PR: #7506

References

UnboundID LDAP SDK — Apache-2.0

pan3793 · 2026-06-09T03:06:21Z

pan3793
Jun 9, 2026
Collaborator

+1, I heard that UnboundID LDAP SDK is a better alternative than the JDK's built-in JNDI API. The proposal looks great, one small nit about the config name - use camelCase for words, dot for namespace

- kyuubi.authentication.ldap.pool.health.check.interval
+ kyuubi.authentication.ldap.pool.healthCheck.interval

- kyuubi.authentication.ldap.cache.max.size
+ kyuubi.authentication.ldap.cache.maxSize

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KPIP] LDAP Authentication — UnboundID-backed pool with failover, request coalescing, and observability #7497

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[KPIP] LDAP Authentication — UnboundID-backed pool with failover, request coalescing, and observability #7497

Uh oh!

Uh oh!

moelhoussein Jun 4, 2026

Abstract

Q1. What are you trying to do?

Q2. How is it done today, and what are the limits of current practice?

Q3. What is new in your approach, and why do you think it will be successful?

Core innovation: UnboundID SDK pool with classifier-aware coalescing cache

Key design decisions

Why it will be successful

Q4. Who cares? If you are successful, what difference will it make?

Q5. How will you measure success?

Q6. What are the mid-term and final "exams"?

Phase 1: Pool + failover + classifier (mid-term)

Phase 2: Coalescing cache + benchmark (final)

Q7. What are the risks?

Implementation Roadmap

Configuration surface

Metrics surface

Rejected Alternatives

Alternative 1: Stay on JNDI, add a retry decorator

Alternative 2: Apache Directory ldap-client-api

Alternative 3: Per-server failure attribution via wrapping FailoverServerSet

Alternative 4: Cache successful auths only, not failures

Alternative 5: Cache keyed by user only with a separate "verified" map of passwords

EDIT: Reference Implementation in this PR: #7506

References

Replies: 1 comment

Uh oh!

pan3793 Jun 9, 2026 Collaborator

moelhoussein
Jun 4, 2026

Alternative 2: Apache Directory `ldap-client-api`

Alternative 3: Per-server failure attribution via wrapping `FailoverServerSet`

pan3793
Jun 9, 2026
Collaborator