[KPIP] LDAP Authentication — UnboundID-backed pool with failover, request coalescing, and observability #7497
moelhoussein
started this conversation in
Ideas
Replies: 1 comment
-
|
+1, I heard that UnboundID LDAP SDK is a better alternative than the JDK's built-in JNDI API. The proposal looks great, one small nit about the config name - use camelCase for words, dot for namespace - kyuubi.authentication.ldap.pool.health.check.interval
+ kyuubi.authentication.ldap.pool.healthCheck.interval
- kyuubi.authentication.ldap.cache.max.size
+ kyuubi.authentication.ldap.cache.maxSize |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Abstract
This proposal replaces Kyuubi's JNDI-based LDAP authentication path with an UnboundID LDAP SDK stack that adds (1) ordered server failover that covers both bind and search operations, (2) a persistent pre-bound connection pool, (3) result-code-driven failure classification, (4) request coalescing with a short-TTL cache, and (5) a first-class metrics surface for both auth outcomes and pool health. Existing config keys, providers, and fallback to ephemeral connections are preserved.
Q1. What are you trying to do?
Make Kyuubi's LDAP login path:
Q2. How is it done today, and what are the limits of current practice?
Kyuubi's
LdapAuthenticationProviderImpluses the JDK's JNDI (com.sun.jndi.ldap.*) opened viaInitialDirContextper request.CommunicationExceptionand the user sees a 403.NamingExceptionwith free-textThe first row is the most important: deployments that put a load balancer or hostname-fronted cluster in front of multiple LDAP servers expect "configure two URLs, get full failover." With JNDI today, that expectation only holds for the bind step. A SEARCH that fails mid-operation is not retried elsewhere.
Q3. What is new in your approach, and why do you think it will be successful?
Core innovation: UnboundID SDK pool with classifier-aware coalescing cache
Key design decisions
1.
FailoverServerSetover JNDI URL-list — failover covers bind and subsequent operations. The pool obtains a connection throughFailoverServerSet, which advances to the next URL on any connection-level failure. Because all auth work (bind + search) runs against a pooled connection that is health-checked and replaced on failure, a server that goes down mid-search results in the next checkout (or the same caller's retry through the SDK'sautoReconnect) landing on a different server. This closes the JNDI gap where only the initial bind benefited from failover.2. Pre-bound pool with background health check. All connections bound as the service account at creation; per-request work is a single
bindRequestagainst a warm socket. Background thread validates idle connections against the root DSE.autoReconnect=trueso a server bounce does not drain the pool.3. Classifier-driven failure metrics.
LDAPException.getResultCode()is mapped to exactly one of three stable buckets:invalid_credentialsResultCode.INVALID_CREDENTIALSinvalid_inputinfrastructure4. Request-coalescing TTL cache. Guava
Cache.get(key, Callable)collapses N concurrent identical authentications into one backend bind. Leader executes; waiters block on theLoadingValueReference. AwasLeaderflag ensures each coalesced waiter increments the correct success/failure counter — the easy bug to ship without it would be a silent metric undercount on the waiter path.5. Lifecycle ordering invariant.
MetricsSystemis registered before frontends inKyuubiServer.initialize.CompositeService.stoptears down in reverse order, so frontends stop accepting auth requests beforeMetricsSystem(and the pool) close. No race where the auth path increments a closed counter.6. Fallback parity. If
bind.user/bind.passwordare not configured, the pool is not instantiated; the code falls back to ephemeral per-request connections — behavioral parity with the JNDI path for the no-bind-user case.Why it will be successful
kyuubi.authentication.ldap.*), same provider hook; existing deployments work without new config.Q4. Who cares? If you are successful, what difference will it make?
infrastructurecounter is the only signal that should ever page themHeadline outcomes: ≥ 70% reduction in backend bind volume at the directory under steady-state load, zero user-visible auth failures from any single LDAP server outage (including mid-request server loss), and incident-response time bounded by dashboard read latency rather than log-grep cycle time.
Q5. How will you measure success?
These are the implementation-level bars the code must clear before the feature ships. Targets are verifiable from tests or short-run benchmarks — not from long-term ops outcomes (which appear in Q4).
LDAPExceptionresult codes land in the correct bucket (invalid_credentials/invalid_input/infrastructure)ResultCodevaluemax.connectionssockets to the directory (not 1000)MetricsSystem.close(); no auth attempt afterpool.close()bind.useris unset, the existingLdapAuthenticationProviderImpltest suite passes against the new ephemeral-connection code path (same accept/reject decisions as JNDI)Q6. What are the mid-term and final "exams"?
Phase-gate milestones. Each phase ships independently and is gated on a single concrete demonstration.
Phase 1: Pool + failover + classifier (mid-term)
UnboundIdConnectionPoolwithFailoverServerSetinstantiated on first authLdapAuthFailureClassifier+ three failure-bucket countersnumAvailable,numFailed/SuccessfulCheckouts,numConnectionsClosed{Defunct,Expired})LoggingHealthCheckWARN on connection discardExam: with two LDAP servers configured and the first one killed mid-request, no user sees an auth failure, and the
infrastructurecounter increments for the failed connection.Phase 2: Coalescing cache + benchmark (final)
CachingLdapAuthenticationProviderwith HMAC-keyed Guava cacheExam: the burst-load benchmark shows ≥ 70% reduction in backend bind QPS vs. the JNDI baseline at the same user-session QPS, and the coalescing concurrency test passes 100 consecutive runs without flaking.
Q7. What are the risks?
checkoutTimeoutMsconfigurable;numFailedCheckoutsalertableDisconnectHandleris follow-up (see Rejected Alternatives)Implementation Roadmap
FailoverServerSet+ ephemeral fallbackConfiguration surface
All new keys default to safe values; existing deployments work without changes.
kyuubi.authentication.ldap.pool.min.connections1kyuubi.authentication.ldap.pool.max.connections10kyuubi.authentication.ldap.pool.health.check.interval60skyuubi.authentication.ldap.pool.checkout.timeout5skyuubi.authentication.ldap.cache.enabledtruekyuubi.authentication.ldap.cache.ttl5mkyuubi.authentication.ldap.cache.max.size10000Metrics surface
Counters (Prometheus
_totalviaDropwizardExports):authentication.success.ldapauthentication.failure.ldap.{invalid_credentials, invalid_input, infrastructure}authentication.cache.hit.ldap/authentication.cache.miss.ldapGauges (live from
LDAPConnectionPoolStatistics):numAvailableConnections,numFailedCheckouts,numSuccessfulCheckoutsnumConnectionsClosedDefunct,numConnectionsClosedExpiredRejected Alternatives
Alternative 1: Stay on JNDI, add a retry decorator
Does not address the core gap — JNDI fails over only on the initial bind, not on operations against an already-established context. Retrying at the call-site level either reopens the bind (full TCP+TLS+BIND cost on the failure path) or re-uses the same broken context. Retries on
INVALID_CREDENTIALSare actively harmful (account lockout). Does not enable pooling without re-implementing SDK pool semantics in tree.Alternative 2: Apache Directory
ldap-client-apiSmaller deployed footprint, less battle-tested pool implementation, no equivalent of
FailoverServerSet's ordered-failover semantic at theServerSetlayer.Alternative 3: Per-server failure attribution via wrapping
FailoverServerSetDeferred, not rejected. The SDK does not expose per-server failure events on the public
ServerSetAPI; getting them requires either aDisconnectHandler(works only for already-established connections) or wrapping everygetConnectioncall. Tracked as follow-up.Alternative 4: Cache successful auths only, not failures
Defeats the coalescing benefit on the case that hurts most — credential stuffing or thundering-herd of wrong passwords. Short TTL bounds the cost.
Alternative 5: Cache keyed by user only with a separate "verified" map of passwords
Would require holding password material in memory beyond the bind boundary. HMAC-keyed boolean cache is strictly weaker on attack surface.
EDIT: Reference Implementation in this PR: #7506
References
Beta Was this translation helpful? Give feedback.
All reactions