Skip to content

Migrate to proper circuit breakers.#4432

Open
markrmiller wants to merge 1 commit into
apache:mainfrom
markrmiller:proper-circuit-breakers
Open

Migrate to proper circuit breakers.#4432
markrmiller wants to merge 1 commit into
apache:mainfrom
markrmiller:proper-circuit-breakers

Conversation

@markrmiller
Copy link
Copy Markdown
Member

@markrmiller markrmiller commented May 16, 2026

Experimental Branch Not for contribution review.

Move circuit breaker enforcement into an async SolrQoSFilter

Motivation

The previous design enforced circuit breakers synchronously inside SearchHandler.checkCircuitBreakers and ContentStreamHandlerBase.checkCircuitBreakers. When a breaker tripped, requests were immediately rejected with a 429. During transient stress events — a 1-2 second GC pause, a burst of expensive faceted queries, brief CPU saturation — every in-flight client got an error even though the underlying condition would clear in moments. Operators saw error spikes that bore no proportional relationship to actual cluster trouble.

This PR replaces that with a Jetty-QoSFilter-style admission control layer: when a breaker is tripped, requests are suspended via AsyncContext and held in a small priority queue until the breaker clears, at which point they're dispatched as normal. Transient pressure is just some added latency instead of a wall of 429s.

Architectural shift

Aspect Before After
Where breakers are checked Inside SearchHandler / ContentStreamHandlerBase after dispatch In SolrQoSFilter before dispatch
What a tripped breaker does Synchronous 429 Asynchronous suspend, dispatch when clear
Priority awareness None — first-come-first-served reject Three-lane priority queue: admin/probe > query > update
Sub-shard deadlock risk Possible: a coordinator could be 429'd waiting on its own sub-shards Mitigated: internal traffic bypasses suspension
Cost of breaker checks under load Each request re-polled OS/JVM metrics Cached scan shared across requests within an evaluation window

What's in this PR

SolrQoSFilter (new)

  • Synchronously checks CircuitBreakerRegistry.checkTrippedAcrossCores(cc, type) on each request.
  • If tripped and queueing is enabled (the new default), suspends via req.startAsync() and parks the AsyncContext in a priority+type queue.
  • If tripped and queueing is disabled, returns a synchronous 429 — preserves the legacy handler behavior for operators who deliberately want fail-fast.
  • A scheduled drainAll (default every 100 ms) re-checks the breaker and dispatches queued requests when it clears, walking lanes in priority order (HIGH → MEDIUM → LOW) and types within each lane.
  • Suspended requests get an AsyncListener so timeouts and client disconnects are accounted correctly (counters decremented, response written, slot freed).
  • Internal intra-cluster traffic — detected via Solr-Request-Context: SERVER header or the isShard=true / distrib.from= query params — bypasses both the breaker check and the suspension queue to avoid the obvious distributed-deadlock failure mode (parent request waits for sub-shard, sub-shard is queued waiting for parent's breaker to clear).

Priority lanes

Requests are sorted into three lanes:

Lane Default population
HIGH /admin/ping, /admin/info
MEDIUM User queries
LOW Updates

Clients can override via the new Solr-Request-Priority request header (HIGH / MEDIUM / LOW, case-insensitive). Unknown values fall through to the path/type heuristic. SolrJ already populates Solr-Request-Type; the priority hint is opt-in for code paths that want explicit control without parsing the request body.

Per-priority admission caps

The big bug class the lanes are meant to insulate against — a flood of LOW-priority bulk updates filling the suspension queue and locking out HIGH-priority admin/health probes — would have remained open with a single global counter. Each lane therefore has its own admission cap, partitioned out of maxSuspendedRequests via configurable shares:

solr.circuitbreaker.qos.priority.high.share    (default 0.10)
solr.circuitbreaker.qos.priority.medium.share  (default 0.60)
# LOW takes the remainder

At the default maxSuspendedRequests=1024: HIGH=102, MEDIUM=614, LOW=308. A saturated LOW lane refuses further LOW requests but leaves HIGH and MEDIUM headroom intact.

Priority-scaled drain budget

drainBudget (the number of requests dispatched per lane per drain tick) is scaled by priority:

HIGH:   drainBudget × 4.0   (configurable: qos.drainBudget.highMultiplier)
MEDIUM: drainBudget × 1.0
LOW:    drainBudget × 0.5   (configurable: qos.drainBudget.lowMultiplier)

So HIGH clears in essentially one tick, MEDIUM clears at the baseline rate, and LOW yields capacity back to MEDIUM after a breaker recovery.

Auto-scaled drainBudget

The base drainBudget defaults to max(200, 2 × maxSuspendedRequests × checkIntervalMs / suspendTimeoutMs). The formula ensures a fully-saturated queue can drain twice within the suspension timeout window, so requests don't expire before they're resumed. Floors at the static default so small/default deployments are unchanged; large deployments (maxSuspendedRequests=100000) get the budget they actually need without manual tuning. Explicit qos.drainBudget overrides win.

Cached breaker scan

SolrQoSFilter caches CircuitBreakerRegistry.checkTrippedAcrossCores per request type for evaluationIntervalMs (default 200ms). Concurrent admission-control callers share one underlying breaker scan rather than each re-walking the registry and re-polling OS/JVM metrics.

MemoryCircuitBreaker rewrite

The old signal was a 30-second moving average of MemoryMXBean.getHeapMemoryUsage(). Under a generational collector, raw heap usage climbs steadily toward max between collections — that's the normal shape, not a problem. The moving average inherited that climb and tripped on healthy heaps.

The new signal reads MemoryPoolMXBean.getCollectionUsage() on the old/tenured pool, which reports the bytes resident immediately after the most recent collection that affected that pool. That's the only point at which "how full is the heap really?" has a defined answer. For non-generational collectors (non-generational ZGC, Shenandoah) the breaker sums getCollectionUsage() across every HEAP-typed pool. Threshold semantics are unchanged (percentage of max heap); the underlying signal is now meaningful.

GcOverheadCircuitBreaker (new)

Trips when the JVM is spending more than a configured percentage of wall-clock time in garbage collection over a sliding window. Complementary to MemoryCircuitBreaker:

  • MemoryCircuitBreaker fires when post-GC live data is exhausting the heap.
  • GcOverheadCircuitBreaker fires when GC is keeping up (live data may be small) but consuming so much CPU that the application is starving.

Both conditions usually precede an OOM, but each catches the other's blind spot. Configurable via solr.circuitbreaker.{update,query}.gcoverhead=<percent> and solr.circuitbreaker.gcoverhead.windowSeconds (default 30).

TtlSampledMetric (new)

Tiny utility wrapping AtomicReference<Sample> for time-bounded caching of expensive metric reads. Used by CPUCircuitBreaker, LoadAverageCircuitBreaker, MemoryCircuitBreaker (post-GC live-bytes lookup), and GcOverheadCircuitBreaker (ratio computation). Configurable globally via solr.circuitbreaker.sampleTtlMs (default 1000ms). Stops high-QPS admission control from hammering OperatingSystemMXBean, Prometheus metric scans, and MemoryPoolMXBean walks on every request.

CircuitBreakerRegistry additions

  • checkTrippedGlobal(SolrRequestType) — static; consults only the process-wide global map. Used by filter-tier callers that have no per-core context. Warn-only breakers excluded.
  • checkTrippedLocal(SolrRequestType) — per-instance; consults only this registry's per-core breakers. Warn-only excluded.
  • checkTrippedAcrossCores(CoreContainer, SolrRequestType) — combines global + every per-core registry. Used by SolrQoSFilter. Iterates all cores; if any core's breaker for the type is tripped, the request is treated as tripped cluster-wide (intentionally conservative — the filter doesn't yet know which core a request will resolve to).
  • Recognizes a new gcoverhead breaker type in parseCircuitBreakersFromProperties.

Handler changes

  • SearchHandler.checkCircuitBreakers removed.
  • ContentStreamHandlerBase.checkCircuitBreakers removed.
  • Tests that previously asserted these handlers threw SolrException on tripped breakers now assert against the registry directly.

web.xml

  • New SolrQoSFilter mapping.
  • <async-supported>true</async-supported> propagated to all filters in the chain ahead of SolrServlet (RequiredSolrRequestFilter, TracingFilter, AuthenticationFilter, RateLimitFilter) — required by the Servlet spec for startAsync() to work.

Configuration reference

All new system properties:

# QoS filter
solr.circuitbreaker.qos.enabled                       (default true)
solr.circuitbreaker.qos.maxSuspendedRequests          (default 1024)
solr.circuitbreaker.qos.suspendTimeoutMs              (default 5000)
solr.circuitbreaker.qos.checkIntervalMs               (default 100)
solr.circuitbreaker.qos.evaluationIntervalMs          (default 200)
solr.circuitbreaker.qos.drainBudget                   (auto-scaled, floor 200)
solr.circuitbreaker.qos.priority.high.share           (default 0.10)
solr.circuitbreaker.qos.priority.medium.share         (default 0.60)
solr.circuitbreaker.qos.drainBudget.highMultiplier    (default 4.0)
solr.circuitbreaker.qos.drainBudget.lowMultiplier     (default 0.5)

# Breakers
solr.circuitbreaker.sampleTtlMs                       (default 1000)
solr.circuitbreaker.{update,query}.gcoverhead         (no default — register to enable)
solr.circuitbreaker.gcoverhead.windowSeconds          (default 30)

New request header:

Solr-Request-Priority: HIGH | MEDIUM | LOW

New metrics (counter unless noted):

qos.suspended.total
qos.suspended.current   (observable gauge)
qos.resumed.total
qos.expired.total
qos.rejected.total

Behavior changes

  1. qos.enabled defaults to true. New deployments will async-suspend tripped breakers instead of fail-fast. Operators who want the legacy synchronous 429 can set solr.circuitbreaker.qos.enabled=false.
  2. MemoryCircuitBreaker semantics changed. Threshold is now compared against post-GC live data rather than the raw heap usage moving average. The breaker should fire less often on healthy generational-GC clusters; thresholds previously tuned around the noisy signal may now feel conservative and can be lowered. Tuning review recommended at upgrade.
  3. Async dispatch in the filter chain. All filters ahead of SolrServlet are now declared async-supported. Downstream Solr filters/servlets are unchanged, but custom third-party filters injected into the chain must also declare async-supported or startAsync() will throw.

Move circuit-breaker admission control out of SearchHandler and
ContentStreamHandlerBase and into a new SolrQoSFilter that suspends
requests asynchronously when a breaker is tripped, dispatching them
once the breaker clears (modeled after Jetty's QoSFilter/QoSHandler).

Async queueing is enabled by default. Suspended requests are sorted
into three priority lanes (HIGH for admin/probe, MEDIUM for queries,
LOW for updates) and each lane has its own admission cap, so a flood
of LOW-priority work cannot reject HIGH-priority probes. Clients can
opt into a specific lane with the Solr-Request-Priority header. The
drain budget per lane is scaled by priority and auto-scales with
maxSuspendedRequests so a saturated queue drains within the suspension
timeout.

Internal intra-cluster shard requests bypass suspension to avoid
distributed deadlock. When QoS queueing is disabled, a tripped breaker
still fails fast synchronously, preserving the prior handler-enforced
behavior.

Add GcOverheadCircuitBreaker. Rewrite MemoryCircuitBreaker to read
post-GC live bytes from the old/tenured pool instead of a moving
average of raw heap usage. TTL-cache CPU and load-average samples so
high-QPS admission control does not repoll expensive OS/JVM signals
per request.
@markrmiller markrmiller force-pushed the proper-circuit-breakers branch from 792b19b to b0928d7 Compare May 20, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant