Add circuit breaking mechanisms for offset auto-reset backfill#18501
Add circuit breaking mechanisms for offset auto-reset backfill#18501rseetham wants to merge 4 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18501 +/- ##
============================================
+ Coverage 63.68% 64.32% +0.63%
+ Complexity 1684 1126 -558
============================================
Files 3266 3311 +45
Lines 199836 203940 +4104
Branches 31023 31740 +717
============================================
+ Hits 127272 131186 +3914
+ Misses 62424 62228 -196
- Partials 10140 10526 +386
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Introduces four independent circuit breakers to prevent unbounded backfill triggering when a cluster is overwhelmed or restarts after prolonged downtime: 1. Pause flag per topic (`realtime.segment.offsetAutoReset.pause`): operator-set boolean in stream config; checked in computeStartOffset() before any backfill decision is made. 2. Max segments guard (`realtime.segment.offsetAutoReset.maxSegmentsBeforeBackfillSkip`): skips backfill trigger if table's segment count >= configured limit, preventing znode exhaustion when ingestion is permanently elevated. 3. Max concurrent backfills per controller (`controller.realtime.offsetAutoReset.maxConcurrentBackfillsPerController`): caps the number of tables that can simultaneously backfill on a single controller instance, guarding against cluster-restart storms. 4. Per-partition in-flight collision threshold (`controller.realtime.offsetAutoReset.maxBackfillCollisionsBeforeAutoPause`, default 3): tracks consecutive backfill-trigger attempts on a partition that already has an active backfill. Below the threshold the new trigger is allowed; at or above the threshold the topic's pause flag is set automatically and a metric is emitted requiring operator intervention. New ControllerMeter entries are added for each skipped-backfill scenario to enable alerting on all circuit breaker activations. Fixes: apache#18314
…uto-reset backfill Adds three new metrics to make the backfill circuit breaking feature fully observable without log parsing: - ControllerGauge.BACKFILL_TOPICS_IN_PROGRESS (per-table): snapshot of how many backfill Kafka topics are actively running for a table. - ControllerMeter.OFFSET_AUTO_RESET_HANDLER_INIT_FAILURE (per-table): fires when handler construction fails silently — previously unalertable. - ControllerMeter.OFFSET_AUTO_RESET_AUTO_PAUSE_FAILURE (per-table): fires when the ZK write to set the pause flag fails, so the circuit breaker appears to activate but the table keeps running. - ControllerMeter.OFFSET_AUTO_RESET_BACKFILL_CLEANUP_COMPLETED (per-table): fires when backfill topics finish cleanup; absence over time signals stuck backfills. Also fixes a bug in setPauseFlag() where the topic name was looked up using the bare key "topic.name" (StreamConfigProperties.STREAM_TOPIC_NAME) instead of the prefixed key "stream.<type>.topic.name" that stream config maps actually use. This caused auto-pause to silently skip writing the pause flag.
70f9fca to
f0742d1
Compare
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal issue; see inline comment.
| // Track collisions per (table, topic, partition); auto-pause if collisions exceed the threshold. | ||
| String partitionKey = tableNameWithType + ":" + topicName + ":" + partitionStr; | ||
| Set<String> activeBackfillTopics = _tableBackfillTopics.get(tableNameWithType); | ||
| boolean anyBackfillInFlight = activeBackfillTopics != null && !activeBackfillTopics.isEmpty(); |
There was a problem hiding this comment.
This breaker is described as per-partition, but the guard is actually table-wide: anyBackfillInFlight becomes true whenever any backfill topic exists for the table. After that, every unrelated (topic, partition) trigger increments its own collision counter and can auto-pause its source topic even though there was no collision on that partition. The check needs to prove that the active backfill matches the same source topic/partition, not just that the table has some backfill running.
There was a problem hiding this comment.
I made this more complicated than it needs to be. I only need to check collisions per partition, so I removed that check altogether now. Thanks for pointing it out.
…rity - Rename _tableTopicsUnderBackfill -> _sourceTopicsByTable - Rename _tableBackfillTopics -> _backfillTopicsByTable - Rename _partitionInFlightCollisionCount -> _partitionsInFlightCount - Fix collision logic: set counter to 1 on successful trigger; only increment on subsequent triggers for the same (table,topic,partition). Previously the guard was table-wide (any backfill in flight would trigger collision counting for unrelated partitions). - Add topic.partition key to BACKFILL_SKIPPED_IN_FLIGHT, BACKFILL_SKIPPED_PAUSED, and BACKFILL_OFFSETS metrics
Introduces four independent circuit breakers to prevent unbounded backfill triggering when a cluster is overwhelmed or restarts after prolonged downtime:
Pause flag per topic (
realtime.segment.offsetAutoReset.pause): operator-set boolean in stream config; checked in computeStartOffset() before any backfill decision is made.Max segments guard (
realtime.segment.offsetAutoReset.maxSegmentsBeforeBackfillSkip): skips backfill trigger if table's segment count >= configured limit, preventing znode exhaustion when ingestion is permanently elevated.Max concurrent backfills per controller (
controller.realtime.offsetAutoReset.maxConcurrentBackfillsPerController): caps the number of tables that can simultaneously backfill on a single controller instance, guarding against cluster-restart storms.Per-partition in-flight collision threshold (
controller.realtime.offsetAutoReset.maxBackfillCollisionsBeforeAutoPause, default 3): tracks consecutive backfill-trigger attempts on a partition that already has an active backfill. Below the threshold the new trigger is allowed; at or above the threshold the topic's pause flag is set automatically and a metric is emitted requiring operator intervention.New ControllerMeter entries are added for each skipped-backfill scenario to enable alerting on all circuit breaker activations.
Fixes: #18314
Deployed and tested.

bugfix