Fix funnel aggregation worker threads not responding to query timeout / cancellation#17692
Merged
yashmayya merged 1 commit intoapache:masterfrom Feb 13, 2026
Merged
Conversation
❌ 3 Tests Failed:
View the full list of 3 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
Jackie-Jiang
approved these changes
Feb 13, 2026
023a495 to
83d33be
Compare
Contributor
Author
|
Weird, tests are failing with errors like: |
Contributor
Author
|
I'm seeing this on other PRs too, and it's clearly unrelated to this one. None of the test failures look related, so I'm merging this. The test failures can be investigated independently. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a query using window funnel aggregation functions (
funnelMaxStep,funnelCompleteCount,funnelMatchStep) times out or is cancelled, the worker threads spawned byIndexedTable.finish()for the multi-threaded final reduce phase continue running indefinitely. Althoughfuture.cancel(true)is correctly called on the futures, the underlyingextractFinalResult()computations are tight CPU-bound loops that never check the thread interrupt flag, so the cancellation has no effect. Repeated timed-out queries compound this, eventually saturating the thread pool and pegging server CPU until a restart.Why this was missed
Most aggregation functions have trivial
extractFinalResult()implementations —SUMreturns the accumulated value,COUNTreturns a long,AVGdoes a division, and sketch-based functions like HLL just callestimate(). Thesecomplete in O(1) and are never a cancellation concern.
The funnel window functions are unique: they defer the actual computation to
extractFinalResult(), which performs asliding-window pattern match over the full
PriorityQueueof raw events accumulated during the aggregate/merge phases.This is the same reason they're the only functions listed in
IndexedTable.containsExpensiveAggregationFunctions()totrigger multi-threaded execution — but the corresponding cancellation-awareness was never added to the computation
itself.
Fix
Add
QueryThreadContext.checkTerminationAndSampleUsagePeriodically()calls inside every hot loop in the funnel window functions:FunnelBaseAggregationFunction.fillWindow()— the step-0 seeking loop (can drain millions of non-matching events)and the window-filling loop (can move millions of events into the sliding window when the window size is large)
FunnelMaxStepAggregationFunction.processWindow()— iterates the entire sliding windowFunnelMatchStepAggregationFunction.processWindow()— sameFunnelCompleteCountAggregationFunction.extractFinalResult()— inline sliding window iterationThese checks detect three cancellation signals: the
TerminationExceptionset byQueryExecutionContext.terminate(),the thread interrupt flag set by
future.cancel(true), and deadline expiration. Any of these causes the worker threadto throw and unwind immediately.
Overhead
checkTerminationAndSampleUsagePeriodicallyuses a bitmask (& 0x1FFF) so the actual check only fires every 8,192 loopiterations. The remaining iterations reduce to a single integer AND + branch prediction hit. When the check does fire,
it reads one volatile field, calls
Thread.interrupted(), and comparesSystem.currentTimeMillis()against thedeadline — all sub-microsecond operations, negligible relative to the per-event
PriorityQueue.poll()(O(log N)) andsliding window processing work done in each iteration.