Skip to content

Fix funnel aggregation worker threads not responding to query timeout / cancellation#17692

Merged
yashmayya merged 1 commit intoapache:masterfrom
yashmayya:funnel-aggregation-cancellation
Feb 13, 2026
Merged

Fix funnel aggregation worker threads not responding to query timeout / cancellation#17692
yashmayya merged 1 commit intoapache:masterfrom
yashmayya:funnel-aggregation-cancellation

Conversation

@yashmayya
Copy link
Contributor

Summary

When a query using window funnel aggregation functions (funnelMaxStep, funnelCompleteCount, funnelMatchStep) times out or is cancelled, the worker threads spawned by IndexedTable.finish() for the multi-threaded final reduce phase continue running indefinitely. Although future.cancel(true) is correctly called on the futures, the underlying extractFinalResult() computations are tight CPU-bound loops that never check the thread interrupt flag, so the cancellation has no effect. Repeated timed-out queries compound this, eventually saturating the thread pool and pegging server CPU until a restart.

Why this was missed

Most aggregation functions have trivial extractFinalResult() implementations — SUM returns the accumulated value, COUNT returns a long, AVG does a division, and sketch-based functions like HLL just call estimate(). These
complete in O(1) and are never a cancellation concern.

The funnel window functions are unique: they defer the actual computation to extractFinalResult(), which performs a
sliding-window pattern match over the full PriorityQueue of raw events accumulated during the aggregate/merge phases.
This is the same reason they're the only functions listed in IndexedTable.containsExpensiveAggregationFunctions() to
trigger multi-threaded execution — but the corresponding cancellation-awareness was never added to the computation
itself.

Fix

Add QueryThreadContext.checkTerminationAndSampleUsagePeriodically() calls inside every hot loop in the funnel window functions:

  • FunnelBaseAggregationFunction.fillWindow() — the step-0 seeking loop (can drain millions of non-matching events)
    and the window-filling loop (can move millions of events into the sliding window when the window size is large)
  • FunnelMaxStepAggregationFunction.processWindow() — iterates the entire sliding window
  • FunnelMatchStepAggregationFunction.processWindow() — same
  • FunnelCompleteCountAggregationFunction.extractFinalResult() — inline sliding window iteration

These checks detect three cancellation signals: the TerminationException set by QueryExecutionContext.terminate(),
the thread interrupt flag set by future.cancel(true), and deadline expiration. Any of these causes the worker thread
to throw and unwind immediately.

Overhead

checkTerminationAndSampleUsagePeriodically uses a bitmask (& 0x1FFF) so the actual check only fires every 8,192 loop
iterations. The remaining iterations reduce to a single integer AND + branch prediction hit. When the check does fire,
it reads one volatile field, calls Thread.interrupted(), and compares System.currentTimeMillis() against the
deadline — all sub-microsecond operations, negligible relative to the per-event PriorityQueue.poll() (O(log N)) and
sliding window processing work done in each iteration.

@codecov-commenter
Copy link

codecov-commenter commented Feb 12, 2026

❌ 3 Tests Failed:

Tests completed Failed Passed Skipped
9379 3 9376 53
View the full list of 3 ❄️ flaky test(s)
org.apache.pinot.integration.tests.KafkaConfluentSchemaRegistryAvroMessageDecoderRealtimeClusterIntegrationTest::setUp

Flake rate in main: 100.00% (Passed 0 times, Failed 64 times)

Stack Traces | 13.2s run time
Could not find a valid Docker environment. Please see logs and check configuration
org.apache.pinot.plugin.inputformat.json.confluent.JsonConfluentSchemaTest::@BeforeClass setup

Flake rate in main: 100.00% (Passed 0 times, Failed 35 times)

Stack Traces | 0.526s run time
Could not find a valid Docker environment. Please see logs and check configuration
org.apache.pinot.plugin.inputformat.json.confluent.JsonConfluentSchemaTest::setup

Flake rate in main: 100.00% (Passed 0 times, Failed 70 times)

Stack Traces | 1.34s run time
Could not find a valid Docker environment. Please see logs and check configuration

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@yashmayya yashmayya force-pushed the funnel-aggregation-cancellation branch from 023a495 to 83d33be Compare February 13, 2026 01:39
@yashmayya
Copy link
Contributor Author

Weird, tests are failing with errors like: IllegalState Could not find a valid Docker environment. Please see logs and check configuration

@yashmayya
Copy link
Contributor Author

I'm seeing this on other PRs too, and it's clearly unrelated to this one. None of the test failures look related, so I'm merging this. The test failures can be investigated independently.

@yashmayya yashmayya merged commit 76cd3e6 into apache:master Feb 13, 2026
12 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants