Handle Query Timeouts During Collector Initialization in QueryPhase #138084

drempapis · 2025-11-14T09:12:17Z

What the problem is?

The FilterByFilterAggregator performs part of its work during collector preparation, before the actual search execution begins. Unlike most other aggregators, this ahead-of-time work is not protected by the standard timeout-catching mechanisms within ContextIndexSearcher. When a search timeout occurs during this phase, a TimeExceededException escapes before the query has fully started, causing the entire QueryPhase to fail with a shard-level QueryPhaseExecutionException.

How this PR solves the problem

We fix this by catching timeout exceptions in QueryPhase, which is a central place to handle timeouts from aggregators like FilterByFilterAggregator that do extra work before the search begins.

The timeoutRunnable is registered before building the collector manager, and both the collector construction and the subsequent searcher.search() invocation are wrapped in a single try/catch block. This ensures that any timeout triggered during early aggregator initialization is handled consistently with timeouts that occur during the main search phase.
Timeouts are converted into meaningful partial results. When a TimeExceededException is caught, the code calls finalizeAsTimedOutResult(), marking the query as timed out and returning a well-formed empty result (top docs + aggregations) rather than propagating an exception.
Since searcher.search() may return null, a defensive null check has been installed to treat this case as a timeout as well, ensuring that no shard-level failure occurs and the response remains consistent.

…ption

elasticsearchmachine · 2025-11-14T09:12:51Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2025-11-14T09:12:51Z

Hi @drempapis, I've created a changelog YAML for you.

…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException

…ption

benchaplin

I can see why this solves the issue. My biggest question for you: is it necessary to do this up in QueryPhase rather than in the ContextIndexSearcher? Couldn't we move the call to newCollector in ContextIndexSearcher#search into the try block below?

elasticsearch/server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java

Lines 326 to 339 in 279c349

    
           public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException { 
        
               final C firstCollector = collectorManager.newCollector(); 
        
               // Take advantage of the few extra rewrite rules of ConstantScoreQuery when score are not needed. 
        
               query = firstCollector.scoreMode().needsScores() ? rewrite(query) : rewrite(new ConstantScoreQuery(query)); 
        
               final Weight weight; 
        
               try { 
        
                   weight = createWeight(query, firstCollector.scoreMode(), 1); 
        
               } catch (@SuppressWarnings("unused") TimeExceededException e) { 
        
                   timeExceeded = true; 
        
                   doAggregationPostCollection(firstCollector); 
        
                   return collectorManager.reduce(Collections.singletonList(firstCollector)); 
        
               } 
        
               return search(weight, collectorManager, firstCollector); 
        
           }

benchaplin · 2025-11-19T20:25:07Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

+        QuerySearchResult queryResult = searchContext.queryResult();
+        SearchTimeoutException.handleTimeout(searchContext.request().allowPartialSearchResults(), searchContext.shardTarget(), queryResult);
+
+        queryResult.topDocs(new TopDocsAndMaxScore(Lucene.EMPTY_TOP_DOCS, Float.NaN), new DocValueFormat[0]);


Is it necessary to set topDocs and aggs here?

Yes, we need to set them. In the early-timeout path the QuerySearchResult may not have been initialized at all, and leaving topDocs/aggs unset leads to stale values during merge. Setting them explicitly ensures the shard returns a well-formed (but empty) result.

e.g.

Caused by: java.lang.IllegalStateException: topDocs already consumed at org.elasticsearch.search.query.QuerySearchResult.topDocs(QuerySearchResult.java:188) at org.elasticsearch.action.search.QueryPhaseResultConsumer.reduce(QueryPhaseResultConsumer.java:246)

drempapis · 2025-11-20T12:11:16Z

I can see why this solves the issue. My biggest question for you: is it necessary to do this up in QueryPhase rather than in the ContextIndexSearcher? Couldn't we move the call to newCollector in ContextIndexSearcher#search into the try block below?

Thank you @benchaplin for the review. That's a good point.

In QueryPhase we register the timeoutRunnable on the searcher so that timeout can also fire during collector construction. MovingcollectorManager.newCollector() into the try in ContextIndexSearcher#search, a timeout that happens inside newCollector() would be caught there, but firstCollector would still be null. The catch block currently assumes a valid collector (it calls doAggregationPostCollection(firstCollector) and collectorManager.reduce(singletonList(firstCollector))), so we’d risk calling those on null.

Handling the timeout in QueryPhase doesn’t depend on having a collector at all and can still produce a consistent partial result.

benchaplin

(Sorry for the two day review... wasn't able to check out the tests yesterday)

These are great tests! Just left a few questions.

benchaplin · 2025-11-20T14:47:34Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

+                }
+            } catch (ContextIndexSearcher.TimeExceededException tee) {
+                finalizeAsTimedOutResult(searchContext);
+                return;


This return is unnecessary, right?

That is true, removed!

benchaplin · 2025-11-20T14:57:51Z

server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

+     * Test aggregation builder that simulates a timeout during collector setup
+     * to verify QueryPhase timeout handling behavior.
+     */
+    private static final class ForceTimeoutAggregationBuilder extends AggregationBuilder {


Perhaps extending AbstractAggregationBuilder would reduce the need for many of these overrides?

Good point, I used the AbstractAggregationBuilder. I guess I had the intention to use this class initially but got confused by the abstart AggregationBuilder.

benchaplin · 2025-11-20T14:58:37Z

server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

+    }
+
+    /**
+     * Simulates the search layer returning null from ContextIndexSearcher.search()


Have you observed ContextIndexSearcher.search() returning null?

I went through the code again and couldn’t find any path where ContextIndexSearcher.search(query, collectorManager) could return null. Looks like my initial assumption was wrong.

I removed the test and the in QueryPhase class

if (queryPhaseResult == null) { finalizeAsTimedOutResult(searchContext); return; }

benchaplin · 2025-11-20T15:08:14Z

...nternalClusterTest/java/org/elasticsearch/search/aggregations/QueryPhaseForcedTimeoutIT.java

+            resp = client().prepareSearch(INDEX)
+                .setQuery(QueryBuilders.matchAllQuery())
+                .setSize(10)
+                .setAllowPartialSearchResults(true)


Can we randomize this and look for SearchTimeoutException in the case that allowPartialSearchResults=false?

Good point! I added a second test because the code behaves differently depending on setAllowPartialSearchResults(xx):

true: shard timeouts are treated as partial failures, so the search returns a normal SearchResponse with timed_out = true.

false: shard timeouts are not allowed, so the coordinating node fails the entire search with a SearchPhaseExecutionException.

Since these are two distinct code paths with different expected outcomes, each needs its own test.

…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException

…ption

…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException

…ption

elasticsearchmachine · 2025-11-24T08:46:41Z

💔 Backport failed

Status	Branch	Result
❌	8.19	Commit could not be cherrypicked due to conflicts
❌	9.1	Commit could not be cherrypicked due to conflicts
❌	9.2	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 138084

…lastic#138084) (cherry picked from commit c699e67) # Conflicts: # server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

drempapis · 2025-11-24T08:55:15Z

💔 Some backports could not be created

Status	Branch	Result
✅	9.2
✅	9.1
❌	8.19	An unhandled error occurred. Please see the logs for details

Manual backport

To create the backport manually run:

backport --pr 138084

Questions ?

Please refer to the Backport tool documentation

…lastic#138084) (cherry picked from commit c699e67) # Conflicts: # server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

…hase (#138084) (#138474) * Handle Query Timeouts During Collector Initialization in QueryPhase (#138084) (cherry picked from commit c699e67) # Conflicts: # server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java * add missing improts

…hase (#138084) (#138473) * Handle Query Timeouts During Collector Initialization in QueryPhase (#138084) (cherry picked from commit c699e67) # Conflicts: # server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java * Refactor QueryPhaseTimeoutTests for clarity * [CI] Auto commit changes from spotless * update imports --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

…lastic#138084)

drempapis added 6 commits November 11, 2025 17:19

Add code and test class

e2bd8a8

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

81efb4c

…ption

update

1421c60

update

19e2121

update code

6b15d74

update

bfc3f02

drempapis added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.3.0 labels Nov 14, 2025

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

8f6674c

…ption

drempapis added 5 commits November 14, 2025 11:12

Update docs/changelog/138084.yaml

cb0e1b2

update after review

3d81fce

Merge branch 'handle_ContextIndexSearcher_TimeExceededException' of g…

09487e7

…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

0710406

…ption

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

a453b4e

…ption

benchaplin reviewed Nov 19, 2025

View reviewed changes

benchaplin reviewed Nov 20, 2025

View reviewed changes

drempapis and others added 9 commits November 20, 2025 17:36

update after review - remove redundant code

83379a5

update after review

a98b855

[CI] Auto commit changes from spotless

4bc1f41

update after review

cabadeb

Merge branch 'handle_ContextIndexSearcher_TimeExceededException' of g…

ae30a7b

…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

7a6cbe8

…ption

[CI] Auto commit changes from spotless

6cf501c

update after review

ba10880

Merge branch 'handle_ContextIndexSearcher_TimeExceededException' of g…

1ee843c

…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException

drempapis and others added 4 commits November 20, 2025 20:17

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

d24498f

…ption

[CI] Auto commit changes from spotless

71ca632

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

c0aacb5

…ption

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

01a2d62

…ption

benchaplin approved these changes Nov 21, 2025

View reviewed changes

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

f4367c3

…ption

drempapis added v8.19.0 v9.1.0 v9.2.0 auto-backport Automatically create backport pull requests when merged labels Nov 24, 2025

Merge branch 'main' into handle_ContextIndexSearcher_TimeExceededExce…

7ba2335

…ption

drempapis merged commit c699e67 into elastic:main Nov 24, 2025
34 checks passed

elasticsearchmachine added the backport pending label Nov 24, 2025

drempapis mentioned this pull request Nov 24, 2025

[9.2] Handle Query Timeouts During Collector Initialization in QueryPhase (#138084) #138473

Merged

drempapis mentioned this pull request Nov 24, 2025

[9.1] Handle Query Timeouts During Collector Initialization in QueryPhase (#138084) #138474

Merged

drempapis removed the backport pending label Nov 24, 2025

drempapis removed the v8.19.0 label Nov 24, 2025

ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Nov 26, 2025

Handle Query Timeouts During Collector Initialization in QueryPhase (e…

2e0ea9b

…lastic#138084)

	public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException {
	final C firstCollector = collectorManager.newCollector();
	// Take advantage of the few extra rewrite rules of ConstantScoreQuery when score are not needed.
	query = firstCollector.scoreMode().needsScores() ? rewrite(query) : rewrite(new ConstantScoreQuery(query));
	final Weight weight;
	try {
	weight = createWeight(query, firstCollector.scoreMode(), 1);
	} catch (@SuppressWarnings("unused") TimeExceededException e) {
	timeExceeded = true;
	doAggregationPostCollection(firstCollector);
	return collectorManager.reduce(Collections.singletonList(firstCollector));
	}
	return search(weight, collectorManager, firstCollector);
	}

Handle Query Timeouts During Collector Initialization in QueryPhase #138084

Handle Query Timeouts During Collector Initialization in QueryPhase #138084

Conversation

drempapis commented Nov 14, 2025

Uh oh!

elasticsearchmachine commented Nov 14, 2025

Uh oh!

elasticsearchmachine commented Nov 14, 2025

Uh oh!

benchaplin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drempapis commented Nov 20, 2025

Uh oh!

benchaplin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 24, 2025

💔 Backport failed

Uh oh!

drempapis commented Nov 24, 2025

💔 Some backports could not be created

Manual backport

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants