Skip to content

Conversation

@drempapis
Copy link
Contributor

What the problem is?

The FilterByFilterAggregator performs part of its work during collector preparation, before the actual search execution begins. Unlike most other aggregators, this ahead-of-time work is not protected by the standard timeout-catching mechanisms within ContextIndexSearcher. When a search timeout occurs during this phase, a TimeExceededException escapes before the query has fully started, causing the entire QueryPhase to fail with a shard-level QueryPhaseExecutionException.

How this PR solves the problem

We fix this by catching timeout exceptions in QueryPhase, which is a central place to handle timeouts from aggregators like FilterByFilterAggregator that do extra work before the search begins.

  • The timeoutRunnable is registered before building the collector manager, and both the collector construction and the subsequent searcher.search() invocation are wrapped in a single try/catch block. This ensures that any timeout triggered during early aggregator initialization is handled consistently with timeouts that occur during the main search phase.
  • Timeouts are converted into meaningful partial results. When a TimeExceededException is caught, the code calls finalizeAsTimedOutResult(), marking the query as timed out and returning a well-formed empty result (top docs + aggregations) rather than propagating an exception.
  • Since searcher.search() may return null, a defensive null check has been installed to treat this case as a timeout as well, ensuring that no shard-level failure occurs and the response remains consistent.

@drempapis drempapis added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.3.0 labels Nov 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @drempapis, I've created a changelog YAML for you.

Copy link
Contributor

@benchaplin benchaplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why this solves the issue. My biggest question for you: is it necessary to do this up in QueryPhase rather than in the ContextIndexSearcher? Couldn't we move the call to newCollector in ContextIndexSearcher#search into the try block below?

public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException {
final C firstCollector = collectorManager.newCollector();
// Take advantage of the few extra rewrite rules of ConstantScoreQuery when score are not needed.
query = firstCollector.scoreMode().needsScores() ? rewrite(query) : rewrite(new ConstantScoreQuery(query));
final Weight weight;
try {
weight = createWeight(query, firstCollector.scoreMode(), 1);
} catch (@SuppressWarnings("unused") TimeExceededException e) {
timeExceeded = true;
doAggregationPostCollection(firstCollector);
return collectorManager.reduce(Collections.singletonList(firstCollector));
}
return search(weight, collectorManager, firstCollector);
}

QuerySearchResult queryResult = searchContext.queryResult();
SearchTimeoutException.handleTimeout(searchContext.request().allowPartialSearchResults(), searchContext.shardTarget(), queryResult);

queryResult.topDocs(new TopDocsAndMaxScore(Lucene.EMPTY_TOP_DOCS, Float.NaN), new DocValueFormat[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to set topDocs and aggs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to set them. In the early-timeout path the QuerySearchResult may not have been initialized at all, and leaving topDocs/aggs unset leads to stale values during merge. Setting them explicitly ensures the shard returns a well-formed (but empty) result.

e.g.

Caused by: java.lang.IllegalStateException: topDocs already consumed
	at org.elasticsearch.search.query.QuerySearchResult.topDocs(QuerySearchResult.java:188)
	at org.elasticsearch.action.search.QueryPhaseResultConsumer.reduce(QueryPhaseResultConsumer.java:246)

@drempapis
Copy link
Contributor Author

I can see why this solves the issue. My biggest question for you: is it necessary to do this up in QueryPhase rather than in the ContextIndexSearcher? Couldn't we move the call to newCollector in ContextIndexSearcher#search into the try block below?

Thank you @benchaplin for the review. That's a good point.

In QueryPhase we register the timeoutRunnable on the searcher so that timeout can also fire during collector construction. MovingcollectorManager.newCollector() into the try in ContextIndexSearcher#search, a timeout that happens inside newCollector() would be caught there, but firstCollector would still be null. The catch block currently assumes a valid collector (it calls doAggregationPostCollection(firstCollector) and collectorManager.reduce(singletonList(firstCollector))), so we’d risk calling those on null.

Handling the timeout in QueryPhase doesn’t depend on having a collector at all and can still produce a consistent partial result.

Copy link
Contributor

@benchaplin benchaplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for the two day review... wasn't able to check out the tests yesterday)

These are great tests! Just left a few questions.

}
} catch (ContextIndexSearcher.TimeExceededException tee) {
finalizeAsTimedOutResult(searchContext);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return is unnecessary, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, removed!

* Test aggregation builder that simulates a timeout during collector setup
* to verify QueryPhase timeout handling behavior.
*/
private static final class ForceTimeoutAggregationBuilder extends AggregationBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps extending AbstractAggregationBuilder would reduce the need for many of these overrides?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I used the AbstractAggregationBuilder. I guess I had the intention to use this class initially but got confused by the abstart AggregationBuilder.

}

/**
* Simulates the search layer returning null from ContextIndexSearcher.search()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you observed ContextIndexSearcher.search() returning null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code again and couldn’t find any path where ContextIndexSearcher.search(query, collectorManager) could return null. Looks like my initial assumption was wrong.

I removed the test and the in QueryPhase class

 if (queryPhaseResult == null) {
     finalizeAsTimedOutResult(searchContext);
     return;
}

resp = client().prepareSearch(INDEX)
.setQuery(QueryBuilders.matchAllQuery())
.setSize(10)
.setAllowPartialSearchResults(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we randomize this and look for SearchTimeoutException in the case that allowPartialSearchResults=false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added a second test because the code behaves differently depending on setAllowPartialSearchResults(xx):

  • true: shard timeouts are treated as partial failures, so the search returns a normal SearchResponse with timed_out = true.
  • false: shard timeouts are not allowed, so the coordinating node fails the entire search with a SearchPhaseExecutionException.

Since these are two distinct code paths with different expected outcomes, each needs its own test.

@drempapis drempapis added v8.19.0 v9.1.0 v9.2.0 auto-backport Automatically create backport pull requests when merged labels Nov 24, 2025
@drempapis drempapis merged commit c699e67 into elastic:main Nov 24, 2025
34 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts
9.1 Commit could not be cherrypicked due to conflicts
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 138084

drempapis added a commit to drempapis/elasticsearch that referenced this pull request Nov 24, 2025
…lastic#138084)

(cherry picked from commit c699e67)

# Conflicts:
#	server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java
@drempapis
Copy link
Contributor Author

💔 Some backports could not be created

Status Branch Result
9.2
9.1
8.19 An unhandled error occurred. Please see the logs for details

Manual backport

To create the backport manually run:

backport --pr 138084

Questions ?

Please refer to the Backport tool documentation

drempapis added a commit to drempapis/elasticsearch that referenced this pull request Nov 24, 2025
…lastic#138084)

(cherry picked from commit c699e67)

# Conflicts:
#	server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java
elasticsearchmachine pushed a commit that referenced this pull request Nov 24, 2025
…hase (#138084) (#138474)

* Handle Query Timeouts During Collector Initialization in QueryPhase (#138084)

(cherry picked from commit c699e67)

# Conflicts:
#	server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

* add missing improts
elasticsearchmachine pushed a commit that referenced this pull request Nov 24, 2025
…hase (#138084) (#138473)

* Handle Query Timeouts During Collector Initialization in QueryPhase (#138084)

(cherry picked from commit c699e67)

# Conflicts:
#	server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

* Refactor QueryPhaseTimeoutTests for clarity

* [CI] Auto commit changes from spotless

* update imports

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
@drempapis drempapis removed the v8.19.0 label Nov 24, 2025
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.1.0 v9.2.0 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants