-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Handle Query Timeouts During Collector Initialization in QueryPhase #138084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Query Timeouts During Collector Initialization in QueryPhase #138084
Conversation
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
Hi @drempapis, I've created a changelog YAML for you. |
…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException
benchaplin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why this solves the issue. My biggest question for you: is it necessary to do this up in QueryPhase rather than in the ContextIndexSearcher? Couldn't we move the call to newCollector in ContextIndexSearcher#search into the try block below?
elasticsearch/server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java
Lines 326 to 339 in 279c349
| public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException { | |
| final C firstCollector = collectorManager.newCollector(); | |
| // Take advantage of the few extra rewrite rules of ConstantScoreQuery when score are not needed. | |
| query = firstCollector.scoreMode().needsScores() ? rewrite(query) : rewrite(new ConstantScoreQuery(query)); | |
| final Weight weight; | |
| try { | |
| weight = createWeight(query, firstCollector.scoreMode(), 1); | |
| } catch (@SuppressWarnings("unused") TimeExceededException e) { | |
| timeExceeded = true; | |
| doAggregationPostCollection(firstCollector); | |
| return collectorManager.reduce(Collections.singletonList(firstCollector)); | |
| } | |
| return search(weight, collectorManager, firstCollector); | |
| } |
| QuerySearchResult queryResult = searchContext.queryResult(); | ||
| SearchTimeoutException.handleTimeout(searchContext.request().allowPartialSearchResults(), searchContext.shardTarget(), queryResult); | ||
|
|
||
| queryResult.topDocs(new TopDocsAndMaxScore(Lucene.EMPTY_TOP_DOCS, Float.NaN), new DocValueFormat[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to set topDocs and aggs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to set them. In the early-timeout path the QuerySearchResult may not have been initialized at all, and leaving topDocs/aggs unset leads to stale values during merge. Setting them explicitly ensures the shard returns a well-formed (but empty) result.
e.g.
Caused by: java.lang.IllegalStateException: topDocs already consumed
at org.elasticsearch.search.query.QuerySearchResult.topDocs(QuerySearchResult.java:188)
at org.elasticsearch.action.search.QueryPhaseResultConsumer.reduce(QueryPhaseResultConsumer.java:246)
Thank you @benchaplin for the review. That's a good point. In Handling the timeout in QueryPhase doesn’t depend on having a collector at all and can still produce a consistent partial result. |
benchaplin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry for the two day review... wasn't able to check out the tests yesterday)
These are great tests! Just left a few questions.
| } | ||
| } catch (ContextIndexSearcher.TimeExceededException tee) { | ||
| finalizeAsTimedOutResult(searchContext); | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return is unnecessary, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, removed!
| * Test aggregation builder that simulates a timeout during collector setup | ||
| * to verify QueryPhase timeout handling behavior. | ||
| */ | ||
| private static final class ForceTimeoutAggregationBuilder extends AggregationBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps extending AbstractAggregationBuilder would reduce the need for many of these overrides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I used the AbstractAggregationBuilder. I guess I had the intention to use this class initially but got confused by the abstart AggregationBuilder.
| } | ||
|
|
||
| /** | ||
| * Simulates the search layer returning null from ContextIndexSearcher.search() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you observed ContextIndexSearcher.search() returning null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code again and couldn’t find any path where ContextIndexSearcher.search(query, collectorManager) could return null. Looks like my initial assumption was wrong.
I removed the test and the in QueryPhase class
if (queryPhaseResult == null) {
finalizeAsTimedOutResult(searchContext);
return;
}
| resp = client().prepareSearch(INDEX) | ||
| .setQuery(QueryBuilders.matchAllQuery()) | ||
| .setSize(10) | ||
| .setAllowPartialSearchResults(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we randomize this and look for SearchTimeoutException in the case that allowPartialSearchResults=false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I added a second test because the code behaves differently depending on setAllowPartialSearchResults(xx):
- true: shard timeouts are treated as partial failures, so the search returns a normal SearchResponse with
timed_out = true. - false: shard timeouts are not allowed, so the coordinating node fails the entire search with a
SearchPhaseExecutionException.
Since these are two distinct code paths with different expected outcomes, each needs its own test.
…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException
…ithub.com:drempapis/elasticsearch into handle_ContextIndexSearcher_TimeExceededException
💔 Backport failed
You can use sqren/backport to manually backport by running |
…lastic#138084) (cherry picked from commit c699e67) # Conflicts: # server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java
💔 Some backports could not be created
Manual backportTo create the backport manually run: Questions ?Please refer to the Backport tool documentation |
…lastic#138084) (cherry picked from commit c699e67) # Conflicts: # server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java
…hase (#138084) (#138473) * Handle Query Timeouts During Collector Initialization in QueryPhase (#138084) (cherry picked from commit c699e67) # Conflicts: # server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java * Refactor QueryPhaseTimeoutTests for clarity * [CI] Auto commit changes from spotless * update imports --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
What the problem is?
The
FilterByFilterAggregatorperforms part of its work during collector preparation, before the actual search execution begins. Unlike most other aggregators, this ahead-of-time work is not protected by the standard timeout-catching mechanisms withinContextIndexSearcher. When a search timeout occurs during this phase, aTimeExceededExceptionescapes before the query has fully started, causing the entireQueryPhaseto fail with a shard-levelQueryPhaseExecutionException.How this PR solves the problem
We fix this by catching timeout exceptions in
QueryPhase, which is a central place to handle timeouts from aggregators likeFilterByFilterAggregatorthat do extra work before the search begins.timeoutRunnableis registered before building the collector manager, and both thecollectorconstruction and the subsequentsearcher.search()invocation are wrapped in a singletry/catchblock. This ensures that any timeout triggered during early aggregator initialization is handled consistently with timeouts that occur during the main search phase.TimeExceededExceptionis caught, the code callsfinalizeAsTimedOutResult(), marking the query as timed out and returning a well-formed empty result (top docs + aggregations) rather than propagating an exception.searcher.search()may returnnull, a defensive null check has been installed to treat this case as a timeout as well, ensuring that no shard-level failure occurs and the response remains consistent.