-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent Leaking Search Tasks on Exceptions in FetchSearchPhase and DfsQueryPhase #45500
Prevent Leaking Search Tasks on Exceptions in FetchSearchPhase and DfsQueryPhase #45500
Conversation
* Only logging at DEBUG here was hiding a tricky bug, increasing it to WARN to help with future issues
Pinging @elastic/es-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failure is not fatal so it can stay at this level I think. The bug that you found is that we call this handler when a phase exception happens (finishPhase.run throw an exception during executeFetch) so this leaves a zombie task/connection in the node and the response is never sent back. Changing the FetchSearchPhase
with:
diff --git a/server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java b/server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
index 2115b4fa998..548c21ab1b6 100644
--- a/server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
+++ b/server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
@@ -163,7 +163,11 @@ final class FetchSearchPhase extends SearchPhase {
new SearchActionListener<FetchSearchResult>(shardTarget, shardIndex) {
@Override
public void innerOnResponse(FetchSearchResult result) {
- counter.onResult(result);
+ try {
+ counter.onResult(result);
+ } catch (Exception e) {
+ context.onPhaseFailure(FetchSearchPhase.this, "", e);
+ }
}
@Override
should be enough and in this case the error will be properly propagated to the user.
I applied the above patch now (to the fetch phase as well as the dfs query phase which has the same issue as far as I can see). WDYT? |
Jenkins run elasticsearch-ci/2 (watcher failure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @original-brownbear . I left one comment regarding the additional assert. Can you also change the title and description to better reflect what this pr is doing ? Regarding the addition of tests, it is difficult to add since the filling of search hits is not supposed to throw exceptions. IMO this is a bug that we can have an exception at this point so I see this pr as a protection against it but we should also have a follow up to ensure that any discrepancy between shards is caught during the merge and not in the final step. I'll take a look when this pr is merged.
server/src/main/java/org/elasticsearch/action/search/DfsQueryPhase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
Outdated
Show resolved
Hide resolved
Jenkins run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great catch @original-brownbear !
Since the fix is simple can we backport this to 6.x and 7.3.x too ? |
@jimczi sure will back port to 6.8 and 7.3 as well :) |
…sQueryPhase (elastic#45500) * If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice). Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
…sQueryPhase (elastic#45500) * If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice). Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
…sQueryPhase (elastic#45500) * If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice). Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
…sQueryPhase (#45500) (#45540) * If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice). Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
…sQueryPhase (#45500) (#45541) * If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice). Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
…sQueryPhase (#45500) (#45543) * If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice). Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
counter.onResult
throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting theonFailure
callback again and couting down thecounter
twice).