Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Leaking Search Tasks on Exceptions in FetchSearchPhase and DfsQueryPhase #45500

Merged

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Aug 13, 2019

  • If counter.onResult throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the onFailure callback again and couting down the counter twice).

* Only logging at DEBUG here was hiding a tricky bug, increasing it to WARN to help with future issues
@original-brownbear original-brownbear added >non-issue :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.4.0 labels Aug 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failure is not fatal so it can stay at this level I think. The bug that you found is that we call this handler when a phase exception happens (finishPhase.run throw an exception during executeFetch) so this leaves a zombie task/connection in the node and the response is never sent back. Changing the FetchSearchPhase with:

diff --git a/server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java b/server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
index 2115b4fa998..548c21ab1b6 100644
--- a/server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
+++ b/server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
@@ -163,7 +163,11 @@ final class FetchSearchPhase extends SearchPhase {
             new SearchActionListener<FetchSearchResult>(shardTarget, shardIndex) {
                 @Override
                 public void innerOnResponse(FetchSearchResult result) {
-                    counter.onResult(result);
+                    try {
+                        counter.onResult(result);
+                    } catch (Exception e) {
+                        context.onPhaseFailure(FetchSearchPhase.this, "", e);
+                    }
                 }
                 @Override

should be enough and in this case the error will be properly propagated to the user.

@original-brownbear
Copy link
Member Author

@jimczi

I applied the above patch now (to the fetch phase as well as the dfs query phase which has the same issue as far as I can see).
However, I could not for the life of me find a way to trigger this kind of failure in an integration test (and I'm not sure a unit test with some hand crafted failure in org.elasticsearch.action.search.FetchSearchPhase#moveToNextPhase would mean anything).
So I figured I'd assert that we simply don't have any exceptions here (as far as I can tell any exception would be a bug in org.elasticsearch.action.search.SearchPhaseController#merge since org.elasticsearch.action.search.AbstractSearchAsyncAction#executeNextPhase catches all the exceptions and passes them to onPhaseFailure as well).

WDYT?

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/2

(watcher failure)

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @original-brownbear . I left one comment regarding the additional assert. Can you also change the title and description to better reflect what this pr is doing ? Regarding the addition of tests, it is difficult to add since the filling of search hits is not supposed to throw exceptions. IMO this is a bug that we can have an exception at this point so I see this pr as a protection against it but we should also have a follow up to ensure that any discrepancy between shards is caught during the merge and not in the final step. I'll take a look when this pr is merged.

@original-brownbear original-brownbear changed the title Increase Log Level to WARN for Fetch Failures Prevent Leaking Search Tasks on Exceptions in FetchSearchPhase and DfsQueryPhase Aug 14, 2019
@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1

@original-brownbear
Copy link
Member Author

Thanks @jimczi , you're right the assert probably doesn't add anything -> removed it in 1547b57 :)

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great catch @original-brownbear !

@jimczi jimczi added >bug and removed >non-issue labels Aug 14, 2019
@jimczi
Copy link
Contributor

jimczi commented Aug 14, 2019

Since the fix is simple can we backport this to 6.x and 7.3.x too ?

@original-brownbear
Copy link
Member Author

@jimczi sure will back port to 6.8 and 7.3 as well :)

@original-brownbear original-brownbear merged commit f5db2b5 into elastic:master Aug 14, 2019
@original-brownbear original-brownbear deleted the fix-leaked-search-task branch August 14, 2019 11:01
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 14, 2019
…sQueryPhase (elastic#45500)

* If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice).

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 14, 2019
…sQueryPhase (elastic#45500)

* If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice).

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 14, 2019
…sQueryPhase (elastic#45500)

* If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice).

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
original-brownbear added a commit that referenced this pull request Aug 14, 2019
…sQueryPhase (#45500) (#45540)

* If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice).

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
original-brownbear added a commit that referenced this pull request Aug 14, 2019
…sQueryPhase (#45500) (#45541)

* If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice).

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
original-brownbear added a commit that referenced this pull request Aug 14, 2019
…sQueryPhase (#45500) (#45543)

* If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice).

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v6.8.3 v7.3.1 v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants