Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Die with dignity while merging #27265

Merged
merged 16 commits into from Nov 6, 2017
Merged

Conversation

jasontedor
Copy link
Member

If an out of memory error is thrown while merging, today we quietly rewrap it into a merge exception and the out of memory error is lost. Instead, we need to rethrow out of memory errors, and in fact any fatal error here, and let those go uncaught so that the node is torn down. This commit causes this to be the case.

Relates #19272

@jasontedor
Copy link
Member Author

Do not be thrown off by the size of the diff here, most of this is moving code to a new class called EngineTestCase so that it can be reused from EvilInternalEngineTests; we need to be evil here so that we can install an uncaught exception handler.

If an out of memory error is thrown while merging, today we quietly
rewrap it into a merge exception and the out of memory error is
lost. Instead, we need to rethrow out of memory errors, and in fact any
fatal error here, and let those go uncaught so that the node is torn
down. This commit causes this to be the case.
@@ -1925,14 +1917,34 @@ public void onFailure(Exception e) {

@Override
protected void doRun() throws Exception {
MergePolicy.MergeException e = new MergePolicy.MergeException(exc, dir);
failEngine("merge failed", e);
maybeDie("fatal error while merging", exc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why maybeDie in the generic thread instead of calling maybeDie as the first thing in handleMergeException?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I should have left a comment explaining this. The reason is because that will be on the Lucene merge thread where I have no guarantees that the call stack does not contain catch throwable! By moving this to another thread that we have complete control over, we know what the stack contains and know this will go uncaught.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dakrone I pushed fb67083.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also the Javadoc I had put on maybeDie that explains that callers must ensure the call stack does not contain catch statements that would lead to the thrown error being caught and never reaching the uncaught exception handler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay, that makes sense. Thanks for the explanation!

What happens if there isn't enough memory to create the new thread? Would it be good to do it in both places just to make best-effort to be sure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unlikely that a new thread will be created here, instead that we will simply be reusing an existing one from the generic thread pool. However, it still raises the question of what happens if some allocation in the path fails leading to another out of memory exception being thrown. In that case, that one would go uncaught and still lead to the node being torn down, exactly what we are already trying to achieve.

@jasontedor
Copy link
Member Author

@bleskes Would you please review?

@@ -1916,7 +1908,6 @@ protected void doRun() throws Exception {

@Override
protected void handleMergeException(final Directory dir, final Throwable exc) {
logger.error("failed to merge", exc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we keep this here, for fast visibility and also, just in case it happens during node shutdown..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I glanecd at the lucene code leading here and it seems all good. I do wonder if it's good to assert we can't find a Error in the cause chain of exc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the Lucene code as written right now, that's not possible. I'm really not sure if I want to add this assertion right now (look at how complicated it is in Netty4Utils#maybeError where Netty can hide things we do not want hidden from us). I understand the point of the assertion is to ensure that what we think we see in the Lucene code always holds, but then it's not clear to me if we'd ever encounter a situation where an Error occurs in testing that would trip one of these assertions anyway. So I see a cost to adding the assertion, and I'm not sure if I see benefit more than that cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point about tests maybe not being enough. I was thinking of something simple like assert ExceptionsHelper.unwrap(exc, Error.class) == null : exc . the maybeError also looks for suppressions which I'm fine with not doing. Alternatively we can convert maybError to a top level utility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will consider this in a follow-up.

@jasontedor jasontedor merged commit d5451b2 into elastic:master Nov 6, 2017
jasontedor added a commit that referenced this pull request Nov 6, 2017
If an out of memory error is thrown while merging, today we quietly
rewrap it into a merge exception and the out of memory error is
lost. Instead, we need to rethrow out of memory errors, and in fact any
fatal error here, and let those go uncaught so that the node is torn
down. This commit causes this to be the case.

Relates #27265
jasontedor added a commit that referenced this pull request Nov 6, 2017
If an out of memory error is thrown while merging, today we quietly
rewrap it into a merge exception and the out of memory error is
lost. Instead, we need to rethrow out of memory errors, and in fact any
fatal error here, and let those go uncaught so that the node is torn
down. This commit causes this to be the case.

Relates #27265
jasontedor added a commit that referenced this pull request Nov 6, 2017
If an out of memory error is thrown while merging, today we quietly
rewrap it into a merge exception and the out of memory error is
lost. Instead, we need to rethrow out of memory errors, and in fact any
fatal error here, and let those go uncaught so that the node is torn
down. This commit causes this to be the case.

Relates #27265
@jasontedor jasontedor deleted the merge-oom branch November 7, 2017 13:52
@lcawl lcawl removed the v6.1.0 label Dec 12, 2017
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Engine :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v5.6.5 v6.0.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants