Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't suppress AlreadyClosedException #19975

Merged
merged 8 commits into from Aug 23, 2016
Merged

Conversation

mikemccand
Copy link
Contributor

Catching and suppressing AlreadyClosedException from Lucene is dangerous because it can mean there is a bug in ES since ES should normally guard against invoking Lucene classes after they were closed.

I reviewed the cases where we catch AlreadyClosedException from Lucene and removed the ones that I believe are not needed, or improved comments explaining why ACE is OK in that case.

I think (@s1monw can you confirm?) that holding the engine's readLock means IW will not be closed, except if disaster strikes (failEngine) at which point I think it's fine to see the original ACE in the logs?

Closes #19861

/* This can happen in a race condition: since we don't hold the engine's readLock (preventing it from closing) while
* holding this searcher, while closing just now, it's possible that we concurrently close the engine's IndexWriter and ES's
* store, and since closing Lucene's DirectoryReader tries to delete pending files it had held open, we can hit
* AlreadyClosedException from Lucene's Directory */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I buy the above explanation ;) Lucene's DirectoryReader itself already ignores ACE when trying to reclaim the pending deleted files. So I'm tempted to remove this catch clause ...

@mikemccand
Copy link
Contributor Author

@jasontedor thank you for the feedback; I pushed new changes.

@@ -914,6 +916,7 @@ protected final void writerSegmentStats(SegmentsStats stats) {

@Override
public long getIndexBufferRAMBytesUsed() {
// We don't guard w/ readLock here, so we could throw AlreadyClosedException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to caught and wrapped then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, where IndexShard calls this method, it expects/handles the AlreadyClosedException (handleRefreshException in IndexShard.java), and an ACE thrown from here can happen under normal usage (does not necessarily mean there's a bug).

super's javadocs also state that it can throw AlreadyClosedException.

We could alternatively acquire the readLock, in this method (and re-throw to AssertionError), but that's somewhat scary since writeLock can be held for quite some time, blocking IndexingMemoryController from polling the shards when one engine is flushing.

@s1monw
Copy link
Contributor

s1monw commented Aug 16, 2016

thanks mike for cleaning this up. We need to be very careful since EngineClosedException has a special meaning in replication that we use to retry documents so we should ensure that we suppress ACE when the engine is closed or also add ACE to the list of exceptions that trigger a retry?

I also wonder about the SearcherManager contract a bit. It seems like we are closing things in the right order SM first then rollback the writer. so from my perspective hitting ACE on SM is always safe? if not that's a problem in SM or RM? I am not sure if IW protects the caller from this or if there are any concurrency issues but SM should be safe to be called at any time?

@mikemccand mikemccand removed their assignment Aug 16, 2016
@mikemccand
Copy link
Contributor Author

I unassigned myself here and on #19861: I don't think I'm qualified to improve the exception handling here.

I had a chat with @s1monw about all the scary complexities; I think someone who better understands the locking / concurrency in the engine, and what the different exceptions mean to the distributed layer, etc., needs to tackle this.

I also wonder about the SearcherManager contract a bit.

I'll look into this.

@s1monw
Copy link
Contributor

s1monw commented Aug 16, 2016

I unassigned myself here and on #19861: I don't think I'm qualified to improve the exception handling here.

I don't think so. It's something we have to solve, I can only encourage you to go and fix it. As long as we do the right thing in the indexing path I think we are fine?

} catch (AlreadyClosedException e) {
// ignore
}
indexWriter.rollback();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we hit a merge exception in multiple threads we can call this multiple times? Why is it not ok to swallow this? do we need more assertions here or commments?

@s1monw
Copy link
Contributor

s1monw commented Aug 22, 2016

@mikemccand I pushed new changes

@mikemccand
Copy link
Contributor Author

LGTM, thanks @s1monw

@s1monw
Copy link
Contributor

s1monw commented Aug 23, 2016

@mikemccand I had to add another special case for forceMerge can you take another look

@mikemccand
Copy link
Contributor Author

LGTM, thanks @s1monw

@s1monw s1monw merged commit 668dac7 into elastic:master Aug 23, 2016
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Sep 19, 2016
…ngine

Since elastic#19975 we are aggressively failing with AssertionError when we catch an ACE
inside the InternalEngine. We treat everything that is neither a tragic even on
the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the
engine hits an IOException on refresh of some sort and the IW doesn't realize it since
it's not fully under it's control we fail he engine but neither IW nor Translog are marked
as failed by tragic event while they are already closed.
This change takes the `failedEngine` exception into account and if it's set we know
that the engine failed by some other even than a tragic one and can continue.

This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather
than it's concrete implementation.

Relates to elastic#19975
s1monw added a commit that referenced this pull request Sep 19, 2016
…ngine (#20546)

Since #19975 we are aggressively failing with AssertionError when we catch an ACE
inside the InternalEngine. We treat everything that is neither a tragic even on
the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the
engine hits an IOException on refresh of some sort and the IW doesn't realize it since
it's not fully under it's control we fail he engine but neither IW nor Translog are marked
as failed by tragic event while they are already closed.
This change takes the `failedEngine` exception into account and if it's set we know
that the engine failed by some other even than a tragic one and can continue.

This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather
than it's concrete implementation.

Relates to #19975
s1monw added a commit that referenced this pull request Sep 19, 2016
…ngine (#20546)

Since #19975 we are aggressively failing with AssertionError when we catch an ACE
inside the InternalEngine. We treat everything that is neither a tragic even on
the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the
engine hits an IOException on refresh of some sort and the IW doesn't realize it since
it's not fully under it's control we fail he engine but neither IW nor Translog are marked
as failed by tragic event while they are already closed.
This change takes the `failedEngine` exception into account and if it's set we know
that the engine failed by some other even than a tragic one and can continue.

This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather
than it's concrete implementation.

Relates to #19975
s1monw added a commit that referenced this pull request Sep 19, 2016
…ngine (#20546)

Since #19975 we are aggressively failing with AssertionError when we catch an ACE
inside the InternalEngine. We treat everything that is neither a tragic even on
the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the
engine hits an IOException on refresh of some sort and the IW doesn't realize it since
it's not fully under it's control we fail he engine but neither IW nor Translog are marked
as failed by tragic event while they are already closed.
This change takes the `failedEngine` exception into account and if it's set we know
that the engine failed by some other even than a tragic one and can continue.

This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather
than it's concrete implementation.

Relates to #19975
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Engine :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v5.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

InternalEngine should not catch AlreadyClosedException
6 participants