Don't suppress AlreadyClosedException #19975

mikemccand · 2016-08-12T14:19:15Z

Catching and suppressing AlreadyClosedException from Lucene is dangerous because it can mean there is a bug in ES since ES should normally guard against invoking Lucene classes after they were closed.

I reviewed the cases where we catch AlreadyClosedException from Lucene and removed the ones that I believe are not needed, or improved comments explaining why ACE is OK in that case.

I think (@s1monw can you confirm?) that holding the engine's readLock means IW will not be closed, except if disaster strikes (failEngine) at which point I think it's fine to see the original ACE in the logs?

Closes #19861

mikemccand · 2016-08-12T14:21:16Z

core/src/main/java/org/elasticsearch/index/engine/EngineSearcher.java

+            /* This can happen in a race condition: since we don't hold the engine's readLock (preventing it from closing) while
+             * holding this searcher, while closing just now, it's possible that we concurrently close the engine's IndexWriter and ES's
+             * store, and since closing Lucene's DirectoryReader tries to delete pending files it had held open, we can hit
+             * AlreadyClosedException from Lucene's Directory */


I'm not sure I buy the above explanation ;) Lucene's DirectoryReader itself already ignores ACE when trying to reclaim the pending deleted files. So I'm tempted to remove this catch clause ...

… catch clauses from suppressing it

…edException))

mikemccand · 2016-08-12T15:17:20Z

@jasontedor thank you for the feedback; I pushed new changes.

jasontedor · 2016-08-12T15:45:26Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -914,6 +916,7 @@ protected final void writerSegmentStats(SegmentsStats stats) {

    @Override
    public long getIndexBufferRAMBytesUsed() {
+        // We don't guard w/ readLock here, so we could throw AlreadyClosedException


Does this need to caught and wrapped then?

Well, where IndexShard calls this method, it expects/handles the AlreadyClosedException (handleRefreshException in IndexShard.java), and an ACE thrown from here can happen under normal usage (does not necessarily mean there's a bug).

super's javadocs also state that it can throw AlreadyClosedException.

We could alternatively acquire the readLock, in this method (and re-throw to AssertionError), but that's somewhat scary since writeLock can be held for quite some time, blocking IndexingMemoryController from polling the shards when one engine is flushing.

s1monw · 2016-08-16T19:36:13Z

thanks mike for cleaning this up. We need to be very careful since EngineClosedException has a special meaning in replication that we use to retry documents so we should ensure that we suppress ACE when the engine is closed or also add ACE to the list of exceptions that trigger a retry?

I also wonder about the SearcherManager contract a bit. It seems like we are closing things in the right order SM first then rollback the writer. so from my perspective hitting ACE on SM is always safe? if not that's a problem in SM or RM? I am not sure if IW protects the caller from this or if there are any concurrency issues but SM should be safe to be called at any time?

mikemccand · 2016-08-16T19:54:04Z

I unassigned myself here and on #19861: I don't think I'm qualified to improve the exception handling here.

I had a chat with @s1monw about all the scary complexities; I think someone who better understands the locking / concurrency in the engine, and what the different exceptions mean to the distributed layer, etc., needs to tackle this.

I also wonder about the SearcherManager contract a bit.

I'll look into this.

s1monw · 2016-08-16T19:58:53Z

I unassigned myself here and on #19861: I don't think I'm qualified to improve the exception handling here.

I don't think so. It's something we have to solve, I can only encourage you to go and fix it. As long as we do the right thing in the indexing path I think we are fine?

s1monw · 2016-08-16T20:01:39Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-                } catch (AlreadyClosedException e) {
-                    // ignore
-                }
+                indexWriter.rollback();


if we hit a merge exception in multiple threads we can call this multiple times? Why is it not ok to swallow this? do we need more assertions here or commments?

s1monw · 2016-08-22T20:39:16Z

@mikemccand I pushed new changes

mikemccand · 2016-08-22T20:45:01Z

LGTM, thanks @s1monw

…ady closed

s1monw · 2016-08-23T07:27:54Z

@mikemccand I had to add another special case for forceMerge can you take another look

mikemccand · 2016-08-23T09:55:39Z

LGTM, thanks @s1monw

…ngine Since elastic#19975 we are aggressively failing with AssertionError when we catch an ACE inside the InternalEngine. We treat everything that is neither a tragic even on the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the engine hits an IOException on refresh of some sort and the IW doesn't realize it since it's not fully under it's control we fail he engine but neither IW nor Translog are marked as failed by tragic event while they are already closed. This change takes the `failedEngine` exception into account and if it's set we know that the engine failed by some other even than a tragic one and can continue. This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather than it's concrete implementation. Relates to elastic#19975

…ngine (#20546) Since #19975 we are aggressively failing with AssertionError when we catch an ACE inside the InternalEngine. We treat everything that is neither a tragic even on the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the engine hits an IOException on refresh of some sort and the IW doesn't realize it since it's not fully under it's control we fail he engine but neither IW nor Translog are marked as failed by tragic event while they are already closed. This change takes the `failedEngine` exception into account and if it's set we know that the engine failed by some other even than a tragic one and can continue. This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather than it's concrete implementation. Relates to #19975

don't suppress AlreadyClosedException: it means there's a bug somewhere

8699d2d

mikemccand added >bug :Engine v5.0.0-beta1 labels Aug 12, 2016

mikemccand self-assigned this Aug 12, 2016

mikemccand reviewed Aug 12, 2016
View reviewed changes

mikemccand added 2 commits August 12, 2016 11:13

Wrap unexpected AlreadyCloseException under AssertionError to prevent…

eb24f12

… catch clauses from suppressing it

Add comment explaining why we special case AssertionError(AlreadyClos…

161822a

…edException))

jasontedor reviewed Aug 12, 2016
View reviewed changes

halt the JVM if an unexpected AlreadyClosedException strikes

3b604e5

mikemccand removed their assignment Aug 16, 2016

s1monw reviewed Aug 16, 2016
View reviewed changes

s1monw added 2 commits August 22, 2016 22:23

Merge branch 'master' into dont_catch_ace

da058cc

enforce that ACE is only handled in a tragic event

ad97309

special case forceMerge - here we can catch ACE safely if we are alre…

ac73e64

…ady closed

add more comments to clarify what to do with ACE

520ce30

s1monw merged commit 668dac7 into elastic:master Aug 23, 2016

s1monw mentioned this pull request Sep 19, 2016

Take refresh IOExceptions into account when catching ACE in InternalEngine #20546

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't suppress AlreadyClosedException #19975

Don't suppress AlreadyClosedException #19975

mikemccand commented Aug 12, 2016

mikemccand Aug 12, 2016

mikemccand commented Aug 12, 2016

jasontedor Aug 12, 2016

mikemccand Aug 12, 2016

s1monw commented Aug 16, 2016

mikemccand commented Aug 16, 2016

s1monw commented Aug 16, 2016

s1monw Aug 16, 2016

s1monw commented Aug 22, 2016

mikemccand commented Aug 22, 2016

s1monw commented Aug 23, 2016

mikemccand commented Aug 23, 2016

Don't suppress AlreadyClosedException #19975

Don't suppress AlreadyClosedException #19975

Conversation

mikemccand commented Aug 12, 2016

mikemccand Aug 12, 2016

Choose a reason for hiding this comment

mikemccand commented Aug 12, 2016

jasontedor Aug 12, 2016

Choose a reason for hiding this comment

mikemccand Aug 12, 2016

Choose a reason for hiding this comment

s1monw commented Aug 16, 2016

mikemccand commented Aug 16, 2016

s1monw commented Aug 16, 2016

s1monw Aug 16, 2016

Choose a reason for hiding this comment

s1monw commented Aug 22, 2016

mikemccand commented Aug 22, 2016

s1monw commented Aug 23, 2016

mikemccand commented Aug 23, 2016