-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't suppress AlreadyClosedException #19975
Conversation
/* This can happen in a race condition: since we don't hold the engine's readLock (preventing it from closing) while | ||
* holding this searcher, while closing just now, it's possible that we concurrently close the engine's IndexWriter and ES's | ||
* store, and since closing Lucene's DirectoryReader tries to delete pending files it had held open, we can hit | ||
* AlreadyClosedException from Lucene's Directory */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I buy the above explanation ;) Lucene's DirectoryReader
itself already ignores ACE
when trying to reclaim the pending deleted files. So I'm tempted to remove this catch clause ...
… catch clauses from suppressing it
@jasontedor thank you for the feedback; I pushed new changes. |
@@ -914,6 +916,7 @@ protected final void writerSegmentStats(SegmentsStats stats) { | |||
|
|||
@Override | |||
public long getIndexBufferRAMBytesUsed() { | |||
// We don't guard w/ readLock here, so we could throw AlreadyClosedException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to caught and wrapped then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, where IndexShard
calls this method, it expects/handles the AlreadyClosedException
(handleRefreshException
in IndexShard.java
), and an ACE thrown from here can happen under normal usage (does not necessarily mean there's a bug).
super
's javadocs also state that it can throw AlreadyClosedException
.
We could alternatively acquire the readLock
, in this method (and re-throw to AssertionError
), but that's somewhat scary since writeLock
can be held for quite some time, blocking IndexingMemoryController
from polling the shards when one engine is flushing.
thanks mike for cleaning this up. We need to be very careful since EngineClosedException has a special meaning in replication that we use to retry documents so we should ensure that we suppress ACE when the engine is closed or also add ACE to the list of exceptions that trigger a retry? I also wonder about the SearcherManager contract a bit. It seems like we are closing things in the right order SM first then rollback the writer. so from my perspective hitting ACE on SM is always safe? if not that's a problem in SM or RM? I am not sure if IW protects the caller from this or if there are any concurrency issues but SM should be safe to be called at any time? |
I unassigned myself here and on #19861: I don't think I'm qualified to improve the exception handling here. I had a chat with @s1monw about all the scary complexities; I think someone who better understands the locking / concurrency in the engine, and what the different exceptions mean to the distributed layer, etc., needs to tackle this.
I'll look into this. |
I don't think so. It's something we have to solve, I can only encourage you to go and fix it. As long as we do the right thing in the indexing path I think we are fine? |
} catch (AlreadyClosedException e) { | ||
// ignore | ||
} | ||
indexWriter.rollback(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we hit a merge exception in multiple threads we can call this multiple times? Why is it not ok to swallow this? do we need more assertions here or commments?
@mikemccand I pushed new changes |
LGTM, thanks @s1monw |
@mikemccand I had to add another special case for forceMerge can you take another look |
LGTM, thanks @s1monw |
…ngine Since elastic#19975 we are aggressively failing with AssertionError when we catch an ACE inside the InternalEngine. We treat everything that is neither a tragic even on the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the engine hits an IOException on refresh of some sort and the IW doesn't realize it since it's not fully under it's control we fail he engine but neither IW nor Translog are marked as failed by tragic event while they are already closed. This change takes the `failedEngine` exception into account and if it's set we know that the engine failed by some other even than a tragic one and can continue. This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather than it's concrete implementation. Relates to elastic#19975
…ngine (#20546) Since #19975 we are aggressively failing with AssertionError when we catch an ACE inside the InternalEngine. We treat everything that is neither a tragic even on the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the engine hits an IOException on refresh of some sort and the IW doesn't realize it since it's not fully under it's control we fail he engine but neither IW nor Translog are marked as failed by tragic event while they are already closed. This change takes the `failedEngine` exception into account and if it's set we know that the engine failed by some other even than a tragic one and can continue. This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather than it's concrete implementation. Relates to #19975
…ngine (#20546) Since #19975 we are aggressively failing with AssertionError when we catch an ACE inside the InternalEngine. We treat everything that is neither a tragic even on the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the engine hits an IOException on refresh of some sort and the IW doesn't realize it since it's not fully under it's control we fail he engine but neither IW nor Translog are marked as failed by tragic event while they are already closed. This change takes the `failedEngine` exception into account and if it's set we know that the engine failed by some other even than a tragic one and can continue. This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather than it's concrete implementation. Relates to #19975
…ngine (#20546) Since #19975 we are aggressively failing with AssertionError when we catch an ACE inside the InternalEngine. We treat everything that is neither a tragic even on the IndexWriter or the Translog as a bug and throw an AssertionError. Yet, if the engine hits an IOException on refresh of some sort and the IW doesn't realize it since it's not fully under it's control we fail he engine but neither IW nor Translog are marked as failed by tragic event while they are already closed. This change takes the `failedEngine` exception into account and if it's set we know that the engine failed by some other even than a tragic one and can continue. This change also uses the `ReferenceManager#RefreshListener` interface in the engine rather than it's concrete implementation. Relates to #19975
Catching and suppressing
AlreadyClosedException
from Lucene is dangerous because it can mean there is a bug in ES since ES should normally guard against invoking Lucene classes after they were closed.I reviewed the cases where we catch
AlreadyClosedException
from Lucene and removed the ones that I believe are not needed, or improved comments explaining why ACE is OK in that case.I think (@s1monw can you confirm?) that holding the engine's
readLock
means IW will not be closed, except if disaster strikes (failEngine
) at which point I think it's fine to see the original ACE in the logs?Closes #19861