Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9164: Ignore ACE on tragic event if IW is closed #1215

Closed
wants to merge 1 commit into from

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jan 27, 2020

If an IndexWriter was closed, then AlreadyClosedException should not be considered a tragic event.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 27, 2020

@mikemccand @jpountz Would you mind taking a look? Thank you.

@s1monw
Copy link
Member

s1monw commented Jan 29, 2020

I don't think we should change the behavior here. If we move away from treating a ACE as fatal here we are missing important verification. ACE should not be thrown anywhere where we treat exceptions as fatal for instance during a merge, in a refresh / flush, inside DWPT etc. If IW is closed due to any kind of event we make sure that we prevent threads from entering DWPT, issue new flush / refreshes and stop all merges. A tragic event does the same thing such that if we hit ACE in a critical place it's highly likely that's a bug. So I think we should keep the logic as it is?

@ywelsch
Copy link
Contributor

ywelsch commented Jan 30, 2020

The problem is that IndexWriter.publishFrozenUpdates does not properly account for the IndexWriter being concurrently closed, and just catches any Throwable from packet.tryApply(w), turning it into a tragic event. In particular, FrozenBufferedUpdates.forceApply does not account for the IndexWriter being concurrently closed between start and finish of applying.

The lifecycle around these components seems a bit ill-defined, I therefore wonder if something more contained, like the following, would be an option:

diff --git a/lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java b/lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java
index e58114d3b0f..cf8559ff703 100644
--- a/lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java
+++ b/lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java
@@ -377,9 +377,15 @@ final class FrozenBufferedUpdates {
       try {
         result = closeSegmentStates(writer, segStates, success);
       } finally {
-        // Matches the incRef we did above, but we must do the decRef after closing segment states else
-        // IFD can't delete still-open files
-        writer.deleter.decRef(delFiles);
+        if (writer.isClosed()) {
+          // writer could have been concurrently closed between start and end of applying, just return here
+          // as deleter will otherwise just start barfing
+          return;
+        } else {
+          // Matches the incRef we did above, but we must do the decRef after closing segment states else
+          // IFD can't delete still-open files
+          writer.deleter.decRef(delFiles);
+        }
       }
 
       if (result.anyDeletes) {

@s1monw
Copy link
Member

s1monw commented Jan 30, 2020

I think you are confusing something here. publishFrozenUpdates and it's packet.tryApply(w) is well defined. It's a closure that's been added to a queue that is only processed in a single thread and will account for concurrent closing. If we indeed processed any item on the queue after we are closed it's a major bug. I assume you have some background info that I am missing here. Is there a failure I can look at or some issue that has caused this change proposal?

@ywelsch
Copy link
Contributor

ywelsch commented Jan 30, 2020

The simplest way to reproduce the issue is to run testRefreshAndRollbackConcurrently in a loop with the assertion that Nhat added here: https://github.com/apache/lucene-solr/pull/1215/files#diff-748b10756598c10ffe466342934deb35R3776

I also added a bit more log output to show the stack trace:

diff --git a/lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java b/lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java
index dc70a63c497..65f8685be9c 100644
--- a/lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java
+++ b/lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java
@@ -3768,6 +3768,10 @@ public class TestIndexWriter extends LuceneTestCase {
       stopped.set(true);
       indexer.join();
       refresher.join();
+      if (w.getTragicException() != null) {
+        w.getTragicException().printStackTrace();
+      }
+      assertNull("should not consider ACE a tragedy on a closed IW", w.getTragicException());
       IOUtils.close(sm, dir);
     }
   }

This results in the following exception:

org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681)
	at org.apache.lucene.index.IndexFileDeleter.ensureOpen(IndexFileDeleter.java:346)
	at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:669)
	at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:589)
	at org.apache.lucene.index.FrozenBufferedUpdates.finishApply(FrozenBufferedUpdates.java:382)
	at org.apache.lucene.index.FrozenBufferedUpdates.lambda$forceApply$0(FrozenBufferedUpdates.java:245)
	at org.apache.lucene.index.FrozenBufferedUpdates.forceApply(FrozenBufferedUpdates.java:250)
	at org.apache.lucene.index.FrozenBufferedUpdates.tryApply(FrozenBufferedUpdates.java:158)
	at org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$3(IndexWriter.java:2575)
	at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5099)
	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:507)
	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297)
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272)
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262)
	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:165)
	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
	at org.apache.lucene.index.TestIndexWriter.lambda$testRefreshAndRollbackConcurrently$38(TestIndexWriter.java:3753)
	at java.base/java.lang.Thread.run(Thread.java:830)
NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriter -Dtests.method=testRefreshAndRollbackConcurrently -Dtests.seed=A130740B315C225B -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fa-AF -Dtests.timezone=America/St_Kitts -Dtests.asserts=true -Dtests.file.encoding=UTF-8

java.lang.AssertionError: should not consider ACE a tragedy on a closed IW expected null, but was:<org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed>

	at __randomizedtesting.SeedInfo.seed([A130740B315C225B:4BF371737FEBAAB9]:0)
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotNull(Assert.java:755)
	at org.junit.Assert.assertNull(Assert.java:737)
	at org.apache.lucene.index.TestIndexWriter.testRefreshAndRollbackConcurrently(TestIndexWriter.java:3774)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
...

showing how the IndexWriter is closed while packet.tryApply(w) is running, in particular when FrozenBufferedUpdates is applying segment states.

@s1monw
Copy link
Member

s1monw commented Jan 31, 2020

Thanks @ywelsch I think I understand the issue here. From my perspective there are some issues in how we ensure that there are no other threads modifying IW while we are closing / aborting. In oder to ensure that we clean up all resources we opened ie. open files etc. we need to harden this more that it is today. I think during rollback we must ensure that no other thread is modifying the writer anymore. We already to that in several places but I wonder if we need to streamline this more with some thing like a semaphore that has Int.MAX leases and that we fully acquire before we rollback. The methods in question are updateDocument and friends, commit & getReader. That way we ensure no other threads are modifying anymore and we can safely rollback.

@s1monw
Copy link
Member

s1monw commented Feb 10, 2020

I will start working on some refactorings to streamline this.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 6, 2020

Closes in favor of #1319.

@dnhatn dnhatn closed this Mar 6, 2020
@dnhatn dnhatn deleted the tragic-event branch March 6, 2020 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants