Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestIndexWriterOnVMError.testUnknownError times out (potential IndexWriter deadlock with tragic exceptions) #12654

Closed
benwtrent opened this issue Oct 11, 2023 · 8 comments · Fixed by #12751

Comments

@benwtrent
Copy link
Member

benwtrent commented Oct 11, 2023

Description

CI indicated the test suite timed out. So, I ran the reproduction line locally and had to kill the test running after 5 minutes.

I seriously doubt this test should take longer than 5 minutes. I ran without the seed and it finished in seconds.

Gradle command to reproduce

./gradlew test --tests TestIndexWriterOnVMError.testUnknownError -Dtests.seed=4A059D04FCC8873 -Dtests.nightly=true -Dtests.locale=rw-RW -Dtests.timezone=Europe/Luxembourg -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.multiplier=1
@benwtrent
Copy link
Member Author

Added some println statements to

public void eval(MockDirectoryWrapper dir) throws IOException {
            if (r.nextInt(3000) == 0) {
              if (callStackContains(IndexWriter.class)) {
                throw new UnknownError("Fake UnknownError");
              }
            }
          }

I verified that throw new UnknownError("Fake UnknownError"); path is executed and the exception is thrown.

Note, both the seed and -Dtests.nightly=true must be set for the extreme runtime.

@benwtrent
Copy link
Member Author

    final int numIterations = TEST_NIGHTLY ? atLeast(100) : atLeast(5);

Is one such place where nightly matters from what I can tell.

I verified the 139 is the iterations, which isn't a ton. I have other seeds passing with much more than that.

@dweiss
Copy link
Contributor

dweiss commented Nov 2, 2023

Well, this test is almost never "fast" for me... the conditions passed in Failure.eval are frequently called, but rarely hit the right call stack - this is particularly problematic with testCheckpoint - if I count the number of times the eval is called (for a particular random seed), it's 1201263, then nextInt(4) == 0 drops the call stack check to ~300k BUT the call stack check is successful only 50 times out of 299622 (and call stack collection is quite expensive overall).

Anyway, for that particular seed you identified, @benwtrent , the index writer is simply hanging in shouldClose and never returns:

  private synchronized boolean shouldClose(boolean waitForClose) {
    while (true) {
      if (closed == false) {
        if (closing == false) {
          // We get to close
          closing = true;
          return true;
        } else if (waitForClose == false) {
          return false;
        } else {
          // Another thread is presently trying to close;
          // wait until it finishes one way (closes
          // successfully) or another (fails to close)
          doWait();
        }
      } else {
        return false;
      }
    }
  }

Nothing is happening in the test - it just idly waits until it times out.

"TEST-TestIndexWriterOnVMError.testUnknownError-seed#[4A059D04FCC8873]" #18 prio=5 os_prio=0 cpu=1453.12ms elapsed=211.66s tid=0x000001a54330c3d0 nid=0x3ce8 in Object.wait()  [0x00000011de3fd000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(java.base@17.0.3/Native Method)
	- waiting on <0x00000000f67432c0> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:5419)
	- locked <0x00000000f67432c0> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.shouldClose(IndexWriter.java:1386)
	- locked <0x00000000f67432c0> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2442)
	at org.apache.lucene.index.TestIndexWriterOnVMError.getTragedy(TestIndexWriterOnVMError.java:250)
	at org.apache.lucene.index.TestIndexWriterOnVMError.doTest(TestIndexWriterOnVMError.java:207)
	at org.apache.lucene.index.TestIndexWriterOnVMError.testUnknownError(TestIndexWriterOnVMError.java:278) 
...

The test has this code:

      // TODO: remove rollback here, and add this assert to ensure "full OOM protection" anywhere IW
      // does writes
      // assertTrue("hit OOM but writer is still open, WTF: ", writer.isClosed());
      try {
        writer.rollback();
      } catch (Throwable t) {
        t.printStackTrace(log);
      }

And clearly that assertion would have fired, if enabled. I don't know how to fix this either though.

@dweiss
Copy link
Contributor

dweiss commented Nov 2, 2023

You can reproduce this problem from the IDE as well:

-ea -Dtests.seed=4A059D04FCC8873 -Dtests.nightly=true -Dtests.multiplier=1 -Dtests.verbose=true

The last message is:

IW 114 [2023-11-02T20:32:41.152762600Z; main]: hit tragic UnknownError inside rollbackInternal

After that - nothing.

@dweiss
Copy link
Contributor

dweiss commented Nov 2, 2023

This may be a legitimate bug somewhere. Maybe @mikemccand or @s1monw will know what the expected state here should be.

@dweiss
Copy link
Contributor

dweiss commented Nov 2, 2023

In fact, I think it's this block in IW:

          // close all the closeables we can (but important is readerPool and writeLock to prevent
          // leaks)
          IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock);
          writeLock = null;
          closed = true;
          closing = false;

the problem here is - if there's another tragic exception from within IOUtils.close* call, the assignments below are never executed.

@s1monw
Copy link
Member

s1monw commented Nov 3, 2023

@dweiss I agree this is the problem. We should execute that IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock); in a try / finally block. I can open a PR for that unless you wanna do that?

@dweiss
Copy link
Contributor

dweiss commented Nov 3, 2023

Thanks, Simon. I'll open up a PR.

@dweiss dweiss changed the title TestIndexWriterOnVMError.testUnknownError timesout TestIndexWriterOnVMError.testUnknownError times out (potential IndexWriter deadlock with tragic exceptions) Nov 3, 2023
dweiss added a commit to dweiss/lucene that referenced this issue Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants