Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8962: Merge segments on getReader #1623

Merged
merged 43 commits into from Aug 24, 2020
Merged

LUCENE-8962: Merge segments on getReader #1623

merged 43 commits into from Aug 24, 2020

Conversation

s1monw
Copy link
Member

@s1monw s1monw commented Jun 27, 2020

Add IndexWriter merge-on-refresh feature to selectively merge small segments on getReader, subject to a configurable timeout, to improve search performance by reducing the number of small segments for searching.

@s1monw
Copy link
Member Author

s1monw commented Jun 27, 2020

@mikemccand @msokolov @msfroh I tried to build the minimal necessary code to enable this feature. I do actually think that the refresh / getReader change is much simpler than the commit part since we don't need to keep a cloned SegmentInfos in sync. I opened this really as a question if it can be that simple?

@mikemccand
Copy link
Member

Wow! This is incredibly simple! Thanks to the clean approach @s1monw worked out for commit-on-merge, awesome! I think tests would exercise this due to MockRandomMergePolicy? I'll try beasting :)

@msokolov
Copy link
Contributor

So this would merge small commits on refresh? That's cool. I wonder if it would be more obvious to users if we call the MergeTrigger REFRESH? I see that the trigger method is IndexWriter.getReader, but it seems like ultimately the higher level event that is more familiar is refresh.

@mikemccand
Copy link
Member

So this would merge small commits on refresh?

Small segments, yes.

I wonder if it would be more obvious to users if we call the MergeTrigger REFRESH?

+1, refresh is more recognized in the outside world :)

I have been beasting this and uncovering small test failures, in tests that are confused that they do not have the expected number of segments. I'll push some fixes for those ...

@mikemccand
Copy link
Member

Here's a fun tragic failure test that repros:

   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterOnVMError -Dtests.method=testUnknownError -Dtests.seed=BBEAF0EBC40AB8F7 -Dtests.slow=true -Dtests.badapples=true \
-Dtests.locale=brx -Dtests.timezone=Indian/Chagos -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.30s | TestIndexWriterOnVMError.testUnknownError <<<
   [junit4]    > Throwable #1: java.lang.AssertionError
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([BBEAF0EBC40AB8F7:50BC2F6BC68D2EC7]:0)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5026)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5019)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4214)
   [junit4]    >        at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5735)
   [junit4]    >        at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:40)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:581)
   [junit4]    >        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:103)
   [junit4]    >        at org.apache.lucene.index.TestIndexWriterOnVMError.doTest(TestIndexWriterOnVMError.java:175)
   [junit4]    >        at org.apache.lucene.index.TestIndexWriterOnVMError.testUnknownError(TestIndexWriterOnVMError.java:251)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene86): {text_payloads=PostingsFormat(name=Direct), text_vectors=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Luc\
ene84)), text1=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene84)), id=PostingsFormat(name=Direct)}, docValues:{dv3=DocValuesFormat(name=Asserting), dv2=DocValuesFormat\
(name=Lucene80), dv5=DocValuesFormat(name=Asserting), dv=DocValuesFormat(name=Lucene80), dv4=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1228, maxMBSortInHeap=6.874571632539512, s\
im=Asserting(RandomSimilarity(queryNorm=true): {text_payloads=IB SPL-L1, text_vectors=BM25(k1=1.2,b=0.75), text1=DFI(Saturated)}), locale=brx, timezone=Indian/Chagos
   [junit4]   2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 (64-bit)/cpus=128,threads=1,free=522930136,total=536870912
   [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriterOnVMError]

@mikemccand
Copy link
Member

A different, reproducing, test failure, likely from the same cause:

   >     java.lang.AssertionError
   >         at __randomizedtesting.SeedInfo.seed([C64EB0BA0CE0061F:FB961E96340E586F]:0)
   >         at org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5026)
   >         at org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5019)
   >         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4214)
   >         at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5735)
   >         at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:40)
   >         at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:581)
   >         at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:103)
   >         at org.apache.lucene.index.TestIndexWriterExceptions2.testBasics(TestIndexWriterExceptions2.java:205)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
   >         at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
   >         at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
   >         at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
   >         at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
   >         at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
   >         at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
   >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
   >         at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
   >         at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
   >         at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826)
   >         at java.base/java.lang.Thread.run(Thread.java:834)
  2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterExceptions2 -Dtests.method=testBasics -Dtests.seed=C64EB0BA0CE0061F -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=d\
e-BE -Dtests.timezone=Atlantic/Azores -Dtests.asserts=true -Dtests.file.encoding=UTF-8
  2> NOTE: test params are: codec=Asserting(Lucene86): {}, docValues:{}, maxPointsInLeafNode=1532, maxMBSortInHeap=7.7106929429614635, sim=Asserting(RandomSimilarity(queryNorm=false): {tex\
t_payloads=LM Dirichlet(2000.000000), text_vectors=IB SPL-D3(800.0), text1=IB SPL-DZ(0.3)}), locale=de-BE, timezone=Atlantic/Azores
  2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 (64-bit)/cpus=128,threads=1,free=245301416,total=268435456
  2> NOTE: All tests run in this JVM: [TestRollingUpdates, Test2BPostingsBytes, TestIndexWriterExceptions2]

@s1monw
Copy link
Member Author

s1monw commented Jul 2, 2020

this assertion checks that we don't hold the full flush lock. I am trying to remember why I added this assertion. I understand the assertion above which also has a comment but I am unsure about the full flush lock it should not cause any deadlocks or any issues. I will keep thinking about it

@s1monw
Copy link
Member Author

s1monw commented Aug 10, 2020

@mikemccand I do understand the issue now why holding the flushLock is illegal here. The problem is again the lock ordering in combination with the commitLock. One option that we have here is to remove the flushLock altogether and replace it's usage with the commitLock. I guess we need to find a better or new name for it but I don't see where having two different locks buys us much since they are both really just used to sync on administration of the IW. I personally also don't see why it would buys us anything in terms of concurrency. WDYT

@s1monw s1monw requested a review from dnhatn August 12, 2020 20:14
@s1monw
Copy link
Member Author

s1monw commented Aug 12, 2020

@mikemccand @msokolov @msfroh I pushed a new and slightly more complex but afaik correct approach to do the merge during getReader. Would be great to get some feedback. I think it's still pretty contained.

@mikemccand
Copy link
Member

Awesome, thanks @s1monw! I will try to have a look soon. I kicked off beasting of all Lucene (core + modules) tests with this change ... no failures yet after 31 iterations.

@mikemccand
Copy link
Member

I kicked off beasting of all Lucene (core + modules) tests with this change ... no failures yet after 31 iterations.

OK, it has run 1061 iterations now of all Lucene (core + modules) tests, and some interesting (~15) failures:

ant test  -Dtestcase=TestIndexWriter -Dtests.method=testRandomOperations -Dtests.seed=432EB011B0898067 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=pt-ST -Dtests.timezone=US/Mountain -Dtests.asserts=true -Dtest\
s.file.encoding=UTF-8

   [junit4]   2> ago 13, 2020 3:37:55 DA TARDE com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
   [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-0,5,TGRP-TestIndexWriter]
   [junit4]   2> java.lang.AssertionError: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot commit
   [junit4]   2>        at __randomizedtesting.SeedInfo.seed([432EB011B0898067]:0)
   [junit4]   2>        at org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3886)
   [junit4]   2>        at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> Caused by: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot commit
   [junit4]   2>        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4930)
   [junit4]   2>        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3365)
   [junit4]   2>        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3664)
   [junit4]   2>        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3622)
   [junit4]   2>        at org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3879)
   [junit4]   2>        ... 1 more
   [junit4]   2>        Suppressed: org.apache.lucene.store.AlreadyClosedException: refusing to delete any files: this IndexWriter hit an unrecoverable exception
   [junit4]   2>                at org.apache.lucene.index.IndexFileDeleter.ensureOpen(IndexFileDeleter.java:349)
   [junit4]   2>                at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:669)
   [junit4]   2>                at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:589)
   [junit4]   2>                at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3375)
   [junit4]   2>                ... 4 more
   [junit4]   2>        Caused by: org.apache.lucene.index.CorruptIndexException: Problem reading index from MockDirectoryWrapper(ByteBuffersDirectory@ebddcb6 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@2bd239b2) (resource=MockDir\
ectoryWrapper(ByteBuffersDirectory@ebddcb6 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@2bd239b2))
   [junit4]   2>                at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:142)
   [junit4]   2>                at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:83)
   [junit4]   2>                at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:171)
   [junit4]   2>                at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:213)
   [junit4]   2>                at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:568)
   [junit4]   2>                at org.apache.lucene.index.IndexWriter.lambda$getReader$1(IndexWriter.java:614)
   [junit4]   2>                at org.apache.lucene.index.IndexWriter$2.onMergeComplete(IndexWriter.java:3461)
   [junit4]   2>                at org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:4078)
   [junit4]   2>                at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4697)
   [junit4]   2>                at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4257)
   [junit4]   2>                at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5808)
   [junit4]   2>                at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
   [junit4]   2>                at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
   [junit4]   2>        Caused by: java.io.FileNotFoundException: _1m.fnm in dir=ByteBuffersDirectory@ebddcb6 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@2bd239b2
   [junit4]   2>                at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:748)
   [junit4]   2>                at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157)
   [junit4]   2>                at org.apache.lucene.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1044)
   [junit4]   2>                at org.apache.lucene.codecs.lucene60.Lucene60FieldInfosFormat.read(Lucene60FieldInfosFormat.java:113)
   [junit4]   2>                at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:109)
   [junit4]   2>                ... 12 more

This one does reproduce with this PR (but not on mainline).

BTW, we need to fix mainline's Reproduce with: ... line to show gradle not ant?

This one also repros, only with this PR, but is maybe a test issue? Not sure:

   [junit4] Started J0 PID(1331595@localhost).
   [junit4] Suite: org.apache.lucene.index.TestIndexWriter
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriter -Dtests.method=testSoftAndHardLiveDocs -Dtests.seed=8ECCF6879557EA9E -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th-TH -Dtests.timezone=America/Shiprock -Dtests\
.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.30s | TestIndexWriter.testSoftAndHardLiveDocs <<<
   [junit4]    > Throwable #1: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([8ECCF6879557EA9E:5BCA4864AD808EEB]:0)
   [junit4]    >        at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:257)
   [junit4]    >        at org.apache.lucene.index.IndexReader.incRef(IndexReader.java:184)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.lambda$getReader$2(IndexWriter.java:653)
   [junit4]    >        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:105)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:642)
   [junit4]    >        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:103)
   [junit4]    >        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:79)
   [junit4]    >        at org.apache.lucene.index.TestIndexWriter.assertHardLiveDocs(TestIndexWriter.java:3638)
   [junit4]    >        at org.apache.lucene.index.TestIndexWriter.testSoftAndHardLiveDocs(TestIndexWriter.java:3624)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene86): {id=PostingsFormat(name=LuceneFixedGap)}, docValues:{soft_delete=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=362, maxMBSortInHeap=5.242399601460019, sim=Asserting(RandomSi\
milarity(queryNorm=true): {}), locale=th-TH, timezone=America/Shiprock
   [junit4]   2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 (64-bit)/cpus=128,threads=1,free=455736688,total=536870912
   [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriter]
   [junit4] Completed [1/1 (1!)] in 0.49s, 1 test, 1 error <<< FAILURES!
   [junit4]
   [junit4]
   [junit4] Tests with failures [seed: 8ECCF6879557EA9E]:
   [junit4]   - org.apache.lucene.index.TestIndexWriter.testSoftAndHardLiveDocs

@mikemccand
Copy link
Member

Ugh, also these test failures that I put onto the wrong PR: #1743 (comment)

@msfroh
Copy link
Contributor

msfroh commented Aug 13, 2020

I went through the code yesterday to try to understand what's going on.

I'm not familiar with the pooled reader stuff, but it seems reasonable to me. I trust the tests (especially w/ added test coverage) far more than I trust my knowledge of that part of the code.

@s1monw
Copy link
Member Author

s1monw commented Aug 14, 2020

I pushed a new commit and fixed ant test -Dtestcase=TestIndexWriter -Dtests.method=testRandomOperations -Dtests.seed=432EB011B0898067 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=pt-ST -Dtests.timezone=US/Mountain -Dtests.asserts=true -Dtest\ s.file.encoding=UTF-8

yet I can't reproduce this one: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtests.method=testSoftAndHardLiveDocs -Dtests.seed=8ECCF6879557EA9E -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th-TH -Dtests.timezone=America/Shiprock -Dtests\ .asserts=true -Dtests.file.encoding=UTF-8 does it reproduce for you on this branch with the latest commit?

@s1monw
Copy link
Member Author

s1monw commented Aug 14, 2020

@mikemccand nevermind I think I found the issue. I think it's ready now.

@mikemccand
Copy link
Member

Another failure that reproduces only on the PR:

[junit4:pickseed] Seed property 'tests.seed' already defined: CADC6D7945159855
   [junit4] <JUnit4> says jolly good day! Master seed: CADC6D7945159855
   [junit4] Executing 1 suite with 1 JVM.
   [junit4]
   [junit4] Started J0 PID(1047056@localhost).
   [junit4] Suite: org.apache.lucene.spatial.prefix.NumberRangeFacetsTest
   [junit4] OK      0.37s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:428852A3EBE9F5AD]}
   [junit4] OK      0.07s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:F6DEEE5FDF2B3E81]}
   [junit4] OK      0.02s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:783778838EEF764A]}
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=NumberRangeFacetsTest -Dtests.method=test -Dtests.seed=CADC6D7945159855 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=lu-CD -Dtests.\
timezone=America/Iqaluit -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.10s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:49D9D366E2112D63]} <<<
   [junit4]    > Throwable #1: java.nio.file.NoSuchFileException: _3.fdx
   [junit4]    >        at org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148)
   [junit4]    >        at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607)
   [junit4]    >        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
   [junit4]    >        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696)
   [junit4]    >        at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690)
   [junit4]    >        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:589)
   [junit4]    >        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:620)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.decRefDeleter(IndexWriter.java:5354)
   [junit4]    >        at org.apache.lucene.index.StandardDirectoryReader.lambda$doClose$1(StandardDirectoryReader.java:370)
   [junit4]    >        at org.apache.lucene.index.StandardDirectoryReader.doClose(StandardDirectoryReader.java:384)
   [junit4]    >        at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:244)
   [junit4]    >        at org.apache.lucene.index.IndexReader.close(IndexReader.java:385)
   [junit4]    >        at org.apache.lucene.util.IOUtils.close(IOUtils.java:89)
   [junit4]    >        at org.apache.lucene.util.IOUtils.close(IOUtils.java:77)
   [junit4]    >        at org.apache.lucene.spatial.SpatialTestCase.commit(SpatialTestCase.java:97)
   [junit4]    >        at org.apache.lucene.spatial.prefix.NumberRangeFacetsTest.test(NumberRangeFacetsTest.java:92)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)Throwable #2: java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 2 open files: {_5.cfs=1, _4.cfs=1}
   [junit4]    >        at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:812)
   [junit4]    >        at org.apache.lucene.util.IOUtils.close(IOUtils.java:89)
   [junit4]    >        at org.apache.lucene.util.IOUtils.close(IOUtils.java:77)
   [junit4]    >        at org.apache.lucene.spatial.SpatialTestCase.tearDown(SpatialTestCase.java:72)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]    > Caused by: java.lang.RuntimeException: unclosed IndexInput: _4.cfs
   [junit4]    >        at org.apache.lucene.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:730)
   [junit4]    >        at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:773)
   [junit4]    >        at org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.<init>(Lucene50CompoundReader.java:77)
   [junit4]    >        at org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.getCompoundReader(Lucene50CompoundFormat.java:71)
   [junit4]    >        at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:101)
   [junit4]    >        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:83)
   [junit4]    >        at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:171)
   [junit4]    >        at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:213)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:569)
   [junit4]    >        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:105)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:625)
   [junit4]    >        at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:290)
   [junit4]    >        at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:265)
   [junit4]    >        at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:255)
   [junit4]    >        at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
   [junit4]    >        at org.apache.lucene.spatial.SpatialTestCase.commit(SpatialTestCase.java:95)
   [junit4]    >        at org.apache.lucene.spatial.prefix.NumberRangeFacetsTest.test(NumberRangeFacetsTest.java:92)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        ... 28 more
   [junit4] OK      0.05s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:51856CB4319EDD8]}
   [junit4] OK      0.04s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:94E285943EA5A0D8]}
   [junit4] OK      0.04s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:AA3CE1122C95B0DE]}
   [junit4] OK      0.06s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:368F7B680F812470]}
   [junit4] OK      0.07s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:4239E06784A332A]}
   [junit4] OK      0.03s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:D3A8C8BC1C1D0478]}
   [junit4] OK      0.06s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:26E920E7CEA126A0]}
   [junit4] OK      0.07s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:AD4EEC224A3C8769]}
   [junit4] OK      0.04s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:CA7D799BAF414598]}
   [junit4] OK      0.05s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:A5365E8433D4C4E8]}
   [junit4] OK      0.05s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:F8A8511CE1A582B1]}
   [junit4] OK      0.01s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:9B1ABC18F3271780]}
   [junit4] OK      0.03s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:1DE11BD1382F7CE9]}
   [junit4] OK      0.01s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:A9AEC9CAF41A0E9B]}
   [junit4] OK      0.07s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:B6DAB6C4888FE199]}
   [junit4] OK      0.05s | NumberRangeFacetsTest.test {seed=[CADC6D7945159855:BC4AC5282267A734]}
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene86): {dateRange=PostingsFormat(name=MockRandom), id=Lucene84}, docValues:{}, maxPointsInLeafNode=1082, maxMBSortInHeap=5.908257958947349, si\
m=Asserting(RandomSimilarity(queryNorm=true): {id=ClassicSimilarity}), locale=lu-CD, timezone=America/Iqaluit
   [junit4]   2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 (64-bit)/cpus=128,threads=1,free=464866264,total=536870912
   [junit4]   2> NOTE: All tests run in this JVM: [NumberRangeFacetsTest]
   [junit4] Completed [1/1 (1!)] in 1.48s, 20 tests, 1 error <<< FAILURES!
   [junit4]
   [junit4]
   [junit4] Tests with failures [seed: CADC6D7945159855]:
   [junit4]   - org.apache.lucene.spatial.prefix.NumberRangeFacetsTest.test {seed=[CADC6D7945159855:49D9D366E2112D63]}
   [junit4]
   [junit4]
   [junit4] JVM J0:     0.34 ..     2.23 =     1.89s
   [junit4] Execution time total: 2.23 sec.
   [junit4] Tests summary: 1 suite, 20 tests, 1 error

@mikemccand
Copy link
Member

Those were the only two failures found after 1033 test iterations!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks great to me! But looks like there are a couple very rare test failures...

@@ -404,7 +404,7 @@ private PendingDeletes newPendingDeletes(SegmentReader reader, SegmentCommitInfo
private boolean noDups() {
Set<String> seen = new HashSet<>();
for(SegmentCommitInfo info : readerMap.keySet()) {
assert !seen.contains(info.info.name);
assert !seen.contains(info.info.name) : "seen twice: " + info.info.name ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love seeing diffs like this one, adding a String message to an otherwise cryptic assert! It makes me realize you must have had a hellacious debugging session!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many fun issues in this PR to be honest. IW is tricky as hell in some places like we are incRefing files in StandardDirectoryReader but not in IW for NRT readers is mindblowing :D

@s1monw
Copy link
Member Author

s1monw commented Aug 18, 2020

@mikemccand I pushed a fix for the failures. I will add a dedicated test to make sure it's covered too

…ed away before we can reopen the reader on it
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Simon. I left two small comments.

@mikemccand
Copy link
Member

I will restart beasting on the latest PR now. Thanks @s1monw!

@s1monw
Copy link
Member Author

s1monw commented Aug 21, 2020

@mikemccand I think we are ready here WDYT?

@mikemccand
Copy link
Member

@mikemccand I think we are ready here WDYT?

+1! Thanks @s1monw!

I will continue beasting. The failures are very rare now ... I ran 196 iters of slow + nightly Lucene core + modules tests, and hit only ~4 interesting failures.

E.g. this failing seed repros on the PR but not on mainline:

org.apache.lucene.index.TestIndexWriterOnVMError > testUnknownError FAILED
    org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MockIndexInputWrapper((clone of) ByteBuffersIndexInput (file=pending_segments_2, buffers\
=258 bytes, block size: 1, blocks: 1, position: 0))))
        at __randomizedtesting.SeedInfo.seed([587A104EFE0C57E1:B32CCFCEFC8BC1D1]:0)
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:300)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:521)
        at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301)
        at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:836)
        at org.apache.lucene.index.TestIndexWriterOnVMError.doTest(TestIndexWriterOnVMError.java:89)
        at org.apache.lucene.index.TestIndexWriterOnVMError.testUnknownError(TestIndexWriterOnVMError.java:251)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
        at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826)
        at java.base/java.lang.Thread.run(Thread.java:834)

        Caused by:
        java.io.FileNotFoundException: _0.si in dir=ByteBuffersDirectory@1bae3fe1 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@38275f41
            at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:748)
            at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157)
            at org.apache.lucene.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1044)
            at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:91)
            at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:364)
            at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:298)
            ... 41 more
        ....

  2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterOnVMError -Dtests.method=testUnknownError -Dtests.seed=587A104EFE0C57E1 -Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true -Dtests.linedocsfile=/l/sim\
on/lucene/test-framework/src/resources/org/apache/lucene/util/2000mb.txt.gz -Dtests.locale=zh-CN -Dtests.timezone=SystemV/MST7MDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8
  2> NOTE: leaving temporary files on disk at: /l/simon/lucene/core/build/tmp/tests-tmp/lucene.index.TestIndexWriterOnVMError_587A104EFE0C57E1-003
  2> NOTE: test params are: codec=Asserting(Lucene86): {text_payloads=BlockTreeOrds(blocksize=128), text_vectors=PostingsFormat(name=Asserting), text1=PostingsFormat(name=Asserting), id=BlockTreeOrds(blocksize=128)}, docValu\
es:{dv3=DocValuesFormat(name=Lucene80), dv2=DocValuesFormat(name=Asserting), dv5=DocValuesFormat(name=Lucene80), dv=DocValuesFormat(name=Asserting), dv4=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=696, maxMBSortInH\
eap=6.040673619645681, sim=Asserting(RandomSimilarity(queryNorm=false): {text_payloads=IB SPL-DZ(0.3), text_vectors=DFR I(ne)L3(800.0), text1=org.apache.lucene.search.similarities.BooleanSimilarity@6f4329a1}), locale=zh-CN, \
timezone=SystemV/MST7MDT
  2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 (64-bit)/cpus=128,threads=1,free=241525696,total=268435456
  2> NOTE: All tests run in this JVM: [TestIndexWriterOnVMError]

@mikemccand
Copy link
Member

Hmm, not that I am using the larger (2000mb) line docs file, and that might be needed to provoke failure.

@s1monw
Copy link
Member Author

s1monw commented Aug 24, 2020

@mikemccand this test reproduces on master too I opened https://issues.apache.org/jira/browse/LUCENE-9477

@mikemccand
Copy link
Member

@mikemccand this test reproduces on master too I opened https://issues.apache.org/jira/browse/LUCENE-9477

Great, thanks @s1monw! Given that beasting is now uncovering pre-existing issues I think we should push this PR! Thank you!

@s1monw s1monw merged commit 8294e1a into master Aug 24, 2020
s1monw added a commit that referenced this pull request Aug 24, 2020
Add IndexWriter merge-on-refresh feature to selectively merge 
small segments on getReader, subject to a configurable timeout, 
to improve search performance by reducing the number of small 
segments for searching.

Co-authored-by: Mike McCandless <mikemccand@apache.org>
gus-asf pushed a commit to gus-asf/lucene-solr that referenced this pull request Sep 4, 2020
Add IndexWriter merge-on-refresh feature to selectively merge 
small segments on getReader, subject to a configurable timeout, 
to improve search performance by reducing the number of small 
segments for searching.

Co-authored-by: Mike McCandless <mikemccand@apache.org>
@mocobeta mocobeta deleted the jira/lucene-8962 branch October 20, 2020 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants