Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible test failure with Terms#intersect on the default codec #12957

Closed
jpountz opened this issue Dec 20, 2023 · 8 comments
Closed

Reproducible test failure with Terms#intersect on the default codec #12957

jpountz opened this issue Dec 20, 2023 · 8 comments
Labels

Comments

@jpountz
Copy link
Contributor

jpountz commented Dec 20, 2023

Description

The new CheckIndex checks are causing some test failures with the default codec, which are reproducible and look like real bugs? I started looking but I'm not familiar enough with BlockTree to understand what it's doing wrong.

https://jenkins.thetaphi.de/job/Lucene-main-Linux/45856/consoleFull

org.apache.lucene.index.TestTerms > testTermMinMaxRandom FAILED
    java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([CBF65306049672F4:8785DC72680AA991]:0)
        at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.getState(IntersectTermsEnum.java:245)
        at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.seekToStartTerm(IntersectTermsEnum.java:288)
        at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.<init>(IntersectTermsEnum.java:126)
        at org.apache.lucene.codecs.lucene90.blocktree.FieldReader.intersect(FieldReader.java:223)
        at org.apache.lucene.index.CheckIndex.checkTermsIntersect(CheckIndex.java:2374)
        at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:2327)
        at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:2529)
        at org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1067)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:783)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:550)
        at org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:340)
        at org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
        at org.apache.lucene.index.TestTerms.testTermMinMaxRandom(TestTerms.java:85)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:578)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
        at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
        at java.base/java.lang.Thread.run(Thread.java:1589)

[...]


Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.index.TestTerms.testTermMinMaxRandom" -Ptests.jvms=6 "-Ptests.jvmargs=-XX:+UseCompressedOops -XX:+UseSerialGC" -Ptests.seed=CBF65306049672F4 -Ptests.multiplier=3 -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1 -Ptests.vectorsize=256

Version and environment details

No response

@mikemccand
Copy link
Member

OK I think the issue here may be that Terms.intersect(Automaton a, BytesRef startTerm) requires that startTerm is accepted by the incoming automaton, yet the way CheckIndex is calling it can clearly violate that.

And the codecs (default and Direct) clearly don't do a good job throwing a clear exception when that is violated :)

In addition to the default Codec, DirectPostingsFormat is also angry, using this repro:

./gradlew :lucene:core:test --tests "org.apache.lucene.index.TestTerms.testTermMinMaxRandom" -Ptests.jvms=4 -Ptests.jvmargs= -Ptests.seed=C8D1EBB5035DA9F -Ptests.multiplier=2 -Ptests.badapples=false -Ptests.gui=true -Ptests.file.encoding=US-ASCII -Ptests.vectorsize=128

@mikemccand
Copy link
Member

I'll try to fix CheckIndex so that it only uses startTerm that is accepted by the automaton.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 20, 2023

Terms.intersect(Automaton a, BytesRef startTerm) requires that startTerm is accepted by the incoming automaton, yet the way CheckIndex is calling it can clearly violate that.

I wondered about that, but the automaton is Automata.makeAnyBinary(), shouldn't it accept any term?

@jpountz
Copy link
Contributor Author

jpountz commented Dec 20, 2023

Oh I see, I created binary automata, but the API implicitly treats automata as UTF32 automata, so you need to tell it explicitly that it's a binary automaton. And something like that should fix the problem?

diff --git a/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java b/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
index a555ce40001..f899b331b92 100644
--- a/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
+++ b/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
@@ -2318,7 +2318,7 @@ public final class CheckIndex implements Closeable {
         startTerm = new BytesRef();
         checkTermsIntersect(terms, automaton, startTerm);
 
-        automaton = Automata.makeAnyBinary();
+        automaton = Automata.makeNonEmptyBinary();
         startTerm = new BytesRef(new byte[] {'l'});
         checkTermsIntersect(terms, automaton, startTerm);
 
@@ -2369,8 +2369,8 @@ public final class CheckIndex implements Closeable {
       throws IOException {
     TermsEnum allTerms = terms.iterator();
     automaton = Operations.determinize(automaton, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
-    CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton);
-    ByteRunAutomaton runAutomaton = new ByteRunAutomaton(automaton);
+    CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton, false, true, true);
+    ByteRunAutomaton runAutomaton = new ByteRunAutomaton(automaton, true);
     TermsEnum filteredTerms = terms.intersect(compiledAutomaton, startTerm);
     BytesRef term;
     if (startTerm != null) {

(I had to change the automaton so that it's still considered of type "normal" and not "all")

@mikemccand
Copy link
Member

Terms.intersect(Automaton a, BytesRef startTerm) requires that startTerm is accepted by the incoming automaton, yet the way CheckIndex is calling it can clearly violate that.

I wondered about that, but the automaton is Automata.makeAnyBinary(), shouldn't it accept any term?

Oh, you're right! I missed that Automata.makeAnyBinary() there!

Oh I see, I created binary automata, but the API implicitly treats automata as UTF32 automata, so you need to tell it explicitly that it's a binary automaton. And something like that should fix the problem?

Oh, you are also right! Specifically CompiledAutomaton assumes it's UTF32 and needs conversion to UTF8, unless you pass isBinar=true. OK I like your fix! I'll confirm it fixes the DirectPostingsFormat failure too.

@mikemccand
Copy link
Member

OK the DirectPostingsFormat failure is also happy with this fix. +1 to merge. Thanks @jpountz!

@jpountz jpountz closed this as completed Dec 20, 2023
@jpountz
Copy link
Contributor Author

jpountz commented Dec 20, 2023

I just pushed the change, thanks @mikemccand for putting me on the right track.

@mikemccand
Copy link
Member

Not sure I did so much "putting on the right path" :) More like "getting randomly confused around the right area" thus inspiring @jpountz to look more closely :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants