New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducible test failure with Terms#intersect on the default codec #12957
Comments
OK I think the issue here may be that And the codecs (default and Direct) clearly don't do a good job throwing a clear exception when that is violated :) In addition to the default Codec,
|
I'll try to fix |
I wondered about that, but the automaton is |
Oh I see, I created binary automata, but the API implicitly treats automata as UTF32 automata, so you need to tell it explicitly that it's a binary automaton. And something like that should fix the problem? diff --git a/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java b/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
index a555ce40001..f899b331b92 100644
--- a/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
+++ b/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
@@ -2318,7 +2318,7 @@ public final class CheckIndex implements Closeable {
startTerm = new BytesRef();
checkTermsIntersect(terms, automaton, startTerm);
- automaton = Automata.makeAnyBinary();
+ automaton = Automata.makeNonEmptyBinary();
startTerm = new BytesRef(new byte[] {'l'});
checkTermsIntersect(terms, automaton, startTerm);
@@ -2369,8 +2369,8 @@ public final class CheckIndex implements Closeable {
throws IOException {
TermsEnum allTerms = terms.iterator();
automaton = Operations.determinize(automaton, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
- CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton);
- ByteRunAutomaton runAutomaton = new ByteRunAutomaton(automaton);
+ CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton, false, true, true);
+ ByteRunAutomaton runAutomaton = new ByteRunAutomaton(automaton, true);
TermsEnum filteredTerms = terms.intersect(compiledAutomaton, startTerm);
BytesRef term;
if (startTerm != null) { (I had to change the automaton so that it's still considered of type "normal" and not "all") |
Oh, you're right! I missed that
Oh, you are also right! Specifically |
OK the |
I just pushed the change, thanks @mikemccand for putting me on the right track. |
Not sure I did so much "putting on the right path" :) More like "getting randomly confused around the right area" thus inspiring @jpountz to look more closely :) |
Description
The new CheckIndex checks are causing some test failures with the default codec, which are reproducible and look like real bugs? I started looking but I'm not familiar enough with BlockTree to understand what it's doing wrong.
https://jenkins.thetaphi.de/job/Lucene-main-Linux/45856/consoleFull
Version and environment details
No response
The text was updated successfully, but these errors were encountered: