StringsToAutomaton#build to take List as parameter instead of Collection #12427

shubhamvishu · 2023-07-09T20:26:28Z

Description

This addresses issue DaciukMihovAutomatonBuilder#build should probably take a List instead of a Collection #12319
StringsToAutomaton previously called DaciukMihovAutomatonBuilder(renamed in this PR) should take List as input instead of Collection as it expects sorted input.

gsmiller · 2023-07-10T21:30:13Z

Thanks @shubhamvishu for working on this! As I look at the PR, I'm wondering if accepting a List is really the proper thing to do here. If users already have a sorted Collection (like SortedSet), are we forcing them to make unnecessary copies of their data to adhere to our API?

I wonder if the ctor for TermInSetQuery might serve as a reasonable example of handling sortedness? TermInSetQuery accepts a Collection but tries to see if it's already sorted (and de-duped). If it is, it uses it as-is. If not, it sorts and de-dupes on behalf of the user. The automaton building logic right now only has a simple assert down in the internals, which could be disabled depending on the runtime environment, resulting in tricky debugging if the user didn't handle sorting their data properly.

Since StringsToAutomaton is an internal class, maybe one approach would be to leave it untouched, letting it expect pre-sorted data and having a simple assert statement as it does today to check that (for easier debugging), but add sorting logic to Automata#makeStringUnion? What if that method accepted a Collection as it does today but had similar logic as the TermInSetQuery ctor related to sorting?

Apologies for leading us in the wrong direction in the issue I opened. I think a List is a bit more semantically appropriate than a Collection, but I'm not sure it's best from a pragmatic standpoint? What do you think?

shubhamvishu · 2023-07-11T06:45:30Z

@gsmiller I totally agree its not helpful if the input/data Collection is sorted since its a unnecessary overhead to convert to a List here. As per TermInSetQuery ctor its checking if the collection is of SortedSet type and uses natural ordering. But what if the user is explicitly passing a List (or any Collection for that matter) which is already sorted but is not SortedSet?
Could we expose another #build function as build(Collection<BytesRef> input, boolean asBinary, boolean isSorted) or for better maybe makeStringUnion(Collection<BytesRef> utf8Strings, boolean isSorted) (similar to the one in tests in this PR) to completely bypass the sorting/checks if incase isSorted == true else we could have the same check as in TermInSetquery ctor and sort the terms incase they are not before moving on in Automata#makeStringUnion as you mentioned? Let me know your thoughts?

gsmiller · 2023-07-11T21:43:43Z

Yeah, good points/questions! I'd be curious how much overhead it would actually add to sort the input when it's already sorted?

But to take a step back for a moment, we also have automata building methods that accept a BytesRefIterator as well, which also must be sorted. This is a situation where we really cannot sort on behalf of the caller, so it might be a bit confusing/trappy to sort some flavors of this method but not others? Maybe it's best to leave these methods as they are?

If we want to make these functions a bit more user-friendly, we could look at changing the assert on line 276 of StringsToAutomaton to throw an explicit IllegalArgumentException so that we don't silently built a corrupt automaton on unordered input (with asserts disabled). There would add overhead since we have to now keep track of the previous term all the time, but maybe it's worth benchmarking and considering this change? I do think it's best to explicitly let the user know they passed invalid input in a case like this, so it would be a nice change if it didn't introduce a significant performance drag.

gautamworah96 · 2023-07-11T23:23:35Z

I was looking at this problem too and the change in test classes to meld arbitrary collections to Lists irked me as well.

+1 to throw an IllegalArgumentException in #add instead of having the user debug the error later on.

On a related note, since we only traverse the items in their given order, maybe we can relax StringsToAutomaton#build(Collection<BytesRef>, boolean) to StringsToAutomaton#build(Iterable<BytesRef> input, boolean). This change will also make it more consistent with the build(BytesRefIterator input, boolean asBinary) method.

shubhamvishu · 2023-07-13T03:23:21Z

This is a situation where we really cannot sort on behalf of the caller, so it might be a bit confusing/trappy to sort some flavors of this method but not others? Maybe it's best to leave these methods as they are?

Agreed @gsmiller! Yes it does seem maybe its better leave this as is.

we could look at changing the assert on line 276 of StringsToAutomaton to throw an explicit IllegalArgumentException so that we don't silently built a corrupt automaton on unordered input (with asserts disabled). There would add overhead since we have to now keep track of the previous term all the time, but maybe it's worth benchmarking and considering this change?

I like the idea to throw IAE if wrong input is provided. I think this would only affect the cases with assert disabled? otherwise with asserts enabled we anyways always keep track of previous term.

maybe we can relax StringsToAutomaton#build(Collection, boolean) to StringsToAutomaton#build(Iterable input, boolean). This change will also make it more consistent with the build(BytesRefIterator input, boolean asBinary) method.

Good point @gautamworah96. Iterable seems like a better way instead of Collection and would indeed be consistent with other method.

shubhamvishu · 2023-07-13T03:46:21Z

On the same note, since both the methods expects Iterable or Iterators why do we even need 2 separate methods here which are doing exactly the same thing i.e. iterating over the ByteRef's and adding to automaton.

It would have been much better if BytesRefIterator implemented the Iterable, Iterator interfaces in which case we could just have one method StringsToAutomaton#build(Iterable) which takes an Iterable. I don't see why do we even have a separate BytesRefIterator interface and not just using Iterator instead(maybe because its legacy code?) or is it that I'm missing something important here?

gsmiller · 2023-07-20T21:02:50Z

I like the idea to throw IAE if wrong input is provided. I think this would only affect the cases with assert disabled? otherwise with asserts enabled we anyways always keep track of previous term.

That's right. I imagine a lot of users disable asserts when running in a production environment for performance reasons, so the only concern here is that we're adding some new work for them to do this validation. I still think the explicit validation is better in this situation, but the counter-argument would be a user that knows for sure they're always properly providing sorted input and don't want the overhead of the validation.

As for Iterable vs. Collection, I agree that Iterable is more appropriate for the API. Whether-or-not it's worth making that change though in a back-compat way is maybe a different question?

I don't see why do we even have a separate BytesRefIterator interface and not just using Iterator instead(maybe because its legacy code?) or is it that I'm missing something important here?

I think it's so that next can throw IOException? You can't do that with Iterator<BytesRef>.

shubhamvishu · 2023-08-10T15:56:56Z

I still think the explicit validation is better in this situation, but the counter-argument would be a user that knows for sure they're always properly providing sorted input and don't want the overhead of the validation.

Agreed! As we discussed above one way could be to sort based on user input that if the collection is sorted or not but since that will be for only one flavour it might be confusing. I'll try to benchmark this change to see if its causing any regression or not.

I think it's so that next can throw IOException? You can't do that with Iterator.

Ahh I see. That makes sense. Thanks @gsmiller !

shubhamvishu · 2023-10-27T14:05:47Z

@gsmiller I finally got a chance to run the benchmarks for this change. Below are the results(looks all good to me). Let me know what do you think? Thanks!

                  TaskQPS baseline  StdDevQPS my_modified_version      StdDev      Pct diff             p-value
                      AndHighMed       96.52      (6.5%)       95.49      (6.3%)   -1.1% ( -12% -   12%) 0.597
                     LowSpanNear      133.61      (4.7%)      132.45      (4.0%)   -0.9% (  -9% -    8%) 0.528
                       OrHighMed       67.95      (3.2%)       67.37      (4.6%)   -0.9% (  -8% -    7%) 0.492
                     MedSpanNear        5.79      (5.2%)        5.74      (4.2%)   -0.8% (  -9% -    8%) 0.573
                    OrNotHighLow      322.30      (2.3%)      319.70      (2.0%)   -0.8% (  -4% -    3%) 0.233
            BrowseDateSSDVFacets        0.90      (8.2%)        0.89      (9.4%)   -0.8% ( -17% -   18%) 0.773
                      AndHighLow      355.80      (4.2%)      352.99      (4.1%)   -0.8% (  -8% -    7%) 0.550
            MedTermDayTaxoFacets        5.85      (6.2%)        5.81      (5.7%)   -0.7% ( -11% -   11%) 0.713
                    OrNotHighMed      222.70      (3.6%)      221.51      (4.0%)   -0.5% (  -7% -    7%) 0.657
                    OrHighNotMed      194.41      (3.6%)      193.40      (3.4%)   -0.5% (  -7% -    6%) 0.641
                       MedPhrase       44.71      (3.1%)       44.50      (3.4%)   -0.5% (  -6% -    6%) 0.639
                   OrNotHighHigh      179.60      (3.7%)      178.85      (3.8%)   -0.4% (  -7% -    7%) 0.726
                    HighSpanNear        8.33      (4.6%)        8.30      (4.1%)   -0.4% (  -8% -    8%) 0.765
                    OrHighNotLow      331.53      (3.8%)      330.15      (3.7%)   -0.4% (  -7% -    7%) 0.728
                   OrHighNotHigh      248.18      (3.5%)      247.20      (3.4%)   -0.4% (  -7% -    6%) 0.718
             MedIntervalsOrdered        4.66      (4.6%)        4.64      (4.3%)   -0.4% (  -8% -    8%) 0.780
          OrHighMedDayTaxoFacets        4.39      (3.9%)        4.38      (4.2%)   -0.3% (  -8% -    8%) 0.820
                       OrHighLow      569.67      (2.6%)      568.04      (2.6%)   -0.3% (  -5% -    5%) 0.733
             LowIntervalsOrdered        2.64      (4.6%)        2.64      (4.6%)   -0.2% (  -8% -    9%) 0.899
            HighTermTitleBDVSort        5.20      (4.0%)        5.19      (4.1%)   -0.2% (  -7% -    8%) 0.895
                      OrHighHigh       35.47      (4.4%)       35.42      (5.8%)   -0.1% (  -9% -   10%) 0.929
       BrowseDayOfYearTaxoFacets        3.80     (15.9%)        3.80     (16.0%)   -0.1% ( -27% -   37%) 0.984
            BrowseDateTaxoFacets        3.78     (15.5%)        3.78     (15.5%)   -0.0% ( -26% -   36%) 0.995
                     AndHighHigh       16.35      (3.6%)       16.35      (7.5%)    0.0% ( -10% -   11%) 0.996
         AndHighMedDayTaxoFacets       49.40      (1.6%)       49.41      (1.7%)    0.0% (  -3% -    3%) 0.975
                       LowPhrase      254.05      (3.6%)      254.16      (4.0%)    0.0% (  -7% -    7%) 0.972
                      HighPhrase       40.53      (3.1%)       40.56      (3.6%)    0.1% (  -6% -    6%) 0.939
                        PKLookup      131.32      (2.3%)      131.44      (2.0%)    0.1% (  -4% -    4%) 0.891
     BrowseRandomLabelTaxoFacets        3.28     (15.0%)        3.28     (14.8%)    0.1% ( -25% -   35%) 0.983
           BrowseMonthTaxoFacets        4.13     (34.7%)        4.13     (34.7%)    0.1% ( -51% -  106%) 0.991
                          Fuzzy1       65.72      (1.1%)       65.81      (1.2%)    0.1% (  -2% -    2%) 0.714
                         LowTerm      513.23      (2.9%)      513.99      (2.9%)    0.1% (  -5% -    6%) 0.870
            HighIntervalsOrdered        1.62      (5.3%)        1.62      (5.0%)    0.2% (  -9% -   11%) 0.909
       BrowseDayOfYearSSDVFacets        3.75     (12.4%)        3.76      (8.6%)    0.2% ( -18% -   24%) 0.956
                         MedTerm      382.67      (3.7%)      383.49      (3.9%)    0.2% (  -7% -    8%) 0.857
                 LowSloppyPhrase       12.31      (2.4%)       12.34      (2.1%)    0.2% (  -4% -    4%) 0.754
                         Respell       36.92      (1.6%)       37.02      (1.6%)    0.3% (  -2% -    3%) 0.617
                 MedSloppyPhrase       22.87      (3.2%)       22.94      (3.0%)    0.3% (  -5% -    6%) 0.751
                        HighTerm      251.80      (4.3%)      252.92      (4.6%)    0.4% (  -8% -    9%) 0.752
        AndHighHighDayTaxoFacets        7.39      (2.7%)        7.43      (2.6%)    0.5% (  -4% -    5%) 0.571
                          Fuzzy2       39.80      (1.4%)       40.02      (1.8%)    0.5% (  -2% -    3%) 0.288
                HighSloppyPhrase        1.88      (4.3%)        1.89      (4.2%)    0.6% (  -7% -    9%) 0.657
                        Wildcard       39.56      (3.9%)       39.82      (2.7%)    0.7% (  -5% -    7%) 0.537
                         Prefix3       77.73      (6.4%)       78.39      (5.8%)    0.8% ( -10% -   13%) 0.663
               HighTermMonthSort     2196.21      (3.5%)     2217.20      (3.5%)    1.0% (  -5% -    8%) 0.389
           HighTermDayOfYearSort      200.38      (3.6%)      203.21      (3.2%)    1.4% (  -5% -    8%) 0.186
           BrowseMonthSSDVFacets        4.46     (13.1%)        4.53      (9.4%)    1.5% ( -18% -   27%) 0.687
     BrowseRandomLabelSSDVFacets        2.68      (7.7%)        2.73      (4.5%)    1.6% (  -9% -   14%) 0.421
                          IntNRQ       17.08      (4.8%)       17.41      (7.4%)    1.9% (  -9% -   14%) 0.335
                      TermDTSort      102.14      (2.9%)      104.31      (5.5%)    2.1% (  -6% -   10%) 0.127
               HighTermTitleSort       38.07     (11.1%)       39.53      (9.0%)    3.8% ( -14% -   26%) 0.228

mikemccand · 2023-10-29T16:14:50Z

Below are the results(looks all good to me). Let me know what do you think? Thanks!

+1 -- looks like just noise to me.

gsmiller · 2023-10-30T19:45:09Z

+1 looks good to me as well. I like that this small change, 1) makes the API a little more general, allowing users to provide any Iterable instead of Collection, and 2) adds an explicit check for expected ordering of the iterable using IAE instead of relying on asserts being enabled. Thanks for following up!

…rted (#12427) Also update the API to accept more general Iterable<BytesRef> instead of Collection<BytesRef>

benwtrent · 2023-11-01T16:00:17Z

Something weird is happening. This commit is causing the following failure:

./gradlew test --tests TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource -Dtests.seed=9E4091CE5D400BA3

I am starting to investigate, but folks involved with this change might find the issue quicker.

  >     java.lang.NoSuchMethodError: 'org.apache.lucene.util.automaton.Automaton org.apache.lucene.util.automaton.Automata.makeStringUnion(java.util.Collection)'
   >         at org.apache.lucene.search.uhighlight.CharArrayMatcher.fromTerms(CharArrayMatcher.java:42)
   >         at org.apache.lucene.search.uhighlight.MemoryIndexOffsetStrategy.buildCombinedAutomaton(MemoryIndexOffsetStrategy.java:71)
   >         at org.apache.lucene.search.uhighlight.MemoryIndexOffsetStrategy.<init>(MemoryIndexOffsetStrategy.java:53)
   >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.getOffsetStrategy(UnifiedHighlighter.java:1238)
   >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.getFieldHighlighter(UnifiedHighlighter.java:1072)
   >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:880)
   >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:814)
   >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:792)
   >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlight(UnifiedHighlighter.java:725)
   >         at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource(TestUnifiedHighlighterTermIntervals.java:607)
   >         at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:580)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
   >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
   >         at java.base/java.lang.Thread.run(Thread.java:1583)
   >
   >     java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 7 open files: {_0_Lucene90_0.pos=1, _0_Lucene90_0.doc=1, _0_Lucene90_0.tim=1, _0.nvd=1, _0.fdx=1, _0.fdt=1, _0_Lucene90_0.tip=1}
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:876)
   >         at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.doAfter(TestUnifiedHighlighterTermIntervals.java:84)
   >         at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:580)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
   >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
   >         at java.base/java.lang.Thread.run(Thread.java:1583)
   >
   >         Caused by:
   >         java.lang.RuntimeException: unclosed IndexInput: _0.nvd
   >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783)
   >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:835)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.codecs.lucene90.Lucene90NormsProducer.<init>(Lucene90NormsProducer.java:84)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.codecs.lucene90.Lucene90NormsFormat.normsProducer(Lucene90NormsFormat.java:96)
   >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.codecs.asserting.AssertingNormsFormat.normsProducer(AssertingNormsFormat.java:46)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:108)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:180)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:222)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:539)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:601)
   >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter$4.getReader(IndexWriter.java:6510)
   >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.RandomIndexWriter.getReader(RandomIndexWriter.java:508)
   >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.RandomIndexWriter.getReader(RandomIndexWriter.java:426)
   >             at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource(TestUnifiedHighlighterTermIntervals.java:581)
   >             at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
   >             at java.base/java.lang.reflect.Method.invoke(Method.java:580)
   >             at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >             at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   >             at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   >             at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)

gsmiller · 2023-11-01T16:27:11Z

@benwtrent is this an issue of the code only being partially rebuilt? The signature for Automata#makeStringUnion was changed to accept a more general Iterable as opposed to a Collection in this change. Let me see if I can reproduce it locally as well.

UPDATE: It didn't repro for me locally. Have you tried a ./gradlew clean and then rerunning the test?

benwtrent · 2023-11-01T16:28:51Z

@gsmiller you are 100% correct 🤦

gsmiller · 2023-11-01T17:18:38Z

@benwtrent glad it was a simple fix. Sorry it created churn!

dweiss · 2023-11-01T19:19:14Z

See #12742 - it's a problem with our javac task settings. Running a clean is a workaround. I will provide a fix for this when I get a chance, it's not a critical issue (but it is a bug in how we currently set up task dependencies).

shubhamvishu · 2023-11-02T06:12:02Z

Nice!....I like it how we discovered a build issue bug(which was hidden for sometime now I think) and fixed it. Thanks for solving @dweiss!

…rted (apache#12427) Also update the API to accept more general Iterable<BytesRef> instead of Collection<BytesRef>

shubhamvishu mentioned this pull request Jul 9, 2023

DaciukMihovAutomatonBuilder#build should probably take a List instead of a Collection #12319

Closed

shubhamvishu force-pushed the daciukmihov-automaton branch from d820f8c to 27b08c4 Compare July 11, 2023 17:45

Replace assert with IAE in StringsToAutomaton#add if data is not sorted

a27a4d4

shubhamvishu force-pushed the daciukmihov-automaton branch from 27b08c4 to a27a4d4 Compare July 13, 2023 03:24

gsmiller approved these changes Oct 30, 2023

View reviewed changes

gsmiller merged commit 9cffd4b into apache:main Oct 30, 2023
4 checks passed

gsmiller pushed a commit that referenced this pull request Oct 30, 2023

Replace assert with IAE in StringsToAutomaton#build if data is not so…

5b26498

…rted (#12427) Also update the API to accept more general Iterable<BytesRef> instead of Collection<BytesRef>

gsmiller added this to the 9.9.0 milestone Oct 31, 2023

rmuir pushed a commit to rmuir/lucene that referenced this pull request Nov 2, 2023

Replace assert with IAE in StringsToAutomaton#build if data is not so…

cd5f8c7

…rted (apache#12427) Also update the API to accept more general Iterable<BytesRef> instead of Collection<BytesRef>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringsToAutomaton#build to take List as parameter instead of Collection #12427

StringsToAutomaton#build to take List as parameter instead of Collection #12427

shubhamvishu commented Jul 9, 2023 •

edited

Loading

gsmiller commented Jul 10, 2023

shubhamvishu commented Jul 11, 2023

gsmiller commented Jul 11, 2023

gautamworah96 commented Jul 11, 2023

shubhamvishu commented Jul 13, 2023

shubhamvishu commented Jul 13, 2023

gsmiller commented Jul 20, 2023

shubhamvishu commented Aug 10, 2023

shubhamvishu commented Oct 27, 2023

mikemccand commented Oct 29, 2023

gsmiller commented Oct 30, 2023

benwtrent commented Nov 1, 2023

gsmiller commented Nov 1, 2023 •

edited

Loading

benwtrent commented Nov 1, 2023

gsmiller commented Nov 1, 2023

dweiss commented Nov 1, 2023 •

edited

Loading

shubhamvishu commented Nov 2, 2023

StringsToAutomaton#build to take List as parameter instead of Collection #12427

StringsToAutomaton#build to take List as parameter instead of Collection #12427

Conversation

shubhamvishu commented Jul 9, 2023 • edited Loading

Description

gsmiller commented Jul 10, 2023

shubhamvishu commented Jul 11, 2023

gsmiller commented Jul 11, 2023

gautamworah96 commented Jul 11, 2023

shubhamvishu commented Jul 13, 2023

shubhamvishu commented Jul 13, 2023

gsmiller commented Jul 20, 2023

shubhamvishu commented Aug 10, 2023

shubhamvishu commented Oct 27, 2023

mikemccand commented Oct 29, 2023

gsmiller commented Oct 30, 2023

benwtrent commented Nov 1, 2023

gsmiller commented Nov 1, 2023 • edited Loading

benwtrent commented Nov 1, 2023

gsmiller commented Nov 1, 2023

dweiss commented Nov 1, 2023 • edited Loading

shubhamvishu commented Nov 2, 2023

shubhamvishu commented Jul 9, 2023 •

edited

Loading

gsmiller commented Nov 1, 2023 •

edited

Loading

dweiss commented Nov 1, 2023 •

edited

Loading