Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting TestStringsToAutomaton failure #12451

Closed
jpountz opened this issue Jul 20, 2023 · 9 comments
Closed

Interesting TestStringsToAutomaton failure #12451

jpountz opened this issue Jul 20, 2023 · 9 comments
Milestone

Comments

@jpountz
Copy link
Contributor

jpountz commented Jul 20, 2023

Description

This test fails because the final automaton reports that it contains { 0xf0, 0x80, 0x80, 0x80 } even though it has not been added to the automaton builder.

01:41:57 org.apache.lucene.util.automaton.TestStringsToAutomaton > test suite's output saved to /var/lib/jenkins/workspace/apache+lucene+main/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.util.automaton.TestStringsToAutomaton.txt, copied below:
01:41:57    >     java.lang.AssertionError
01:41:57    >         at __randomizedtesting.SeedInfo.seed([1D5E685C39EEEAE6:E06CE40C08BB0DEB]:0)
01:41:57    >         at org.junit.Assert.fail(Assert.java:87)
01:41:57    >         at org.junit.Assert.assertTrue(Assert.java:42)
01:41:57    >         at org.junit.Assert.assertTrue(Assert.java:53)
01:41:57    >         at org.apache.lucene.util.automaton.TestStringsToAutomaton.checkAutomaton(TestStringsToAutomaton.java:148)
01:41:57    >         at org.apache.lucene.util.automaton.TestStringsToAutomaton.testRandom(TestStringsToAutomaton.java:128)
01:41:57    >         at org.apache.lucene.util.automaton.TestStringsToAutomaton.testRandomUnicodeOnly(TestStringsToAutomaton.java:87)
01:41:57    >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
01:41:57    >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
01:41:57    >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
01:41:57    >         at java.base/java.lang.reflect.Method.invoke(Method.java:568)
01:41:57    >         at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
01:41:57    >         at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
01:41:57    >         at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
01:41:57    >         at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
01:41:57    >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
01:41:57    >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
01:41:57    >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
01:41:57    >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
01:41:57    >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
01:41:57    >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
01:41:57    >         at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
01:41:57    >         at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
01:41:57    >         at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
01:41:57    >         at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
01:41:57    >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
01:41:57    >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
01:41:57    >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
01:41:57    >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
01:41:57    >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
01:41:57    >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
01:41:57    >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
01:41:57    >         at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
01:41:57    >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
01:41:57    >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
01:41:57    >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
01:41:57    >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
01:41:57    >         at java.base/java.lang.Thread.run(Thread.java:833)

Gradle command to reproduce

01:41:57   2> NOTE: reproduce with: gradlew test --tests TestStringsToAutomaton.testRandomUnicodeOnly -Dtests.seed=1D5E685C39EEEAE6 -Dtests.locale=ro-RO -Dtests.timezone=America/Kentucky/Louisville -Dtests.asserts=true -Dtests.file.encoding=UTF-8
@gsmiller
Copy link
Contributor

I've reproduced this and narrowed it down to the following: The built automaton is correct up to the point when the (randomly generated) term 0xf0 0x90 0x80 0x80 gets added. Once that term is added, the automaton will incorrectly produce 0xf0 0x80 0x80 0x80 as a valid term. Digging into why that is.

@gsmiller
Copy link
Contributor

I've tracked it down to this exact sequence being added:
U+65535 (0xef 0xbf 0xbf) followed by U+65536 (0xf0 0x90 0x80 0x80). Note that these code points are immediately next to each other. If you change either of them by one value down/up, the bug no longer exists. I have to step away now but will try to spend more time on it soon.

You can reproduce this by adding the following code immediately after line 127 in TestStringsToAutomaton:

      sorted = new ArrayList<BytesRef>();
      sorted.add(new BytesRef(new byte[]{(byte) 0xef, (byte) 0xbf, (byte) 0xbf}));
      sorted.add(new BytesRef(new byte[]{(byte) 0xf0, (byte) 0x90, (byte) 0x80, (byte) 0x80}));

@gsmiller
Copy link
Contributor

I'll also mention that:

  1. The RunAutomaton also incorrectly accepts 0xf0 0x80 0x80 0x80
  2. It's also worth noting that the incorrectly accepted/produced value, 0xf0 0x80 0x80 0x80, is code point 0 represented as 4 bytes.

@gsmiller
Copy link
Contributor

Hmm... I'm not sure yet if this is a new bug or if the tests added by the PR this references just uncovered it. But if I add U+65535 and U+65536 as two single-character terms in an automaton, here's what gets built:
out
It's representing the two terms as a min/max range on a single transition.

I wonder if the bug is in the automaton representation itself, where our min in this case is the largest value that can be represented in utf8 in 3 bytes, and our max is the smallest value that can be represented in 4 utf8 bytes. In this case:

  • min: U+65535; 11111111 11111111 in binary; 11101111 10111111 10111111 in utf8
  • max: U+65536; 00000001 00000000 00000000 in binary; 11110000 10010000 10000000 10000000 in utf8

The value that is getting erroneously accepted by the automaton is the 4 byte representation of all 0s in utf8. I wonder if the range is incorrectly including that value in the edge case where the min/max transition range spans a transition in utf8 byte counts? This is just a guess based on the values that reproduce the bug. I've run out of time for now to dig into it, but will peek again this weekend if I have time or early next week if not.

@gsmiller
Copy link
Contributor

gsmiller commented Jul 24, 2023

Also, here's the compiled automaton resulting from the code-point automaton above:
out

This shows 0xf0 0x80 0x80 0x80 being erroneously accepted.

@gsmiller
Copy link
Contributor

This looks like a bug in the UTF32toUTF8 conversion logic to me. It seems the resulting UTF8 automaton in this case is producing a large range of invalid UTF8. This appears to all be preexisting behavior that has been exposed with new testing. I detailed the bug in #12458.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 25, 2023

Thanks a lot for the detailed investigation @gsmiller.

@gsmiller
Copy link
Contributor

@jpountz of course. It's an interesting issue. I'll post a small PR shortly to work around it in this test case to avoid the periodic failures shortly.

@gsmiller
Copy link
Contributor

PR that works around the underlying bug, making sure we don't hit this sporadic test case failures until the bug eventually is fixed: #12461

@gsmiller gsmiller added this to the 9.8.0 milestone Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants