Fix UTF32toUTF8 will produce invalid transition #12472

tang-hi · 2023-07-29T12:55:23Z

FIX ISSUE #12458

mikemccand

Thanks @tang-hi -- what a sneaky bug! It's crazy it took randomized testing so long to eek out this failure, and also wonderful that it did.

How do we characterize the end user impact of this bug? We need a well written CHANGES entry. The impact should be small, since it's accepting illegal UTF-8 sequences? It'd only affect users operating entirely in binary space (not Unicode encoded UTF-8 terms, which is the vast majority of users / default case).

mikemccand · 2023-07-30T10:24:10Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
        // doesn't accept certain byte sequences) -- there
        // are other cases we could optimize too:
        startCode = 194;


Let's maybe fix this one to hex as well (0xC2)?

mikemccand · 2023-07-30T10:24:48Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
        // doesn't accept certain byte sequences) -- there
        // are other cases we could optimize too:


Is this comment (there are other cases we could optimize too) still true :) Or are these two new ifs covering them AND fixing this sneaky bug?

mikemccand · 2023-07-30T10:25:28Z

lucene/core/src/test/org/apache/lucene/util/TestUnicodeUtil.java

@@ -188,6 +191,30 @@ public void testUTF8CodePointAt() {
    }
  }

+  public void testUTF8TwoToThreeBytes() throws Exception {


Maybe also add the three-to-four and one-to-two cases?

…that a codepoint would start at 0x80 when the number of bits is 6. This has now been corrected. The correct behavior is that when the length of the codepoint is 3 and the first byte is 0xE0, it will start from 0xA0. Similarly, when the length of the codepoint is 4 and the first byte is 0xF0, it will start from 0x90.

tang-hi · 2023-07-30T11:56:42Z

@mikemccand Thanks for the review. I have already made the updates to the pull request based on your review comments. 😄

mikemccand · 2023-08-01T07:16:40Z

@mikemccand Thanks for the review. I have already made the updates to the pull request based on your review comments. 😄

Wow that was fast, thank you! I'll try to review soon -- I gotta stare at this long enough to understand the bug / fix / UTF-8 details. Thank you for digging on this tricky case so quickly.

tang-hi · 2023-08-01T07:30:48Z

2 byte -> 3 byte

3 byte -> 4 byte

hope these pictures will be helpful

mikemccand

Phew, I finally understand this bug!

And I realize how I messed this up long ago after studying the UTF-8 encoding table ... from the line for the first 3-byte UTF-8 sequence:

U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx

one cannot assume that the xxxx of the not-first UTF-8 byte will be 0s! For the 2->3 and 3->4 case they are not :)

Thank you @tang-hi for bearing with me as I remember this tricky stuff. I left a bunch of small comments to help readability maybe. I think the fix is fundamentally correct!

mikemccand · 2023-08-01T10:29:57Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

+      } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
+        startCode = 0xA0;
+      } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {
+        startCode = 0x90;
      } else {
        startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]);


I wonder why MASKS is size 32? We only access it with numBits -- shouldn't it only be length 7? Separately we should make it 1-based so access doesn't have to keep subtracting 1 ... might eek out a bit of CPU. But let's fix that later/separately...

mikemccand · 2023-08-01T10:46:11Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
          endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
          endUTF8.byteAt(upto));
    } else {
+      // There are three special case
+      // 1. if codepoint's numBits is 5, byte start from 0xC2


Instead of saying if a codepoint's numBits is 5, could we say if a codepoint encodes to 2 UTF-8 bytes? And same for bullets 2 & 3 (3 UTF-8 bytes, 4 UTF-8 bytes)?

mikemccand · 2023-08-01T10:48:32Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

+      // 1. if codepoint's numBits is 5, byte start from 0xC2
+      // 2. if codepoint's len is 3  and  first byte is 0xE0. the second byte start from 0xA0
+      // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte start from 0x90
+      // you could found the reference in https://www.utf8-chartable.de/unicode-utf8-table.pl
      final int startCode;
      if (endUTF8.numBits(upto) == 5) {


For consistency, could we change this to:

if (endUTF8.len == 2) { assert upto == 0; // the upto==1 case will be handled by the first if above startCode = 0xC2; }

?

mikemccand · 2023-08-01T10:50:21Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

+      } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
+        startCode = 0xA0;
+      } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {
+        startCode = 0x90;


Hmm should we also set 0x80 in the upto == 2 case and endUTF8.len == 4 when the leading bytes are 0xF0 0x90? But how come no test is failing, if that's right?

Edit: OK, I see, the existing code will in fact set startCode = 0x80 already in that case, so, no bug, phew. And this same logic happens to work for 2nd byte of 1 -> 2 UTF8 length transition as well (also 0x80, which the first if handles correctly today). Tricky!

mikemccand · 2023-08-01T11:10:38Z

lucene/CHANGES.txt

@@ -182,6 +182,10 @@ Bug Fixes
 * GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458
  (Greg Miller).

+* GITHUB#2472: Fix the incorrect conversion from a Unicode (UTF-32) automaton to its UTF-8 representation.


Could we reword maybe to:

UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences. This should not have any user impact except in the very unusual possible use-case of searching a non-UTF-8 (fully binary terms) inverted field using Unicode search terms.

?

Net/net I don't think this will impact any users ... users who index binary content would search it with binary terms (not via conversion from Unicode, where this bug lurks) unless they had a truly exotic use case.

mikemccand · 2023-08-01T12:26:06Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
          endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
          endUTF8.byteAt(upto));
    } else {
+      // There are three special case
+      // 1. if codepoint's numBits is 5, byte start from 0xC2
+      // 2. if codepoint's len is 3  and  first byte is 0xE0. the second byte start from 0xA0


Extra space before and and first?

mikemccand · 2023-08-01T12:49:15Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

-        startCode = 194;
+        startCode = 0xC2;
+      } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
+        startCode = 0xA0;


Can you move the above three bullets into each of these internal ifs to make the special casing clearer?

E.g. in this if maybe say:

// the first length=3 UTF8 Unicode character is E0 A0 80 so we must special case A0 as the 2nd byte when E0 was the first byte of endUTF8 in this case

mikemccand · 2023-08-01T12:53:37Z

lucene/core/src/test/org/apache/lucene/util/automaton/TestStringsToAutomaton.java

@@ -142,22 +141,11 @@ private void checkAutomaton(List<BytesRef> expected, Automaton a, boolean isBina
    }

    // Make sure every term produced by the automaton is expected
-    FiniteStringsIterator it = new FiniteStringsIterator(a);


Hmm why did you remove the isBinary case for this test?

Because it was added by @gsmiller in #12461 for the purpose of passing the CI test, it should be restored since we have already fixed the bug.

mikemccand · 2023-08-01T12:56:11Z

lucene/core/src/test/org/apache/lucene/util/TestUnicodeUtil.java

+    b.addTransition(s1, s2, 0x800);
+    b.addTransition(s1, s2, 0xFFFF);
+    // utf8 codepoint length is 4
+    b.addTransition(s1, s2, 0x10000);


I'm pretty sure the Automaton builder collapses adjacent transitions like this, but for paranoia, could you add the range explicitly? E.g.:

b.addTransition(s1, s2, 0x7f, 0x80);

(Instead of two separate addTransition calls). And same for the other two (four) transitions?

tang-hi · 2023-08-01T17:20:00Z

I have already updated the PR. You may review it at your convenience.

gsmiller · 2023-08-03T00:44:14Z

Thanks @tang-hi for this change! I'll also try to have a look at this tomorrow or Friday since I already spent some time untangling the code in here :)

tang-hi · 2023-08-10T15:56:26Z

@mikemccand Just a gentle reminder, perhaps you can review this code and then move this PR forward.

gsmiller

Pretty elegant fix for a tricky bug. Thank you!

gsmiller · 2023-08-14T20:50:37Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

@@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
      // start.addTransition(new Transition(endUTF8.byteAt(upto) &
      // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end));   // type=end


very minor, but could we either remove this commented-out code or also update the index used against MASKS to reflect the offset change you're making here? (I'd probably vote for removing the commented out code completely)

gsmiller · 2023-08-15T01:22:24Z

lucene/CHANGES.txt

@@ -182,6 +182,10 @@ Bug Fixes
 * GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458
  (Greg Miller).

+* GITHUB#2472: UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences.  This should not have any


Unless a user was directly using UTF32ToUTF8#convert right? Maybe you could mention that somehow?

gsmiller · 2023-08-15T01:22:39Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java


  static {
    int v = 2;
-    for (int i = 0; i < 32; i++) {
-      MASKS[i] = v - 1;
+    for (int i = 0; i < 7; i++) {


Nice tidy-up!

gsmiller · 2023-08-15T01:23:22Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

-        // doesn't accept certain byte sequences) -- there
-        // are other cases we could optimize too:
-        startCode = 194;
+      if (endUTF8.len == 2) {


Maybe we should reference the issue number in a comment here that tracked the bug for future readers? It's a pretty tricky bug, so having the issue number in a comment might be useful?

gsmiller · 2023-08-15T01:24:20Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

+        // the first length=3 UTF8 Unicode character is E0 A0 80,
+        // so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8.
+        startCode = 0xA0;
+      } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {


I'm sort of contorting my brain to try to figure out if this can ever actually happen since the max UTF8 length is 4 bytes. I'm not opposed to keeping it, but for my own sanity, did you actually trip this problem with a test case or see something I'm overlooking?

If you comment it out, the test testUTF8SpanMultipleBytes in TestUnicodeUtil.java will fail. This is because when there is a transition span from 0xFFFF (3 bytes) to 0x10000 (4 bytes), it will produce an incorrect result.

Right, of course. That makes sense. My brain was a bit tired at the end of the day yesterday when I was looking through this, and I had an "off by one" bug.

tang-hi · 2023-08-15T15:44:55Z

Hi, @gsmiller! I have already updated the PR. You may review it at your convenience.

gsmiller · 2023-08-15T16:32:51Z

Thanks @tang-hi. This LGTM now (I have two outstanding, small bits of feedback/questions above if you have a chance, but they're not blocking). I'll leave this open for another day incase Mike has any additional feedback, but will otherwise get it merged in a day or so. Thanks again!

gsmiller

LGTM! Will merge tomorrow unless Mike has additional feedback.

gsmiller · 2023-08-15T22:35:49Z

lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java

+        // the first length=3 UTF8 Unicode character is E0 A0 80,
+        // so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8.
+        startCode = 0xA0;
+      } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {


Right, of course. That makes sense. My brain was a bit tired at the end of the day yesterday when I was looking through this, and I had an "off by one" bug.

gsmiller · 2023-08-16T22:32:21Z

Merged and back-ported. Thanks again @tang-hi! Nice fix for a reasonably tricky issue!

tang-hi added 2 commits July 29, 2023 18:46

fix error convert from utf32 to utf8

75de476

fix utf8 error

f538154

tang-hi mentioned this pull request Jul 29, 2023

UTF32toUTF8 can create automata that produce/accept invalid unicode #12458

Closed

fix typo

4591ff1

mikemccand reviewed Jul 30, 2023

View reviewed changes

fix typo

6c30f69

tang-hi requested a review from mikemccand July 31, 2023 11:56

mikemccand reviewed Aug 1, 2023

View reviewed changes

tang-hi added 2 commits August 2, 2023 01:10

update comment and improve readability based one review comments

667e56a

fix typo

3ac53ed

tang-hi requested a review from mikemccand August 1, 2023 17:20

gsmiller reviewed Aug 15, 2023

View reviewed changes

upadte code based on code reivew

318d077

tang-hi requested a review from gsmiller August 15, 2023 15:44

gsmiller approved these changes Aug 15, 2023

View reviewed changes

gsmiller merged commit ec13678 into apache:main Aug 16, 2023
4 checks passed

gsmiller pushed a commit that referenced this pull request Aug 16, 2023

Fix UTF32toUTF8 will produce invalid transition (#12472)

20748b7

gsmiller added this to the 9.8.0 milestone Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF32toUTF8 will produce invalid transition #12472

Fix UTF32toUTF8 will produce invalid transition #12472

tang-hi commented Jul 29, 2023

mikemccand left a comment

mikemccand Jul 30, 2023

mikemccand Jul 30, 2023

mikemccand Jul 30, 2023

tang-hi commented Jul 30, 2023

mikemccand commented Aug 1, 2023

tang-hi commented Aug 1, 2023

mikemccand left a comment

mikemccand Aug 1, 2023

mikemccand Aug 1, 2023

mikemccand Aug 1, 2023

mikemccand Aug 1, 2023

mikemccand Aug 1, 2023

mikemccand Aug 1, 2023

mikemccand Aug 1, 2023

mikemccand Aug 1, 2023

tang-hi Aug 1, 2023

mikemccand Aug 1, 2023

tang-hi commented Aug 1, 2023

gsmiller commented Aug 3, 2023

tang-hi commented Aug 10, 2023

gsmiller left a comment

gsmiller Aug 14, 2023

gsmiller Aug 15, 2023

gsmiller Aug 15, 2023

gsmiller Aug 15, 2023

gsmiller Aug 15, 2023

tang-hi Aug 15, 2023

gsmiller Aug 15, 2023

tang-hi commented Aug 15, 2023

gsmiller commented Aug 15, 2023

gsmiller left a comment

gsmiller Aug 15, 2023

gsmiller commented Aug 16, 2023

		@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
		// doesn't accept certain byte sequences) -- there
		// are other cases we could optimize too:

		@@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
		// start.addTransition(new Transition(endUTF8.byteAt(upto) &
		// (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end)); // type=end

Fix UTF32toUTF8 will produce invalid transition #12472

Fix UTF32toUTF8 will produce invalid transition #12472

Conversation

tang-hi commented Jul 29, 2023

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tang-hi commented Jul 30, 2023

mikemccand commented Aug 1, 2023

tang-hi commented Aug 1, 2023

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tang-hi commented Aug 1, 2023

gsmiller commented Aug 3, 2023

tang-hi commented Aug 10, 2023

gsmiller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tang-hi commented Aug 15, 2023

gsmiller commented Aug 15, 2023

gsmiller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsmiller commented Aug 16, 2023