Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF32toUTF8 will produce invalid transition #12472

Merged
merged 8 commits into from
Aug 16, 2023

Conversation

tang-hi
Copy link
Contributor

@tang-hi tang-hi commented Jul 29, 2023

FIX ISSUE #12458

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tang-hi -- what a sneaky bug! It's crazy it took randomized testing so long to eek out this failure, and also wonderful that it did.

How do we characterize the end user impact of this bug? We need a well written CHANGES entry. The impact should be small, since it's accepting illegal UTF-8 sequences? It'd only affect users operating entirely in binary space (not Unicode encoded UTF-8 terms, which is the vast majority of users / default case).

@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
// doesn't accept certain byte sequences) -- there
// are other cases we could optimize too:
startCode = 194;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe fix this one to hex as well (0xC2)?

@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
// doesn't accept certain byte sequences) -- there
// are other cases we could optimize too:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment (there are other cases we could optimize too) still true :) Or are these two new ifs covering them AND fixing this sneaky bug?

@@ -188,6 +191,30 @@ public void testUTF8CodePointAt() {
}
}

public void testUTF8TwoToThreeBytes() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add the three-to-four and one-to-two cases?

…that a codepoint would start at 0x80 when the number of bits is 6.

This has now been corrected. The correct behavior is that when the length of the codepoint is 3 and the first byte is 0xE0, it will start from 0xA0.
Similarly, when the length of the codepoint is 4 and the first byte is 0xF0, it will start from 0x90.
@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 30, 2023

@mikemccand Thanks for the review. I have already made the updates to the pull request based on your review comments. 😄

@tang-hi tang-hi requested a review from mikemccand July 31, 2023 11:56
@mikemccand
Copy link
Member

@mikemccand Thanks for the review. I have already made the updates to the pull request based on your review comments. 😄

Wow that was fast, thank you! I'll try to review soon -- I gotta stare at this long enough to understand the bug / fix / UTF-8 details. Thank you for digging on this tricky case so quickly.

@tang-hi
Copy link
Contributor Author

tang-hi commented Aug 1, 2023

2 byte -> 3 byte
image

3 byte -> 4 byte
image

hope these pictures will be helpful

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, I finally understand this bug!

And I realize how I messed this up long ago after studying the UTF-8 encoding table ... from the line for the first 3-byte UTF-8 sequence:

U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx

one cannot assume that the xxxx of the not-first UTF-8 byte will be 0s! For the 2->3 and 3->4 case they are not :)

Thank you @tang-hi for bearing with me as I remember this tricky stuff. I left a bunch of small comments to help readability maybe. I think the fix is fundamentally correct!

} else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
startCode = 0xA0;
} else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {
startCode = 0x90;
} else {
startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why MASKS is size 32? We only access it with numBits -- shouldn't it only be length 7? Separately we should make it 1-based so access doesn't have to keep subtracting 1 ... might eek out a bit of CPU. But let's fix that later/separately...

@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
endUTF8.byteAt(upto));
} else {
// There are three special case
// 1. if codepoint's numBits is 5, byte start from 0xC2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of saying if a codepoint's numBits is 5, could we say if a codepoint encodes to 2 UTF-8 bytes? And same for bullets 2 & 3 (3 UTF-8 bytes, 4 UTF-8 bytes)?

// 1. if codepoint's numBits is 5, byte start from 0xC2
// 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0
// 3. if codepoint's len is 4 and first byte is 0xF0, the second byte start from 0x90
// you could found the reference in https://www.utf8-chartable.de/unicode-utf8-table.pl
final int startCode;
if (endUTF8.numBits(upto) == 5) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, could we change this to:

    if (endUTF8.len == 2) {
        assert upto == 0;   // the upto==1 case will be handled by the first if above
        startCode = 0xC2;
    }

?

} else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
startCode = 0xA0;
} else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {
startCode = 0x90;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we also set 0x80 in the upto == 2 case and endUTF8.len == 4 when the leading bytes are 0xF0 0x90? But how come no test is failing, if that's right?

Edit: OK, I see, the existing code will in fact set startCode = 0x80 already in that case, so, no bug, phew. And this same logic happens to work for 2nd byte of 1 -> 2 UTF8 length transition as well (also 0x80, which the first if handles correctly today). Tricky!

@@ -182,6 +182,10 @@ Bug Fixes
* GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458
(Greg Miller).

* GITHUB#2472: Fix the incorrect conversion from a Unicode (UTF-32) automaton to its UTF-8 representation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reword maybe to:

UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences.  This should not have any user impact except in the very unusual possible use-case of searching a non-UTF-8 (fully binary terms) inverted field using Unicode search terms.

?

Net/net I don't think this will impact any users ... users who index binary content would search it with binary terms (not via conversion from Unicode, where this bug lurks) unless they had a truly exotic use case.

@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
endUTF8.byteAt(upto));
} else {
// There are three special case
// 1. if codepoint's numBits is 5, byte start from 0xC2
// 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space before and and first?

startCode = 194;
startCode = 0xC2;
} else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
startCode = 0xA0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the above three bullets into each of these internal ifs to make the special casing clearer?

E.g. in this if maybe say:

// the first length=3 UTF8 Unicode character is E0 A0 80 so we must special case A0 as the 2nd byte when E0 was the first byte of endUTF8 in this case

@@ -142,22 +141,11 @@ private void checkAutomaton(List<BytesRef> expected, Automaton a, boolean isBina
}

// Make sure every term produced by the automaton is expected
FiniteStringsIterator it = new FiniteStringsIterator(a);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why did you remove the isBinary case for this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was added by @gsmiller in #12461 for the purpose of passing the CI test, it should be restored since we have already fixed the bug.

b.addTransition(s1, s2, 0x800);
b.addTransition(s1, s2, 0xFFFF);
// utf8 codepoint length is 4
b.addTransition(s1, s2, 0x10000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the Automaton builder collapses adjacent transitions like this, but for paranoia, could you add the range explicitly? E.g.:

b.addTransition(s1, s2, 0x7f, 0x80);

(Instead of two separate addTransition calls). And same for the other two (four) transitions?

@tang-hi
Copy link
Contributor Author

tang-hi commented Aug 1, 2023

I have already updated the PR. You may review it at your convenience.

@tang-hi tang-hi requested a review from mikemccand August 1, 2023 17:20
@gsmiller
Copy link
Contributor

gsmiller commented Aug 3, 2023

Thanks @tang-hi for this change! I'll also try to have a look at this tomorrow or Friday since I already spent some time untangling the code in here :)

@tang-hi
Copy link
Contributor Author

tang-hi commented Aug 10, 2023

@mikemccand Just a gentle reminder, perhaps you can review this code and then move this PR forward.

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty elegant fix for a tricky bug. Thank you!

@@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA
// start.addTransition(new Transition(endUTF8.byteAt(upto) &
// (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end)); // type=end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor, but could we either remove this commented-out code or also update the index used against MASKS to reflect the offset change you're making here? (I'd probably vote for removing the commented out code completely)

@@ -182,6 +182,10 @@ Bug Fixes
* GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458
(Greg Miller).

* GITHUB#2472: UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences. This should not have any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless a user was directly using UTF32ToUTF8#convert right? Maybe you could mention that somehow?


static {
int v = 2;
for (int i = 0; i < 32; i++) {
MASKS[i] = v - 1;
for (int i = 0; i < 7; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tidy-up!

// doesn't accept certain byte sequences) -- there
// are other cases we could optimize too:
startCode = 194;
if (endUTF8.len == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should reference the issue number in a comment here that tracked the bug for future readers? It's a pretty tricky bug, so having the issue number in a comment might be useful?

// the first length=3 UTF8 Unicode character is E0 A0 80,
// so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8.
startCode = 0xA0;
} else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sort of contorting my brain to try to figure out if this can ever actually happen since the max UTF8 length is 4 bytes. I'm not opposed to keeping it, but for my own sanity, did you actually trip this problem with a test case or see something I'm overlooking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you comment it out, the test testUTF8SpanMultipleBytes in TestUnicodeUtil.java will fail. This is because when there is a transition span from 0xFFFF (3 bytes) to 0x10000 (4 bytes), it will produce an incorrect result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, of course. That makes sense. My brain was a bit tired at the end of the day yesterday when I was looking through this, and I had an "off by one" bug.

@tang-hi tang-hi requested a review from gsmiller August 15, 2023 15:44
@tang-hi
Copy link
Contributor Author

tang-hi commented Aug 15, 2023

Hi, @gsmiller! I have already updated the PR. You may review it at your convenience.

@gsmiller
Copy link
Contributor

Thanks @tang-hi. This LGTM now (I have two outstanding, small bits of feedback/questions above if you have a chance, but they're not blocking). I'll leave this open for another day incase Mike has any additional feedback, but will otherwise get it merged in a day or so. Thanks again!

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Will merge tomorrow unless Mike has additional feedback.

// the first length=3 UTF8 Unicode character is E0 A0 80,
// so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8.
startCode = 0xA0;
} else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, of course. That makes sense. My brain was a bit tired at the end of the day yesterday when I was looking through this, and I had an "off by one" bug.

@gsmiller gsmiller merged commit ec13678 into apache:main Aug 16, 2023
4 checks passed
@gsmiller
Copy link
Contributor

Merged and back-ported. Thanks again @tang-hi! Nice fix for a reasonably tricky issue!

@gsmiller gsmiller added this to the 9.8.0 milestone Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants