-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UTF32toUTF8 will produce invalid transition #12472
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tang-hi -- what a sneaky bug! It's crazy it took randomized testing so long to eek out this failure, and also wonderful that it did.
How do we characterize the end user impact of this bug? We need a well written CHANGES entry. The impact should be small, since it's accepting illegal UTF-8 sequences? It'd only affect users operating entirely in binary space (not Unicode encoded UTF-8 terms, which is the vast majority of users / default case).
@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA | |||
// doesn't accept certain byte sequences) -- there | |||
// are other cases we could optimize too: | |||
startCode = 194; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's maybe fix this one to hex as well (0xC2)?
@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA | |||
// doesn't accept certain byte sequences) -- there | |||
// are other cases we could optimize too: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment (there are other cases we could optimize too
) still true :) Or are these two new ifs covering them AND fixing this sneaky bug?
@@ -188,6 +191,30 @@ public void testUTF8CodePointAt() { | |||
} | |||
} | |||
|
|||
public void testUTF8TwoToThreeBytes() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add the three-to-four and one-to-two cases?
…that a codepoint would start at 0x80 when the number of bits is 6. This has now been corrected. The correct behavior is that when the length of the codepoint is 3 and the first byte is 0xE0, it will start from 0xA0. Similarly, when the length of the codepoint is 4 and the first byte is 0xF0, it will start from 0x90.
@mikemccand Thanks for the review. I have already made the updates to the pull request based on your review comments. 😄 |
Wow that was fast, thank you! I'll try to review soon -- I gotta stare at this long enough to understand the bug / fix / UTF-8 details. Thank you for digging on this tricky case so quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, I finally understand this bug!
And I realize how I messed this up long ago after studying the UTF-8 encoding table ... from the line for the first 3-byte UTF-8 sequence:
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx
one cannot assume that the xxxx
of the not-first UTF-8 byte will be 0s! For the 2->3 and 3->4 case they are not :)
Thank you @tang-hi for bearing with me as I remember this tricky stuff. I left a bunch of small comments to help readability maybe. I think the fix is fundamentally correct!
} else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { | ||
startCode = 0xA0; | ||
} else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { | ||
startCode = 0x90; | ||
} else { | ||
startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why MASKS
is size 32? We only access it with numBits
-- shouldn't it only be length 7? Separately we should make it 1-based so access doesn't have to keep subtracting 1 ... might eek out a bit of CPU. But let's fix that later/separately...
@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA | |||
endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), | |||
endUTF8.byteAt(upto)); | |||
} else { | |||
// There are three special case | |||
// 1. if codepoint's numBits is 5, byte start from 0xC2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of saying if a codepoint's numBits is 5
, could we say if a codepoint encodes to 2 UTF-8 bytes
? And same for bullets 2 & 3 (3 UTF-8 bytes, 4 UTF-8 bytes)?
// 1. if codepoint's numBits is 5, byte start from 0xC2 | ||
// 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0 | ||
// 3. if codepoint's len is 4 and first byte is 0xF0, the second byte start from 0x90 | ||
// you could found the reference in https://www.utf8-chartable.de/unicode-utf8-table.pl | ||
final int startCode; | ||
if (endUTF8.numBits(upto) == 5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, could we change this to:
if (endUTF8.len == 2) {
assert upto == 0; // the upto==1 case will be handled by the first if above
startCode = 0xC2;
}
?
} else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { | ||
startCode = 0xA0; | ||
} else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { | ||
startCode = 0x90; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should we also set 0x80
in the upto == 2
case and endUTF8.len == 4
when the leading bytes are 0xF0 0x90
? But how come no test is failing, if that's right?
Edit: OK, I see, the existing code will in fact set startCode = 0x80
already in that case, so, no bug, phew. And this same logic happens to work for 2nd byte of 1 -> 2 UTF8 length transition as well (also 0x80, which the first if
handles correctly today). Tricky!
lucene/CHANGES.txt
Outdated
@@ -182,6 +182,10 @@ Bug Fixes | |||
* GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458 | |||
(Greg Miller). | |||
|
|||
* GITHUB#2472: Fix the incorrect conversion from a Unicode (UTF-32) automaton to its UTF-8 representation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we reword maybe to:
UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences. This should not have any user impact except in the very unusual possible use-case of searching a non-UTF-8 (fully binary terms) inverted field using Unicode search terms.
?
Net/net I don't think this will impact any users ... users who index binary content would search it with binary terms (not via conversion from Unicode, where this bug lurks) unless they had a truly exotic use case.
@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA | |||
endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), | |||
endUTF8.byteAt(upto)); | |||
} else { | |||
// There are three special case | |||
// 1. if codepoint's numBits is 5, byte start from 0xC2 | |||
// 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space before and
and first
?
startCode = 194; | ||
startCode = 0xC2; | ||
} else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { | ||
startCode = 0xA0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the above three bullets into each of these internal if
s to make the special casing clearer?
E.g. in this if
maybe say:
// the first length=3 UTF8 Unicode character is E0 A0 80 so we must special case A0 as the 2nd byte when E0 was the first byte of endUTF8 in this case
@@ -142,22 +141,11 @@ private void checkAutomaton(List<BytesRef> expected, Automaton a, boolean isBina | |||
} | |||
|
|||
// Make sure every term produced by the automaton is expected | |||
FiniteStringsIterator it = new FiniteStringsIterator(a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why did you remove the isBinary
case for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b.addTransition(s1, s2, 0x800); | ||
b.addTransition(s1, s2, 0xFFFF); | ||
// utf8 codepoint length is 4 | ||
b.addTransition(s1, s2, 0x10000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure the Automaton builder collapses adjacent transitions like this, but for paranoia, could you add the range explicitly? E.g.:
b.addTransition(s1, s2, 0x7f, 0x80);
(Instead of two separate addTransition
calls). And same for the other two (four) transitions?
I have already updated the PR. You may review it at your convenience. |
Thanks @tang-hi for this change! I'll also try to have a look at this tomorrow or Friday since I already spent some time untangling the code in here :) |
@mikemccand Just a gentle reminder, perhaps you can review this code and then move this PR forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty elegant fix for a tricky bug. Thank you!
@@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA | |||
// start.addTransition(new Transition(endUTF8.byteAt(upto) & | |||
// (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end)); // type=end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor, but could we either remove this commented-out code or also update the index used against MASKS to reflect the offset change you're making here? (I'd probably vote for removing the commented out code completely)
@@ -182,6 +182,10 @@ Bug Fixes | |||
* GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458 | |||
(Greg Miller). | |||
|
|||
* GITHUB#2472: UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences. This should not have any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless a user was directly using UTF32ToUTF8#convert
right? Maybe you could mention that somehow?
|
||
static { | ||
int v = 2; | ||
for (int i = 0; i < 32; i++) { | ||
MASKS[i] = v - 1; | ||
for (int i = 0; i < 7; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tidy-up!
// doesn't accept certain byte sequences) -- there | ||
// are other cases we could optimize too: | ||
startCode = 194; | ||
if (endUTF8.len == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should reference the issue number in a comment here that tracked the bug for future readers? It's a pretty tricky bug, so having the issue number in a comment might be useful?
// the first length=3 UTF8 Unicode character is E0 A0 80, | ||
// so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8. | ||
startCode = 0xA0; | ||
} else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sort of contorting my brain to try to figure out if this can ever actually happen since the max UTF8 length is 4 bytes. I'm not opposed to keeping it, but for my own sanity, did you actually trip this problem with a test case or see something I'm overlooking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you comment it out, the test testUTF8SpanMultipleBytes
in TestUnicodeUtil.java
will fail. This is because when there is a transition span from 0xFFFF (3 bytes) to 0x10000 (4 bytes), it will produce an incorrect result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, of course. That makes sense. My brain was a bit tired at the end of the day yesterday when I was looking through this, and I had an "off by one" bug.
Hi, @gsmiller! I have already updated the PR. You may review it at your convenience. |
Thanks @tang-hi. This LGTM now (I have two outstanding, small bits of feedback/questions above if you have a chance, but they're not blocking). I'll leave this open for another day incase Mike has any additional feedback, but will otherwise get it merged in a day or so. Thanks again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Will merge tomorrow unless Mike has additional feedback.
// the first length=3 UTF8 Unicode character is E0 A0 80, | ||
// so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8. | ||
startCode = 0xA0; | ||
} else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, of course. That makes sense. My brain was a bit tired at the end of the day yesterday when I was looking through this, and I had an "off by one" bug.
Merged and back-ported. Thanks again @tang-hi! Nice fix for a reasonably tricky issue! |
FIX ISSUE #12458