Fix from_utf8 handling of invalid UTF-8 codepoint #7442

kagamiori · 2023-11-07T00:00:58Z

Summary:
For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Differential Revision: D51050645

netlify · 2023-11-07T00:01:03Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`90f7c48`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/655199de1a64e20008aa1e1c

facebook-github-bot · 2023-11-07T00:01:06Z

This pull request was exported from Phabricator. Differential Revision: D51050645

bikramSingh91

looks like this is resulting in a regression as conbench is failing, its probably due to the extra checks/branches added here but might be worth looking into to be sure its expected

bikramSingh91 · 2023-11-07T01:30:21Z

velox/functions/prestosql/Utf8Utils.cpp

+int firstByteCharLength(const char* u_input) {
+  auto u = (const unsigned char*)u_input;
+  unsigned char u0 = u[0];
+  if (u0 < 0x80) {


nit: can you use decimals like in utf8proc_char_length instead of hexadecimal notation since its much easier to read

I actually thought decimals are harder to read, so I used hex representations. What about I use binary representations instead, e.g., if (u0 < 0b10000000)?

This is great and actually much better than decimal, thanks!

bikramSingh91 · 2023-11-07T17:46:28Z

velox/functions/prestosql/Utf8Utils.cpp

+    return -4;
+  }
+
+  auto filefthByte = input[4];


nit: did you mean fifthByte ?

Good catch! Thank.

bikramSingh91 · 2023-11-07T17:47:14Z

velox/functions/prestosql/Utf8Utils.cpp

@@ -18,12 +18,53 @@
 #include "velox/external/utf8proc/utf8procImpl.h"

 namespace facebook::velox::functions {
+namespace {
+
+// Returns the length of a UTF-8 character indicated by the first byte.


nit: add to comment that it returns -1 for invalid length

bikramSingh91 · 2023-11-07T18:46:20Z

velox/functions/prestosql/Utf8Utils.cpp

+  }
+
+  auto filefthByte = input[4];
+  if (!utf_cont(filefthByte)) {


qq, what does utf_cont return?

UTF-8 continuation bytes starts with 0b10 (i.e., 0b10xxxxxx). utf_cont simply check whether the highest bits are 10.

bikramSingh91 · 2023-11-07T18:47:52Z

velox/functions/prestosql/Utf8Utils.cpp

+  }
+
+  if (charLength == 5) {
+    // Per RFC3629, UTF-8 is limited to 4 bytes, so more bytes are illegal.


if anything above 4 is illegal do we need any branches after 4? can we just return -1 after size or charLength > 4 ?
if not, do we need a utf_cont for those extra 5th and 6th bytes?

Byte sequences of length 5 and 6 were legal in the past, but became illegal since RFC3629. However, the data we read can still have byte sequences of length 5 and 6 (e.g., outdated data) and we need to consider the sequence as one codepoint. This happens in the Presto batch query and Presto-java consider a sequence of 5 bytes as one codepoint .

Thanks, can you clarify which a short comment as to what the output of this function signifies? I wasent entirely sure why we needed multiple negative values.

Hi @bikramSingh91, There is actually a comment in the header file that mentions the meaning of negative values. A negative return value means the byte sequence is invalid UTF-8 and the absolute value is the length of the invalid sequence. Do you think this is enough to explain the output of this function?

velox/velox/functions/prestosql/Utf8Utils.h

Lines 22 to 36 in f52eaa3

/// This function is not part of the original utf8proc.

/// Tries to get the length of UTF-8 encoded code point. A

/// positive return value means the UTF-8 sequence is valid, and

/// the result is the length of the code point. A negative return value means

/// the UTF-8 sequence at the position is invalid, and the length of the invalid

/// sequence is the absolute value of the result.

///

/// @param input Pointer to the first byte of the code point. Must not be null.

/// @param size Number of available bytes. Must be greater than zero.

/// @return the length of the code point or negative the number of bytes in the

/// invalid UTF-8 sequence.

///

/// Adapted from tryGetCodePointAt in

/// https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java

int32_t tryGetCharLength(const char* input, int64_t size);

This works, thanks for letting me know, i had not checked the header file earlier.

kagamiori · 2023-11-08T02:32:18Z

looks like this is resulting in a regression as conbench is failing, its probably due to the extra checks/branches added here but might be worth looking into to be sure its expected

This is interesting. The checks I added should not be significantly slower than the original code I think. In fact, I just checked the failed conbench benchmarks, and I saw all the regressions were with the eqToConstant benchmark. This benchmark evaluates and expression eq(a, constant) that should not involve from_utf8.

velox/velox/benchmarks/basic/ComparisonConjunct.cpp

Lines 145 to 147 in fb34b0f

    
           BENCHMARK(eqToConstant) { 
        
             benchmark->run("eq(a, constant)"); 
        
           }

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Differential Revision: D51050645

facebook-github-bot · 2023-11-08T02:35:00Z

This pull request was exported from Phabricator. Differential Revision: D51050645

facebook-github-bot · 2023-11-08T02:35:03Z

This pull request was exported from Phabricator. Differential Revision: D51050645

bikramSingh91 · 2023-11-09T00:27:32Z

looks like this is resulting in a regression as conbench is failing, its probably due to the extra checks/branches added here but might be worth looking into to be sure its expected

This is interesting. The checks I added should not be significantly slower than the original code I think. In fact, I just checked the failed conbench benchmarks, and I saw all the regressions were with the eqToConstant benchmark. This benchmark evaluates and expression eq(a, constant) that should not involve from_utf8.

velox/velox/benchmarks/basic/ComparisonConjunct.cpp

Lines 145 to 147 in fb34b0f

BENCHMARK(eqToConstant) {

benchmark->run("eq(a, constant)");

}

Thanks for checking, its probably noise that we have seen previously with microbenchmarks. Also it seems to pass with your latest updates so it definitely was noise.

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-09T19:33:51Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-09T20:57:10Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-09T22:14:31Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-09T22:19:28Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-09T22:20:16Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-09T22:27:54Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-09T22:28:19Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-10T00:12:17Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-10T00:12:41Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-13T03:31:52Z

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442) Summary: For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE, from_utf8() used to recognizie it as multiple invalid codepoint of one byte each. On the other hand, Presto recognize it as one codepoint. This makes the result of from_utf8() different between Velox and Presto because Presto replaces the invalid sequence with one 0xFFFD while Velox replaces it with five. This diff makes from_utf8() follow Presto behavior. Reviewed By: bikramSingh91 Differential Revision: D51050645

facebook-github-bot · 2023-11-13T03:37:07Z

This pull request was exported from Phabricator. Differential Revision: D51050645

facebook-github-bot · 2023-11-13T18:12:43Z

This pull request has been merged in b45ebb9.

conbench-facebook · 2023-11-13T18:33:22Z

Conbench analyzed the 1 benchmark run on commit b45ebb9e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 7, 2023

facebook-github-bot added the fb-exported label Nov 7, 2023

bikramSingh91 reviewed Nov 7, 2023

View reviewed changes

kagamiori force-pushed the export-D51050645 branch from 41cdf4e to 8e9d623 Compare November 8, 2023 02:34

kagamiori force-pushed the export-D51050645 branch from 8e9d623 to 3684469 Compare November 8, 2023 02:34

kagamiori requested a review from bikramSingh91 November 8, 2023 02:35

bikramSingh91 approved these changes Nov 9, 2023

View reviewed changes

kagamiori force-pushed the export-D51050645 branch from 3684469 to 37b1e47 Compare November 9, 2023 19:33

kagamiori force-pushed the export-D51050645 branch from 37b1e47 to fe32158 Compare November 9, 2023 20:56

kagamiori force-pushed the export-D51050645 branch from fe32158 to ab87621 Compare November 9, 2023 22:14

kagamiori force-pushed the export-D51050645 branch from ab87621 to 64eef8b Compare November 9, 2023 22:19

kagamiori force-pushed the export-D51050645 branch from 64eef8b to 460a7b9 Compare November 9, 2023 22:19

kagamiori force-pushed the export-D51050645 branch from 460a7b9 to 87a6ead Compare November 9, 2023 22:27

kagamiori force-pushed the export-D51050645 branch from 87a6ead to 0865cef Compare November 9, 2023 22:28

kagamiori force-pushed the export-D51050645 branch from 0865cef to bbfbf0b Compare November 10, 2023 00:11

kagamiori force-pushed the export-D51050645 branch from bbfbf0b to b767c7d Compare November 10, 2023 00:12

kagamiori force-pushed the export-D51050645 branch from b767c7d to 1379fa5 Compare November 13, 2023 03:31

kagamiori force-pushed the export-D51050645 branch from 1379fa5 to 90f7c48 Compare November 13, 2023 03:37

facebook-github-bot closed this in b45ebb9 Nov 13, 2023

facebook-github-bot added the Merged label Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix from_utf8 handling of invalid UTF-8 codepoint #7442

Fix from_utf8 handling of invalid UTF-8 codepoint #7442

kagamiori commented Nov 7, 2023

netlify bot commented Nov 7, 2023 •

edited

facebook-github-bot commented Nov 7, 2023

bikramSingh91 left a comment •

edited

bikramSingh91 Nov 7, 2023

kagamiori Nov 8, 2023 •

edited

bikramSingh91 Nov 9, 2023

bikramSingh91 Nov 7, 2023

kagamiori Nov 8, 2023

bikramSingh91 Nov 7, 2023

bikramSingh91 Nov 7, 2023

kagamiori Nov 8, 2023

bikramSingh91 Nov 7, 2023

kagamiori Nov 8, 2023 •

edited

bikramSingh91 Nov 9, 2023

kagamiori Nov 9, 2023

bikramSingh91 Nov 9, 2023

kagamiori commented Nov 8, 2023

facebook-github-bot commented Nov 8, 2023

facebook-github-bot commented Nov 8, 2023

bikramSingh91 commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 10, 2023

facebook-github-bot commented Nov 10, 2023

facebook-github-bot commented Nov 13, 2023

facebook-github-bot commented Nov 13, 2023

facebook-github-bot commented Nov 13, 2023

conbench-facebook bot commented Nov 13, 2023

	/// This function is not part of the original utf8proc.
	/// Tries to get the length of UTF-8 encoded code point. A
	/// positive return value means the UTF-8 sequence is valid, and
	/// the result is the length of the code point. A negative return value means
	/// the UTF-8 sequence at the position is invalid, and the length of the invalid
	/// sequence is the absolute value of the result.
	///
	/// @param input Pointer to the first byte of the code point. Must not be null.
	/// @param size Number of available bytes. Must be greater than zero.
	/// @return the length of the code point or negative the number of bytes in the
	/// invalid UTF-8 sequence.
	///
	/// Adapted from tryGetCodePointAt in
	/// https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java
	int32_t tryGetCharLength(const char* input, int64_t size);

Fix from_utf8 handling of invalid UTF-8 codepoint #7442

Fix from_utf8 handling of invalid UTF-8 codepoint #7442

Conversation

kagamiori commented Nov 7, 2023

netlify bot commented Nov 7, 2023 • edited

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Nov 7, 2023

bikramSingh91 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kagamiori Nov 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kagamiori Nov 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kagamiori commented Nov 8, 2023

facebook-github-bot commented Nov 8, 2023

facebook-github-bot commented Nov 8, 2023

bikramSingh91 commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 10, 2023

facebook-github-bot commented Nov 10, 2023

facebook-github-bot commented Nov 13, 2023

facebook-github-bot commented Nov 13, 2023

facebook-github-bot commented Nov 13, 2023

conbench-facebook bot commented Nov 13, 2023

netlify bot commented Nov 7, 2023 •

edited

bikramSingh91 left a comment •

edited

kagamiori Nov 8, 2023 •

edited

kagamiori Nov 8, 2023 •

edited