Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix from_utf8 handling of invalid UTF-8 codepoint #7442

Closed
wants to merge 1 commit into from

Conversation

kagamiori
Copy link
Contributor

Summary:
For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Differential Revision: D51050645

Copy link

netlify bot commented Nov 7, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 90f7c48
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/655199de1a64e20008aa1e1c

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 7, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

Copy link
Contributor

@bikramSingh91 bikramSingh91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is resulting in a regression as conbench is failing, its probably due to the extra checks/branches added here but might be worth looking into to be sure its expected

int firstByteCharLength(const char* u_input) {
auto u = (const unsigned char*)u_input;
unsigned char u0 = u[0];
if (u0 < 0x80) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you use decimals like in utf8proc_char_length instead of hexadecimal notation since its much easier to read

Copy link
Contributor Author

@kagamiori kagamiori Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thought decimals are harder to read, so I used hex representations. What about I use binary representations instead, e.g., if (u0 < 0b10000000)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great and actually much better than decimal, thanks!

return -4;
}

auto filefthByte = input[4];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: did you mean fifthByte ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thank.

@@ -18,12 +18,53 @@
#include "velox/external/utf8proc/utf8procImpl.h"

namespace facebook::velox::functions {
namespace {

// Returns the length of a UTF-8 character indicated by the first byte.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add to comment that it returns -1 for invalid length

}

auto filefthByte = input[4];
if (!utf_cont(filefthByte)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, what does utf_cont return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 continuation bytes starts with 0b10 (i.e., 0b10xxxxxx). utf_cont simply check whether the highest bits are 10.

}

if (charLength == 5) {
// Per RFC3629, UTF-8 is limited to 4 bytes, so more bytes are illegal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if anything above 4 is illegal do we need any branches after 4? can we just return -1 after size or charLength > 4 ?
if not, do we need a utf_cont for those extra 5th and 6th bytes?

Copy link
Contributor Author

@kagamiori kagamiori Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte sequences of length 5 and 6 were legal in the past, but became illegal since RFC3629. However, the data we read can still have byte sequences of length 5 and 6 (e.g., outdated data) and we need to consider the sequence as one codepoint. This happens in the Presto batch query and Presto-java consider a sequence of 5 bytes as one codepoint .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, can you clarify which a short comment as to what the output of this function signifies? I wasent entirely sure why we needed multiple negative values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bikramSingh91, There is actually a comment in the header file that mentions the meaning of negative values. A negative return value means the byte sequence is invalid UTF-8 and the absolute value is the length of the invalid sequence. Do you think this is enough to explain the output of this function?

/// This function is not part of the original utf8proc.
/// Tries to get the length of UTF-8 encoded code point. A
/// positive return value means the UTF-8 sequence is valid, and
/// the result is the length of the code point. A negative return value means
/// the UTF-8 sequence at the position is invalid, and the length of the invalid
/// sequence is the absolute value of the result.
///
/// @param input Pointer to the first byte of the code point. Must not be null.
/// @param size Number of available bytes. Must be greater than zero.
/// @return the length of the code point or negative the number of bytes in the
/// invalid UTF-8 sequence.
///
/// Adapted from tryGetCodePointAt in
/// https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java
int32_t tryGetCharLength(const char* input, int64_t size);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, thanks for letting me know, i had not checked the header file earlier.

@kagamiori
Copy link
Contributor Author

looks like this is resulting in a regression as conbench is failing, its probably due to the extra checks/branches added here but might be worth looking into to be sure its expected

This is interesting. The checks I added should not be significantly slower than the original code I think. In fact, I just checked the failed conbench benchmarks, and I saw all the regressions were with the eqToConstant benchmark. This benchmark evaluates and expression eq(a, constant) that should not involve from_utf8.

BENCHMARK(eqToConstant) {
benchmark->run("eq(a, constant)");
}

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 8, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Differential Revision: D51050645
kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 8, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

@bikramSingh91
Copy link
Contributor

looks like this is resulting in a regression as conbench is failing, its probably due to the extra checks/branches added here but might be worth looking into to be sure its expected

This is interesting. The checks I added should not be significantly slower than the original code I think. In fact, I just checked the failed conbench benchmarks, and I saw all the regressions were with the eqToConstant benchmark. This benchmark evaluates and expression eq(a, constant) that should not involve from_utf8.

BENCHMARK(eqToConstant) {
benchmark->run("eq(a, constant)");
}

Thanks for checking, its probably noise that we have seen previously with microbenchmarks. Also it seems to pass with your latest updates so it definitely was noise.

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 9, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 9, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 9, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 9, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 9, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 9, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 9, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 10, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 10, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

kagamiori added a commit to kagamiori/velox that referenced this pull request Nov 13, 2023
…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

…7442)

Summary:

For in invalid codepoint of multiple-byte long, e.g., 0xFB 0xB7 0x8E 0xB6 0xBE,
from_utf8() used to recognizie it as multiple invalid codepoint of one byte each.
On the other hand, Presto recognize it as one codepoint. This makes the result
of from_utf8() different between Velox and Presto because Presto replaces the
invalid sequence with one 0xFFFD while Velox replaces it with five. This diff
makes from_utf8() follow Presto behavior.

Reviewed By: bikramSingh91

Differential Revision: D51050645
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51050645

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b45ebb9.

Copy link

Conbench analyzed the 1 benchmark run on commit b45ebb9e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants