New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15699: [C++][Gandiva] Fix implementation of left and right func… #12440
Conversation
|
@@ -2207,6 +2213,12 @@ const char* right_utf8_int32(gdv_int64 context, const char* text, gdv_int32 text | |||
return ""; | |||
} | |||
|
|||
//case where right('abcdef', -6) -> "" and right('abcdef', -7) -> "" | |||
if(number < 0 && abs(number) >= text_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not handling mutibyte unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it for left('abc¥', -4) and it is returning correct result, Would it better if I put this condition in the for loop below
if (number < 0 && char_count <= abs(number)) {
*out_len = 0;
return "";
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text_len for strings with multibyte characters will be number of bytes rather than number of characters.. while number
refers the character count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I assumed text_len is no of characters.
5803604
to
61eac98
Compare
int32_t start_char_pos; // the char result start position (inclusive) | ||
int32_t end_char_len; // the char result end position (inclusive) | ||
int32_t end_pos; // the char result end position (inclusive) | ||
|
||
if (number > 0) { | ||
// case where right('abc', 5) ==> 'abc' start_char_pos=1. | ||
start_char_pos = (char_count > number) ? char_count - number : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this logic is not correct for multibyte unicode characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm. looks like it is being handled in following lines
aa360b5
to
5ee9753
Compare
…tions to handle more cases Proper handling of multibyte characters Added conditions to handle below cases: case where left('abcdef', -6) -> "" and left('abcdef', -7) -> "" case where right('abcdef', -6) -> "" and right('abcdef', -7) -> ""
5ee9753
to
244a5d9
Compare
Benchmark runs are scheduled for baseline = 91f6585 and contender = 7e70c42. 7e70c42 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…tions to handle more cases
Added conditions to handle below cases:
case where left('abcdef', -6) -> "" and left('abcdef', -7) -> ""
case where right('abcdef', -6) -> "" and right('abcdef', -7) -> ""