New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12657: [C++] Adding String hex to numeric conversion #11064
Conversation
|
Having a more general hex string to number parser is good, and I would recommend for it to substitute the current simple parser. Replacing the hex parser will involve changing files not necessarily associated directly with this PR so you may want to consider to open a JIRA for the replacement. |
cpp/src/arrow/util/value_parsing.h
Outdated
@@ -273,6 +273,96 @@ inline bool ParseUnsigned(const char* s, size_t length, uint64_t* out) { | |||
#undef PARSE_UNSIGNED_ITERATION | |||
#undef PARSE_UNSIGNED_ITERATION_LAST | |||
|
|||
#define PARSE_HEX_ITERATION(C_TYPE) \ | |||
if (length > 0) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to make this a function. The use of macros is undesired in the general case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this recommendation. I was following the pattern of how ParseUnsigned
and PARSE_UNSIGNED_ITERATION
is done in the same file. Since I am new to the codebase I decided to follow the patterns I was seeing.
cpp/src/arrow/util/value_parsing.h
Outdated
|
||
|
||
inline bool ParseHex(const char* s, size_t length, uint8_t* out) { | ||
uint8_t result = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having multiple functions for each type, I suggest to use templates, and, given that each function is conceptually doing the same operation but different number of times based on the output data type width, use a for-loop instead.
template <typename T>
inline bool ParseHex(...) {
...
for (int i = 0; i < sizeof(T) * 2; ++i) {
ParseHexIteration<T>(result);
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you prefer, but I would be fine with keeping this a macro in this case. Note there is a control flow inside the macro so you cannot turn it into such a trivial loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I turned it into a loop, it works fine
cpp/src/arrow/util/value_parsing.h
Outdated
@@ -281,6 +371,19 @@ struct StringToUnsignedIntConverterMixin { | |||
if (ARROW_PREDICT_FALSE(length == 0)) { | |||
return false; | |||
} | |||
// If its starts with 0x then its hex | |||
if (*s == '0' && *(s + 1) == 'x'){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for case-insensitive 0x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return false; | ||
} | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify these checks and boolean result as
return ARROW_PREDICT_TRUE((sizeof(value_type) * 2 >= length) && ParseHex(s, length, out));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// If its starts with 0x then its hex | ||
if (*s == '0' && *(s + 1) == 'x'){ | ||
length -= 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a negative sign, then it is not a well-formed hex string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed this
cpp/src/arrow/util/value_parsing.h
Outdated
if (*s == '0' && *(s + 1) == 'x'){ | ||
length -= 2; | ||
s += 2; | ||
// lets make sure that the length of the string is not too big |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You probably not need to decrease length
and consider the offset of s
as the difference to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted not to do this, since length is used both at StringToSignedIntConverterMixin and at ParseHex. Getting rid of length would then involve keeping track of the original pointer of s
and the new pointer and comparing that to the original length. It seems more convoluted
…gned integer vectors When adding a byte `0xff` to a UInt1Vector, the toString method produces `[-1]`. Since the vector contains unsinged integers, the correct result should be `[255]`. Closes apache#11029 from liyafan82/fly_0830_uin Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
…hanges to Vectors) See https://issues.apache.org/jira/browse/ARROW-13544 According to the discussion in apache#10864 (comment), we want to split the task into multiple parts. This PR is for the changes related to vectors Closes apache#10910 from liyafan82/fly_0810_depv Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Exclude .factorypath files generated by Eclipse IDE for configuring annotation processing from Git commit and RAT plugin scans. Closes apache#11042 from laurentgo/laurentgo/exclude-factorypath Authored-by: Laurent Goujon <laurent@apache.org> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
…hanges to JDBC) See https://issues.apache.org/jira/browse/ARROW-13544 According to the discussion in apache#10864 (comment), we want to split the task into multiple parts. This PR is for the changes related to the JDBC adapter Closes apache#10912 from liyafan82/fly_0811_depj Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Essentially, this failure boils down to: when generating the array of uniques for booleans, we pack 8 bytes at a time into one byte. The bytes are packed from what turns out to be a scratch array allocated by TempVectorStack, which does not initialize its memory. So when we have a non-multiple-of-8 number of bytes, we may end up packing initialized bytes and uninitialized bytes together into a single garbage byte, resulting in Valgrind complaining. Closes apache#11041 from lidavidm/arrow-13812 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Closes apache#11045 from cyb70289/13067-int2dec Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Should fix the following issues found by OSS-Fuzz: * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37927 * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37915 * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37888 Also add the IPC integration reference files to the fuzzing corpus, this may help find more issues. Closes apache#11059 from pitrou/ARROW-13846-ipc-fuzz-crashes Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Add validation to detect invalid DELTA_BINARY_PACKED data. This should fix the following issues: * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37431 * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37432 * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37421 Closes apache#11060 from pitrou/ARROW-13850-parquet-fuzz Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Closes apache#10730 from romainfrancois/ARROW-13164_altrep_with_nulls Lead-authored-by: Romain Francois <romain@rstudio.com> Co-authored-by: Romain François <romain@rstudio.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com> Closes apache#11056 from zhjwpku/docs/missing_param_for_setcolumn Authored-by: Junwang Zhao <zhjwpku@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Closes apache#11055 from kou/ruby-table-save-dataset Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
I templated from ARROW-11735. Let's see how all the tests go! Closes apache#11046 from karldw/arrow-12981 Authored-by: karldw <karldw@users.noreply.github.com> Signed-off-by: Ian Cook <ianmcook@gmail.com>
Closes apache#11061 from lidavidm/arrow-13782 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…ension types Closes apache#11071 from pitrou/ARROW-13855-export-extension Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: David Li <li.davidm96@gmail.com>
- [x] collect() uses ExecPlan - [x] arrange() uses an OrderBySink - [x] .data inside of arrow_dplyr_query can itself be arrow_dplyr_query - [x] can build more query after calling summarize() - [x] handle non-deterministic dataset collect() tests - [x] fix group_by-expression behavior - [x] make official collapse() method with more testing of faithful behavior after collapsing - [x] make sort after summarize be configurable by option (default FALSE, though local_options TRUE in the tests) - [x] add print method for collapsed query - [x] Skip 32-bit rtools35 dataset tests/examples ~~- [ ] should queries on in-memory data evaluate eagerly (like dplyr)?~~ Followups: * ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions * ARROW-13778: [R] Handle complex summarize expressions * ARROW-13779: [R] Disallow expressions that depend on order after arrange() * ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan * ARROW-13854: [R] More accurately determine output type of an aggregation expression * ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries Closes apache#10992 from nealrichardson/subquery Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Co-authored-by: Jonathan Keane <jkeane@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Closes apache#11074 from thisisnic/ARROW-13874_trimpotions Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
…functions Closes apache#11078 from nealrichardson/summarize-0 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Closes apache#11083 from kou/ruby-slicer-expression Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Closes apache#11092 from thisisnic/ARROW-13904_modeoptions Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Closes apache#11093 from thisisnic/ARROW-13905_replacesliceoptions Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Nic <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Closes apache#11094 from thisisnic/ARROW-13905_partitionnthoptions Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
…s kernels Closes apache#11102 from thisisnic/ARROW-13869_misc_compute Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Closes apache#11098 from thisisnic/ARROW-13908_extract_regex_options Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
… wmalpica/ARROW-12657
Something strange happened when I rebased. I will close this PR and create a new one. |
This PR adds the ability for the StringConverter to parse hex strings of the form: 0x123abc to an integer. The strings must start with "0x" case insensitive. The values ABCDEF are case insensitive. Note that there was already a Hex Parsing function here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76 Which was not really flexible enough for what was needed. Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions This work was originally done here #11064 but had to create a new PR due to a rebase issue Closes #11161 from wmalpica/ARROW-12657_3 Lead-authored-by: William Malpica <16705032+wmalpica@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This PR adds the ability for the StringConverter to parse hex strings of the form: 0x123abc to an integer. The strings must start with "0x" case insensitive. The values ABCDEF are case insensitive. Note that there was already a Hex Parsing function here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76 Which was not really flexible enough for what was needed. Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions This work was originally done here apache#11064 but had to create a new PR due to a rebase issue Closes apache#11161 from wmalpica/ARROW-12657_3 Lead-authored-by: William Malpica <16705032+wmalpica@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This PR adds the ability for the StringConverter to parse hex strings of the form:
0x123abc to an integer.
The strings must start with "0x" case sensitive.
The values ABCDEF are case insensitive.
Note that there was already a Hex Parsing function here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76
Which was not really flexible enough for what was needed.
Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions