Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12657: [C++] Adding String hex to numeric conversion #11064

Closed
wants to merge 94 commits into from

Conversation

wmalpica
Copy link
Contributor

@wmalpica wmalpica commented Sep 2, 2021

This PR adds the ability for the StringConverter to parse hex strings of the form:
0x123abc to an integer.
The strings must start with "0x" case sensitive.
The values ABCDEF are case insensitive.

Note that there was already a Hex Parsing function here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76
Which was not really flexible enough for what was needed.
Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions

@github-actions
Copy link

github-actions bot commented Sep 2, 2021

@github-actions
Copy link

github-actions bot commented Sep 2, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@edponce
Copy link
Contributor

edponce commented Sep 2, 2021

Having a more general hex string to number parser is good, and I would recommend for it to substitute the current simple parser. Replacing the hex parser will involve changing files not necessarily associated directly with this PR so you may want to consider to open a JIRA for the replacement.

@@ -273,6 +273,96 @@ inline bool ParseUnsigned(const char* s, size_t length, uint64_t* out) {
#undef PARSE_UNSIGNED_ITERATION
#undef PARSE_UNSIGNED_ITERATION_LAST

#define PARSE_HEX_ITERATION(C_TYPE) \
if (length > 0) { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to make this a function. The use of macros is undesired in the general case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this recommendation. I was following the pattern of how ParseUnsigned and PARSE_UNSIGNED_ITERATION is done in the same file. Since I am new to the codebase I decided to follow the patterns I was seeing.



inline bool ParseHex(const char* s, size_t length, uint8_t* out) {
uint8_t result = 0;
Copy link
Contributor

@edponce edponce Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having multiple functions for each type, I suggest to use templates, and, given that each function is conceptually doing the same operation but different number of times based on the output data type width, use a for-loop instead.

template <typename T>
inline bool ParseHex(...) {
   ...
   for (int i = 0; i < sizeof(T) * 2; ++i) {
      ParseHexIteration<T>(result);
   }
   ...
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you prefer, but I would be fine with keeping this a macro in this case. Note there is a control flow inside the macro so you cannot turn it into such a trivial loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned it into a loop, it works fine

@@ -281,6 +371,19 @@ struct StringToUnsignedIntConverterMixin {
if (ARROW_PREDICT_FALSE(length == 0)) {
return false;
}
// If its starts with 0x then its hex
if (*s == '0' && *(s + 1) == 'x'){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add support for case-insensitive 0x.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return false;
}
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify these checks and boolean result as

return ARROW_PREDICT_TRUE((sizeof(value_type) * 2 >= length) && ParseHex(s, length, out));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// If its starts with 0x then its hex
if (*s == '0' && *(s + 1) == 'x'){
length -= 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a negative sign, then it is not a well-formed hex string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed this

if (*s == '0' && *(s + 1) == 'x'){
length -= 2;
s += 2;
// lets make sure that the length of the string is not too big
Copy link
Contributor

@edponce edponce Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You probably not need to decrease length and consider the offset of s as the difference to consider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted not to do this, since length is used both at StringToSignedIntConverterMixin and at ParseHex. Getting rid of length would then involve keeping track of the original pointer of s and the new pointer and comparing that to the original length. It seems more convoluted

liyafan82 and others added 19 commits September 1, 2021 21:24
…gned integer vectors

When adding a byte `0xff` to a UInt1Vector, the toString method produces `[-1]`. Since the vector contains unsinged integers, the correct result should be `[255]`.

Closes apache#11029 from liyafan82/fly_0830_uin

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
…hanges to Vectors)

See https://issues.apache.org/jira/browse/ARROW-13544

According to the discussion in apache#10864 (comment), we want to split the task into multiple parts.

This PR is for the changes related to vectors

Closes apache#10910 from liyafan82/fly_0810_depv

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Exclude .factorypath files generated by Eclipse IDE for configuring
annotation processing from Git commit and RAT plugin scans.

Closes apache#11042 from laurentgo/laurentgo/exclude-factorypath

Authored-by: Laurent Goujon <laurent@apache.org>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
…hanges to JDBC)

See https://issues.apache.org/jira/browse/ARROW-13544

According to the discussion in apache#10864 (comment), we want to split the task into multiple parts.

This PR is for the changes related to the JDBC adapter

Closes apache#10912 from liyafan82/fly_0811_depj

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Essentially, this failure boils down to: when generating the array of uniques for booleans, we pack 8 bytes at a time into one byte. The bytes are packed from what turns out to be a scratch array allocated by TempVectorStack, which does not initialize its memory. So when we have a non-multiple-of-8 number of bytes, we may end up packing initialized bytes and uninitialized bytes together into a single garbage byte, resulting in Valgrind complaining.

Closes apache#11041 from lidavidm/arrow-13812

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Closes apache#11045 from cyb70289/13067-int2dec

Authored-by: Yibo Cai <yibo.cai@arm.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Should fix the following issues found by OSS-Fuzz:
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37927
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37915
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37888

Also add the IPC integration reference files to the fuzzing corpus, this may help find more issues.

Closes apache#11059 from pitrou/ARROW-13846-ipc-fuzz-crashes

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Add validation to detect invalid DELTA_BINARY_PACKED data.

This should fix the following issues:
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37431
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37432
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37421

Closes apache#11060 from pitrou/ARROW-13850-parquet-fuzz

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Closes apache#10730 from romainfrancois/ARROW-13164_altrep_with_nulls

Lead-authored-by: Romain Francois <romain@rstudio.com>
Co-authored-by: Romain François <romain@rstudio.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>

Closes apache#11056 from zhjwpku/docs/missing_param_for_setcolumn

Authored-by: Junwang Zhao <zhjwpku@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Closes apache#11055 from kou/ruby-table-save-dataset

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
I templated from ARROW-11735. Let's see how all the tests go!

Closes apache#11046 from karldw/arrow-12981

Authored-by: karldw <karldw@users.noreply.github.com>
Signed-off-by: Ian Cook <ianmcook@gmail.com>
Closes apache#11061 from lidavidm/arrow-13782

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
…ension types

Closes apache#11071 from pitrou/ARROW-13855-export-extension

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: David Li <li.davidm96@gmail.com>
- [x] collect() uses ExecPlan
- [x] arrange() uses an OrderBySink
- [x] .data inside of arrow_dplyr_query can itself be arrow_dplyr_query
- [x] can build more query after calling summarize()
- [x] handle non-deterministic dataset collect() tests
- [x] fix group_by-expression behavior
- [x] make official collapse() method with more testing of faithful behavior after collapsing
- [x] make sort after summarize be configurable by option (default FALSE, though local_options TRUE in the tests)
- [x] add print method for collapsed query
- [x] Skip 32-bit rtools35 dataset tests/examples
~~- [ ] should queries on in-memory data evaluate eagerly (like dplyr)?~~

Followups:

* ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions
* ARROW-13778: [R] Handle complex summarize expressions
* ARROW-13779: [R] Disallow expressions that depend on order after arrange()
* ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan
* ARROW-13854: [R] More accurately determine output type of an aggregation expression
* ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries

Closes apache#10992 from nealrichardson/subquery

Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Closes apache#11074 from thisisnic/ARROW-13874_trimpotions

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
…functions

Closes apache#11078 from nealrichardson/summarize-0

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Closes apache#11083 from kou/ruby-slicer-expression

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
thisisnic and others added 14 commits September 13, 2021 22:39
Closes apache#11092 from thisisnic/ARROW-13904_modeoptions

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Closes apache#11093 from thisisnic/ARROW-13905_replacesliceoptions

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Nic <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Closes apache#11094 from thisisnic/ARROW-13905_partitionnthoptions

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
…s kernels

Closes apache#11102 from thisisnic/ARROW-13869_misc_compute

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Closes apache#11098 from thisisnic/ARROW-13908_extract_regex_options

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@wmalpica
Copy link
Contributor Author

Something strange happened when I rebased. I will close this PR and create a new one.

@wmalpica wmalpica closed this Sep 14, 2021
wmalpica added a commit to wmalpica/arrow that referenced this pull request Sep 15, 2021
pitrou pushed a commit to wmalpica/arrow that referenced this pull request Sep 16, 2021
pitrou added a commit that referenced this pull request Sep 16, 2021
This PR adds the ability for the StringConverter to parse hex strings of the form:
0x123abc to an integer.
The strings must start with "0x" case insensitive.
The values ABCDEF are case insensitive.

Note that there was already a Hex Parsing function here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76
Which was not really flexible enough for what was needed.
Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions

This work was originally done here #11064 but had to create a new PR due to a rebase issue

Closes #11161 from wmalpica/ARROW-12657_3

Lead-authored-by: William Malpica <16705032+wmalpica@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This PR adds the ability for the StringConverter to parse hex strings of the form:
0x123abc to an integer.
The strings must start with "0x" case insensitive.
The values ABCDEF are case insensitive.

Note that there was already a Hex Parsing function here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76
Which was not really flexible enough for what was needed.
Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions

This work was originally done here apache#11064 but had to create a new PR due to a rebase issue

Closes apache#11161 from wmalpica/ARROW-12657_3

Lead-authored-by: William Malpica <16705032+wmalpica@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet