New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ARROW-12657: [C++] Adding String hex to numeric conversion #11064

Closed

wmalpica wants to merge 94 commits into apache:master from wmalpica:wmalpica/ARROW-12657

Contributor

wmalpica commented Sep 2, 2021 •

edited

This PR adds the ability for the StringConverter to parse hex strings of the form:
0x123abc to an integer.
The strings must start with "0x" case sensitive.
The values ABCDEF are case insensitive.

Note that there was already a Hex Parsing function here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76
Which was not really flexible enough for what was needed.
Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions

wmalpica added 2 commits

September 1, 2021 08:48


          working implementation but lacks case-insensitivity and more unit tests

59dcbde


          different algorithm. Added more tests and benchmarks

4cb862b

github-actions bot commented Sep 2, 2021

https://issues.apache.org/jira/browse/ARROW-12657

github-actions bot commented Sep 2, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions bot added the Component: C++ label


          uncommented tests

d03b7eb

Contributor

edponce commented Sep 2, 2021

Having a more general hex string to number parser is good, and I would recommend for it to substitute the current simple parser. Replacing the hex parser will involve changing files not necessarily associated directly with this PR so you may want to consider to open a JIRA for the replacement.

edponce suggested changes

View reviewed changes

cpp/src/arrow/util/value_parsing.h Outdated

               #undef PARSE_UNSIGNED_ITERATION
               #undef PARSE_UNSIGNED_ITERATION_LAST
+              #define PARSE_HEX_ITERATION(C_TYPE)                                     \
+                if (length > 0) {                                                     \

Contributor

edponce Sep 2, 2021

I suggest to make this a function. The use of macros is undesired in the general case.

Contributor Author

wmalpica Sep 2, 2021

I agree with this recommendation. I was following the pattern of how ParseUnsigned and PARSE_UNSIGNED_ITERATION is done in the same file. Since I am new to the codebase I decided to follow the patterns I was seeing.

edponce suggested changes

View reviewed changes

cpp/src/arrow/util/value_parsing.h Outdated



		inline bool ParseHex(const char* s, size_t length, uint8_t* out) {
		uint8_t result = 0;

Contributor

edponce Sep 2, 2021 •

edited

Instead of having multiple functions for each type, I suggest to use templates, and, given that each function is conceptually doing the same operation but different number of times based on the output data type width, use a for-loop instead.

template <typename T>
inline bool ParseHex(...) {
   ...
   for (int i = 0; i < sizeof(T) * 2; ++i) {
      ParseHexIteration<T>(result);
   }
   ...
}

Member

pitrou Sep 6, 2021

As you prefer, but I would be fine with keeping this a macro in this case. Note there is a control flow inside the macro so you cannot turn it into such a trivial loop.

Contributor Author

wmalpica Sep 8, 2021

I turned it into a loop, it works fine

edponce suggested changes

View reviewed changes

cpp/src/arrow/util/value_parsing.h Outdated

@@ @@ -281,6 +371,19 @@ struct StringToUnsignedIntConverterMixin { @@
                   if (ARROW_PREDICT_FALSE(length == 0)) {
                     return false;
                   }
+                  // If its starts with 0x then its hex
+                  if (*s == '0' && *(s + 1) == 'x'){

Contributor

edponce Sep 2, 2021

Add support for case-insensitive 0x.

Member

pitrou Sep 6, 2021

+1

Contributor Author

wmalpica Sep 8, 2021

Done

edponce suggested changes

View reviewed changes

cpp/src/arrow/util/value_parsing.h

+                      return false;
+                    }
+                    return true;
+                  }

Contributor

edponce Sep 2, 2021

You can simplify these checks and boolean result as

return ARROW_PREDICT_TRUE((sizeof(value_type) * 2 >= length) && ParseHex(s, length, out));

Contributor Author

wmalpica Sep 8, 2021

Done

cpp/src/arrow/util/value_parsing.h

+                  // If its starts with 0x then its hex
+                  if (*s == '0' && *(s + 1) == 'x'){
+                    length -= 2;

Contributor

edponce Sep 2, 2021

If there is a negative sign, then it is not a well-formed hex string.

Contributor Author

wmalpica Sep 8, 2021

Good catch. Fixed this

cpp/src/arrow/util/value_parsing.h Outdated

+                  if (*s == '0' && *(s + 1) == 'x'){
+                    length -= 2;
+                    s += 2;
+                    // lets make sure that the length of the string is not too big

Contributor

edponce Sep 2, 2021 •

edited

Nit: You probably not need to decrease length and consider the offset of s as the difference to consider.

Contributor Author

wmalpica Sep 8, 2021

I opted not to do this, since length is used both at StringToSignedIntConverterMixin and at ParseHex. Getting rid of length would then involve keeping track of the original pointer of s and the new pointer and comparing that to the original length. It seems more convoluted

liyafan82 and others added 19 commits

September 1, 2021 21:24


          ARROW-13792 [Java]: The toString representation is incorrect for unsi…

69972dd

…gned integer vectors

When adding a byte `0xff` to a UInt1Vector, the toString method produces `[-1]`. Since the vector contains unsinged integers, the correct result should be `[255]`.

Closes apache#11029 from liyafan82/fly_0830_uin

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>


          ARROW-13544 [Java]: Remove APIs that have been deprecated for long (C…

b76caf4

…hanges to Vectors)

See https://issues.apache.org/jira/browse/ARROW-13544

According to the discussion in apache#10864 (comment), we want to split the task into multiple parts.

This PR is for the changes related to vectors

Closes apache#10910 from liyafan82/fly_0810_depv

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>


          ARROW-13823 [Java]: Exclude .factorypath

111f0c7

Exclude .factorypath files generated by Eclipse IDE for configuring
annotation processing from Git commit and RAT plugin scans.

Closes apache#11042 from laurentgo/laurentgo/exclude-factorypath

Authored-by: Laurent Goujon <laurent@apache.org>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>


          ARROW-13544 [Java]: Remove APIs that have been deprecated for long (C…

09497a9

…hanges to JDBC)

See https://issues.apache.org/jira/browse/ARROW-13544

According to the discussion in apache#10864 (comment), we want to split the task into multiple parts.

This PR is for the changes related to the JDBC adapter

Closes apache#10912 from liyafan82/fly_0811_depj

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>


          ARROW-13812: [C++] Fix Valgrind error in Grouper.BooleanKey test

e380c1a

Essentially, this failure boils down to: when generating the array of uniques for booleans, we pack 8 bytes at a time into one byte. The bytes are packed from what turns out to be a scratch array allocated by TempVectorStack, which does not initialize its memory. So when we have a non-multiple-of-8 number of bytes, we may end up packing initialized bytes and uninitialized bytes together into a single garbage byte, resulting in Valgrind complaining.

Closes apache#11041 from lidavidm/arrow-13812

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>


          ARROW-13067: [C++][Compute] Implement integer to decimal cast

bbecb6a

Closes apache#11045 from cyb70289/13067-int2dec

Authored-by: Yibo Cai <yibo.cai@arm.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>


          ARROW-13846: [C++] Fix crashes on invalid IPC file

495c734

Should fix the following issues found by OSS-Fuzz:
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37927
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37915
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37888

Also add the IPC integration reference files to the fuzzing corpus, this may help find more issues.

Closes apache#11059 from pitrou/ARROW-13846-ipc-fuzz-crashes

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>


          ARROW-13850: [C++] Fix crashes on invalid Parquet data

425b1cb

Add validation to detect invalid DELTA_BINARY_PACKED data.

This should fix the following issues:
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37431
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37432
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37421

Closes apache#11060 from pitrou/ARROW-13850-parquet-fuzz

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>


          ARROW-13164: [R] altrep vectors from Array with nulls

f0879a5

Closes apache#10730 from romainfrancois/ARROW-13164_altrep_with_nulls

Lead-authored-by: Romain Francois <romain@rstudio.com>
Co-authored-by: Romain François <romain@rstudio.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>


          ARROW-13459: [C++][Docs]Missing param docs for RecordBatch::SetColumn

8c70a5f

Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>

Closes apache#11056 from zhjwpku/docs/missing_param_for_setcolumn

Authored-by: Junwang Zhao <zhjwpku@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>


          ARROW-13831: [GLib][Ruby] Add support for writing by Arrow Dataset

a1d207e

Closes apache#11055 from kou/ruby-table-save-dataset

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>


          ARROW-13768: [R] Allow JSON to be an optional component

1440d5a

I templated from ARROW-11735. Let's see how all the tests go!

Closes apache#11046 from karldw/arrow-12981

Authored-by: karldw <karldw@users.noreply.github.com>
Signed-off-by: Ian Cook <ianmcook@gmail.com>


          ARROW-13782: [C++] Add skip_nulls/min_count to tdigest/mode/quantile

a45fc3f

Closes apache#11061 from lidavidm/arrow-13782

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>


          ARROW-13855: [C++][Python] Implement C data interface support for ext…

5ead375

…ension types

Closes apache#11071 from pitrou/ARROW-13855-export-extension

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: David Li <li.davidm96@gmail.com>


          ARROW-13740: [R] summarize() should not eagerly evaluate

e9251b0

- [x] collect() uses ExecPlan
- [x] arrange() uses an OrderBySink
- [x] .data inside of arrow_dplyr_query can itself be arrow_dplyr_query
- [x] can build more query after calling summarize()
- [x] handle non-deterministic dataset collect() tests
- [x] fix group_by-expression behavior
- [x] make official collapse() method with more testing of faithful behavior after collapsing
- [x] make sort after summarize be configurable by option (default FALSE, though local_options TRUE in the tests)
- [x] add print method for collapsed query
- [x] Skip 32-bit rtools35 dataset tests/examples
~~- [ ] should queries on in-memory data evaluate eagerly (like dplyr)?~~

Followups:

* ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions
* ARROW-13778: [R] Handle complex summarize expressions
* ARROW-13779: [R] Disallow expressions that depend on order after arrange()
* ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan
* ARROW-13854: [R] More accurately determine output type of an aggregation expression
* ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries

Closes apache#10992 from nealrichardson/subquery

Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>


          ARROW-13874: [R] Implement TrimOptions

858ac57

Closes apache#11074 from thisisnic/ARROW-13874_trimpotions

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>


          ARROW-13543: [R] Handle summarize() with 0 arguments or no aggregate …

a49048b

…functions

Closes apache#11078 from nealrichardson/summarize-0

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>


          ARROW-13899: [Ruby] Implement slicer by compute kernels

f12c18e

Closes apache#11083 from kou/ruby-slicer-expression

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>


          MINOR: [Doc][Python] Fix a typo (apache#11085)

882e8b4

thisisnic and others added 14 commits

September 13, 2021 22:39


          ARROW-13904: [R] Implement ModeOptions

1cbc4a2

Closes apache#11092 from thisisnic/ARROW-13904_modeoptions

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>


          ARROW-13905: [R] Implement ReplaceSliceOptions

f3d3c68

Closes apache#11093 from thisisnic/ARROW-13905_replacesliceoptions

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Nic <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>


          ARROW-13906: [R] Implement PartitionNthOptions

0b6f531

Closes apache#11094 from thisisnic/ARROW-13905_partitionnthoptions

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>


          ARROW-13869: [R] Implement options for non-bound MatchSubstringOption…

672149b

…s kernels

Closes apache#11102 from thisisnic/ARROW-13869_misc_compute

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>


          ARROW-13908: [R] Implement ExtractRegexOptions

8875d5c

Closes apache#11098 from thisisnic/ARROW-13908_extract_regex_options

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>


          working implementation but lacks case-insensitivity and more unit tests

f1d6811


          different algorithm. Added more tests and benchmarks

925b2a7


          Implemented review feedback and added more unit tests

68ec4db


          checked for empty hex falues. added scalar tests

a538072


          fixed style with clang-format

400d886


          implemented some improvements

9cac060


          fixed clang format


          fixed unit test

68a6844


          Merge branch 'wmalpica/ARROW-12657' of github.com:wmalpica/arrow into…

7aee4f0

… wmalpica/ARROW-12657

github-actions bot added Component: GLib Component: Go Component: Java Component: Python Component: R Component: Ruby Component: Parquet labels

Contributor Author

wmalpica commented Sep 14, 2021

Something strange happened when I rebased. I will close this PR and create a new one.

wmalpica closed this

wmalpica mentioned this pull request

ARROW-12657: [C++] Adding String hex to numeric conversion #11160

Closed

wmalpica added a commit to wmalpica/arrow that referenced this pull request


          Re-implementing ARROW-12657 which was originally done in PR apache#11064

2a606a1

wmalpica mentioned this pull request

ARROW-12657: [C++] Adding String hex to numeric conversion #11161

Closed

pitrou pushed a commit to wmalpica/arrow that referenced this pull request


          Re-implementing ARROW-12657 which was originally done in PR apache#11064

597cc9f

pitrou added a commit that referenced this pull request


          ARROW-12657: [C++] Adding String hex to numeric conversion

012248a

This PR adds the ability for the StringConverter to parse hex strings of the form:
0x123abc to an integer.
The strings must start with "0x" case insensitive.
The values ABCDEF are case insensitive.

Note that there was already a Hex Parsing function here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76
Which was not really flexible enough for what was needed.
Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions

This work was originally done here #11064 but had to create a new PR due to a rebase issue

Closes #11161 from wmalpica/ARROW-12657_3

Lead-authored-by: William Malpica <16705032+wmalpica@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request


          ARROW-12657: [C++] Adding String hex to numeric conversion

94ece6a

This PR adds the ability for the StringConverter to parse hex strings of the form:
0x123abc to an integer.
The strings must start with "0x" case insensitive.
The values ABCDEF are case insensitive.

Note that there was already a Hex Parsing function here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76
Which was not really flexible enough for what was needed.
Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions

This work was originally done here apache#11064 but had to create a new PR due to a rebase issue

Closes apache#11161 from wmalpica/ARROW-12657_3

Lead-authored-by: William Malpica <16705032+wmalpica@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

asfimport mentioned this pull request

[C++][Python][Compute] String hex to numeric conversion and bit shifting #18653

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

24 participants