New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] "utf8_split_whitespace" kernel returned offset buffers are too small for large string in case of empty array #37437
Comments
pyarrow.compute.utf8_split_whitespace
returned offset buffers are too small for empty lists
@agoose77 thanks for the report! I can confirm this looks buggy. In the execution path of a scalar kernel, we do have a check for an empty input: arrow/cpp/src/arrow/compute/exec.cc Lines 781 to 793 in f40bf77
And so I suppose something is going wrong there (because once you enter the actual kernel-specific exec code, it creates a LargeStringBuilder, and finishing such a builder produces the correct offsets, also in the case of an empty result) |
The bug is indeed in this This function is exposed in pyarrow through the
So creating empty large string array itself gives the correct offset size, but when nested in a list array, then the list's values array (the large string array) has a wrong offset size. |
Ah, I see. It makes sense that it was a helper function; the other split functions also appear to have this bug. |
…tring values type
…tring values type (apache#37467) ### Rationale for this change `MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets. ### Are these changes tested? Yes * Closes: apache#37437 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…tring values type (apache#37467) ### Rationale for this change `MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets. ### Are these changes tested? Yes * Closes: apache#37437 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…tring values type (apache#37467) ### Rationale for this change `MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets. ### Are these changes tested? Yes * Closes: apache#37437 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
Consider the following call to
utf8_split_whitespace
The returned type is
list<item: large_string>
, but the returned offset buffer hassize=4
, notsize=8
. I expect to seesize=8
, pertaining to a length-one array ofint64
(a single zero)For a non-empty array, the size is correct.
We've encountered this in Awkward Array (scikit-hep/awkward#2679 (comment)).
Thanks!
Component(s)
Python
The text was updated successfully, but these errors were encountered: