Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] "utf8_split_whitespace" kernel returned offset buffers are too small for large string in case of empty array #37437

Closed
agoose77 opened this issue Aug 29, 2023 · 3 comments · Fixed by #37467

Comments

@agoose77
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Consider the following call to utf8_split_whitespace

import pyarrow as pa
import pyarrow.compute as pc

result = pc.utf8_split_whitespace(
    pa.array([], type=pa.large_string()),
)
print(result.type)
print(result.buffers())

The returned type is list<item: large_string>, but the returned offset buffer has size=4, not size=8. I expect to see size=8, pertaining to a length-one array of int64 (a single zero)

For a non-empty array, the size is correct.

We've encountered this in Awkward Array (scikit-hep/awkward#2679 (comment)).

Thanks!

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title pyarrow.compute.utf8_split_whitespace returned offset buffers are too small for empty lists [C++] "utf8_split_whitespace" kernel returned offset buffers are too small for large string in case of empty array Aug 30, 2023
@jorisvandenbossche
Copy link
Member

@agoose77 thanks for the report! I can confirm this looks buggy.

In the execution path of a scalar kernel, we do have a check for an empty input:

class ScalarExecutor : public KernelExecutorImpl<ScalarKernel> {
public:
Status Execute(const ExecBatch& batch, ExecListener* listener) override {
RETURN_NOT_OK(span_iterator_.Init(batch, exec_context()->exec_chunksize()));
if (batch.length == 0) {
// For zero-length batches, we do nothing except return a zero-length
// array of the correct output type
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Array> result,
MakeArrayOfNull(output_type_.GetSharedPtr(), /*length=*/0,
exec_context()->memory_pool()));
return EmitResult(result->data(), listener);
}

And so I suppose something is going wrong there (because once you enter the actual kernel-specific exec code, it creates a LargeStringBuilder, and finishing such a builder produces the correct offsets, also in the case of an empty result)

@jorisvandenbossche
Copy link
Member

The bug is indeed in this MakeArrayOfNull call used above, when using a length of 0 and having a nested large_string inside a list.

This function is exposed in pyarrow through the nulls function, which can reproduce the same issue:

In [1]: pa.nulls(0, pa.large_string())
Out[1]: 
<pyarrow.lib.LargeStringArray object at 0x7f957decd0c0>
[]

In [2]: pa.nulls(0, pa.large_string()).buffers()
Out[2]: 
[None,
 <pyarrow.Buffer address=0x7f9580a08040 size=8 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f9580a08040 size=8 is_cpu=True is_mutable=True>]  # <-- not fully sure this buffer is actually needed ..

In [3]: pa.nulls(0, pa.list_(pa.large_string())).values.buffers()
Out[3]: 
[None,
 <pyarrow.Buffer address=0x7f9580a08080 size=4 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f9580a08080 size=4 is_cpu=True is_mutable=True>]

So creating empty large string array itself gives the correct offset size, but when nested in a list array, then the list's values array (the large string array) has a wrong offset size.

@agoose77
Copy link
Author

Ah, I see. It makes sense that it was a helper function; the other split functions also appear to have this bug.

@jorisvandenbossche jorisvandenbossche self-assigned this Aug 30, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Aug 30, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Sep 29, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Oct 5, 2023
@pitrou pitrou closed this as completed in 1b262a2 Oct 5, 2023
@pitrou pitrou added this to the 14.0.0 milestone Oct 5, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…tring values type (apache#37467)

### Rationale for this change

`MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets.

### Are these changes tested?

Yes

* Closes: apache#37437

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…tring values type (apache#37467)

### Rationale for this change

`MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets.

### Are these changes tested?

Yes

* Closes: apache#37437

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…tring values type (apache#37467)

### Rationale for this change

`MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets.

### Are these changes tested?

Yes

* Closes: apache#37437

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants