Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Comparison kernels crashing for string array with null string scalar #18369

Closed
asfimport opened this issue Nov 13, 2020 · 4 comments
Closed

Comments

@asfimport
Copy link

Comparing a string array with a string scalar works:

In [1]: import pyarrow.compute as pc

In [2]: pc.equal(pa.array(["a", None, "b"]), pa.scalar("a", type="string"))
Out[2]: 
<pyarrow.lib.BooleanArray object at 0x7f38d56e23a8>
[
  true,
  null,
  false
]

but if the scalar is a null (from the proper string type), it crashes:

In [4]: pc.equal(pa.array(["a", None, "b"]), pa.scalar(None, type="string"))
Segmentation fault (core dumped)

(and not even debug messages ..)

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Kirill Lykov / @KirillLykov

PRs and other links:

Note: This issue was originally created as ARROW-10578. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
With gdb I get this:

>>> pc.not_equal(pa.array(["a", None, "b"]), pa.scalar(None, type="string", from_pandas=True))
[Thread 0x7fbdb87fc700 (LWP 11192) exited]
[Thread 0x7fbdb7ffb700 (LWP 11193) exited]
[Thread 0x7fbdbaffd700 (LWP 11191) exited]
[Thread 0x7fbdbf7fe700 (LWP 11190) exited]
[Thread 0x7fbdbffff700 (LWP 11189) exited]
[Thread 0x7fbdc8fbc700 (LWP 11188) exited]
[Thread 0x7fbdc97bd700 (LWP 11187) exited]
[Detaching after fork from child process 11201]
[Detaching after fork from child process 11206]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fbdce798223 in arrow::Buffer::operator nonstd::sv_lite::basic_string_view<char, std::char_traits<char> > (this=0x0) at ../src/arrow/buffer.h:175
175	    return util::string_view(reinterpret_cast<const char*>(data_), size_);
(gdb) bt
#0  0x00007fbdce798223 in arrow::Buffer::operator nonstd::sv_lite::basic_string_view<char, std::char_traits<char> > (this=0x0) at ../src/arrow/buffer.h:175
#1  0x00007fbdcebc9511 in arrow::compute::internal::UnboxScalar<arrow::BinaryType, void>::Unbox (val=...) at ../src/arrow/compute/kernels/codegen_internal.h:275
#2  0x00007fbdcec625e6 in arrow::compute::internal::applicator::ScalarBinary<arrow::BooleanType, arrow::BinaryType, arrow::BinaryType, arrow::compute::internal::(anonymous namespace)::NotEqual>::ArrayScalar (
    ctx=0x7fff74cc7620, arg0=..., arg1=..., out=0x7fff74cc7420) at ../src/arrow/compute/kernels/codegen_internal.h:697
#3  0x00007fbdcec58c5e in arrow::compute::internal::applicator::ScalarBinary<arrow::BooleanType, arrow::BinaryType, arrow::BinaryType, arrow::compute::internal::(anonymous namespace)::NotEqual>::Exec (
    ctx=0x7fff74cc7620, batch=..., out=0x7fff74cc7420) at ../src/arrow/compute/kernels/codegen_internal.h:727
#4  0x00007fbdceb66e31 in std::_Function_handler<void (arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*), void (*)(arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*)>::_M_invoke(std::_Any_data const&, arrow::compute::KernelContext*&&, arrow::compute::ExecBatch const&, arrow::Datum*&&) (__functor=..., __args#0=@0x7fff74cc73a0: 0x7fff74cc7620, __args#1=..., 
    __args#2=@0x7fff74cc7390: 0x7fff74cc7420) at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/std_function.h:316
#5  0x00007fbdceabda82 in std::function<void (arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*)>::operator()(arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*) const (this=0x55dacf413718, __args#0=0x7fff74cc7620, __args#1=..., __args#2=0x7fff74cc7420)
    at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/std_function.h:706
#6  0x00007fbdceab9458 in arrow::compute::detail::(anonymous namespace)::ScalarExecutor::ExecuteBatch (this=0x55dacf712020, batch=..., listener=0x55dacf772a30) at ../src/arrow/compute/exec.cc:578
#7  0x00007fbdceab8afa in arrow::compute::detail::(anonymous namespace)::ScalarExecutor::Execute (this=0x55dacf712020, args=..., listener=0x55dacf772a30) at ../src/arrow/compute/exec.cc:516
#8  0x00007fbdceac8b7f in arrow::compute::Function::Execute (this=0x55dacf411640, args=..., options=0x0, ctx=0x7fff74cc7850) at ../src/arrow/compute/function.cc:146
#9  0x00007fbdb360749c in __pyx_pf_7pyarrow_8_compute_8Function_6call(__pyx_obj_7pyarrow_8_compute_Function*, _object*, __pyx_obj_7pyarrow_8_compute_FunctionOptions*, __pyx_obj_7pyarrow_3lib_MemoryPool*) [clone .isra.501] () from /home/joris/scipy/repos/arrow/python/pyarrow/_compute.cpython-37m-x86_64-linux-gnu.so
#

@asfimport
Copy link
Author

Kirill Lykov / @KirillLykov:
Problem is still reproducible. It happens only for type==string

I don't see cpp tests for this use case:

TEST_F(TestStringCompareKernel, RandomCompareArrayScalar) {

Let me know if I look into the wrong place.
I will try to add unit test for this particular case.

I also think it makes sense to add test on pyarrow. Something similar to

@pytest.mark.parametrize("typ", ["array", "chunked_array"])

The problem is that the scalar is invalid (datum->is_valid == false): see https://github.com/apache/arrow/blob/ca685a0c08bb41f43a80e5605e4cc8f9efb77cca/cpp/src/arrow/compute/kernels/codegen_internal.h#L713 
But we deference val at codegen_internal.h:275 and trying to create string_view from data_ which has address 0x10.

To fix the bug, I guess some additional checks should be added to https://github.com/apache/arrow/blame/ca685a0c08bb41f43a80e5605e4cc8f9efb77cca/cpp/src/arrow/compute/kernels/codegen_internal.h#L273
Something like if scalar is invalid, return default string_view.

 

@asfimport
Copy link
Author

Kirill Lykov / @KirillLykov:
will post PR soon

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 9079
#9079

@asfimport asfimport added this to the 3.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant