SIGSEGV during join on large table #39332

tmaxwell-anthropic · 2023-12-20T23:06:09Z

This Python script produces a segmentation fault in the join() call:

import pyarrow as pa

eight_mib = "xyzw" * (2048 * 1024)
gib = pa.array((eight_mib for i in range(128)), pa.string())
keys = pa.array(range(128), pa.int64())
left = pa.Table.from_pydict({"keys": keys, "gib": gib})

right_keys = pa.array(list(range(128)) * 4, pa.int64())
right = pa.Table.from_pydict({"keys": right_keys})

print("joining...")
left.join(right, "keys")
print("joined.")

The C++ call stack is:

#0  0x00007ffff43aba6a in arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData> const&, arrow::compute::ResizableArrayData*, int, unsigned short const*, arrow::MemoryPool*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#1  0x00007ffff43acfa7 in arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*, arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#2  0x00007ffff67cbe44 in arrow::acero::JoinResultMaterialize::Append(arrow::compute::ExecBatch const&, int, unsigned short const*, unsigned int const*, unsigned int const*, int*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#3  0x00007ffff67e0423 in arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch const&, arrow::util::TempVectorStack*, std::vector<arrow::compute::KeyColumnArray, std::allocator<arrow::compute::KeyColumnArray> >*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#4  0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned long, arrow::compute::ExecBatch) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#5  0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long, long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#6  0x00007ffff67be225 in arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#7  0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long, int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned long&&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#8  0x00007ffff67baf47 in std::_Function_handler<arrow::Status (), arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned long)>, std::basic_string_view<char, std::char_traits<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#9  0x00007ffff67f9260 in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> >::invoke() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#10 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() && ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#11 0x00007ffff44d5c38 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#12 0x00007ffff543b4a0 in execute_native_thread_routine () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#13 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007ffff78e2660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Software versions: PyArrow 12.0.1, Python 3.11.6, Ubuntu 22.04.3.

If I change the pa.string() to a pa.large_string() then it works fine.

Component(s)

C++, Python

The text was updated successfully, but these errors were encountered:

tmaxwell-anthropic · 2023-12-20T23:20:15Z

A variation is to define right_keys as follows:

right_keys = pa.array([0, 0], pa.int64())

Instead of a SIGSEGV, this results in an infinite loop with the following stack trace:

   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#1  0x00007ffff43ab901 in arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData> const&, arrow::compute::ResizableArrayData*, int, unsigned short const*, arrow::MemoryPool*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#2  0x00007ffff43acfa7 in arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*, arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#3  0x00007ffff67d526b in arrow::acero::JoinResultMaterialize::AppendProbeOnly(arrow::compute::ExecBatch const&, int, unsigned short const*, int*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#4  0x00007ffff67e0953 in arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch const&, arrow::util::TempVectorStack*, std::vector<arrow::compute::KeyColumnArray, std::allocator<arrow::compute::KeyColumnArray> >*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#5  0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned long, arrow::compute::ExecBatch) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#6  0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long, long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#7  0x00007ffff67be225 in arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#8  0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long, int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned long&&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#9  0x00007ffff67baf47 in std::_Function_handler<arrow::Status (), arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned long)>, std::basic_string_view<char, std::char_traits<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#10 0x00007ffff67f9260 in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> >::invoke() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#11 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() && ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#12 0x00007ffff44d5c38 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#13 0x00007ffff543b4a0 in execute_native_thread_routine () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#14 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x00007ffff78e2660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

We're looping through the following three instructions:

0x7ffff43a7a90 <_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+80>  add    %ebx,%ebx
0x7ffff43a7a92 <_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+82>  cmp    %ebx,%eax
0x7ffff43a7a94 <_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+84>  jg     0x7ffff43a7a90 <_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+80>

Here %eax=1082130432 and %ebx=0.

I believe this corresponds to https://github.com/apache/arrow/blob/apache-arrow-12.0.1/cpp/src/arrow/compute/light_array.cc#L329-L331. Suppose min_new_size is slightly larger than 2^30. If new_size is initially some power of two, it will double until it hits 2^30; and then it will try to double again, but overflow to -2147483648; and then try to double again, and become 0.

Oddly, the duplicate value in right_keys is essential to reproduce the bug. If I set right_keys = pa.array([0, 1], pa.int64()) then it completes successfully.

Replacing pa.string() with pa.large_string() also makes the bug go away.

zanmato1984 · 2023-12-27T23:28:41Z

Took a look at this, here are my findings.

First some trivial answers to related concerns:

If I change the pa.string() to a pa.large_string() then it works fine.

Because only string() type goes to swiss join, which leverages the light array stuff, which is problematic (detailed later). large_string() takes another code path which is regular hash join impl, which doesn't have this problem.

Oddly, the duplicate value in right_keys is essential to reproduce the bug. If I set right_keys = pa.array([0, 1], pa.int64() then it completes successfully.

The essentiality of this failure is the overflow of the offsets in the resulting var length buffer (detailed later). So it is necessary for the right table to contain enough duplicate keys to explode the join result, which will trigger the overflow. E.g., the left table has 128 8mb-sized string rows, so if the right table contains 4 (or more) matches per row in left, then the result will have 128 * 8mb * 4 = 4gb data, which will overflow to zero if using uint32_t as offset.

Now the detailed explanation of the segmentation fault (didn't look much on the infinite loop case, but I believe they have essentially the same cause):

The result of the join will contain 128 * 4 = 512 rows. For the string column in the left table, each row has 8mb-sized string. So the result will have a string column with 8mb * 512 = 4gb-sized data.

When ExecBatchBuilder appends this data, it first calculates the offsets for each row. Specifically, for the last (512-th) row, which occupies [4gb - 8mb, 4gb) bytes, we'll write its offset as offsets[512] = 4gb. Unfortunately, in the following code

arrow/cpp/src/arrow/compute/light_array.cc

Line 594 in bcaeaa8

offsets[num_rows_before + num_rows_to_append] = sum;

sum will overflow to 0 when reaching 4gb as it is uint32_t.

And then when allocating the buffer for the string data, the following code

arrow/cpp/src/arrow/compute/light_array.cc

Line 598 in bcaeaa8

RETURN_NOT_OK(target->ResizeVaryingLengthBuffer());

basically does nothing because it sees offsets[512] == 0 and thinks this buffer needs no resizing, leaving the data buffer in its initial size align(8b + 64b, 64b) = 128b.

Later when appending string data to it, segmentation fault happens when it writes to its 128-th byte.

I'm not sure if the API contract is upheld for this particular case, because for a string array whose element is 8mb each, simply >=512 rows will exceed the valid offset range (utint32_t). Maybe large string is a better choice.

But I think we should at least detect this kind of overflow and report an explicit error in this case.

zanmato1984 · 2023-12-28T03:43:27Z

PR #39383 drafted to give explicit error for cases in this issue. I appreciate that any reviewer could help to see if this is the right thing to do. Thanks.

zanmato1984 · 2023-12-28T08:20:52Z

@tmaxwell-anthropic As you mentioned:

I believe this corresponds to apache-arrow-12.0.1/cpp/src/arrow/compute/light_array.cc#L329-L331. Suppose min_new_size is slightly larger than 2^30. If new_size is initially some power of two, it will double until it hits 2^30; and then it will try to double again, but overflow to -2147483648; and then try to double again, and become 0.

This is another problem independent of what I discovered for the segmentation fault. I found this when I was debugging the UT for my fix. Now I've included the fix for the infinite loop in my PR as well.

tmaxwell-anthropic · 2024-01-03T22:16:33Z

Great, thank you for looking into this!

I think switching to large_string is the right solution for my case. But it will be very nice to have a clear error message instead of a segfault/hang when this happens. :)

zanmato1984 · 2024-01-15T08:51:46Z

found this when I was debugging the UT for my fix. Now I've included the fix for the infinite loop in my PR as well.

I've formalized the PR and hope someone could review soon.

… length data exceeds offset limit (int32 max) (#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: #39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

… length data exceeds offset limit (int32 max) (#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: #39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

tmaxwell-anthropic added the Type: bug label Dec 20, 2023

github-actions bot added Component: C++ Component: Python labels Dec 20, 2023

amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 21, 2023

github-actions bot mentioned this issue Dec 28, 2023

GH-39332: [C++] Explicit error in ExecBatchBuilder when appending var length data exceeds offset limit (int32 max) #39383

Merged

github-actions bot assigned zanmato1984 Dec 28, 2023

pitrou closed this as completed in #39383 Jan 18, 2024

pitrou modified the milestones: 16.0.0, 15.0.1 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV during join on large table #39332

SIGSEGV during join on large table #39332

tmaxwell-anthropic commented Dec 20, 2023 •

edited

Loading

tmaxwell-anthropic commented Dec 20, 2023 •

edited

Loading

zanmato1984 commented Dec 27, 2023

zanmato1984 commented Dec 28, 2023

zanmato1984 commented Dec 28, 2023

tmaxwell-anthropic commented Jan 3, 2024

zanmato1984 commented Jan 15, 2024

SIGSEGV during join on large table #39332

SIGSEGV during join on large table #39332

Comments

tmaxwell-anthropic commented Dec 20, 2023 • edited Loading

Component(s)

tmaxwell-anthropic commented Dec 20, 2023 • edited Loading

zanmato1984 commented Dec 27, 2023

zanmato1984 commented Dec 28, 2023

zanmato1984 commented Dec 28, 2023

tmaxwell-anthropic commented Jan 3, 2024

zanmato1984 commented Jan 15, 2024

tmaxwell-anthropic commented Dec 20, 2023 •

edited

Loading

tmaxwell-anthropic commented Dec 20, 2023 •

edited

Loading