Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV during join on large table #39332

Closed
tmaxwell-anthropic opened this issue Dec 20, 2023 · 6 comments · Fixed by #39383
Closed

SIGSEGV during join on large table #39332

tmaxwell-anthropic opened this issue Dec 20, 2023 · 6 comments · Fixed by #39383
Assignees
Labels
Component: C++ Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@tmaxwell-anthropic
Copy link

tmaxwell-anthropic commented Dec 20, 2023

This Python script produces a segmentation fault in the join() call:

import pyarrow as pa

eight_mib = "xyzw" * (2048 * 1024)
gib = pa.array((eight_mib for i in range(128)), pa.string())
keys = pa.array(range(128), pa.int64())
left = pa.Table.from_pydict({"keys": keys, "gib": gib})

right_keys = pa.array(list(range(128)) * 4, pa.int64())
right = pa.Table.from_pydict({"keys": right_keys})

print("joining...")
left.join(right, "keys")
print("joined.")

The C++ call stack is:

#0  0x00007ffff43aba6a in arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData> const&, arrow::compute::ResizableArrayData*, int, unsigned short const*, arrow::MemoryPool*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#1  0x00007ffff43acfa7 in arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*, arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#2  0x00007ffff67cbe44 in arrow::acero::JoinResultMaterialize::Append(arrow::compute::ExecBatch const&, int, unsigned short const*, unsigned int const*, unsigned int const*, int*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#3  0x00007ffff67e0423 in arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch const&, arrow::util::TempVectorStack*, std::vector<arrow::compute::KeyColumnArray, std::allocator<arrow::compute::KeyColumnArray> >*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#4  0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned long, arrow::compute::ExecBatch) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#5  0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long, long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#6  0x00007ffff67be225 in arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#7  0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long, int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned long&&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#8  0x00007ffff67baf47 in std::_Function_handler<arrow::Status (), arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned long)>, std::basic_string_view<char, std::char_traits<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#9  0x00007ffff67f9260 in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> >::invoke() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#10 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() && ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#11 0x00007ffff44d5c38 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#12 0x00007ffff543b4a0 in execute_native_thread_routine () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#13 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007ffff78e2660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Software versions: PyArrow 12.0.1, Python 3.11.6, Ubuntu 22.04.3.

If I change the pa.string() to a pa.large_string() then it works fine.

Component(s)

C++, Python

@tmaxwell-anthropic
Copy link
Author

tmaxwell-anthropic commented Dec 20, 2023

A variation is to define right_keys as follows:

right_keys = pa.array([0, 0], pa.int64())

Instead of a SIGSEGV, this results in an infinite loop with the following stack trace:

   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#1  0x00007ffff43ab901 in arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData> const&, arrow::compute::ResizableArrayData*, int, unsigned short const*, arrow::MemoryPool*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#2  0x00007ffff43acfa7 in arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*, arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#3  0x00007ffff67d526b in arrow::acero::JoinResultMaterialize::AppendProbeOnly(arrow::compute::ExecBatch const&, int, unsigned short const*, int*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#4  0x00007ffff67e0953 in arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch const&, arrow::util::TempVectorStack*, std::vector<arrow::compute::KeyColumnArray, std::allocator<arrow::compute::KeyColumnArray> >*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#5  0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned long, arrow::compute::ExecBatch) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#6  0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long, long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#7  0x00007ffff67be225 in arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#8  0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long, int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned long&&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#9  0x00007ffff67baf47 in std::_Function_handler<arrow::Status (), arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned long)>, std::basic_string_view<char, std::char_traits<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#10 0x00007ffff67f9260 in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> >::invoke() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#11 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() && ()
   from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#12 0x00007ffff44d5c38 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#13 0x00007ffff543b4a0 in execute_native_thread_routine () from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#14 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x00007ffff78e2660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

We're looping through the following three instructions:

0x7ffff43a7a90 <_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+80>  add    %ebx,%ebx
0x7ffff43a7a92 <_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+82>  cmp    %ebx,%eax
0x7ffff43a7a94 <_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+84>  jg     0x7ffff43a7a90 <_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+80>

Here %eax=1082130432 and %ebx=0.

I believe this corresponds to https://github.com/apache/arrow/blob/apache-arrow-12.0.1/cpp/src/arrow/compute/light_array.cc#L329-L331. Suppose min_new_size is slightly larger than 2^30. If new_size is initially some power of two, it will double until it hits 2^30; and then it will try to double again, but overflow to -2147483648; and then try to double again, and become 0.

Oddly, the duplicate value in right_keys is essential to reproduce the bug. If I set right_keys = pa.array([0, 1], pa.int64()) then it completes successfully.

Replacing pa.string() with pa.large_string() also makes the bug go away.

@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 21, 2023
@zanmato1984
Copy link
Collaborator

Took a look at this, here are my findings.

First some trivial answers to related concerns:

If I change the pa.string() to a pa.large_string() then it works fine.

Because only string() type goes to swiss join, which leverages the light array stuff, which is problematic (detailed later). large_string() takes another code path which is regular hash join impl, which doesn't have this problem.

Oddly, the duplicate value in right_keys is essential to reproduce the bug. If I set right_keys = pa.array([0, 1], pa.int64() then it completes successfully.

The essentiality of this failure is the overflow of the offsets in the resulting var length buffer (detailed later). So it is necessary for the right table to contain enough duplicate keys to explode the join result, which will trigger the overflow. E.g., the left table has 128 8mb-sized string rows, so if the right table contains 4 (or more) matches per row in left, then the result will have 128 * 8mb * 4 = 4gb data, which will overflow to zero if using uint32_t as offset.

Now the detailed explanation of the segmentation fault (didn't look much on the infinite loop case, but I believe they have essentially the same cause):

The result of the join will contain 128 * 4 = 512 rows. For the string column in the left table, each row has 8mb-sized string. So the result will have a string column with 8mb * 512 = 4gb-sized data.

When ExecBatchBuilder appends this data, it first calculates the offsets for each row. Specifically, for the last (512-th) row, which occupies [4gb - 8mb, 4gb) bytes, we'll write its offset as offsets[512] = 4gb. Unfortunately, in the following code

offsets[num_rows_before + num_rows_to_append] = sum;

sum will overflow to 0 when reaching 4gb as it is uint32_t.

And then when allocating the buffer for the string data, the following code

RETURN_NOT_OK(target->ResizeVaryingLengthBuffer());

basically does nothing because it sees offsets[512] == 0 and thinks this buffer needs no resizing, leaving the data buffer in its initial size align(8b + 64b, 64b) = 128b.

Later when appending string data to it, segmentation fault happens when it writes to its 128-th byte.

I'm not sure if the API contract is upheld for this particular case, because for a string array whose element is 8mb each, simply >=512 rows will exceed the valid offset range (utint32_t). Maybe large string is a better choice.

But I think we should at least detect this kind of overflow and report an explicit error in this case.

@zanmato1984
Copy link
Collaborator

PR #39383 drafted to give explicit error for cases in this issue. I appreciate that any reviewer could help to see if this is the right thing to do. Thanks.

@zanmato1984
Copy link
Collaborator

@tmaxwell-anthropic As you mentioned:

I believe this corresponds to apache-arrow-12.0.1/cpp/src/arrow/compute/light_array.cc#L329-L331. Suppose min_new_size is slightly larger than 2^30. If new_size is initially some power of two, it will double until it hits 2^30; and then it will try to double again, but overflow to -2147483648; and then try to double again, and become 0.

This is another problem independent of what I discovered for the segmentation fault. I found this when I was debugging the UT for my fix. Now I've included the fix for the infinite loop in my PR as well.

@tmaxwell-anthropic
Copy link
Author

Great, thank you for looking into this!

I think switching to large_string is the right solution for my case. But it will be very nice to have a clear error message instead of a segfault/hang when this happens. :)

@zanmato1984
Copy link
Collaborator

found this when I was debugging the UT for my fix. Now I've included the fix for the infinite loop in my PR as well.

I've formalized the PR and hope someone could review soon.

pitrou added a commit that referenced this issue Jan 18, 2024
… length data exceeds offset limit (int32 max) (#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: #39332

Lead-authored-by: zanmato <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou modified the milestones: 16.0.0, 15.0.1 Jan 18, 2024
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…ng var length data exceeds offset limit (int32 max) (apache#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: apache#39332

Lead-authored-by: zanmato <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ng var length data exceeds offset limit (int32 max) (apache#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: apache#39332

Lead-authored-by: zanmato <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
raulcd pushed a commit that referenced this issue Feb 20, 2024
… length data exceeds offset limit (int32 max) (#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: #39332

Lead-authored-by: zanmato <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 added a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…ng var length data exceeds offset limit (int32 max) (apache#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: apache#39332

Lead-authored-by: zanmato <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…ng var length data exceeds offset limit (int32 max) (apache#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: apache#39332

Lead-authored-by: zanmato <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
4 participants