-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Uncaught std::bad_alloc exception when group_by very big large_utf8 columns #39190
Description
Describe the bug, including details regarding any error messages, version, and platform.
When a pyarrow.Table contains very large rows, whose size is very close to 2**31 - 1, segfault or allocator exceptions can be raised when performing a group_by on very big large_utf8 columns:
import pyarrow as pa
# MAX_SIZE is the largest value that can fit in a 32-bit signed integer.
MAX_SIZE = int(2**31) - 1
# Create a string whose length is very close to MAX_SIZE:
BIG_STR_LEN = MAX_SIZE - 1
print(f"{BIG_STR_LEN=} = 2**31 - {2**31 - BIG_STR_LEN}")
BIG_STR = "A" * BIG_STR_LEN
# Create a record batch with two rows, both containing the BIG_STR in each of their columns:
record_batch = pa.RecordBatch.from_pydict(
mapping={
"id": [BIG_STR, BIG_STR],
"other": [BIG_STR, BIG_STR],
},
schema=pa.schema(
{
"id": pa.large_utf8(),
"other": pa.large_utf8(),
}
),
)
# Create a table containing just the one RecordBatch:
table = pa.Table.from_batches([record_batch])
# Attempt to group by `id`:
ans = table.group_by(["id"]).aggregate([("other", "max")])
print(ans)On my M1 mac, the output from running this program looks like:
Pyarrow version: 14.0.1
BIG_STR_LEN=2147483646 = 2**31 - 2
libc++abi: terminating due to uncaught exception of type std::bad_alloc: std::bad_alloc
zsh: abort python main.py=
(in the previous version Pyarrow==10.0.1, this was a segfault instead of just a bad_alloc exception):
BIG_STR_LEN=2147483642 = 2**31 - 2
zsh: segmentation fault python main.py
I need to emphasize that there is more than enough memory to satisfy this operation. The problem is actually caused by integer overflow; I believe in one/both of the following places:
- In
VarLengthKeyEncoder::AddLengththere is no check that the size of the offset does not cause the length of the buffer to overflow anint32_t - In
GrouperImpl::Consumethere's no check that sums of theoffsets_batchdo not overflow anint32_t
Overflow in signed integer arithmetic is undefined behavior in C++, but typically results in "wrap-around". The result is that we're getting a negative int32_t value.
Then, when we construct
std::vector<uint8_t> key_bytes_batch(total_length);the total_length is converted from int32_t to uint64_t (since std::vector's length constructor accepts a size_t, which is uint64_t on most modern computers). The conversion goes like this:
int32_t(-1) ==> int64_t(-1) ==> uint64_t(2**64 - 1)
But 2**64 - 1 bytes is obviously more memory than is available on my computer. The overflow needs to be detected sooner to prevent this excessively-large number from being used as an impossible allocation request.
Component(s)
Python