Skip to content

Uncaught std::bad_alloc exception when group_by very big large_utf8 columns #39190

@Nathan-Fenner

Description

@Nathan-Fenner

Describe the bug, including details regarding any error messages, version, and platform.

When a pyarrow.Table contains very large rows, whose size is very close to 2**31 - 1, segfault or allocator exceptions can be raised when performing a group_by on very big large_utf8 columns:

import pyarrow as pa

# MAX_SIZE is the largest value that can fit in a 32-bit signed integer.
MAX_SIZE = int(2**31) - 1

# Create a string whose length is very close to MAX_SIZE:
BIG_STR_LEN = MAX_SIZE - 1
print(f"{BIG_STR_LEN=} = 2**31 - {2**31 - BIG_STR_LEN}")
BIG_STR = "A" * BIG_STR_LEN

# Create a record batch with two rows, both containing the BIG_STR in each of their columns:
record_batch = pa.RecordBatch.from_pydict(
    mapping={
        "id": [BIG_STR, BIG_STR],
        "other": [BIG_STR, BIG_STR],
    },
    schema=pa.schema(
        {
            "id": pa.large_utf8(),
            "other": pa.large_utf8(),
        }
    ),
)

# Create a table containing just the one RecordBatch:
table = pa.Table.from_batches([record_batch])

# Attempt to group by `id`:
ans = table.group_by(["id"]).aggregate([("other", "max")])
print(ans)

On my M1 mac, the output from running this program looks like:

Pyarrow version: 14.0.1

BIG_STR_LEN=2147483646 = 2**31 - 2
libc++abi: terminating due to uncaught exception of type std::bad_alloc: std::bad_alloc
zsh: abort      python main.py=

(in the previous version Pyarrow==10.0.1, this was a segfault instead of just a bad_alloc exception):

BIG_STR_LEN=2147483642 = 2**31 - 2
zsh: segmentation fault  python main.py

I need to emphasize that there is more than enough memory to satisfy this operation. The problem is actually caused by integer overflow; I believe in one/both of the following places:

Overflow in signed integer arithmetic is undefined behavior in C++, but typically results in "wrap-around". The result is that we're getting a negative int32_t value.

Then, when we construct

std::vector<uint8_t> key_bytes_batch(total_length);

the total_length is converted from int32_t to uint64_t (since std::vector's length constructor accepts a size_t, which is uint64_t on most modern computers). The conversion goes like this:

int32_t(-1)  ==>  int64_t(-1)  ==>  uint64_t(2**64 - 1)

But 2**64 - 1 bytes is obviously more memory than is available on my computer. The overflow needs to be detected sooner to prevent this excessively-large number from being used as an impossible allocation request.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions