Uncaught `std::bad_alloc` exception when `group_by` very big `large_utf8` columns

### Describe the bug, including details regarding any error messages, version, and platform.

When a `pyarrow.Table` contains very large rows, whose size is very close to `2**31 - 1`, segfault or allocator exceptions can be raised when performing a `group_by` on very big `large_utf8` columns:

```py
import pyarrow as pa

# MAX_SIZE is the largest value that can fit in a 32-bit signed integer.
MAX_SIZE = int(2**31) - 1

# Create a string whose length is very close to MAX_SIZE:
BIG_STR_LEN = MAX_SIZE - 1
print(f"{BIG_STR_LEN=} = 2**31 - {2**31 - BIG_STR_LEN}")
BIG_STR = "A" * BIG_STR_LEN

# Create a record batch with two rows, both containing the BIG_STR in each of their columns:
record_batch = pa.RecordBatch.from_pydict(
    mapping={
        "id": [BIG_STR, BIG_STR],
        "other": [BIG_STR, BIG_STR],
    },
    schema=pa.schema(
        {
            "id": pa.large_utf8(),
            "other": pa.large_utf8(),
        }
    ),
)

# Create a table containing just the one RecordBatch:
table = pa.Table.from_batches([record_batch])

# Attempt to group by `id`:
ans = table.group_by(["id"]).aggregate([("other", "max")])
print(ans)
```

On my M1 mac, the output from running this program looks like:

**Pyarrow version: 14.0.1**

```
BIG_STR_LEN=2147483646 = 2**31 - 2
libc++abi: terminating due to uncaught exception of type std::bad_alloc: std::bad_alloc
zsh: abort      python main.py=
```

(in the previous version Pyarrow==10.0.1, this was a segfault instead of just a bad_alloc exception):
```
BIG_STR_LEN=2147483642 = 2**31 - 2
zsh: segmentation fault  python main.py
```

---

I need to emphasize that there is more than enough memory to satisfy this operation. The problem is actually caused by integer overflow; I believe in one/both of the following places:

- In [`VarLengthKeyEncoder::AddLength`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/kernels/row_encoder_internal.h#L134) there is no check that the size of the offset does not cause the length of the buffer to overflow an `int32_t`
- In [`GrouperImpl::Consume`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/row/grouper.cc#L433-L441) there's no check that sums of the `offsets_batch` do not overflow an `int32_t`

Overflow in signed integer arithmetic is undefined behavior in C++, but typically results in "wrap-around". The result is that we're getting a negative `int32_t` value.

Then, when we construct

```cpp
std::vector<uint8_t> key_bytes_batch(total_length);
```

the `total_length` is converted from `int32_t` to `uint64_t` (since `std::vector`'s length constructor accepts a `size_t`, which is `uint64_t` on most modern computers). The conversion goes like this:

```
int32_t(-1)  ==>  int64_t(-1)  ==>  uint64_t(2**64 - 1)
```

But `2**64 - 1` bytes is obviously more memory than is available on my computer. The overflow needs to be detected sooner to prevent this excessively-large number from being used as an impossible allocation request.

### Component(s)

Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uncaught `std::bad_alloc` exception when `group_by` very big `large_utf8` columns #39190

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uncaught std::bad_alloc exception when group_by very big large_utf8 columns #39190

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Uncaught `std::bad_alloc` exception when `group_by` very big `large_utf8` columns #39190