[C++][Gandiva] Out-of-bounds read in utf8_length_ignore_invalid on truncated multi-byte input

### Describe the bug, including details regarding any error messages, version, and platform.

`utf8_length_ignore_invalid` in `cpp/src/gandiva/precompiled/string_ops.cc` counts glyphs while tolerating invalid bytes. When it hits a byte that is not a valid continuation byte it extends `char_len`:

```cpp
for (int j = 1; j < char_len; ++j) {
  if ((data[i + j] & 0xC0) != 0x80) {
    char_len += 1;
  }
}
```

The inner loop keeps reading `data[i + j]` as `char_len` grows, but it never rechecks the buffer end. If a value ends in a truncated multi-byte sequence (for example a `0xF0` lead byte followed by non-continuation bytes) `char_len` grows past the remaining bytes and the loop reads past `data_len`.

The function is reached from untrusted string data through `lpad`/`rpad` (`lpad_utf8_int32_utf8`, `rpad_utf8_int32_utf8`), which call it on the input text before computing padding.

Reproduced against a verbatim copy of the function under AddressSanitizer with the 4-byte input `{0xF0, 'a', 'a', 'a'}` in an exactly-sized heap buffer:

```
==ERROR: AddressSanitizer: heap-buffer-overflow
READ of size 1 ... 0 bytes after 4-byte region
    #0 utf8_length_ignore_invalid
```

The sibling helpers (`utf8_length`, `reverse_utf8`, `utf8_byte_pos`) already guard the equivalent access with `i + char_len > data_len`; this one does not.

### Component(s)

C++, Gandiva


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Gandiva] Out-of-bounds read in utf8_length_ignore_invalid on truncated multi-byte input #50355

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[C++][Gandiva] Out-of-bounds read in utf8_length_ignore_invalid on truncated multi-byte input #50355

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions