Skip to content

[C++][Gandiva] Out-of-bounds read in utf8_length_ignore_invalid on truncated multi-byte input #50355

Description

@Arawoof06

Describe the bug, including details regarding any error messages, version, and platform.

utf8_length_ignore_invalid in cpp/src/gandiva/precompiled/string_ops.cc counts glyphs while tolerating invalid bytes. When it hits a byte that is not a valid continuation byte it extends char_len:

for (int j = 1; j < char_len; ++j) {
  if ((data[i + j] & 0xC0) != 0x80) {
    char_len += 1;
  }
}

The inner loop keeps reading data[i + j] as char_len grows, but it never rechecks the buffer end. If a value ends in a truncated multi-byte sequence (for example a 0xF0 lead byte followed by non-continuation bytes) char_len grows past the remaining bytes and the loop reads past data_len.

The function is reached from untrusted string data through lpad/rpad (lpad_utf8_int32_utf8, rpad_utf8_int32_utf8), which call it on the input text before computing padding.

Reproduced against a verbatim copy of the function under AddressSanitizer with the 4-byte input {0xF0, 'a', 'a', 'a'} in an exactly-sized heap buffer:

==ERROR: AddressSanitizer: heap-buffer-overflow
READ of size 1 ... 0 bytes after 4-byte region
    #0 utf8_length_ignore_invalid

The sibling helpers (utf8_length, reverse_utf8, utf8_byte_pos) already guard the equivalent access with i + char_len > data_len; this one does not.

Component(s)

C++, Gandiva

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions