Describe the bug, including details regarding any error messages, version, and platform.
utf8_length_ignore_invalid in cpp/src/gandiva/precompiled/string_ops.cc counts glyphs while tolerating invalid bytes. When it hits a byte that is not a valid continuation byte it extends char_len:
for (int j = 1; j < char_len; ++j) {
if ((data[i + j] & 0xC0) != 0x80) {
char_len += 1;
}
}
The inner loop keeps reading data[i + j] as char_len grows, but it never rechecks the buffer end. If a value ends in a truncated multi-byte sequence (for example a 0xF0 lead byte followed by non-continuation bytes) char_len grows past the remaining bytes and the loop reads past data_len.
The function is reached from untrusted string data through lpad/rpad (lpad_utf8_int32_utf8, rpad_utf8_int32_utf8), which call it on the input text before computing padding.
Reproduced against a verbatim copy of the function under AddressSanitizer with the 4-byte input {0xF0, 'a', 'a', 'a'} in an exactly-sized heap buffer:
==ERROR: AddressSanitizer: heap-buffer-overflow
READ of size 1 ... 0 bytes after 4-byte region
#0 utf8_length_ignore_invalid
The sibling helpers (utf8_length, reverse_utf8, utf8_byte_pos) already guard the equivalent access with i + char_len > data_len; this one does not.
Component(s)
C++, Gandiva
Describe the bug, including details regarding any error messages, version, and platform.
utf8_length_ignore_invalidincpp/src/gandiva/precompiled/string_ops.cccounts glyphs while tolerating invalid bytes. When it hits a byte that is not a valid continuation byte it extendschar_len:The inner loop keeps reading
data[i + j]aschar_lengrows, but it never rechecks the buffer end. If a value ends in a truncated multi-byte sequence (for example a0xF0lead byte followed by non-continuation bytes)char_lengrows past the remaining bytes and the loop reads pastdata_len.The function is reached from untrusted string data through
lpad/rpad(lpad_utf8_int32_utf8,rpad_utf8_int32_utf8), which call it on the input text before computing padding.Reproduced against a verbatim copy of the function under AddressSanitizer with the 4-byte input
{0xF0, 'a', 'a', 'a'}in an exactly-sized heap buffer:The sibling helpers (
utf8_length,reverse_utf8,utf8_byte_pos) already guard the equivalent access withi + char_len > data_len; this one does not.Component(s)
C++, Gandiva