Skip to content

[C++][Gandiva] Optimize LPAD/RPAD functions: fix memory safety issue and improve performance #49438

@dmitry-chirkov-dremio

Description

@dmitry-chirkov-dremio

Describe the enhancement requested

The lpad_utf8_int32_utf8 and rpad_utf8_int32_utf8 functions have a memory safety issue and performance inefficiency.

Memory safety issue:

When the fill string is longer than the padding space needed, the initial memcpy writes more bytes than allocated, causing a buffer overflow.

Performance issues:

  1. Single-byte fill: Iterates character-by-character even for single-byte fills like space padding, when a single memset call would suffice.

  2. Multi-byte fill: Copies the fill pattern character-by-character in O(n) iterations instead of using a doubling strategy with O(log n) memcpy calls.

Proposed fixes:

  1. Use std::min(fill_text_len, total_fill_bytes) for the initial copy to prevent overflow
  2. Add single-byte fill fast path using memset
  3. Replace character-by-character loop with doubling strategy for multi-byte fills

Component(s)

C++, Gandiva

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions