Improve the performance of ltrim/rtrim/btrim #10006

JasonLi-cn · 2024-04-09T08:30:36Z

Which issue does this PR close?

Rationale for this change

If the trim function includes a second argument, I believe it is predominantly a Scalar rather than an Array. Expanding the second argument into an Array would lead to performance degradation, and more critically, the code arg.clone().into_array(expansion_len) would be invoked for every computation.

Benchmark

Gnuplot not found, using plotters backend
ltrim ": 1024           time:   [23.495 µs 23.520 µs 23.554 µs]
                        change: [-11.992% -11.922% -11.854%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

ltrim ": 4096           time:   [92.348 µs 92.489 µs 92.669 µs]
                        change: [-10.305% -10.123% -9.9762%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

ltrim ": 8192           time:   [189.78 µs 190.59 µs 191.57 µs]
                        change: [-6.1871% -5.8626% -5.5516%] (p = 0.00 < 0.05)
                        Performance has improved.

ltrim Header:: 1024     time:   [80.256 µs 80.276 µs 80.300 µs]
                        change: [-7.5562% -7.0325% -6.6364%] (p = 0.00 < 0.05)
                        Performance has improved.

ltrim Header:: 4096     time:   [318.94 µs 319.04 µs 319.15 µs]
                        change: [-5.5723% -5.4562% -5.3322%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

ltrim Header:: 8192     time:   [643.04 µs 643.69 µs 644.34 µs]
                        change: [-4.9289% -4.7291% -4.5327%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Omega359 · 2024-04-09T15:43:17Z

Nice! It would be a nice addition if the benchmark was expanded to cover btrim and rtrim as well

andygrove · 2024-04-09T22:46:47Z

datafusion/functions/src/string/common.rs

@@ -78,6 +80,19 @@ pub(crate) fn general_trim<T: OffsetSizeTrait>(
        2 => {
            let characters_array = as_generic_string_array::<T>(&args[1])?;

+            if characters_array.len() == 1 {
+                if characters_array.is_null(0) {
+                    return Ok(new_null_array(args[0].data_type(), args[0].len()));


This looks like new behavior for null handling? Do we have existing unit tests for this case or can we add a new test as part of this PR?

This is not a new behavior. The reason for this logic characters_array.is_null(0) is because initially, a test did not pass, and the error was as follows:

... External error: query result mismatch: [SQL] SELECT btrim(' xyxtrimyyx ', NULL) [Diff] (-expected|+actual) - NULL + xyxtrimyyx at test_files/expr.slt:373 ... note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. error: test failed, to rerun pass `-p datafusion-sqllogictest --test sqllogictests`

To provide additional context, this logic is consistent with the _ => None here:

let result = string_array .iter() .zip(characters_array.iter()) .map(|(string, characters)| match (string, characters) { (Some(string), Some(characters)) => Some(func(string, characters)), _ => None, // If characters is null, append None. }) .collect::<GenericStringArray<T>>();

Thanks for the clarification @JasonLi-cn

JasonLi-cn · 2024-04-10T02:09:41Z

Nice! It would be a nice addition if the benchmark was expanded to cover btrim and rtrim as well

Thank you @Omega359 for your suggestion. I still need to ask @alamb whether it is necessary to add benchmarks for btrim/rtrim.

alamb · 2024-04-10T16:55:44Z

Nice! It would be a nice addition if the benchmark was expanded to cover btrim and rtrim as well

Thank you @Omega359 for your suggestion. I still need to ask @alamb whether it is necessary to add benchmarks for btrim/rtrim.

it is not necessary, though it would be nice as @Omega359 said. We can also do it as a follow on PR. Thanks again @JasonLi-cn

alamb · 2024-04-10T16:55:54Z

Thanks @Omega359 and @andygrove for the reviews!

optimize trim function

c523c0c

JasonLi-cn changed the title ~~optimize trim function~~ Improve the performance of ltrim/rtrim/btrim Apr 9, 2024

fix: the second arg is NULL

7c84228

alamb mentioned this pull request Apr 9, 2024

DataFusion weekly project plan (Andrew Lamb) - April 8, 2024 #10002

Closed

9 tasks

andygrove reviewed Apr 9, 2024

View reviewed changes

JasonLi-cn closed this Apr 10, 2024

JasonLi-cn reopened this Apr 10, 2024

andygrove approved these changes Apr 10, 2024

View reviewed changes

alamb merged commit fdb2d57 into apache:main Apr 10, 2024
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of ltrim/rtrim/btrim #10006

Improve the performance of ltrim/rtrim/btrim #10006

JasonLi-cn commented Apr 9, 2024 •

edited

Loading

Omega359 commented Apr 9, 2024

andygrove Apr 9, 2024

JasonLi-cn Apr 10, 2024

JasonLi-cn Apr 10, 2024

andygrove Apr 10, 2024

JasonLi-cn commented Apr 10, 2024 •

edited

Loading

alamb commented Apr 10, 2024

alamb commented Apr 10, 2024

Improve the performance of ltrim/rtrim/btrim #10006

Improve the performance of ltrim/rtrim/btrim #10006

Conversation

JasonLi-cn commented Apr 9, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

Benchmark

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Omega359 commented Apr 9, 2024

andygrove Apr 9, 2024

Choose a reason for hiding this comment

JasonLi-cn Apr 10, 2024

Choose a reason for hiding this comment

JasonLi-cn Apr 10, 2024

Choose a reason for hiding this comment

andygrove Apr 10, 2024

Choose a reason for hiding this comment

JasonLi-cn commented Apr 10, 2024 • edited Loading

alamb commented Apr 10, 2024

alamb commented Apr 10, 2024

JasonLi-cn commented Apr 9, 2024 •

edited

Loading

JasonLi-cn commented Apr 10, 2024 •

edited

Loading