Optimize `regexp_replace` when the input is a sparse array #3804

isidentical · 2022-10-11T22:13:26Z

Which issue does this PR close?

Closes #3803.

Rationale for this change

This is something that arrow-rs already does very heavily, although they also combine it with the unsafe array building APIs which this PR deliberately avoids (reasoning is at the end).

It doesn't provide as much speed-up as the other optimizations for the regex_replace, but it seems like a relatively straightforward change for a not-so-bad gain (see benchmarks section). It is also the final optimization candidate (at least localized to the regex_replace itself), if I am not missing anything.

What changes are included in this PR?

Instead of rebuilding the string array from the iterator, we now build the underlying array data by ourselves in which case we also have the ability to leverage existing null buffers for the input. In the generic version, this couldn't be done without recombining the underlying bitmaps of all the inputs but since this is the specialized case we know for a fact that the only array input is the strings (the first argument).

Are there any user-facing changes?

This is an optimization.

Benchmarks

As expected, there is no speed-up (or slowdowns, which I guess is noteworthy) in cases where the input array is very dense with data (low amount of NULLs). But depending on the input's data density, we see an average speed-up of %20.

	master	this pr	this pr (unsafe)	speed-up
%100 null & %0 data	0.121	0.088	0.071	1.38x against this PR (1.71x against unsafe version)
%80 null & %20 data	0.489	0.396	0.347	1.23x against this PR (1.40x against unsafe version)
%50 null & %50 data	0.741	0.640	0.619	1.15x against this PR (1.20x against unsafe version)
data > null (80-20, 100-0)	1.104	1.101	1.067	No change (<=%3, within noise)

(Built with the release mode, for a variation of the query/dataset in my example PR).

I have also included a column which indicates the speed-ups when we use the unsafe version but I am not so sure if it makes sense to increase the number of 'unsafe' spots in datafusion for a not-so-crazy speed-up (even if that from the soundness perspective it is safe). I'd be happy to also change my PR to include the unsafe version (which looks like this) if it is not that bad.

…g null buffers)

alamb

Code looks good to me -- I suggest one more test but otherwise 👍

alamb · 2022-10-12T16:00:29Z

datafusion/physical-expr/src/regex_expressions.rs

+        string_array
+            .data_ref()
+            .null_buffer()
+            .map(|b| b.bit_slice(string_array.offset(), string_array.len())),


👍 for handling the offset

datafusion/physical-expr/src/regex_expressions.rs

alamb · 2022-10-12T16:04:02Z

Thanks @isidentical

isidentical · 2022-10-12T19:03:36Z

@alamb thanks again for the review, I've added it as a separate test and it seems to work perfectly!

alamb

Thanks @isidentical

ursabot · 2022-10-12T20:22:53Z

Benchmark runs are scheduled for baseline = 37fe938 and contender = d5c361b. d5c361b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Optimize regexp_replace when the input is a sparse array (by reusin…

1bee7e9

…g null buffers)

github-actions bot added the physical-expr Changes to the physical-expr crates label Oct 11, 2022

isidentical marked this pull request as ready for review October 11, 2022 22:45

alamb approved these changes Oct 12, 2022

View reviewed changes

Add a test regarding the slicing behavior

7965471

isidentical requested a review from alamb October 12, 2022 19:03

alamb approved these changes Oct 12, 2022

View reviewed changes

alamb merged commit d5c361b into apache:master Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize `regexp_replace` when the input is a sparse array #3804

Optimize `regexp_replace` when the input is a sparse array #3804

Uh oh!

isidentical commented Oct 11, 2022

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 12, 2022

Uh oh!

Uh oh!

alamb commented Oct 12, 2022

Uh oh!

isidentical commented Oct 12, 2022

Uh oh!

alamb left a comment

Uh oh!

ursabot commented Oct 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize regexp_replace when the input is a sparse array #3804

Optimize regexp_replace when the input is a sparse array #3804

Uh oh!

Conversation

isidentical commented Oct 11, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Benchmarks

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Oct 12, 2022

Uh oh!

isidentical commented Oct 12, 2022

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented Oct 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize `regexp_replace` when the input is a sparse array #3804

Optimize `regexp_replace` when the input is a sparse array #3804