-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Optimize regexp_replace when the input is a sparse array
#3804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me -- I suggest one more test but otherwise 👍
| string_array | ||
| .data_ref() | ||
| .null_buffer() | ||
| .map(|b| b.bit_slice(string_array.offset(), string_array.len())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for handling the offset
|
Thanks @isidentical |
|
@alamb thanks again for the review, I've added it as a separate test and it seems to work perfectly! |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @isidentical
|
Benchmark runs are scheduled for baseline = 37fe938 and contender = d5c361b. d5c361b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3803.
Rationale for this change
This is something that arrow-rs already does very heavily, although they also combine it with the unsafe array building APIs which this PR deliberately avoids (reasoning is at the end).
It doesn't provide as much speed-up as the other optimizations for the
regex_replace, but it seems like a relatively straightforward change for a not-so-bad gain (see benchmarks section). It is also the final optimization candidate (at least localized to theregex_replaceitself), if I am not missing anything.What changes are included in this PR?
Instead of rebuilding the string array from the iterator, we now build the underlying array data by ourselves in which case we also have the ability to leverage existing null buffers for the input. In the generic version, this couldn't be done without recombining the underlying bitmaps of all the inputs but since this is the specialized case we know for a fact that the only array input is the
strings(the first argument).Are there any user-facing changes?
This is an optimization.
Benchmarks
As expected, there is no speed-up (or slowdowns, which I guess is noteworthy) in cases where the input array is very dense with data (low amount of NULLs). But depending on the input's data density, we see an average speed-up of %20.
(Built with the release mode, for a variation of the query/dataset in my example PR).
I have also included a column which indicates the speed-ups when we use the unsafe version but I am not so sure if it makes sense to increase the number of 'unsafe' spots in datafusion for a not-so-crazy speed-up (even if that from the soundness perspective it is safe). I'd be happy to also change my PR to include the unsafe version (which looks like this) if it is not that bad.