take
kernel that works across multiple RecordBatch
es
#1523
Labels
arrow
Changes to the arrow crate
enhancement
Any new improvement worthy of a entry in the changelog
performance
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For several operations in data processing, it is important to be able to select some subset (for sorting or filtering)
For example, the current take kernel works like this:
In DataFusion, our operators get multiple record batches at a time, and we would like to do stuff like sort them without first combining into a single record batch. For example:
Describe the solution you'd like
I would like a function something like
batch_take
that takes a vector ofRecordBatch
es and a list of(record_batch_index, offset_in_the_record_batch)
tuples and produces the resulting array, like:Overtime I would expect these to become optimized in the same way as we have optimized the
take
kernelThis will come up in Grouping and Join operators as well.
Describe alternatives you've considered
There are two more features that @yjshen added in apache/datafusion#2132 that we might contemplate:
(record_batch_index, offset_in_the_record_batch, num_records)
to optimize the common case of copying multiple rows from each source batch.Additional context
This came up while @yjshen was implementing a more memory efficient sort in DataFusion: apache/datafusion#2132 and suggested by @Dandandan apache/datafusion#2132 (comment)
We can probably move a bunch of the implementation from that PR to this one.
The text was updated successfully, but these errors were encountered: