-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In an attempt to unlock sort based optimizations in DataFusion (and IOx) often we want to take several input streams and merge them together so that the output is still sorted in the same way.
One major usecase we have in IOx is that we will have data in several streams (e.g. parquet files, or in memory) that are already sorted and we want to merge those streams together into a single stream with the same sort order
As another example, it would be really nice to have a repartitioned sort operation (so that we are able to sort on partitions of an input in parallel) but there is currently no way to combine several sorted streams together.
Also, to implement something like a parallel sort-merge-join, #141, again having the ability to repartition / merge / retain sort is likely important.
Describe the solution you'd like
I would like a SortPreservingMerge operator that takes as input a list of SortExprs and some number of input streams. The input streams are guaranteed to be sorted according to the sort exprs.
When the operator is run, it produces a output stream that is also sorted on SortExprs
Describe alternatives you've considered
TBD
Additional context
I think @tustvold plans to implement an operator similar to this in IOx. Depending on how that goes, we will contemplate putting it into DataFusion