Skip to content

Add an Order Preserving merge operator #362

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In an attempt to unlock sort based optimizations in DataFusion (and IOx) often we want to take several input streams and merge them together so that the output is still sorted in the same way.

One major usecase we have in IOx is that we will have data in several streams (e.g. parquet files, or in memory) that are already sorted and we want to merge those streams together into a single stream with the same sort order

As another example, it would be really nice to have a repartitioned sort operation (so that we are able to sort on partitions of an input in parallel) but there is currently no way to combine several sorted streams together.

Also, to implement something like a parallel sort-merge-join, #141, again having the ability to repartition / merge / retain sort is likely important.

Describe the solution you'd like
I would like a SortPreservingMerge operator that takes as input a list of SortExprs and some number of input streams. The input streams are guaranteed to be sorted according to the sort exprs.

When the operator is run, it produces a output stream that is also sorted on SortExprs

Describe alternatives you've considered
TBD

Additional context
I think @tustvold plans to implement an operator similar to this in IOx. Depending on how that goes, we will contemplate putting it into DataFusion

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions