Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow Row Format in SortExec #5230 #5242

Closed
wants to merge 0 commits into from
Closed

Conversation

jaylmiller
Copy link
Contributor

@jaylmiller jaylmiller commented Feb 10, 2023

Which issue does this PR close?

issue #5230

Rationale for this change

Modifying SortExec to use arrow row format should increase performance when sorting by multiple columns. Row encoding is preserved so that it can be used by parent nodes (right now that is just SortPreservingMergeExec).

What changes are included in this PR?

Both iterations of:

A first iteration could simply modify sort_batch to use the row format for multi-column sorts, as demonstrated here, falling back to sort_to_indices if only a single column.

A second iteration could then look to find a way to convert to the row format once, and preserve this encoding when feeding sorted batches into SortPreservingMerge.

Are these changes tested?

Mostly covered by existing unit tests. Added unit tests for the new publicly exported structs.

Are there any user-facing changes?

Some public exports had to be added to be able to be used in the integration test core/tests/sort_key_cursor.

@github-actions github-actions bot added the core Core datafusion crate label Feb 10, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaylmiller -- this is looking good. Very nice first contribution

I think this code could almost go in as it -- the only thing that I think we need to see is some sort of benchmark improvement.

I looked around in our existing benchmarks to find a benchmark for our sort operator, but was not able to. Maybe @tustvold or @Dandandan or @andygrove know of a preexisting benchmark

If not, perhaps we can adapt some of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/benches/merge.rs to make a SortExec benchmark

@jaylmiller jaylmiller marked this pull request as draft February 11, 2023 16:32
@jaylmiller jaylmiller changed the title Draft: Arrow Row Format in SortExec #5230 Arrow Row Format in SortExec #5230 Feb 11, 2023
@jaylmiller jaylmiller changed the title Arrow Row Format in SortExec #5230 v1 Arrow Row Format in SortExec #5230 Feb 12, 2023
@jaylmiller jaylmiller marked this pull request as ready for review February 12, 2023 14:27
@jaylmiller
Copy link
Contributor Author

Thanks @jaylmiller -- this is looking good. Very nice first contribution

I think this code could almost go in as it -- the only thing that I think we need to see is some sort of benchmark improvement.

I looked around in our existing benchmarks to find a benchmark for our sort operator, but was not able to. Maybe @tustvold or @Dandandan or @andygrove know of a preexisting benchmark

If not, perhaps we can adapt some of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/benches/merge.rs to make a SortExec benchmark

Thanks! Should I add said benchmark to this PR then?

@jaylmiller
Copy link
Contributor Author

@alamb Does this SortExec benchmark seem good?

@jaylmiller jaylmiller marked this pull request as draft February 14, 2023 19:24
@jaylmiller jaylmiller changed the title v1 Arrow Row Format in SortExec #5230 Arrow Row Format in SortExec #5230 Feb 14, 2023
@alamb
Copy link
Contributor

alamb commented Feb 14, 2023

@alamb Does this SortExec benchmark seem good?

Yes, I think it looks like a good start. Thank you @jaylmiller . Could you please create a new PR with just the sort benchmark (so we can merge that and use it as a way to compare the code on master and this PR)?

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants