add lexicographically partition points and ranges #424

jimexist · 2021-06-08T04:22:16Z

Which issue does this PR close?

Closes #428

Rationale for this change

in order to support order by and partition by within window functions, we'll need to find out partition points on already sorted columns (lexicographically), and with binary search the time complexity can be O(log(n)). Here we'll utilize the lexicographical comparator extracted from #423 and add two functions.

What changes are included in this PR?

                        time:   [20.391 us 20.433 us 20.490 us]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

lexicographical_partition_ranges(u8) 2^12
                        time:   [28.129 us 28.181 us 28.254 us]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

Benchmarking lexicographical_partition_ranges(u8) 2^10 with nulls: Collecting 100 samples in estimated 5.0664 s (268k iterat                                                                                                                            lexicographical_partition_ranges(u8) 2^10 with nulls
                        time:   [18.841 us 18.889 us 18.945 us]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking lexicographical_partition_ranges(u8) 2^12 with nulls: Collecting 100 samples in estimated 5.0278 s (172k iterat                                                                                                                            lexicographical_partition_ranges(u8) 2^12 with nulls
                        time:   [29.202 us 29.293 us 29.402 us]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

lexicographical_partition_ranges(f64) 2^10
                        time:   [67.934 us 68.701 us 69.617 us]
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking lexicographical_partition_ranges(low cardinality) 1024: Collecting 100 samples in estimated 5.0047 s (4.1M iter                                                                                                                            lexicographical_partition_ranges(low cardinality) 1024
                        time:   [1.2215 us 1.2272 us 1.2336 us]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high severe

Are there any user-facing changes?

arrow/src/compute/kernels/partition.rs

jorgecarleitao

Looks great. I left an optional comment; it can be moved to a separate PR 👍

jorgecarleitao · 2021-06-08T16:54:27Z

arrow/src/compute/kernels/partition.rs

+///
+/// The returned vec would be of size k+1 where k is cardinality of the sorted values; the first and
+/// last value would be 0 and n.
+fn lexicographical_partition_points(columns: &[SortColumn]) -> Result<Vec<usize>> {


It should be possible to write the below as an iterator, thereby avoiding the allocation of the vector.

The general idea: create a struct with a method try_new. That method initializes the struct's content and errors if something is wrong (e.g. empty columns). It also initializes lexicographical_comparator.

Next, implement Iterator<Item=usize> for that struct where Iterator::next yields previous_partition_point and increments it as written in this function.

tracked in #437

alamb · 2021-06-08T21:04:08Z

I filed #428 to track this. Thanks @jimexist

alamb · 2021-06-08T21:08:26Z

BTW I hope to get this into Arrow 4.3 -- I plan to build a release candidate for that on Thursday or Friday this week and release early next week. Once it gets released it will then be available in DataFusion

Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com>

jimexist mentioned this pull request Jun 8, 2021

refactor lexico sort for future code reuse #423

Merged

jorgecarleitao reviewed Jun 8, 2021

View reviewed changes

arrow/src/compute/kernels/partition.rs Outdated Show resolved Hide resolved

jimexist force-pushed the add-partition branch from c4166d2 to 1e4c160 Compare June 8, 2021 06:21

jimexist changed the title ~~WIP add lexicographically partition points~~ add lexicographically partition points and ranges Jun 8, 2021

jimexist marked this pull request as ready for review June 8, 2021 06:25

jimexist force-pushed the add-partition branch from bea9429 to d52ab16 Compare June 8, 2021 07:38

jimexist requested a review from jorgecarleitao June 8, 2021 09:54

jorgecarleitao approved these changes Jun 8, 2021

View reviewed changes

refactor lexico sort

14ea178

jimexist force-pushed the add-partition branch from 044d522 to 14ea178 Compare June 9, 2021 00:57

This was referenced Jun 9, 2021

migrate partition kernel to use Iterator trait #437

Closed

use iterator for partition kernel instead of generating vec #438

Merged

alamb merged commit 0c00776 into apache:master Jun 9, 2021

alamb pushed a commit that referenced this pull request Jun 9, 2021

refactor lexico sort (#424)

48bfcdd

alamb added the cherry-picked label Jun 9, 2021

alamb mentioned this pull request Jun 9, 2021

Cherry pick add lexicographically partition points and ranges to active_release #441

Merged

jimexist deleted the add-partition branch June 9, 2021 23:45

alamb pushed a commit that referenced this pull request Jun 10, 2021

refactor lexico sort (#424)

11f2087

alamb added a commit that referenced this pull request Jun 10, 2021

refactor lexico sort (#424) (#441)

a7656a8

Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com>

alamb mentioned this pull request Jun 10, 2021

Add changelog and bump version for proposed 4.3.0 release #444

Merged

alamb added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add lexicographically partition points and ranges #424

add lexicographically partition points and ranges #424

jimexist commented Jun 8, 2021 •

edited

Loading

jorgecarleitao left a comment

jorgecarleitao Jun 8, 2021

jimexist Jun 9, 2021

alamb commented Jun 8, 2021

alamb commented Jun 8, 2021

add lexicographically partition points and ranges #424

add lexicographically partition points and ranges #424

Conversation

jimexist commented Jun 8, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

jorgecarleitao left a comment

Choose a reason for hiding this comment

jorgecarleitao Jun 8, 2021

Choose a reason for hiding this comment

jimexist Jun 9, 2021

Choose a reason for hiding this comment

alamb commented Jun 8, 2021

alamb commented Jun 8, 2021

jimexist commented Jun 8, 2021 •

edited

Loading