Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lexicographically partition points and ranges #424

Merged
merged 1 commit into from
Jun 9, 2021

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented Jun 8, 2021

Which issue does this PR close?

Closes #428

Rationale for this change

in order to support order by and partition by within window functions, we'll need to find out partition points on already sorted columns (lexicographically), and with binary search the time complexity can be O(log(n)). Here we'll utilize the lexicographical comparator extracted from #423 and add two functions.

What changes are included in this PR?

                        time:   [20.391 us 20.433 us 20.490 us]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

lexicographical_partition_ranges(u8) 2^12
                        time:   [28.129 us 28.181 us 28.254 us]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

Benchmarking lexicographical_partition_ranges(u8) 2^10 with nulls: Collecting 100 samples in estimated 5.0664 s (268k iterat                                                                                                                            lexicographical_partition_ranges(u8) 2^10 with nulls
                        time:   [18.841 us 18.889 us 18.945 us]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking lexicographical_partition_ranges(u8) 2^12 with nulls: Collecting 100 samples in estimated 5.0278 s (172k iterat                                                                                                                            lexicographical_partition_ranges(u8) 2^12 with nulls
                        time:   [29.202 us 29.293 us 29.402 us]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

lexicographical_partition_ranges(f64) 2^10
                        time:   [67.934 us 68.701 us 69.617 us]
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking lexicographical_partition_ranges(low cardinality) 1024: Collecting 100 samples in estimated 5.0047 s (4.1M iter                                                                                                                            lexicographical_partition_ranges(low cardinality) 1024
                        time:   [1.2215 us 1.2272 us 1.2336 us]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high severe

Are there any user-facing changes?

@jimexist jimexist changed the title WIP add lexicographically partition points add lexicographically partition points and ranges Jun 8, 2021
@jimexist jimexist marked this pull request as ready for review June 8, 2021 06:25
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I left an optional comment; it can be moved to a separate PR 👍

///
/// The returned vec would be of size k+1 where k is cardinality of the sorted values; the first and
/// last value would be 0 and n.
fn lexicographical_partition_points(columns: &[SortColumn]) -> Result<Vec<usize>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to write the below as an iterator, thereby avoiding the allocation of the vector.

The general idea: create a struct with a method try_new. That method initializes the struct's content and errors if something is wrong (e.g. empty columns). It also initializes lexicographical_comparator.

Next, implement Iterator<Item=usize> for that struct where Iterator::next yields previous_partition_point and increments it as written in this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracked in #437

@alamb
Copy link
Contributor

alamb commented Jun 8, 2021

I filed #428 to track this. Thanks @jimexist

@alamb
Copy link
Contributor

alamb commented Jun 8, 2021

BTW I hope to get this into Arrow 4.3 -- I plan to build a release candidate for that on Thursday or Friday this week and release early next week. Once it gets released it will then be available in DataFusion

@alamb alamb merged commit 0c00776 into apache:master Jun 9, 2021
alamb pushed a commit that referenced this pull request Jun 9, 2021
@alamb alamb added the cherry-picked PR that was backported to active release (will be included in maintenance release) label Jun 9, 2021
@jimexist jimexist deleted the add-partition branch June 9, 2021 23:45
alamb pushed a commit that referenced this pull request Jun 10, 2021
alamb added a commit that referenced this pull request Jun 10, 2021
Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com>
@alamb alamb added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate cherry-picked PR that was backported to active release (will be included in maintenance release) enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add partitioning kernel for sorted arrays
3 participants