-
Notifications
You must be signed in to change notification settings - Fork 1.8k
perf: Implement boolean group values #17726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @ashdnazg -- thank you
I will try and review it carefully in the next day or two
| } | ||
| DataType::Utf8 => { | ||
| return Ok(Box::new(GroupValuesByes::<i32>::new(OutputType::Utf8))); | ||
| return Ok(Box::new(GroupValuesBytes::<i32>::new(OutputType::Utf8))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GroupValuesByes 🤦
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ashdnazg -- I reviewed the code and it all looks pretty good to me (follows the existing patterns)
However, I tested this locally with datafusion-cli and couldn't get it to show any difference. Do you have any benchmarks / queries that showed this would be faster?
Here is what I tried:
Table setup
> create or replace table foo as select random() as float_val, random() < 0.5 as bool_val, case (random() * 4)::integer when 0 THEN 'Foo' WHEN 1 then 'bar' when 2 then 'baz' else 'grogo' end as str_val from generate_series(1, 1000000000);
0 row(s) fetched.
Elapsed 5.596 seconds.
> select * from foo limit 10;
+----------------------+----------+---------+
| float_val | bool_val | str_val |
+----------------------+----------+---------+
| 0.003644060055853049 | true | grogo |
| 0.5430315943491387 | false | Foo |
| 0.0635246361601266 | false | baz |
| 0.09122350127376644 | true | Foo |
| 0.4325015383008821 | true | baz |
| 0.10141972676501176 | true | Foo |
| 0.6749389965920886 | false | grogo |
| 0.6319317066473308 | false | grogo |
| 0.07946106391385499 | true | bar |
| 0.5026330571571326 | true | Foo |
+----------------------+----------+---------+
10 row(s) fetched.
Elapsed 0.071 seconds.Test query
And then ran
> select avg(float_val), bool_val, str_val from foo GROUP BY bool_val, str_val;results
- main: 1.645 seconds
- This branch: 1.695 seconds
(basically no change)
🤔
| nulls: MaybeNullBufferBuilder, | ||
| } | ||
|
|
||
| impl<const NULLABLE: bool> BooleanGroupValueBuilder<NULLABLE> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is pretty similar to PrimitiveGroupValueBuilder but I don't see any obvious way to improve that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It's a recurring theme, I feel, that the boolean implementations are very similar to the primitive ones.
|
Thank you for benchmarking! I thought I had some promising results, but maybe I was mistaken. |
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a question.
| new_builder.finish(); | ||
|
|
||
| Arc::new(BooleanArray::new(new_builder.finish(), first_n_nulls)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call new_builder.finish() twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that! I'll add a test to verify that take_n works correctly.
| lhs_rows: &[usize], | ||
| array: &ArrayRef, | ||
| rhs_rows: &[usize], | ||
| equal_to_results: &mut [bool], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is a slice and not buffer this limit optimizations in my optimization for creating optimized version for all uniuqe, for example for non nullable checking if 2 arrays are the same is simple NOT XOR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's beyond the scope of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a new ticket to track these ideas so we don't forget them
| lhs_rows: &[usize], | ||
| array: &ArrayRef, | ||
| rhs_rows: &[usize], | ||
| equal_to_results: &mut [bool], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to change it to MutableBooleanBuffer or something in the future to allow for more optimizations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
|
@alamb And I get the following results locally Without the boolean impl: With the boolean impl In general the multi group by option seems to make certain scenarios worse. I get Perhaps the logic for when to use it should be more careful than just "whenever it's supported". But I suspect this PR is not the right spot for that discussion. |
| ); | ||
|
|
||
| for (&lhs_row, &rhs_row, equal_to_result) in iter { | ||
| // Has found not equal to in previous column, don't need to check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment does not add any more information than the next line.
IMO it could be removed or updated with some information "why there is no need to check"
|
|
||
| let null_count = array.null_count(); | ||
| let num_rows = array.len(); | ||
| let all_null_or_non_null = if null_count == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this would be clearer with an enum: NO_NULLS, ALL_NULLS and SOME_NULLS
Good call -- filed #17850 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ashdnazg
I took another look through this PR and was able to reproduce your numbers and I think it is ready to go
Here is how I benchmarked the results:
./bench.sh data h2o_mediumAnd then ran
datafusion-cli -c "SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count, (v1 % 2)=0 AS b1 FROM '/Users/andrewlamb/Software/datafusion/benchmarks/data/h2o/G1_1e8_1e8_100_0.csv' GROUP BY id1, id2, id3, id4, id5, id6, b1;"main
Elapsed 10.549 seconds.
Elapsed 8.316 seconds.
Elapsed 8.762 seconds.
This PR
Elapsed 4.625 seconds.
Elapsed 4.100 seconds.
Elapsed 3.722 seconds.
🚀
@martin-g I think your review suggestions about comments and enums great. Perhaps you can create a follow on PR to implement them?
| lhs_rows: &[usize], | ||
| array: &ArrayRef, | ||
| rhs_rows: &[usize], | ||
| equal_to_results: &mut [bool], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a new ticket to track these ideas so we don't forget them
…n array (#18048) * chore: Use an enum to express the different kinds of nullability in an array Follow-up of #17726 (review) * Use the Nulls enum also in .../multi_group_by/primitive.rs * Use the Nulls enum in .../multi_group_by/bytes[_view].rs as well
…n array (apache#18048) * chore: Use an enum to express the different kinds of nullability in an array Follow-up of apache#17726 (review) * Use the Nulls enum also in .../multi_group_by/primitive.rs * Use the Nulls enum in .../multi_group_by/bytes[_view].rs as well
Which issue does this PR close?
N/A
Rationale for this change
Having more data types supported in this code path avoids converting the group by columns to rows during aggregation.
What changes are included in this PR?
Adding boolean support for single and multiple grouping values.
Are these changes tested?
I added unit tests for the multi-value implementation and both single value and multi value implementations are tested in the sql logic tests.
Are there any user-facing changes?
No