-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: support bitwise and boolean aggregate functions #6276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support bitwise and boolean aggregate functions #6276
Conversation
|
@alamb, @tustvold, @mingmwang I wonder if you have time to review this PR. |
|
Thanks @izveigor -- I will try and review this PR in the next day or two |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @izveigor -- this is looking good. I have gotten through this PR execept for the content of datafusion/physical-expr/src/aggregate/bit_and_or_xor.rs which hopefully I'll find time to work on tomorrow
| ('2021-01-01T05:11:10.432', 'Row 3'); | ||
|
|
||
|
|
||
| statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given postgres also runs these queries what do you think about also adding them into a pg_compat test in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests/test_files/pg_compat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a nice idea for bitwise aggregate functions.
| query III | ||
| SELECT bit_and(c1), bit_and(c2), bit_and(c3) FROM bit_aggregate_functions | ||
| ---- | ||
| 1 8 11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked this is consistent with postgres 👍
| | approx_median(expr) | Calculates an approximation of the median for `expr`. | | ||
| | approx_percentile_cont(expr, percentile) | Calculates an approximation of the specified `percentile` for `expr`. | | ||
| | approx_percentile_cont_with_weight(expr, weight_expr, percentile) | Calculates an approximation of the specified `percentile` for `expr` and `weight_expr`. | | ||
| | bit_and(expr) | Computes the bitwise AND of all non-null input values for `expr`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for also updating the documentation
| } | ||
|
|
||
| fn supports_bounded_execution(&self) -> bool { | ||
| true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this aggregator actually supports bounded execution (that are used in sliding window functions) -- to do so I think it would need to remember how many tries it has seen perhaps.
@mustafasrepo or @ozankabak can you double check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it supports bounded execution either. To support bounded execution accumulator should implement retract_batch method. (Theoretically, I don't think there is a way to implement retract_batch for bitwise and , bitwise or. However, it can be implemented for bit wise xor though see discussion). Hence, I think we should not implement this method for and, or cases (it defaults to false). For xor we can support it, however, we may do these in followup PRs. As far as this Pr is concerned, deleting supports_bounded_execution implementation is enough I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also want to mention that this situation concerns relatively DISTINCT cases (only BIT_XOR(DISTINCT) makes sense, the rest of DISTINCT cases will produce the same results as without them)
|
|
||
| // bool_and/bool_or of two scalar values. | ||
| macro_rules! typed_bool_and_or { | ||
| ($VALUE:expr, $DELTA:expr, $SCALAR:ident, $OP:ident) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you added ScalarValue::bitand etc, maybe you could also add ScalarValue::and ? I think the overall code complexity would be the same but then the logic wouldn't be hidden in macros inside the aggregator
A similar comment applies to ScalarValue:or etc
| use std::ops::BitAnd as BitAndImplementation; | ||
| use std::ops::BitOr as BitOrImplementation; | ||
|
|
||
| fn bool_and(array: &BooleanArray) -> Option<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 these are good implementations
| use std::ops::BitOr as BitOrImplementation; | ||
| use std::ops::BitXor as BitXorImplementation; | ||
|
|
||
| fn bit_and<T>(array: &PrimitiveArray<T>) -> Option<T::Native> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should strive to move these kernels into arrow-rs eventually.
Can you please file a ticket in arrow-rs to do so (or I can do to so if you prefer) and leave a comment here with a reference to that ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! apache/arrow-rs#4210
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @izveigor -- this looks like a great start to me and is quite close. I think the sliding accumulator stuff needs to be fixed prior to merge (as it would produce wrong results).
I also left some comments on how to make the code easier to maintain
Thanks again for this really nice contribution
| None => { | ||
| let bit_and = data | ||
| .iter() | ||
| .fold(T::Native::ONE.neg_wrapping(), |accumulator, value| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // bit_and/bit_or/bit_xor the array and returns a ScalarValue of its corresponding type. | ||
| macro_rules! bit_and_or_xor_batch { | ||
| ($VALUES:expr, $OP:ident) => {{ | ||
| match $VALUES.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thin you could probably use https://docs.rs/arrow/latest/arrow/macro.downcast_primitive_array.html to avoid some of this replication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the usual way is more preferable (because other aggregate functions use this method). But, I think the idea can be continued in the separate PR for all aggregate functions.
|
@alamb I have updated the PR based on your comments. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- thanks again @izveigor
Which issue does this PR close?
Closes #6275
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
Yes