-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
GenericDistinctBuffer was introduced by #18348 to make it easier to write distinct accumulators. Currently it uses Hashable to represent the items being stored:
datafusion/datafusion/functions-aggregate-common/src/utils.rs
Lines 185 to 188 in e42a0b6
| pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> { | |
| pub values: HashSet<Hashable<T::Native>, RandomState>, | |
| data_type: DataType, | |
| } |
datafusion/datafusion/functions-aggregate-common/src/utils.rs
Lines 73 to 81 in e42a0b6
| /// A wrapper around a type to provide hash for floats | |
| #[derive(Copy, Clone, Debug)] | |
| pub struct Hashable<T>(pub T); | |
| impl<T: ToByteSlice> std::hash::Hash for Hashable<T> { | |
| fn hash<H: std::hash::Hasher>(&self, state: &mut H) { | |
| self.0.to_byte_slice().hash(state) | |
| } | |
| } |
Hashable is mainly required because f32/f64 aren't hashable natively, so this wrapper lets us hash them + all other native types (e.g. i32, u64). However I wonder if there is a way to use Hashable only for floats, whilst using the native Hash implementation for other native types (potential performance benefit?).
Describe the solution you'd like
Be able to use GenericDistinctBuffer generically over both f32 and i32 types without needing to force both through Hashable interface. This could be done by having duplicated versions, one that takes Hashable and one that takes natively hashable types, but ideally want a single solution.
- If we had specialization this would be much easier 😅
Describe alternatives you've considered
No response
Additional context
If we could get this to work, we could fold PrimitiveDistinctCountAccumulator together with FloatDistinctCountAccumulator:
datafusion/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
Lines 44 to 51 in e42a0b6
| pub struct PrimitiveDistinctCountAccumulator<T> | |
| where | |
| T: ArrowPrimitiveType + Send, | |
| T::Native: Eq + Hash, | |
| { | |
| values: HashSet<T::Native, RandomState>, | |
| data_type: DataType, | |
| } |
datafusion/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
Lines 126 to 129 in e42a0b6
| #[derive(Debug)] | |
| pub struct FloatDistinctCountAccumulator<T: ArrowPrimitiveType> { | |
| values: GenericDistinctBuffer<T>, | |
| } |
Also when making these changes we'd need to benchmark as likely there are performance implications; I initially tried forcing PrimitiveDistinctCountAccumulator to use Hashable but I kept hitting regressions on micro benchmark with count(distinct) queries, so need to keep that in mind.