Skip to content

Make GenericDistinctBuffer generic over both Hashable and native types #18670

@Jefffrey

Description

@Jefffrey

Is your feature request related to a problem or challenge?

GenericDistinctBuffer was introduced by #18348 to make it easier to write distinct accumulators. Currently it uses Hashable to represent the items being stored:

pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> {
pub values: HashSet<Hashable<T::Native>, RandomState>,
data_type: DataType,
}

/// A wrapper around a type to provide hash for floats
#[derive(Copy, Clone, Debug)]
pub struct Hashable<T>(pub T);
impl<T: ToByteSlice> std::hash::Hash for Hashable<T> {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.0.to_byte_slice().hash(state)
}
}

Hashable is mainly required because f32/f64 aren't hashable natively, so this wrapper lets us hash them + all other native types (e.g. i32, u64). However I wonder if there is a way to use Hashable only for floats, whilst using the native Hash implementation for other native types (potential performance benefit?).

Describe the solution you'd like

Be able to use GenericDistinctBuffer generically over both f32 and i32 types without needing to force both through Hashable interface. This could be done by having duplicated versions, one that takes Hashable and one that takes natively hashable types, but ideally want a single solution.

  • If we had specialization this would be much easier 😅

Describe alternatives you've considered

No response

Additional context

If we could get this to work, we could fold PrimitiveDistinctCountAccumulator together with FloatDistinctCountAccumulator:

pub struct PrimitiveDistinctCountAccumulator<T>
where
T: ArrowPrimitiveType + Send,
T::Native: Eq + Hash,
{
values: HashSet<T::Native, RandomState>,
data_type: DataType,
}

#[derive(Debug)]
pub struct FloatDistinctCountAccumulator<T: ArrowPrimitiveType> {
values: GenericDistinctBuffer<T>,
}

Also when making these changes we'd need to benchmark as likely there are performance implications; I initially tried forcing PrimitiveDistinctCountAccumulator to use Hashable but I kept hitting regressions on micro benchmark with count(distinct) queries, so need to keep that in mind.

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions