-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge?
Hash Joins with large build side & lots of hash-duplicates are relatively slow in DataFusion.
The cost seems largely associated with traversing the chain of duplicates (chain_traverse) (1) + which is known to be very cache-inefficient, as the access pattern is mostly random.
Currently, we implement hash joins partitioned by hash, but we can implement a more efficient algorithm (radix hash join) that splits build data into smaller tables that individually mostly fits in CPU caches and allow more efficient access patterns.
[TODO: collect some issues / examples]
(1) #17494
Describe the solution you'd like
Implement a version of Radix Hash Joins:
https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster