Skip to content

Implement Radix Hash Join #18939

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge?

Hash Joins with large build side & lots of hash-duplicates are relatively slow in DataFusion.

The cost seems largely associated with traversing the chain of duplicates (chain_traverse) (1) + which is known to be very cache-inefficient, as the access pattern is mostly random.

Currently, we implement hash joins partitioned by hash, but we can implement a more efficient algorithm (radix hash join) that splits build data into smaller tables that individually mostly fits in CPU caches and allow more efficient access patterns.

[TODO: collect some issues / examples]

(1) #17494

Describe the solution you'd like

Implement a version of Radix Hash Joins:

Image

https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions