Skip to content

Optimizations for hash join operator #18942

@Dandandan

Description

@Dandandan

Thanks! Besides looking at optimizing the join order during planning time or dynamic (I think there are a couple of issues covering that), we can look at what makes the operator slow in more challenging scenario's.

Some optimizations for the current operator come to mind that might improve the current hash join operator in certain scenario's, while keeping the same algorithm:

  • Reuse the allocation of Vec indices between calls. This probably helps when the amount of matching indices is low (compared to the batch size).
  • (Related): Keep building matching indices until limit rows have been reached and use interleave to collect the batches. That probably makes the operator more cache efficient as accessing the map / chain is done at the same time, before producing output batches from the input data. This also helps with avoiding the overhead of a later CoalesceBatches, which might help as well.
  • Instead of building indices for the right side, we can build a boolean mask / filter to mark match / no match. This reduces memory usage (somewhat) plus a boolean filter is much faster for low selectivity (i.e. most of the right side matches). We then should use the coalesce kernel to produce the right side arrays.

I opened #18939 for exploring to use a different algorithm (radix hash joins), which additionally should improve the performance of our join operators by making the algorithm more cache efficient.

Originally posted by @Dandandan in #17494

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceMake DataFusion faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions