-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
performanceMake DataFusion fasterMake DataFusion faster
Description
Thanks! Besides looking at optimizing the join order during planning time or dynamic (I think there are a couple of issues covering that), we can look at what makes the operator slow in more challenging scenario's.
Some optimizations for the current operator come to mind that might improve the current hash join operator in certain scenario's, while keeping the same algorithm:
- Reuse the allocation of
Vecindices between calls. This probably helps when the amount of matching indices is low (compared to the batch size).- (Related): Keep building matching indices until
limitrows have been reached and useinterleaveto collect the batches. That probably makes the operator more cache efficient as accessing the map / chain is done at the same time, before producing output batches from the input data. This also helps with avoiding the overhead of a laterCoalesceBatches, which might help as well.- Instead of building indices for the right side, we can build a boolean mask / filter to mark match / no match. This reduces memory usage (somewhat) plus a boolean filter is much faster for low selectivity (i.e. most of the right side matches). We then should use the coalesce kernel to produce the right side arrays.
I opened #18939 for exploring to use a different algorithm (radix hash joins), which additionally should improve the performance of our join operators by making the algorithm more cache efficient.
Originally posted by @Dandandan in #17494
Metadata
Metadata
Assignees
Labels
performanceMake DataFusion fasterMake DataFusion faster