Currently, the order of a hash join (i.e., which relation is hashed and which is probed) is dictated by the size of the relations in bytes. This is controlled by the should_swap_join_order function:
|
/// 2. Compare the in-memory sizes of both sides, and place the smaller side on |
|
/// the left (build) side. |
|
/// 3. If in-memory byte sizes are unavailable, fall back to row counts. |
|
/// 4. Do not reorder the join if neither statistic is available, or if |
|
/// `datafusion.optimizer.join_reordering` is disabled. |
|
/// |
|
/// Used configurations inside arg `config` |
|
/// - `config.optimizer.join_reordering`: allows or forbids statistics-driven join swapping |
|
pub(crate) fn should_swap_join_order( |
This is a safe heuristic, since we want to increase the probability of the hash table fitting in memory. However, if one of the tables is considerably wider than the other, we can end up choosing a narrower table but with many more rows as the build side, which can be slower. I did some tests which joined a narrow table (fixed to 100M rows) with a wide one (from 10M to 100M rows), disabling the automatic join order with datafusion.optimizer.join_reordering to test both narrow+wide and wide+narrow:
-- narrow (fixed to 100M rows)
create table t1 (k int, v int)
insert into t1 select i as k, i as v from generate_series(1, 100000000) t(i)
-- wide (variable num of rows)
create table t2 (k int, v varchar)
insert into t2 select i as k, i || repeat('0', 54) as v from generate_series(1, ...) t(i)
Here are the results (8 vCPUs, 32GB Mem, avg of 10 runs):
The wide table with 10M rows is slightly larger in bytes than the narrow one with 100M, so the current DataFusion implementation always uses the narrow_join_wide version. We see that this version is only faster when the wide table has 90M rows, i.e. only 10% fewer rows that the narrow one.
So what I'd like to discuss is if the hash join order should be decided based on a more complex heuristic. For example, "if the difference in size between the tables is less than X, go by row count, otherwise go by byte size". It appears that Postgres also does something like this:
-- t1 (narrow): 10M rows
-- t2 (wide): 8M rows <-- hashed
Hash Cond: (t1.k = t2.k)
-> Seq Scan on t1 (... width=8) (actual time=0.016..452.699 rows=10000000.00 loops=1)
-> Hash (...) (actual time=3315.965..3315.966 rows=8000000.00 loops=1)
Buckets: 8388608 Batches: 1 Memory Usage: 830076kB
-> Seq Scan on t2 (... width=65) (actual time=0.027..462.818 rows=8000000.00 loops=1)
-- t1 (narrow): 10M rows <-- hashed
-- t2 (wide): 9.5M rows
Hash Cond: (t2.k = t1.k)
-> Seq Scan on t2 (... width=65) (actual time=0.010..437.648 rows=9500000.00 loops=1)
-> Hash (...) (actual time=3013.599..3013.600 rows=10000000.00 loops=1)
Buckets: 16777216 Batches: 1 Memory Usage: 521697kB
-> Seq Scan on t1 (... width=8) (actual time=0.028..472.798 rows=10000000.00 loops=1)
For reference, here are the sizes of the tables. It was only faster to use the narrow table once the wider one was 9x larger.

cc: @alamb @Dandandan @adriangb since I think you are also interested in hash join performance.
Currently, the order of a hash join (i.e., which relation is hashed and which is probed) is dictated by the size of the relations in bytes. This is controlled by the
should_swap_join_orderfunction:datafusion/datafusion/physical-optimizer/src/join_selection.rs
Lines 79 to 87 in e62f06c
This is a safe heuristic, since we want to increase the probability of the hash table fitting in memory. However, if one of the tables is considerably wider than the other, we can end up choosing a narrower table but with many more rows as the build side, which can be slower. I did some tests which joined a narrow table (fixed to 100M rows) with a wide one (from 10M to 100M rows), disabling the automatic join order with
datafusion.optimizer.join_reorderingto test both narrow+wide and wide+narrow:Here are the results (8 vCPUs, 32GB Mem, avg of 10 runs):
The wide table with 10M rows is slightly larger in bytes than the narrow one with 100M, so the current DataFusion implementation always uses the
narrow_join_wideversion. We see that this version is only faster when the wide table has 90M rows, i.e. only 10% fewer rows that the narrow one.So what I'd like to discuss is if the hash join order should be decided based on a more complex heuristic. For example, "if the difference in size between the tables is less than X, go by row count, otherwise go by byte size". It appears that Postgres also does something like this:
For reference, here are the sizes of the tables. It was only faster to use the narrow table once the wider one was 9x larger.

cc: @alamb @Dandandan @adriangb since I think you are also interested in hash join performance.