Skip to content

[Discussion] Should hash join order be based on memory size? #22098

@nuno-faria

Description

@nuno-faria

Currently, the order of a hash join (i.e., which relation is hashed and which is probed) is dictated by the size of the relations in bytes. This is controlled by the should_swap_join_order function:

/// 2. Compare the in-memory sizes of both sides, and place the smaller side on
/// the left (build) side.
/// 3. If in-memory byte sizes are unavailable, fall back to row counts.
/// 4. Do not reorder the join if neither statistic is available, or if
/// `datafusion.optimizer.join_reordering` is disabled.
///
/// Used configurations inside arg `config`
/// - `config.optimizer.join_reordering`: allows or forbids statistics-driven join swapping
pub(crate) fn should_swap_join_order(

This is a safe heuristic, since we want to increase the probability of the hash table fitting in memory. However, if one of the tables is considerably wider than the other, we can end up choosing a narrower table but with many more rows as the build side, which can be slower. I did some tests which joined a narrow table (fixed to 100M rows) with a wide one (from 10M to 100M rows), disabling the automatic join order with datafusion.optimizer.join_reordering to test both narrow+wide and wide+narrow:

-- narrow (fixed to 100M rows)
create table t1 (k int, v int)
insert into t1 select i as k, i as v from generate_series(1, 100000000) t(i)

-- wide (variable num of rows)
create table t2 (k int, v varchar)
insert into t2 select i as k, i || repeat('0', 54) as v from generate_series(1, ...) t(i)

Here are the results (8 vCPUs, 32GB Mem, avg of 10 runs):

Image

The wide table with 10M rows is slightly larger in bytes than the narrow one with 100M, so the current DataFusion implementation always uses the narrow_join_wide version. We see that this version is only faster when the wide table has 90M rows, i.e. only 10% fewer rows that the narrow one.

So what I'd like to discuss is if the hash join order should be decided based on a more complex heuristic. For example, "if the difference in size between the tables is less than X, go by row count, otherwise go by byte size". It appears that Postgres also does something like this:

-- t1 (narrow): 10M rows
-- t2 (wide): 8M rows <-- hashed
Hash Cond: (t1.k = t2.k)
->  Seq Scan on t1  (... width=8) (actual time=0.016..452.699 rows=10000000.00 loops=1)
->  Hash  (...) (actual time=3315.965..3315.966 rows=8000000.00 loops=1)
      Buckets: 8388608  Batches: 1  Memory Usage: 830076kB
      ->  Seq Scan on t2  (... width=65) (actual time=0.027..462.818 rows=8000000.00 loops=1)

-- t1 (narrow): 10M rows <-- hashed
-- t2 (wide): 9.5M rows
Hash Cond: (t2.k = t1.k)
->  Seq Scan on t2  (... width=65) (actual time=0.010..437.648 rows=9500000.00 loops=1)
->  Hash  (...) (actual time=3013.599..3013.600 rows=10000000.00 loops=1)
      Buckets: 16777216  Batches: 1  Memory Usage: 521697kB
      ->  Seq Scan on t1  (... width=8) (actual time=0.028..472.798 rows=10000000.00 loops=1)

For reference, here are the sizes of the tables. It was only faster to use the narrow table once the wider one was 9x larger.
Image

cc: @alamb @Dandandan @adriangb since I think you are also interested in hash join performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions