[Discussion] Should hash join order be based on memory size?

Currently, the order of a hash join (i.e., which relation is hashed and which is probed) is dictated by the size of the relations in bytes. This is controlled by the `should_swap_join_order` function:
https://github.com/apache/datafusion/blob/e62f06c914cfdfb388abf0b49a45afe015d76fcb/datafusion/physical-optimizer/src/join_selection.rs#L79-L87

This is a safe heuristic, since we want to increase the probability of the hash table fitting in memory. However, if one of the tables is considerably wider than the other, we can end up choosing a narrower table but with many more rows as the build side, which can be slower. I did some tests which joined a narrow table (fixed to 100M rows) with a wide one (from 10M to 100M rows), disabling the automatic join order with `datafusion.optimizer.join_reordering` to test both narrow+wide and wide+narrow:

```sql
-- narrow (fixed to 100M rows)
create table t1 (k int, v int)
insert into t1 select i as k, i as v from generate_series(1, 100000000) t(i)

-- wide (variable num of rows)
create table t2 (k int, v varchar)
insert into t2 select i as k, i || repeat('0', 54) as v from generate_series(1, ...) t(i)
```

Here are the results (8 vCPUs, 32GB Mem, avg of 10 runs):

<img width="400" alt="Image" src="https://github.com/user-attachments/assets/9d57880f-7c99-477d-8ff8-1aeb2acf9597" />

The wide table with 10M rows is slightly larger in bytes than the narrow one with 100M, so the current DataFusion implementation always uses the `narrow_join_wide` version. We see that this version is only faster when the wide table has 90M rows, i.e. only 10% fewer rows that the narrow one.

So what I'd like to discuss is if the hash join order should be decided based on a more complex heuristic. For example, "if the difference in size between the tables is less than X, go by row count, otherwise go by byte size". It appears that Postgres also does something like this:

```sql
-- t1 (narrow): 10M rows
-- t2 (wide): 8M rows <-- hashed
Hash Cond: (t1.k = t2.k)
->  Seq Scan on t1  (... width=8) (actual time=0.016..452.699 rows=10000000.00 loops=1)
->  Hash  (...) (actual time=3315.965..3315.966 rows=8000000.00 loops=1)
      Buckets: 8388608  Batches: 1  Memory Usage: 830076kB
      ->  Seq Scan on t2  (... width=65) (actual time=0.027..462.818 rows=8000000.00 loops=1)

-- t1 (narrow): 10M rows <-- hashed
-- t2 (wide): 9.5M rows
Hash Cond: (t2.k = t1.k)
->  Seq Scan on t2  (... width=65) (actual time=0.010..437.648 rows=9500000.00 loops=1)
->  Hash  (...) (actual time=3013.599..3013.600 rows=10000000.00 loops=1)
      Buckets: 16777216  Batches: 1  Memory Usage: 521697kB
      ->  Seq Scan on t1  (... width=8) (actual time=0.028..472.798 rows=10000000.00 loops=1)
```

For reference, here are the sizes of the tables. It was only faster to use the narrow table once the wider one was 9x larger.
<img width="400" alt="Image" src="https://github.com/user-attachments/assets/bd136af9-acec-44be-aecd-3dfb6b753e66" />

cc: @alamb @Dandandan @adriangb since I think you are also interested in hash join performance.

	/// 2. Compare the in-memory sizes of both sides, and place the smaller side on
	/// the left (build) side.
	/// 3. If in-memory byte sizes are unavailable, fall back to row counts.
	/// 4. Do not reorder the join if neither statistic is available, or if
	/// `datafusion.optimizer.join_reordering` is disabled.
	///
	/// Used configurations inside arg `config`
	/// - `config.optimizer.join_reordering`: allows or forbids statistics-driven join swapping
	pub(crate) fn should_swap_join_order(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Should hash join order be based on memory size? #22098

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Discussion] Should hash join order be based on memory size? #22098

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions