[Epic] Remove Sort Merge Join Experimental status #9846

comphead · 2024-03-29T00:58:57Z

Is your feature request related to a problem or challenge?

Hi all

I was going through SMJ implementation and suddenly stepped on the comments

// Sort-Merge join support currently is experimental

https://github.com/apache/arrow-datafusion/blob/81c96fc3db0ea35638278f32df066be63b745a51/datafusion/core/src/physical_planner.rs#L1141

I think it would be nice to revisit it and understand if Sort Merge Join Exec is still experimental.
And if so is there any strategies to make it stable, or to run benchmarks to prove the join is stable?

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

comphead · 2024-03-29T01:01:03Z

@alamb @ozankabak @viirya @mustafasrepo @berkaysynnada @metesynnada appreciate your inputs.

alamb · 2024-03-29T10:57:44Z

From my experience, I have never seen SortMergeJoin used in any plan I looked at in DataFusion, so therefore I think it is still "experimental" or at least "not used by datafusion by default" (which maybe is the same thing)

It looks like there was some past interest in SortMergeJoin -- https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+sortmergejoin

Also the people interested in that operator seem to be the people focused on Spark

comphead · 2024-03-29T16:05:12Z

it is used if next conditions met https://github.com/apache/arrow-datafusion/blob/81c96fc3db0ea35638278f32df066be63b745a51/datafusion/core/src/physical_planner.rs#L1136

There is also a small set of tests introduced in sort_merge_join.slt. And the plans there shows SMJ

To enforce SMJ its needed to set

set datafusion.optimizer.prefer_hash_join = false;

Probably we can revisit tests and run some benchmarks with SMJ enforced to make a decision?

metesynnada · 2024-03-30T20:17:18Z

I believe we can add fuzz tests for SMJ to ensure it is robust.

comphead · 2024-04-02T19:47:48Z

I'm thinking if its enough to add fuzz tests, prob we also need to run benchmarks on top of SMJ? Afaik now benchmarks are on top of the HJ?

metesynnada · 2024-04-03T11:14:52Z

Is there a rule of thumb for choosing SMJ over HJ?

Dandandan · 2024-04-03T11:49:59Z

Is there a rule of thumb for choosing SMJ over HJ?

I wonder how SMJ in DataFusion compares against HJ at the moment.

Some ideas for when SMJ could be chosen over HJ:

When input data is already sorted on relevant keys, it is likely faster/requires less memory to plan a SMJ than HJ.
HJ might require more memory than SMJ, so whenever e.g. data skew is expected one might choose sort merge over hash join.

alamb · 2024-04-03T13:13:51Z

Is there a rule of thumb for choosing SMJ over HJ?

I believe current state of the art in query processing is

If the data is already sorted by join keys, use MergeJoin (as @Dandandan says)
If the data is not already sorted on join key, use HashJoin
If HashJoin runs out of memory building the hash table, spill the table to disk (possibly switching to merge join internally)

The only benefit SMJ has over HJ at the moment in Datafusion is that we could plausibly join relations that are larger than memory using SMJ (using the fact that we can spill the inputs) -- this may be what @Dandandan is saying in #9846 (comment)

I think it is close to impossible to make SMJ beat HJ for raw performance when the relations fit in memory

comphead · 2024-04-03T15:23:55Z

we shouldn't be comparing HJ vs SMJ 1:1, but the performance has to be quite close? What I'm trying to solve is to find a strategy to remove the experimental flag from SMJ and prove it is stable.

btw I found the fuzz tests are in place https://github.com/apache/arrow-datafusion/blob/daf182dc789230dbd9cf21ca2e975789213a5365/datafusion/core/tests/fuzz_cases/join_fuzz.rs#L128

comphead · 2024-04-15T23:51:57Z

I ran TPCH benchmarks for SMJ and got

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/joins/sort_merge_join.rs:1357:22:
index out of bounds: the len is 0 but the index is 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/joins/sort_merge_join.rs:1357:22:
index out of bounds: the len is 0 but the index is 1
thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/joins/sort_merge_join.rs:1357:22:
index out of bounds: the len is 0 but the index is 1
thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/joins/sort_merge_join.rs:1357:22:
index out of bounds: the len is 0 but the index is 1
Error: Context("Join Error", External(JoinError::Panic(Id(88693), ...)))

alamb · 2024-04-16T16:01:14Z

Seems like a good reason to keep it marked as experimental

comphead · 2024-04-16T16:12:01Z

Seems like a good reason to keep it marked as experimental

I'll create a separate issue on it.
Once TPCH passed we can get back on SMJ status

comphead added the enhancement New feature or request label Mar 29, 2024

comphead mentioned this issue Apr 5, 2024

Add spilling in SortMergeJoin #9359

Open

comphead mentioned this issue Apr 15, 2024

Adding TPCH benchmarks for Sort Merge Join #10092

Merged

comphead mentioned this issue Apr 16, 2024

fix Sort Merge Join to pass TPCH tests #10100

Closed

comphead mentioned this issue May 13, 2024

Fix: Sort Merge Join LeftSemi issues when JoinFilter is set #10304

Merged

This was referenced Jun 3, 2024

SMJ: Add more tests and improve comments #10784

Merged

Bench: Add PREFER_HASH_JOIN env variable #10809

Merged

comphead changed the title ~~[DISCUSSION] Sort Merge Join Experimental status~~ [Epic] Remove Sort Merge Join Experimental status Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Remove Sort Merge Join Experimental status #9846

[Epic] Remove Sort Merge Join Experimental status #9846

comphead commented Mar 29, 2024 •

edited

Loading

comphead commented Mar 29, 2024

alamb commented Mar 29, 2024

comphead commented Mar 29, 2024

metesynnada commented Mar 30, 2024

comphead commented Apr 2, 2024

metesynnada commented Apr 3, 2024

Dandandan commented Apr 3, 2024

alamb commented Apr 3, 2024 •

edited

Loading

comphead commented Apr 3, 2024

comphead commented Apr 15, 2024

alamb commented Apr 16, 2024

comphead commented Apr 16, 2024

[Epic] Remove Sort Merge Join Experimental status #9846

[Epic] Remove Sort Merge Join Experimental status #9846

Comments

comphead commented Mar 29, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

comphead commented Mar 29, 2024

alamb commented Mar 29, 2024

comphead commented Mar 29, 2024

metesynnada commented Mar 30, 2024

comphead commented Apr 2, 2024

metesynnada commented Apr 3, 2024

Dandandan commented Apr 3, 2024

alamb commented Apr 3, 2024 • edited Loading

comphead commented Apr 3, 2024

comphead commented Apr 15, 2024

alamb commented Apr 16, 2024

comphead commented Apr 16, 2024

comphead commented Mar 29, 2024 •

edited

Loading

alamb commented Apr 3, 2024 •

edited

Loading