Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-35352][SQL] Add code-gen for full outer sort merge join
### What changes were proposed in this pull request? This PR is to add code-gen for FULL OUTER sort merge join. The change is in `SortMergeJoinExec.scala:codegenFullOuter()`. Followed the same algorithm in iterator mode - `SortMergeFullOuterJoinScanner`: maintain buffer for join left and right sides, and iterate over matched rows in buffers. Example query: ``` val df1 = spark.range(5).select($"id".as("k1")) val df2 = spark.range(10).select($"id".as("k2")) df1.join(df2.hint(hint), $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2", "full_outer") ``` Example generated code: https://gist.github.com/c21/5cab9751f24ae448d77a259d28cb77d7 In addition, to help review as this PR triggers several TPCDS plan files change. The below files are having the real code change: * `SortMergeJoinExec.scala` * `WholeStageCodegenSuite.scala` All other files are auto-generated golden file plan changes for TPCDS queries. ### Why are the changes needed? Improve the run-time/CPU performance of FULL OUTER sort merge join. Micro benchmark (same query in `JoinBenchmark.scala`): ``` def sortMergeJoin(): Unit = { val N = 2 << 20 codegenBenchmark("sort merge join", N) { val df1 = spark.range(N).selectExpr(s"id * 2 as k1") val df2 = spark.range(N).selectExpr(s"id * 3 as k2") val df = df1.join(df2, col("k1") === col("k2"), "full_outer") assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.noop() } } def sortMergeJoinWithDuplicates(): Unit = { val N = 2 << 20 codegenBenchmark("sort merge join with duplicates", N) { val df1 = spark.range(N) .selectExpr(s"(id * 15485863) % ${N*10} as k1") val df2 = spark.range(N) .selectExpr(s"(id * 15485867) % ${N*10} as k2") val df = df1.join(df2, col("k1") === col("k2"), "full_outer") assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.noop() } } ``` Seeing 20-30% of run-time improvement: ``` Running benchmark: sort merge join Running case: sort merge join wholestage off Stopped after 2 iterations, 2979 ms Running case: sort merge join wholestage on Stopped after 5 iterations, 5849 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz sort merge join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ sort merge join wholestage off 1453 1490 52 1.4 693.0 1.0X sort merge join wholestage on 1115 1170 43 1.9 531.6 1.3X Running benchmark: sort merge join with duplicates Running case: sort merge join with duplicates wholestage off Stopped after 2 iterations, 3236 ms Running case: sort merge join with duplicates wholestage on Stopped after 5 iterations, 6768 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz sort merge join with duplicates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------ sort merge join with duplicates wholestage off 1609 1618 13 1.3 767.2 1.0X sort merge join with duplicates wholestage on 1330 1354 24 1.6 634.4 1.2X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Added unit test in `WholeStageCodegenSuite.scala`. * Existing unit test in `OuterJoinSuite.scala`. Closes #34581 from c21/smj-codegen. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information
Showing
14 changed files
with
327 additions
and
65 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.