[SPARK-35352][SQL] Add code-gen for full outer sort merge join #34581

c21 · 2021-11-13T09:46:31Z

What changes were proposed in this pull request?

This PR is to add code-gen for FULL OUTER sort merge join. The change is in SortMergeJoinExec.scala:codegenFullOuter(). Followed the same algorithm in iterator mode - SortMergeFullOuterJoinScanner: maintain buffer for join left and right sides, and iterate over matched rows in buffers.

Example query:

val df1 = spark.range(5).select($"id".as("k1"))
val df2 = spark.range(10).select($"id".as("k2"))
df1.join(df2.hint(hint), $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2", "full_outer")

Example generated code: https://gist.github.com/c21/5cab9751f24ae448d77a259d28cb77d7

In addition, to help review as this PR triggers several TPCDS plan files change. The below files are having the real code change:

SortMergeJoinExec.scala
WholeStageCodegenSuite.scala

All other files are auto-generated golden file plan changes for TPCDS queries.

Why are the changes needed?

Improve the run-time/CPU performance of FULL OUTER sort merge join.

Micro benchmark (same query in JoinBenchmark.scala):

  def sortMergeJoin(): Unit = {
    val N = 2 << 20
    codegenBenchmark("sort merge join", N) {
      val df1 = spark.range(N).selectExpr(s"id * 2 as k1")
      val df2 = spark.range(N).selectExpr(s"id * 3 as k2")
      val df = df1.join(df2, col("k1") === col("k2"), "full_outer")
      assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
      df.noop()
    }
  }

  def sortMergeJoinWithDuplicates(): Unit = {
    val N = 2 << 20
    codegenBenchmark("sort merge join with duplicates", N) {
      val df1 = spark.range(N)
        .selectExpr(s"(id * 15485863) % ${N*10} as k1")
      val df2 = spark.range(N)
        .selectExpr(s"(id * 15485867) % ${N*10} as k2")
      val df = df1.join(df2, col("k1") === col("k2"), "full_outer")
      assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
      df.noop()
    }
  }

Seeing 20-30% of run-time improvement:

Running benchmark: sort merge join
  Running case: sort merge join wholestage off
  Stopped after 2 iterations, 2979 ms
  Running case: sort merge join wholestage on
  Stopped after 5 iterations, 5849 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
sort merge join:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
sort merge join wholestage off                     1453           1490          52          1.4         693.0       1.0X
sort merge join wholestage on                      1115           1170          43          1.9         531.6       1.3X

Running benchmark: sort merge join with duplicates
  Running case: sort merge join with duplicates wholestage off
  Stopped after 2 iterations, 3236 ms
  Running case: sort merge join with duplicates wholestage on
  Stopped after 5 iterations, 6768 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
sort merge join with duplicates:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------
sort merge join with duplicates wholestage off           1609           1618          13          1.3         767.2       1.0X
sort merge join with duplicates wholestage on            1330           1354          24          1.6         634.4       1.2X

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test in WholeStageCodegenSuite.scala.
Existing unit test in OuterJoinSuite.scala.

SparkQA · 2021-11-13T11:00:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49655/

SparkQA · 2021-11-13T11:15:09Z

Test build #145186 has finished for PR 34581 at commit f712f74.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-13T11:41:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49655/

SparkQA · 2021-11-13T23:56:12Z

Test build #145194 has finished for PR 34581 at commit aa6ebd9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-14T00:04:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49663/

SparkQA · 2021-11-14T01:02:13Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49663/

SparkQA · 2021-11-14T02:07:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49664/

SparkQA · 2021-11-14T02:50:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49664/

SparkQA · 2021-11-14T05:06:13Z

Test build #145195 has finished for PR 34581 at commit e8b2477.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-14T10:21:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49676/

SparkQA · 2021-11-14T11:06:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49676/

SparkQA · 2021-11-14T14:08:34Z

Test build #145207 has finished for PR 34581 at commit c846656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-11-15T01:10:39Z

cc @cloud-fan could you help take a look when you have time? Thanks.

SparkQA · 2021-11-15T08:11:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49695/

cloud-fan · 2021-11-15T08:24:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

+    val rightIndex = ctx.freshName("rightIndex")
+
+    // Generate code for join condition
+    // val leftVars = genOneSideJoinVars(ctx, leftOutputRow, left, setDefaultValue = false)


shall we remove this line?

@cloud-fan - ah sorry, was added during debugging, removed.

cloud-fan · 2021-11-15T08:43:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

+    val leftAnyNull = leftKeyVars.map(_.isNull).mkString(" || ")
+    val rightKeyVars = createJoinKey(ctx, rightInputRow, rightKeys, right.output)
+    val rightAnyNull = rightKeyVars.map(_.isNull).mkString(" || ")
+    val matchedKeyVars = copyKeys(ctx, leftKeyVars)


why it's only related to left keys?

oh i see, it's matched, so left and right keys are the same.

@cloud-fan - yeah, it can be copying from either leftKeyVars or rightKeyVars.

SparkQA · 2021-11-15T08:51:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49695/

SparkQA · 2021-11-15T10:15:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49696/

SparkQA · 2021-11-15T10:59:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49696/

cloud-fan · 2021-11-15T11:34:48Z

thanks, merging to master!

SparkQA · 2021-11-15T11:53:58Z

Test build #145225 has finished for PR 34581 at commit 23eedba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-11-15T12:20:35Z

Thank you @cloud-fan for review!

SparkQA · 2021-11-15T14:02:37Z

Test build #145226 has finished for PR 34581 at commit 34830f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions bot added the SQL label Nov 13, 2021

c21 added 4 commits November 14, 2021 22:55

Add codegen for full outer SMJ

5dfe420

throw IOException in function definition to fix unit test failures

2f7996d

Do not enable code-gen for existence join to fix unit test failure

2e58ea7

regenerate plan golden files to fix unit test failures

8ee6d2d

c21 force-pushed the smj-codegen branch from c846656 to 8ee6d2d Compare November 15, 2021 06:55

Fix unit test failure afte rebasing to master

23eedba

cloud-fan reviewed Nov 15, 2021

View reviewed changes

cloud-fan approved these changes Nov 15, 2021

View reviewed changes

c21 added 2 commits November 15, 2021 00:52

Address comment to remove unecessary comment

3c96a72

Remove unnecessary extra space as well

34830f2

cloud-fan closed this in 2ef60f7 Nov 15, 2021

c21 deleted the smj-codegen branch November 15, 2021 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35352][SQL] Add code-gen for full outer sort merge join #34581

[SPARK-35352][SQL] Add code-gen for full outer sort merge join #34581

c21 commented Nov 13, 2021 •

edited

Loading

SparkQA commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

c21 commented Nov 15, 2021

SparkQA commented Nov 15, 2021

cloud-fan Nov 15, 2021

c21 Nov 15, 2021

cloud-fan Nov 15, 2021

cloud-fan Nov 15, 2021

c21 Nov 15, 2021

SparkQA commented Nov 15, 2021

SparkQA commented Nov 15, 2021

SparkQA commented Nov 15, 2021

cloud-fan commented Nov 15, 2021

SparkQA commented Nov 15, 2021

c21 commented Nov 15, 2021

SparkQA commented Nov 15, 2021

[SPARK-35352][SQL] Add code-gen for full outer sort merge join #34581

[SPARK-35352][SQL] Add code-gen for full outer sort merge join #34581

Conversation

c21 commented Nov 13, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

c21 commented Nov 15, 2021

SparkQA commented Nov 15, 2021

cloud-fan Nov 15, 2021

Choose a reason for hiding this comment

c21 Nov 15, 2021

Choose a reason for hiding this comment

cloud-fan Nov 15, 2021

Choose a reason for hiding this comment

cloud-fan Nov 15, 2021

Choose a reason for hiding this comment

c21 Nov 15, 2021

Choose a reason for hiding this comment

SparkQA commented Nov 15, 2021

SparkQA commented Nov 15, 2021

SparkQA commented Nov 15, 2021

cloud-fan commented Nov 15, 2021

SparkQA commented Nov 15, 2021

c21 commented Nov 15, 2021

SparkQA commented Nov 15, 2021

c21 commented Nov 13, 2021 •

edited

Loading