[SPARK-46653][SQL] Code-gen for full outer sort merge join output row by row by zml1206 · Pull Request #44660 · apache/spark

zml1206 · 2024-01-10T05:09:34Z

What changes were proposed in this pull request?

Be consistent with closing code-gen, update code-gen for full outer sort merge join output row by row.
For example:

val a = Seq((1, 2), (2, 3)).toDF("a", "b")
val b = Seq((2, 5), (3, 4)).toDF("a", "c")
a.join(b, Seq("a"), "fullouter")

before this pr, generated code: https://gist.github.com/zml1206/aff18fc313a7164d6f65096a97d233eb
after: https://gist.github.com/zml1206/a27350b8849951e6efac0fb6088e527f

Why are the changes needed?

Avoid oom. When code-gen for full outer sort merge join is enbaled and the parent of SortMergeJoin cannot codegen, full outer sort merge join needs to append the output of the same key to BufferedRowIterator.currentRows which type is LinkedList. If there are a large number of duplicate keys, it is likely to cause executor oom.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UT and local test.

val df1 = spark.range(10000).map(_ => ("testkey", "testvalue1")).toDF("key", "value")
val df2 = spark.range(10000).map(_ => ("testkey", "testvalue2")).toDF("key", "value")
df1.join(df2, Seq("key"), "fullouter").show()

Local mode and driver memory 1G.
Before this pr will oom.

java.lang.OutOfMemoryError: Java heap space
	at java.base/java.util.LinkedList.linkLast(LinkedList.java:146)
	at java.base/java.util.LinkedList.add(LinkedList.java:342)
	at org.apache.spark.sql.execution.BufferedRowIterator.append(BufferedRowIterator.java:73)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.smj_consumeFullOuterJoinRow_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)

After this pr is ok.

Was this patch authored or co-authored using generative AI tooling?

No.

zml1206 · 2024-01-12T02:13:16Z

@cloud-fan Can you help take a look if you have time? Thanks.

beliefer · 2024-01-12T11:37:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

Why do you need replace for with while?

Convenient control, go to +1 in different places.

zml1206 · 2024-01-23T10:14:23Z

cc @cloud-fan @wankunde @ulysses-you do you have any thought about this? Thanks.

wankunde · 2024-01-24T07:50:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

Could we wrap "$matchRowsInBuffer" as a function? There are too many duplicated code.

Done, generated code in PR description also updated.

wankunde · 2024-01-25T02:29:13Z

I'm sorry where is smj_leftIndex_0 reset to 0 ?

https://gist.github.com/zml1206/a27350b8849951e6efac0fb6088e527f#file-full_outer_sort_merge_join_codegen_after-L280-L305

zml1206 · 2024-01-25T02:34:09Z

I'm sorry where is smj_leftIndex_0 reset to 0 ?

https://gist.github.com/zml1206/a27350b8849951e6efac0fb6088e527f#file-full_outer_sort_merge_join_codegen_after-L280-L305

https://gist.github.com/zml1206/a27350b8849951e6efac0fb6088e527f#file-full_outer_sort_merge_join_codegen_after-L126

zml1206 · 2024-01-25T02:45:07Z

Whenever processNext is called, the buffer will be consumed first, and then findNextJoinRows will be called to reset the index and buffer and match the new row written into the buffer. @wankunde

Fix oom by code-gen for full outer sort merge join

github-actions · 2024-06-28T00:20:12Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Jan 10, 2024

zml1206 force-pushed the SPARK-46653 branch from 3f95485 to df40d30 Compare January 10, 2024 05:11

beliefer reviewed Jan 12, 2024

View reviewed changes

zml1206 changed the title ~~[SPARK-46653][SQL] Code-gen for full outer sort merge join output line by line~~ [SPARK-46653][SQL] Code-gen for full outer sort merge join output row by row Jan 16, 2024

wankunde reviewed Jan 24, 2024

View reviewed changes

zml1206 force-pushed the SPARK-46653 branch from c06dd25 to 605ae61 Compare January 24, 2024 11:34

zml1206 added 2 commits March 19, 2024 17:30

Fix oom by code-gen for full outer sort merge join

8e244d2

Fix oom by code-gen for full outer sort merge join

wrap "$matchRowsInBuffer" as a function

f34e0a5

zml1206 force-pushed the SPARK-46653 branch from 605ae61 to f34e0a5 Compare March 19, 2024 09:31

github-actions bot added the Stale label Jun 28, 2024

github-actions bot closed this Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46653][SQL] Code-gen for full outer sort merge join output row by row#44660

[SPARK-46653][SQL] Code-gen for full outer sort merge join output row by row#44660
zml1206 wants to merge 2 commits intoapache:masterfrom
zml1206:SPARK-46653

zml1206 commented Jan 10, 2024 •

edited

Loading

Uh oh!

zml1206 commented Jan 12, 2024

Uh oh!

beliefer Jan 12, 2024

Uh oh!

zml1206 Jan 12, 2024

Uh oh!

zml1206 commented Jan 23, 2024

Uh oh!

wankunde Jan 24, 2024

Uh oh!

zml1206 Jan 24, 2024 •

edited

Loading

Uh oh!

wankunde commented Jan 25, 2024

Uh oh!

zml1206 commented Jan 25, 2024

Uh oh!

zml1206 commented Jan 25, 2024

Uh oh!

github-actions bot commented Jun 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zml1206 commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zml1206 commented Jan 12, 2024

Uh oh!

beliefer Jan 12, 2024

Choose a reason for hiding this comment

Uh oh!

zml1206 Jan 12, 2024

Choose a reason for hiding this comment

Uh oh!

zml1206 commented Jan 23, 2024

Uh oh!

wankunde Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

zml1206 Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wankunde commented Jan 25, 2024

Uh oh!

zml1206 commented Jan 25, 2024

Uh oh!

zml1206 commented Jan 25, 2024

Uh oh!

github-actions bot commented Jun 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zml1206 commented Jan 10, 2024 •

edited

Loading

zml1206 Jan 24, 2024 •

edited

Loading