[SPARK-43911] [SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array #41419

mcdull-zhang · 2023-06-01T08:33:58Z

What changes were proposed in this pull request?

When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for dynamic partition pruning, it will put all the keys in an Array, and then call the distinct of the Array to remove the duplicates.

In general, Broadcast HashedRelation may have many rows, and the repetition rate of this key is high. Doing so will cause this Array to occupy a large amount of memory (and this memory is not managed by MemoryManager), which may trigger OOM.

The approach here is to directly call the toSet of the iterator to deduplicate, which can prevent the creation of a large array.

Why are the changes needed?

Avoid the occurrence of the following OOM exceptions:

Exception in thread "dynamicpruning-0" java.lang.OutOfMemoryError: Java heap space
	at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
	at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
	at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
	at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85)
	at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:62)
	at scala.collection.generic.Growable$$Lambda$7/1514840818.apply(Unknown Source)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at org.apache.spark.sql.execution.SubqueryBroadcastExec.$anonfun$relationFuture$2(SubqueryBroadcastExec.scala:92)
	at org.apache.spark.sql.execution.SubqueryBroadcastExec$$Lambda$4212/5099232.apply(Unknown Source)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:140)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Production environment manual verification && Pass existing unit tests

sql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala

wangyum · 2023-06-02T09:02:22Z

@mcdull-zhang Please fix the PR title and PR description according to the new changes?

mcdull-zhang · 2023-06-02T09:47:12Z

@mcdull-zhang Please fix the PR title and PR description according to the new changes?

modified

…ent the creation of large Array ### What changes were proposed in this pull request? When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for dynamic partition pruning, it will put all the keys in an Array, and then call the distinct of the Array to remove the duplicates. In general, Broadcast HashedRelation may have many rows, and the repetition rate of this key is high. Doing so will cause this Array to occupy a large amount of memory (and this memory is not managed by MemoryManager), which may trigger OOM. The approach here is to directly call the toSet of the iterator to deduplicate, which can prevent the creation of a large array. ### Why are the changes needed? Avoid the occurrence of the following OOM exceptions: ```text Exception in thread "dynamicpruning-0" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:49) at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:62) at scala.collection.generic.Growable$$Lambda$7/1514840818.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.sql.execution.SubqueryBroadcastExec.$anonfun$relationFuture$2(SubqueryBroadcastExec.scala:92) at org.apache.spark.sql.execution.SubqueryBroadcastExec$$Lambda$4212/5099232.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:140) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Production environment manual verification && Pass existing unit tests Closes #41419 from mcdull-zhang/reduce_memory_usage. Authored-by: mcdull-zhang <work4dong@163.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 595ad30) Signed-off-by: Yuming Wang <yumwang@ebay.com>

wangyum · 2023-06-04T07:09:50Z

Merged to master and branch-3.4.

HyukjinKwon · 2023-06-06T00:26:06Z

I don't know why but seems like this PR causes the build broken with OOM. Let me revert this and see how it goes if you don't mind.

HyukjinKwon · 2023-06-06T00:27:58Z

#41452

wangyum · 2023-06-06T03:01:51Z

+1 to revert it. It is because the size of Set can be much bigger than Array:

val plan = sql("select id from range(100000)").queryExecution.executedPlan
println(com.carrotsearch.sizeof.RamUsageEstimator.sizeOf(plan.executeToIterator().toSet))
println(com.carrotsearch.sizeof.RamUsageEstimator.sizeOf(plan.executeToIterator().toArray))

Output:

11460896
7600016

…ent the creation of large Array ### What changes were proposed in this pull request? When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for dynamic partition pruning, it will put all the keys in an Array, and then call the distinct of the Array to remove the duplicates. In general, Broadcast HashedRelation may have many rows, and the repetition rate of this key is high. Doing so will cause this Array to occupy a large amount of memory (and this memory is not managed by MemoryManager), which may trigger OOM. The approach here is to directly call the toSet of the iterator to deduplicate, which can prevent the creation of a large array. ### Why are the changes needed? Avoid the occurrence of the following OOM exceptions: ```text Exception in thread "dynamicpruning-0" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:49) at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:62) at scala.collection.generic.Growable$$Lambda$7/1514840818.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.sql.execution.SubqueryBroadcastExec.$anonfun$relationFuture$2(SubqueryBroadcastExec.scala:92) at org.apache.spark.sql.execution.SubqueryBroadcastExec$$Lambda$4212/5099232.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:140) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Production environment manual verification && Pass existing unit tests Closes apache#41419 from mcdull-zhang/reduce_memory_usage. Authored-by: mcdull-zhang <work4dong@163.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

…ent the creation of large Array ### What changes were proposed in this pull request? When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for dynamic partition pruning, it will put all the keys in an Array, and then call the distinct of the Array to remove the duplicates. In general, Broadcast HashedRelation may have many rows, and the repetition rate of this key is high. Doing so will cause this Array to occupy a large amount of memory (and this memory is not managed by MemoryManager), which may trigger OOM. The approach here is to directly call the toSet of the iterator to deduplicate, which can prevent the creation of a large array. ### Why are the changes needed? Avoid the occurrence of the following OOM exceptions: ```text Exception in thread "dynamicpruning-0" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:49) at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:62) at scala.collection.generic.Growable$$Lambda$7/1514840818.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.sql.execution.SubqueryBroadcastExec.$anonfun$relationFuture$2(SubqueryBroadcastExec.scala:92) at org.apache.spark.sql.execution.SubqueryBroadcastExec$$Lambda$4212/5099232.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:140) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Production environment manual verification && Pass existing unit tests Closes apache#41419 from mcdull-zhang/reduce_memory_usage. Authored-by: mcdull-zhang <work4dong@163.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 595ad30) Signed-off-by: Yuming Wang <yumwang@ebay.com>

first

ef57b82

github-actions bot added the SQL label Jun 1, 2023

HyukjinKwon reviewed Jun 2, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala Outdated Show resolved Hide resolved

mcdull-zhang added 2 commits June 2, 2023 16:14

Merge branch 'master' into reduce_memory_usage

0a28949

second

d414aef

mcdull-zhang requested a review from HyukjinKwon June 2, 2023 08:22

mcdull-zhang changed the title ~~[SPARK-43911] [SQL] Directly use Set to consume iterator data to deduplicate, thereby reducing memory usage~~ [SPARK-43911] [SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array Jun 2, 2023

test

4552c33

wangyum approved these changes Jun 2, 2023

View reviewed changes

compile

2c8bb93

wangyum closed this in 595ad30 Jun 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43911] [SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array #41419

[SPARK-43911] [SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array #41419

mcdull-zhang commented Jun 1, 2023 •

edited

wangyum commented Jun 2, 2023

mcdull-zhang commented Jun 2, 2023

wangyum commented Jun 4, 2023

HyukjinKwon commented Jun 6, 2023

HyukjinKwon commented Jun 6, 2023

wangyum commented Jun 6, 2023 •

edited

[SPARK-43911] [SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array #41419

[SPARK-43911] [SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array #41419

Conversation

mcdull-zhang commented Jun 1, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

wangyum commented Jun 2, 2023

mcdull-zhang commented Jun 2, 2023

wangyum commented Jun 4, 2023

HyukjinKwon commented Jun 6, 2023

HyukjinKwon commented Jun 6, 2023

wangyum commented Jun 6, 2023 • edited

mcdull-zhang commented Jun 1, 2023 •

edited

wangyum commented Jun 6, 2023 •

edited