[SPARK-44021][SQL] Add spark.sql.files.maxPartitionNum #41545

wangyum · 2023-06-11T16:26:05Z

What changes were proposed in this pull request?

This PR add a new SQL config: spark.sql.files.maxPartitionNum. User can set it to avoid generating too many partitions when reading file-based sources. Too many partitions will increase the various overheads of the driver and cause Shuffle service OOM.

The following is the GC log of the Shuffle service:

2023-06-08T01:41:01.871-0700: 7303.965: [Full GC (Allocation Failure) 2023-06-08T01:41:01.871-0700: 7303.965: [CMS: 4194304K->4194304K(4194304K), 7.4010107 secs]2023-06-08T01:41:09.272-0700: 7311.366: [Class Histogram (after full gc):
 num     #instances         #bytes  class name
----------------------------------------------
   1:       7110660     2334927400  [C
   2:      19465810      467514416  [I
   3:       6754570      270182800  org.apache.spark.network.protocol.ChunkFetchRequest
   4:       6661155      266446200  org.sparkproject.io.netty.channel.DefaultChannelPromise
   5:       6639056      265562240  org.apache.spark.network.buffer.FileSegmentManagedBuffer
   6:       6639055      265562200  org.apache.spark.network.protocol.RequestTraceInfo
   7:       6663764      213240448  org.sparkproject.io.netty.util.Recycler$DefaultHandle
   8:       6659382      213100224  org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$WriteTask
   9:       6659218      213094976  org.apache.spark.network.server.ChunkFetchRequestHandler$$Lambda$156/886274988
  10:       6640444      212494208  java.io.File
...

Why are the changes needed?

The PR aims to selectively rescale large RDDs only, while keeping the existing behavior in the same way in small RDDs. So directly increasing spark.sql.files.maxPartitionBytes is not acceptable:
1. There are multiple data sources in one SQL, setting spark.sql.files.maxPartitionBytes will affect all data sources.
2. We don't know how much spark.sql.files.maxPartitionBytes should be set to, sometimes it may be very large(More than 20GiB).
To make it do not generate too many partitions if it is very large partitioned and bucketed table as it is not always use bucket scan since SPARK-32859.

Before SPARK-32859 After SPARK-32859
Avoid generating too many partitions if these are lots of small files.

Does this PR introduce any user-facing change?

No. Unless the user sets spark.sql.files.maxPartitionNum.

How was this patch tested?

Unit test and manual testing:

Before this PR	After this PR and `set spark.sql.files.maxPartitionNum=20000`

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

pan3793 · 2023-06-12T03:35:23Z

Thanks @wangyum, this really helps for the query that scans the huge table.

LuciferYang · 2023-06-12T03:53:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "and ORC.")
+    .version("3.5.0")
+    .intConf
+    .checkValue(threshold => threshold > 0,


should we change to check threshold > FILES_MIN_PARTITION_NUM?

According to the document of FILES_MIN_PARTITION_NUM, it seems FILES_MIN_PARTITION_NUM just suggests(not guaranteed) minimum number of split file partitions .
Maybe we can't rely on it.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

beliefer · 2023-06-12T04:21:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "and ORC.")
+    .version("3.5.0")
+    .intConf
+    .checkValue(threshold => threshold > 0,


According to the document of FILES_MIN_PARTITION_NUM, it seems FILES_MIN_PARTITION_NUM just suggests(not guaranteed) minimum number of split file partitions .
Maybe we can't rely on it.

wangyum · 2023-06-12T15:26:33Z

cc @cloud-fan

dongjoon-hyun

Thank you, @wangyum . Could you add a description about why you couldn't increase spark.sql.files.maxPartitionBytes instead initially? Maybe, is this PR aiming to selectively tune a large RDD only while keeping the existing behavior in the same ways in small RDDs?

wangyum · 2023-06-13T05:33:51Z

Yes. This PR aiming to selectively tune a large RDD only while keeping the existing behavior in the same ways in small RDDs.

Increase spark.sql.files.maxPartitionBytes directly has these issues:

There are multiple data sources in one SQL, setting spark.sql.files.maxPartitionBytes will affect all data sources.
We don't know how much spark.sql.files.maxPartitionBytes should be set to, sometimes it may be very large(May be more than 20GiB).

dongjoon-hyun · 2023-06-13T05:41:31Z

Thanks. Please add that into the PR description to make it a commit log.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun · 2023-06-13T05:55:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala

+      maxSplitBytes: Long): Seq[FilePartition] = {
+    val openCostBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
+    val maxPartitionNum = sparkSession.sessionState.conf.filesMaxPartitionNum
+    val partitions = getFilePartitions(partitionedFiles, maxSplitBytes, openCostBytes)


After this line, we ignore spark.sql.files.maxPartitionBytes. Could you add a warning about this side-effect? This means each task will took a longer time and the job can become slower in some cases.

dongjoon-hyun

+1, LGTM from my side. Thank you for updating, @wangyum .

dongjoon-hyun · 2023-06-14T01:14:14Z

Merged to master. Thank you, @wangyum , @pan3793 , @beliefer , @LuciferYang !

### What changes were proposed in this pull request? This PR add a new SQL config: `spark.sql.files.maxPartitionNum`. User can set it to avoid generating too many partitions when reading file-based sources. Too many partitions will increase the various overheads of the driver and cause Shuffle service OOM. The following is the GC log of the Shuffle service: ``` 2023-06-08T01:41:01.871-0700: 7303.965: [Full GC (Allocation Failure) 2023-06-08T01:41:01.871-0700: 7303.965: [CMS: 4194304K->4194304K(4194304K), 7.4010107 secs]2023-06-08T01:41:09.272-0700: 7311.366: [Class Histogram (after full gc): num #instances #bytes class name ---------------------------------------------- 1: 7110660 2334927400 [C 2: 19465810 467514416 [I 3: 6754570 270182800 org.apache.spark.network.protocol.ChunkFetchRequest 4: 6661155 266446200 org.sparkproject.io.netty.channel.DefaultChannelPromise 5: 6639056 265562240 org.apache.spark.network.buffer.FileSegmentManagedBuffer 6: 6639055 265562200 org.apache.spark.network.protocol.RequestTraceInfo 7: 6663764 213240448 org.sparkproject.io.netty.util.Recycler$DefaultHandle 8: 6659382 213100224 org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$WriteTask 9: 6659218 213094976 org.apache.spark.network.server.ChunkFetchRequestHandler$$Lambda$156/886274988 10: 6640444 212494208 java.io.File ... ``` ### Why are the changes needed? 1. The PR aims to selectively rescale large RDDs only, while keeping the existing behavior in the same way in small RDDs. So directly increasing `spark.sql.files.maxPartitionBytes` is not acceptable: 1. There are multiple data sources in one SQL, setting `spark.sql.files.maxPartitionBytes` will affect all data sources. 2. We don't know how much `spark.sql.files.maxPartitionBytes` should be set to, sometimes it may be very large(More than 20GiB). 2. To make it do not generate too many partitions if it is very large partitioned and bucketed table as it is not always use bucket scan since [SPARK-32859](https://issues.apache.org/jira/browse/SPARK-32859). Before SPARK-32859 | After SPARK-32859 --- | --- <img width="400" src="https://github.com/apache/spark/assets/5399861/5e14932b-aa3d-4b14-b80c-e3ff348958c4"> | <img width="400" src="https://github.com/apache/spark/assets/5399861/170311a0-c086-408a-9d95-17031e21e16a"> 3. Avoid generating too many partitions if these are lots of small files. ### Does this PR introduce _any_ user-facing change? No. Unless the user sets `spark.sql.files.maxPartitionNum`. ### How was this patch tested? Unit test and manual testing: Before this PR | After this PR and `set spark.sql.files.maxPartitionNum=20000` -- | -- <img width="400" src="https://github.com/apache/spark/assets/5399861/ffda1850-cd4a-4970-a4e5-e1e43177135a"> | <img width="330" src="https://github.com/apache/spark/assets/5399861/1df7cac7-fe82-4af3-b3ec-91aa23c79a8b"> Closes apache#41545 from wangyum/SPARK-44021. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

wangyum · 2023-11-17T02:42:52Z

Limiting the number of mappers can also reduce RPC calls to improve NameNode performance.

Data size: 3.8 T, File number: 3992

Mapper tasks	31920	15960	7980	3990	3648
Namenode RPC calls	127671	64684	31919	15953	15953

dongjoon-hyun · 2023-11-17T03:22:06Z

Thank you for sharing that result additionally, @wangyum .

github-actions bot added the SQL label Jun 11, 2023

pan3793 reviewed Jun 12, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

pan3793 reviewed Jun 12, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

LuciferYang reviewed Jun 12, 2023

View reviewed changes

beliefer reviewed Jun 12, 2023

View reviewed changes

Add spark.sql.files.maxDesiredPartitionNum

8b4e9ee

wangyum force-pushed the SPARK-44021 branch from ff253ff to 8b4e9ee Compare June 12, 2023 06:40

wangyum changed the title ~~[SPARK-44021][SQL] Add spark.sql.files.maxDesiredPartitionNum~~ [SPARK-44021][SQL] Add spark.sql.files.maxPartitionNum Jun 12, 2023

dongjoon-hyun reviewed Jun 13, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jun 13, 2023

View reviewed changes

add

92db4a5

github-actions bot added the DOCS label Jun 13, 2023

dongjoon-hyun approved these changes Jun 13, 2023

View reviewed changes

dongjoon-hyun closed this in b174aaa Jun 14, 2023

wangyum deleted the SPARK-44021 branch June 14, 2023 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44021][SQL] Add spark.sql.files.maxPartitionNum #41545

[SPARK-44021][SQL] Add spark.sql.files.maxPartitionNum #41545

wangyum commented Jun 11, 2023 •

edited

pan3793 commented Jun 12, 2023

LuciferYang Jun 12, 2023

beliefer Jun 12, 2023

beliefer Jun 12, 2023

wangyum commented Jun 12, 2023

dongjoon-hyun left a comment

wangyum commented Jun 13, 2023

dongjoon-hyun commented Jun 13, 2023

dongjoon-hyun Jun 13, 2023

wangyum Jun 13, 2023

dongjoon-hyun left a comment

dongjoon-hyun commented Jun 14, 2023

wangyum commented Nov 17, 2023 •

edited

dongjoon-hyun commented Nov 17, 2023

[SPARK-44021][SQL] Add spark.sql.files.maxPartitionNum #41545

[SPARK-44021][SQL] Add spark.sql.files.maxPartitionNum #41545

Conversation

wangyum commented Jun 11, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

pan3793 commented Jun 12, 2023

LuciferYang Jun 12, 2023

Choose a reason for hiding this comment

beliefer Jun 12, 2023

Choose a reason for hiding this comment

beliefer Jun 12, 2023

Choose a reason for hiding this comment

wangyum commented Jun 12, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

wangyum commented Jun 13, 2023

dongjoon-hyun commented Jun 13, 2023

dongjoon-hyun Jun 13, 2023

Choose a reason for hiding this comment

wangyum Jun 13, 2023

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 14, 2023

wangyum commented Nov 17, 2023 • edited

dongjoon-hyun commented Nov 17, 2023

wangyum commented Jun 11, 2023 •

edited

wangyum commented Nov 17, 2023 •

edited