Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44021][SQL] Add spark.sql.files.maxPartitionNum #41545

Closed
wants to merge 2 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Jun 11, 2023

What changes were proposed in this pull request?

This PR add a new SQL config: spark.sql.files.maxPartitionNum. User can set it to avoid generating too many partitions when reading file-based sources. Too many partitions will increase the various overheads of the driver and cause Shuffle service OOM.

The following is the GC log of the Shuffle service:

2023-06-08T01:41:01.871-0700: 7303.965: [Full GC (Allocation Failure) 2023-06-08T01:41:01.871-0700: 7303.965: [CMS: 4194304K->4194304K(4194304K), 7.4010107 secs]2023-06-08T01:41:09.272-0700: 7311.366: [Class Histogram (after full gc):
 num     #instances         #bytes  class name
----------------------------------------------
   1:       7110660     2334927400  [C
   2:      19465810      467514416  [I
   3:       6754570      270182800  org.apache.spark.network.protocol.ChunkFetchRequest
   4:       6661155      266446200  org.sparkproject.io.netty.channel.DefaultChannelPromise
   5:       6639056      265562240  org.apache.spark.network.buffer.FileSegmentManagedBuffer
   6:       6639055      265562200  org.apache.spark.network.protocol.RequestTraceInfo
   7:       6663764      213240448  org.sparkproject.io.netty.util.Recycler$DefaultHandle
   8:       6659382      213100224  org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$WriteTask
   9:       6659218      213094976  org.apache.spark.network.server.ChunkFetchRequestHandler$$Lambda$156/886274988
  10:       6640444      212494208  java.io.File
...

Why are the changes needed?

  1. The PR aims to selectively rescale large RDDs only, while keeping the existing behavior in the same way in small RDDs. So directly increasing spark.sql.files.maxPartitionBytes is not acceptable:

    1. There are multiple data sources in one SQL, setting spark.sql.files.maxPartitionBytes will affect all data sources.
    2. We don't know how much spark.sql.files.maxPartitionBytes should be set to, sometimes it may be very large(More than 20GiB).
  2. To make it do not generate too many partitions if it is very large partitioned and bucketed table as it is not always use bucket scan since SPARK-32859.

    Before SPARK-32859 After SPARK-32859
  3. Avoid generating too many partitions if these are lots of small files.

Does this PR introduce any user-facing change?

No. Unless the user sets spark.sql.files.maxPartitionNum.

How was this patch tested?

Unit test and manual testing:

Before this PR After this PR and set spark.sql.files.maxPartitionNum=20000

@github-actions github-actions bot added the SQL label Jun 11, 2023
@pan3793
Copy link
Member

pan3793 commented Jun 12, 2023

Thanks @wangyum, this really helps for the query that scans the huge table.

"and ORC.")
.version("3.5.0")
.intConf
.checkValue(threshold => threshold > 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change to check threshold > FILES_MIN_PARTITION_NUM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the document of FILES_MIN_PARTITION_NUM, it seems FILES_MIN_PARTITION_NUM just suggests(not guaranteed) minimum number of split file partitions .
Maybe we can't rely on it.

"and ORC.")
.version("3.5.0")
.intConf
.checkValue(threshold => threshold > 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the document of FILES_MIN_PARTITION_NUM, it seems FILES_MIN_PARTITION_NUM just suggests(not guaranteed) minimum number of split file partitions .
Maybe we can't rely on it.

@wangyum wangyum changed the title [SPARK-44021][SQL] Add spark.sql.files.maxDesiredPartitionNum [SPARK-44021][SQL] Add spark.sql.files.maxPartitionNum Jun 12, 2023
@wangyum
Copy link
Member Author

wangyum commented Jun 12, 2023

cc @cloud-fan

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @wangyum . Could you add a description about why you couldn't increase spark.sql.files.maxPartitionBytes instead initially? Maybe, is this PR aiming to selectively tune a large RDD only while keeping the existing behavior in the same ways in small RDDs?

@wangyum
Copy link
Member Author

wangyum commented Jun 13, 2023

Yes. This PR aiming to selectively tune a large RDD only while keeping the existing behavior in the same ways in small RDDs.

Increase spark.sql.files.maxPartitionBytes directly has these issues:

  1. There are multiple data sources in one SQL, setting spark.sql.files.maxPartitionBytes will affect all data sources.
  2. We don't know how much spark.sql.files.maxPartitionBytes should be set to, sometimes it may be very large(May be more than 20GiB).

@dongjoon-hyun
Copy link
Member

Thanks. Please add that into the PR description to make it a commit log.

maxSplitBytes: Long): Seq[FilePartition] = {
val openCostBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
val maxPartitionNum = sparkSession.sessionState.conf.filesMaxPartitionNum
val partitions = getFilePartitions(partitionedFiles, maxSplitBytes, openCostBytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this line, we ignore spark.sql.files.maxPartitionBytes. Could you add a warning about this side-effect? This means each task will took a longer time and the job can become slower in some cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@github-actions github-actions bot added the DOCS label Jun 13, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM from my side. Thank you for updating, @wangyum .

@dongjoon-hyun
Copy link
Member

Merged to master. Thank you, @wangyum , @pan3793 , @beliefer , @LuciferYang !

@wangyum wangyum deleted the SPARK-44021 branch June 14, 2023 01:16
czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
### What changes were proposed in this pull request?

This PR add a new SQL config: `spark.sql.files.maxPartitionNum`. User can set it to avoid generating too many partitions when reading file-based sources. Too many partitions will increase the various overheads of the driver and cause Shuffle service OOM.

The following is the GC log of the Shuffle service:
```
2023-06-08T01:41:01.871-0700: 7303.965: [Full GC (Allocation Failure) 2023-06-08T01:41:01.871-0700: 7303.965: [CMS: 4194304K->4194304K(4194304K), 7.4010107 secs]2023-06-08T01:41:09.272-0700: 7311.366: [Class Histogram (after full gc):
 num     #instances         #bytes  class name
----------------------------------------------
   1:       7110660     2334927400  [C
   2:      19465810      467514416  [I
   3:       6754570      270182800  org.apache.spark.network.protocol.ChunkFetchRequest
   4:       6661155      266446200  org.sparkproject.io.netty.channel.DefaultChannelPromise
   5:       6639056      265562240  org.apache.spark.network.buffer.FileSegmentManagedBuffer
   6:       6639055      265562200  org.apache.spark.network.protocol.RequestTraceInfo
   7:       6663764      213240448  org.sparkproject.io.netty.util.Recycler$DefaultHandle
   8:       6659382      213100224  org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$WriteTask
   9:       6659218      213094976  org.apache.spark.network.server.ChunkFetchRequestHandler$$Lambda$156/886274988
  10:       6640444      212494208  java.io.File
...
```

### Why are the changes needed?

1. The PR aims to selectively rescale large RDDs only, while keeping the existing behavior in the same way in small RDDs. So directly increasing `spark.sql.files.maxPartitionBytes` is not acceptable:
   1. There are multiple data sources in one SQL, setting `spark.sql.files.maxPartitionBytes` will affect all data sources.
   2. We don't know how much `spark.sql.files.maxPartitionBytes` should be set to, sometimes it may be very large(More than 20GiB).

2. To make it do not generate too many partitions if it is very large partitioned and bucketed table as it is not always use bucket scan since [SPARK-32859](https://issues.apache.org/jira/browse/SPARK-32859).

   Before SPARK-32859 | After SPARK-32859
   --- | ---
   <img width="400" src="https://github.com/apache/spark/assets/5399861/5e14932b-aa3d-4b14-b80c-e3ff348958c4"> | <img width="400" src="https://github.com/apache/spark/assets/5399861/170311a0-c086-408a-9d95-17031e21e16a">

3. Avoid generating too many partitions if these are lots of small files.

### Does this PR introduce _any_ user-facing change?

No. Unless the user sets `spark.sql.files.maxPartitionNum`.

### How was this patch tested?

Unit test and manual testing:

Before this PR | After this PR and `set spark.sql.files.maxPartitionNum=20000`
-- | --
<img width="400" src="https://github.com/apache/spark/assets/5399861/ffda1850-cd4a-4970-a4e5-e1e43177135a"> | <img width="330" src="https://github.com/apache/spark/assets/5399861/1df7cac7-fe82-4af3-b3ec-91aa23c79a8b">

Closes apache#41545 from wangyum/SPARK-44021.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@wangyum
Copy link
Member Author

wangyum commented Nov 17, 2023

Limiting the number of mappers can also reduce RPC calls to improve NameNode performance.

Data size: 3.8 T, File number: 3992

Mapper tasks 31920 15960 7980 3990 3648
Namenode RPC calls 127671 64684 31919 15953 15953

@dongjoon-hyun
Copy link
Member

Thank you for sharing that result additionally, @wangyum .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants