Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32019][SQL] Add spark.sql.files.minPartitionNum config #28853

Closed
wants to merge 10 commits into from

Conversation

ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Add a new config spark.sql.files.minPartitionNum to control file split partition in local session.

Why are the changes needed?

Aims to control file split partitions in session level.
More details see discuss in PR-28778.

Does this PR introduce any user-facing change?

Yes, new config.

How was this patch tested?

Add UT.

@ulysses-you
Copy link
Contributor Author

cc @cloud-fan @maropu

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124188 has finished for PR 28853 at commit 8ef9f29.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124191 has finished for PR 28853 at commit 4d776d3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124195 has finished for PR 28853 at commit 09bba5c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 18, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124208 has finished for PR 28853 at commit 09bba5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124211 has finished for PR 28853 at commit 8f9b1c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

@maropu @cloud-fan thanks for review again.

@@ -528,6 +528,18 @@ class FileSourceStrategySuite extends QueryTest with SharedSparkSession with Pre
}
}

test("Add spark.sql.files.minPartitionNum config") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add SPARK-32019: prefix into this test case name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's not a bug. So it's not need ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to add it since it's a dedicated test case for this JIRA ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it.

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124241 has finished for PR 28853 at commit 7c667d8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

Seems the error is not related.

@@ -1176,6 +1176,15 @@ object SQLConf {
.longConf
.createWithDefault(4 * 1024 * 1024)

val FILES_MIN_PARTITION_NUM = buildConf("spark.sql.files.minPartitionNum")
.doc("The suggested (not guaranteed) minimum number of file split partitions. If not set, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file split -> split file?

@@ -1176,6 +1176,15 @@ object SQLConf {
.longConf
.createWithDefault(4 * 1024 * 1024)

val FILES_MIN_PARTITION_NUM = buildConf("spark.sql.files.minPartitionNum")
.doc("The suggested (not guaranteed) minimum number of file split partitions. If not set, " +
"the default value is the default parallelism of the Spark cluster. This configuration is " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default parallelism of the Spark cluster -> spark.default.parallelism?

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124245 has finished for PR 28853 at commit 600b933.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 19, 2020

retest this please

"file3" -> 1
))
assert(table.rdd.partitions.length == 3)
}
Copy link
Contributor

@cloud-fan cloud-fan Jun 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add more tests to make sure that this is really for min partition number?

e.g. we can create more partitions than the min number, as we make each partition 128mb.

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124279 has finished for PR 28853 at commit f6d574a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val partitions = (1 to 100).map(i => s"file$i" -> 128*1024*1024)
val table = createTable(files = partitions)
// partition is limit by filesMaxPartitionBytes(128MB)
assert(table.rdd.partitions.length == 100)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 * 128 / 16 > 128, so max partition size is 128, partition number == file number.

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124269 has finished for PR 28853 at commit 600b933.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124287 has finished for PR 28853 at commit 8f86b42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

withSQLConf(SQLConf.FILES_MIN_PARTITION_NUM.key -> "32") {
val partitions = (1 to 800).map(i => s"file$i" -> 4*1024*1024)
val table = createTable(files = partitions)
assert(table.rdd.partitions.length == 50)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

800 * (4 + 4) / 32 > 128, and use 128 as maxSplitSize, 800 * (4 + 4) / 128 = 50

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124290 has finished for PR 28853 at commit 3a646a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1176,6 +1176,15 @@ object SQLConf {
.longConf
.createWithDefault(4 * 1024 * 1024)

val FILES_MIN_PARTITION_NUM = buildConf("spark.sql.files.minPartitionNum")
.doc("The suggested (not guaranteed) minimum number of splitting file partitions. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitting? split?

}

withSQLConf(SQLConf.FILES_MIN_PARTITION_NUM.key -> "16") {
val partitions = (1 to 100).map(i => s"file$i" -> 128*1024*1024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 128*1024*1024 -> 128 * 1024 * 1024

withSQLConf(SQLConf.FILES_MIN_PARTITION_NUM.key -> "16") {
val partitions = (1 to 100).map(i => s"file$i" -> 128*1024*1024)
val table = createTable(files = partitions)
// partition is limit by filesMaxPartitionBytes(128MB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: limit -> limited

}

withSQLConf(SQLConf.FILES_MIN_PARTITION_NUM.key -> "32") {
val partitions = (1 to 800).map(i => s"file$i" -> 4*1024*1024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Jun 20, 2020

Test build #124311 has finished for PR 28853 at commit 1fb9dc6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 20, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 20, 2020

Test build #124317 has finished for PR 28853 at commit 1fb9dc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you all. Like spark.sql.adaptive.coalescePartitions.minPartitionNum, it seems to have a reasonable use case. Merged to master for Apache Spark 3.1.

@ulysses-you
Copy link
Contributor Author

@dongjoon-hyun thanks for merging. Thanks @maropu @cloud-fan !

HyukjinKwon pushed a commit that referenced this pull request Sep 29, 2020
### What changes were proposed in this pull request?

The UT for SPARK-32019 (#28853) tries to write about 16GB of data do the disk. We must change the value of `spark.sql.files.maxPartitionBytes` to a smaller value do check the correct behavior with less data. By default it is `128MB`.
The other parameters in this UT are also changed to smaller values to keep the behavior the same.

### Why are the changes needed?

The runtime of this one UT can be over 7 minutes on Jenkins. After the change it is few seconds.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #29842 from tanelk/SPARK-32970.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants