[SPARK-32019][SQL] Add spark.sql.files.minPartitionNum config #28853

ulysses-you · 2020-06-18T02:22:53Z

What changes were proposed in this pull request?

Add a new config spark.sql.files.minPartitionNum to control file split partition in local session.

Why are the changes needed?

Aims to control file split partitions in session level.
More details see discuss in PR-28778.

Does this PR introduce any user-facing change?

Yes, new config.

How was this patch tested?

Add UT.

ulysses-you · 2020-06-18T02:25:49Z

cc @cloud-fan @maropu

SparkQA · 2020-06-18T02:27:18Z

Test build #124188 has finished for PR 28853 at commit 8ef9f29.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-18T02:48:15Z

Test build #124191 has finished for PR 28853 at commit 4d776d3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-18T07:05:02Z

Test build #124195 has finished for PR 28853 at commit 09bba5c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-18T07:53:24Z

retest this please

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2020-06-18T12:36:34Z

Test build #124208 has finished for PR 28853 at commit 09bba5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-18T12:58:40Z

Test build #124211 has finished for PR 28853 at commit 8f9b1c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you · 2020-06-18T13:48:31Z

@maropu @cloud-fan thanks for review again.

dongjoon-hyun · 2020-06-18T22:09:19Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

@@ -528,6 +528,18 @@ class FileSourceStrategySuite extends QueryTest with SharedSparkSession with Pre
    }
  }

+  test("Add spark.sql.files.minPartitionNum config") {


Shall we add SPARK-32019: prefix into this test case name?

Since it's not a bug. So it's not need ?

it's better to add it since it's a dedicated test case for this JIRA ticket.

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

SparkQA · 2020-06-19T00:17:13Z

Test build #124241 has finished for PR 28853 at commit 7c667d8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you · 2020-06-19T01:00:50Z

Seems the error is not related.

maropu · 2020-06-19T01:06:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1176,6 +1176,15 @@ object SQLConf {
    .longConf
    .createWithDefault(4 * 1024 * 1024)

+  val FILES_MIN_PARTITION_NUM = buildConf("spark.sql.files.minPartitionNum")
+    .doc("The suggested (not guaranteed) minimum number of file split partitions. If not set, " +


file split -> split file?

maropu · 2020-06-19T01:14:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1176,6 +1176,15 @@ object SQLConf {
    .longConf
    .createWithDefault(4 * 1024 * 1024)

+  val FILES_MIN_PARTITION_NUM = buildConf("spark.sql.files.minPartitionNum")
+    .doc("The suggested (not guaranteed) minimum number of file split partitions. If not set, " +
+      "the default value is the default parallelism of the Spark cluster. This configuration is " +


the default parallelism of the Spark cluster -> spark.default.parallelism?

SparkQA · 2020-06-19T07:05:02Z

Test build #124245 has finished for PR 28853 at commit 600b933.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-19T07:11:12Z

retest this please

cloud-fan · 2020-06-19T08:55:05Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

+          "file3" -> 1
+        ))
+      assert(table.rdd.partitions.length == 3)
+    }


can we add more tests to make sure that this is really for min partition number?

e.g. we can create more partitions than the min number, as we make each partition 128mb.

SparkQA · 2020-06-19T11:09:31Z

Test build #124279 has finished for PR 28853 at commit f6d574a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you · 2020-06-19T11:23:44Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

+      val partitions = (1 to 100).map(i => s"file$i" -> 128*1024*1024)
+      val table = createTable(files = partitions)
+      // partition is limit by filesMaxPartitionBytes(128MB)
+      assert(table.rdd.partitions.length == 100)


100 * 128 / 16 > 128, so max partition size is 128, partition number == file number.

SparkQA · 2020-06-19T12:14:51Z

Test build #124269 has finished for PR 28853 at commit 600b933.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-19T15:25:28Z

Test build #124287 has finished for PR 28853 at commit 8f86b42.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you · 2020-06-19T15:39:24Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

+    withSQLConf(SQLConf.FILES_MIN_PARTITION_NUM.key -> "32") {
+      val partitions = (1 to 800).map(i => s"file$i" -> 4*1024*1024)
+      val table = createTable(files = partitions)
+      assert(table.rdd.partitions.length == 50)


800 * (4 + 4) / 32 > 128, and use 128 as maxSplitSize, 800 * (4 + 4) / 128 = 50

SparkQA · 2020-06-19T20:11:44Z

Test build #124290 has finished for PR 28853 at commit 3a646a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-20T00:35:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1176,6 +1176,15 @@ object SQLConf {
    .longConf
    .createWithDefault(4 * 1024 * 1024)

+  val FILES_MIN_PARTITION_NUM = buildConf("spark.sql.files.minPartitionNum")
+    .doc("The suggested (not guaranteed) minimum number of splitting file partitions. " +


splitting? split?

maropu · 2020-06-20T00:36:03Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

+    }
+
+    withSQLConf(SQLConf.FILES_MIN_PARTITION_NUM.key -> "16") {
+      val partitions = (1 to 100).map(i => s"file$i" -> 128*1024*1024)


nit: 128*1024*1024 -> 128 * 1024 * 1024

maropu · 2020-06-20T00:36:31Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

+    withSQLConf(SQLConf.FILES_MIN_PARTITION_NUM.key -> "16") {
+      val partitions = (1 to 100).map(i => s"file$i" -> 128*1024*1024)
+      val table = createTable(files = partitions)
+      // partition is limit by filesMaxPartitionBytes(128MB)


nit: limit -> limited

maropu · 2020-06-20T00:36:44Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

+    }
+
+    withSQLConf(SQLConf.FILES_MIN_PARTITION_NUM.key -> "32") {
+      val partitions = (1 to 800).map(i => s"file$i" -> 4*1024*1024)


SparkQA · 2020-06-20T07:05:02Z

Test build #124311 has finished for PR 28853 at commit 1fb9dc6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-20T07:45:15Z

retest this please

SparkQA · 2020-06-20T12:25:03Z

Test build #124317 has finished for PR 28853 at commit 1fb9dc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you all. Like spark.sql.adaptive.coalescePartitions.minPartitionNum, it seems to have a reasonable use case. Merged to master for Apache Spark 3.1.

ulysses-you · 2020-06-22T00:50:36Z

@dongjoon-hyun thanks for merging. Thanks @maropu @cloud-fan !

### What changes were proposed in this pull request? The UT for SPARK-32019 (#28853) tries to write about 16GB of data do the disk. We must change the value of `spark.sql.files.maxPartitionBytes` to a smaller value do check the correct behavior with less data. By default it is `128MB`. The other parameters in this UT are also changed to smaller values to keep the behavior the same. ### Why are the changes needed? The runtime of this one UT can be over 7 minutes on Jenkins. After the change it is few seconds. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #29842 from tanelk/SPARK-32970. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

init

8ef9f29

probot-autolabeler bot added the SQL label Jun 18, 2020

fix

4d776d3

fix

09bba5c

maropu reviewed Jun 18, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

add check

8f9b1c1

dongjoon-hyun reviewed Jun 18, 2020

View reviewed changes

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala Show resolved Hide resolved

add ut

7c667d8

maropu reviewed Jun 19, 2020

View reviewed changes

update doc

600b933

cloud-fan reviewed Jun 19, 2020

View reviewed changes

add ut

f6d574a

fix

8f86b42

ulysses-you commented Jun 19, 2020

View reviewed changes

fix

3a646a2

ulysses-you commented Jun 19, 2020

View reviewed changes

maropu reviewed Jun 20, 2020

View reviewed changes

fix

1fb9dc6

maropu approved these changes Jun 20, 2020

View reviewed changes

dongjoon-hyun approved these changes Jun 21, 2020

View reviewed changes

dongjoon-hyun closed this in 9784934 Jun 21, 2020

tanelk mentioned this pull request Sep 22, 2020

[SPARK-32970][SQL][TEST] Reduce the runtime of an UT for SPARK-32019 #29842

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32019][SQL] Add spark.sql.files.minPartitionNum config #28853

[SPARK-32019][SQL] Add spark.sql.files.minPartitionNum config #28853

ulysses-you commented Jun 18, 2020

ulysses-you commented Jun 18, 2020

SparkQA commented Jun 18, 2020

SparkQA commented Jun 18, 2020

SparkQA commented Jun 18, 2020

maropu commented Jun 18, 2020

SparkQA commented Jun 18, 2020

SparkQA commented Jun 18, 2020

ulysses-you commented Jun 18, 2020

dongjoon-hyun Jun 18, 2020

ulysses-you Jun 18, 2020

cloud-fan Jun 19, 2020

ulysses-you Jun 19, 2020

SparkQA commented Jun 19, 2020

ulysses-you commented Jun 19, 2020

maropu Jun 19, 2020

maropu Jun 19, 2020

SparkQA commented Jun 19, 2020

maropu commented Jun 19, 2020

cloud-fan Jun 19, 2020 •

edited

SparkQA commented Jun 19, 2020

ulysses-you Jun 19, 2020

SparkQA commented Jun 19, 2020

SparkQA commented Jun 19, 2020

ulysses-you Jun 19, 2020

SparkQA commented Jun 19, 2020

maropu Jun 20, 2020

maropu Jun 20, 2020

maropu Jun 20, 2020

maropu Jun 20, 2020

SparkQA commented Jun 20, 2020

maropu commented Jun 20, 2020

SparkQA commented Jun 20, 2020

dongjoon-hyun left a comment

ulysses-you commented Jun 22, 2020

[SPARK-32019][SQL] Add spark.sql.files.minPartitionNum config #28853

[SPARK-32019][SQL] Add spark.sql.files.minPartitionNum config #28853

Conversation

ulysses-you commented Jun 18, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

ulysses-you commented Jun 18, 2020

SparkQA commented Jun 18, 2020

SparkQA commented Jun 18, 2020

SparkQA commented Jun 18, 2020

maropu commented Jun 18, 2020

SparkQA commented Jun 18, 2020

SparkQA commented Jun 18, 2020

ulysses-you commented Jun 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 19, 2020

ulysses-you commented Jun 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 19, 2020

maropu commented Jun 19, 2020

cloud-fan Jun 19, 2020 • edited

Choose a reason for hiding this comment

SparkQA commented Jun 19, 2020

Choose a reason for hiding this comment

SparkQA commented Jun 19, 2020

SparkQA commented Jun 19, 2020

Choose a reason for hiding this comment

SparkQA commented Jun 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 20, 2020

maropu commented Jun 20, 2020

SparkQA commented Jun 20, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

ulysses-you commented Jun 22, 2020

cloud-fan Jun 19, 2020 •

edited