[SPARK-23997][SQL] Configurable maximum number of buckets #21087

ferdonline · 2018-04-17T16:47:29Z

What changes were proposed in this pull request?

This PR implements the possibility of the user to override the maximum number of buckets when saving to a table.
Currently the limit is a hard-coded 100k, which might be insufficient for large workloads.
A new configuration entry is proposed: spark.sql.bucketing.maxBuckets, which defaults to the previous 100k.

How was this patch tested?

Added unit tests in the following spark.sql test suites:

CreateTableAsSelectSuite
BucketedWriteSuite

ferdonline · 2018-04-23T09:34:02Z

retest this please

ferdonline · 2018-08-01T17:16:28Z

It would be great if some admin could review. If there is anything to improve please tell. It is very simple though.

gatorsmile · 2018-08-01T23:08:18Z

ok to test

SparkQA · 2018-08-01T23:15:02Z

Test build #93913 has finished for PR 21087 at commit a884656.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T12:27:20Z

Test build #93984 has finished for PR 21087 at commit aad1068.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ferdonline · 2018-08-02T12:45:58Z

retest this please

ferdonline · 2018-08-03T09:19:26Z

It seems the tests timed-out. Any chance to re-run them?

kiszk · 2018-08-03T12:52:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -580,6 +580,11 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
+    .doc("The maximum number of buckets allowed. Defaults to 100000")
+    .longConf


Why is this type long while the type of numBuckets is Int?

I was following the convention used in config entries, where integrals use longConf, without making further changes. However I agree we could update the class type as well to match. Will submit the patch.

SparkQA · 2018-08-06T08:59:40Z

Test build #94274 has finished for PR 21087 at commit 8ddc4eb.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

ferdonline · 2018-08-06T09:23:47Z

retest this please

SparkQA · 2018-08-06T17:37:25Z

Test build #94276 has finished for PR 21087 at commit e517f66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-07T05:41:33Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala

-      val e = intercept[AnalysisException](df.write.bucketBy(numBuckets, "i").saveAsTable("tt"))
-      assert(
-        e.getMessage.contains("Number of buckets should be greater than 0 but less than 100000"))
+    Seq(-1, 0, 100001).foreach(numBuckets => {


nit: Only two parts are necessary to be updated for ease of tracking updates. Other changes look unnecessary.
100000 -> 100001
less than 100000 -> less than

kiszk · 2018-08-07T05:50:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1490,6 +1495,8 @@ class SQLConf extends Serializable with Logging {

  def bucketingEnabled: Boolean = getConf(SQLConf.BUCKETING_ENABLED)

+  def bucketingMaxBuckets: Long = getConf(SQLConf.BUCKETING_MAX_BUCKETS)


Do we still need Long instead of Int?

SparkQA · 2018-08-09T15:56:26Z

Test build #94492 has finished for PR 21087 at commit 628b4e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T20:24:01Z

Test build #94507 has finished for PR 21087 at commit 6049059.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-08-09T20:56:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -580,6 +580,11 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")


Make it consistent with spark.sql.sources.bucketing.enabled? rename it to spark.sql.sources.bucketing.maxBuckets?

Oh... did it change or I overlooked 'sources'? Sure I will change!

gatorsmile · 2018-08-09T20:59:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -580,6 +580,11 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
+    .doc("The maximum number of buckets allowed. Defaults to 100000")
+    .intConf


.checkValue(_ > 0, "the value of spark.sql.sources.bucketing.maxBuckets must be larger than 0")

SparkQA · 2018-08-10T03:38:43Z

Test build #94528 has finished for PR 21087 at commit ebd9265.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-18T16:14:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

-  if (numBuckets <= 0 || numBuckets >= 100000) {
+  def conf: SQLConf = SQLConf.get
+
+  if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) {


Since the condition is changed from > to >=, there is inconsistent between the condition and the error message.

If this condition is true, the message is like ... but less than or equal to bucketing.maxBuckets ....

Could you submit a followup PR to address this message issue?

We can merge this PR first.

gatorsmile · 2018-08-24T04:51:53Z

retest this please

SparkQA · 2018-08-24T07:05:01Z

Test build #95194 has finished for PR 21087 at commit ebd9265.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-25T00:59:36Z

retest this please

SparkQA · 2018-08-25T04:17:50Z

Test build #95238 has finished for PR 21087 at commit ebd9265.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-25T05:14:36Z

retest this please

SparkQA · 2018-08-25T07:05:01Z

Test build #95244 has finished for PR 21087 at commit ebd9265.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-25T08:30:42Z

retest this please

SparkQA · 2018-08-25T12:34:15Z

Test build #95249 has finished for PR 21087 at commit ebd9265.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ferdonline · 2018-08-28T16:04:55Z

Any further changes?

gatorsmile · 2018-08-28T17:31:05Z

Thanks! Merged to master.

gatorsmile · 2018-08-28T17:33:16Z

@kiszk Please submit a follow-up PR to address your comment?

kiszk · 2018-08-28T23:30:06Z

@gatorsmile I see. I will open the PR today.

## What changes were proposed in this pull request? This PR is an follow-up PR of apache#21087 based on [a discussion thread](apache#21087 (comment)]. Since apache#21087 changed a condition of `if` statement, the message in an exception is not consistent of the current behavior. This PR updates the exception message. ## How was this patch tested? Existing UTs Closes apache#22269 from kiszk/SPARK-23997-followup. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

ferdonline added 2 commits April 17, 2018 14:53

Adding configurable max buckets

61a476f

fixing tests in spark.sql

a884656

fix import order

aad1068

kiszk reviewed Aug 3, 2018

View reviewed changes

long numBuckets

8ddc4eb

Use int instead for both numBuckets and config

e517f66

kiszk reviewed Aug 7, 2018

View reviewed changes

type fix, improved tests

628b4e3

fix

6049059

gatorsmile reviewed Aug 9, 2018

View reviewed changes

maxBuckets config check and rename

ebd9265

kiszk reviewed Aug 18, 2018

View reviewed changes

asfgit closed this in de46df5 Aug 28, 2018

kiszk mentioned this pull request Aug 29, 2018

[SPARK-23997][SQL][Followup] Update exception message #22269

Closed

ferdonline deleted the enh/configurable_bucket_limit branch September 2, 2018 22:31

		@@ -1490,6 +1495,8 @@ class SQLConf extends Serializable with Logging {

		def bucketingEnabled: Boolean = getConf(SQLConf.BUCKETING_ENABLED)

		def bucketingMaxBuckets: Long = getConf(SQLConf.BUCKETING_MAX_BUCKETS)

[SPARK-23997][SQL] Configurable maximum number of buckets #21087

[SPARK-23997][SQL] Configurable maximum number of buckets #21087

Conversation

ferdonline commented Apr 17, 2018

What changes were proposed in this pull request?

How was this patch tested?

ferdonline commented Apr 23, 2018

ferdonline commented Aug 1, 2018

gatorsmile commented Aug 1, 2018

SparkQA commented Aug 1, 2018

SparkQA commented Aug 2, 2018

ferdonline commented Aug 2, 2018

ferdonline commented Aug 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 6, 2018

ferdonline commented Aug 6, 2018

SparkQA commented Aug 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 9, 2018

SparkQA commented Aug 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Aug 24, 2018

SparkQA commented Aug 24, 2018

kiszk commented Aug 25, 2018

SparkQA commented Aug 25, 2018

kiszk commented Aug 25, 2018

SparkQA commented Aug 25, 2018

kiszk commented Aug 25, 2018

SparkQA commented Aug 25, 2018

ferdonline commented Aug 28, 2018

gatorsmile commented Aug 28, 2018

gatorsmile commented Aug 28, 2018

kiszk commented Aug 28, 2018