Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23997][SQL] Configurable maximum number of buckets #21087

Closed

Conversation

ferdonline
Copy link
Contributor

What changes were proposed in this pull request?

This PR implements the possibility of the user to override the maximum number of buckets when saving to a table.
Currently the limit is a hard-coded 100k, which might be insufficient for large workloads.
A new configuration entry is proposed: spark.sql.bucketing.maxBuckets, which defaults to the previous 100k.

How was this patch tested?

Added unit tests in the following spark.sql test suites:

  • CreateTableAsSelectSuite
  • BucketedWriteSuite

@ferdonline
Copy link
Contributor Author

retest this please

@ferdonline
Copy link
Contributor Author

It would be great if some admin could review. If there is anything to improve please tell. It is very simple though.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93913 has finished for PR 21087 at commit a884656.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93984 has finished for PR 21087 at commit aad1068.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ferdonline
Copy link
Contributor Author

retest this please

@ferdonline
Copy link
Contributor Author

It seems the tests timed-out. Any chance to re-run them?

@@ -580,6 +580,11 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
.doc("The maximum number of buckets allowed. Defaults to 100000")
.longConf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this type long while the type of numBuckets is Int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the convention used in config entries, where integrals use longConf, without making further changes. However I agree we could update the class type as well to match. Will submit the patch.

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94274 has finished for PR 21087 at commit 8ddc4eb.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ferdonline
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94276 has finished for PR 21087 at commit e517f66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val e = intercept[AnalysisException](df.write.bucketBy(numBuckets, "i").saveAsTable("tt"))
assert(
e.getMessage.contains("Number of buckets should be greater than 0 but less than 100000"))
Seq(-1, 0, 100001).foreach(numBuckets => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Only two parts are necessary to be updated for ease of tracking updates. Other changes look unnecessary.
100000 -> 100001
less than 100000 -> less than

@@ -1490,6 +1495,8 @@ class SQLConf extends Serializable with Logging {

def bucketingEnabled: Boolean = getConf(SQLConf.BUCKETING_ENABLED)

def bucketingMaxBuckets: Long = getConf(SQLConf.BUCKETING_MAX_BUCKETS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need Long instead of Int?

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94492 has finished for PR 21087 at commit 628b4e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94507 has finished for PR 21087 at commit 6049059.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -580,6 +580,11 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it consistent with spark.sql.sources.bucketing.enabled? rename it to spark.sql.sources.bucketing.maxBuckets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... did it change or I overlooked 'sources'? Sure I will change!

@@ -580,6 +580,11 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
.doc("The maximum number of buckets allowed. Defaults to 100000")
.intConf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.checkValue(_ > 0, "the value of spark.sql.sources.bucketing.maxBuckets must be larger than 0")

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94528 has finished for PR 21087 at commit ebd9265.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (numBuckets <= 0 || numBuckets >= 100000) {
def conf: SQLConf = SQLConf.get

if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the condition is changed from > to >=, there is inconsistent between the condition and the error message.

If this condition is true, the message is like ... but less than or equal to bucketing.maxBuckets ....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you submit a followup PR to address this message issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this PR first.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 24, 2018

Test build #95194 has finished for PR 21087 at commit ebd9265.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 25, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 25, 2018

Test build #95238 has finished for PR 21087 at commit ebd9265.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 25, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 25, 2018

Test build #95244 has finished for PR 21087 at commit ebd9265.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 25, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 25, 2018

Test build #95249 has finished for PR 21087 at commit ebd9265.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ferdonline
Copy link
Contributor Author

Any further changes?

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@gatorsmile
Copy link
Member

@kiszk Please submit a follow-up PR to address your comment?

@asfgit asfgit closed this in de46df5 Aug 28, 2018
@kiszk
Copy link
Member

kiszk commented Aug 28, 2018

@gatorsmile I see. I will open the PR today.

fjh100456 pushed a commit to fjh100456/spark that referenced this pull request Aug 31, 2018
## What changes were proposed in this pull request?

This PR is an follow-up PR of apache#21087 based on [a discussion thread](apache#21087 (comment)]. Since apache#21087 changed a condition of `if` statement, the message in an exception is not consistent of the current behavior.
This PR updates the exception message.

## How was this patch tested?

Existing UTs

Closes apache#22269 from kiszk/SPARK-23997-followup.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@ferdonline ferdonline deleted the enh/configurable_bucket_limit branch September 2, 2018 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants