-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23997][SQL] Configurable maximum number of buckets #21087
[SPARK-23997][SQL] Configurable maximum number of buckets #21087
Conversation
retest this please |
It would be great if some admin could review. If there is anything to improve please tell. It is very simple though. |
ok to test |
Test build #93913 has finished for PR 21087 at commit
|
Test build #93984 has finished for PR 21087 at commit
|
retest this please |
It seems the tests timed-out. Any chance to re-run them? |
@@ -580,6 +580,11 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets") | |||
.doc("The maximum number of buckets allowed. Defaults to 100000") | |||
.longConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this type long
while the type of numBuckets
is Int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the convention used in config entries, where integrals use longConf, without making further changes. However I agree we could update the class type as well to match. Will submit the patch.
Test build #94274 has finished for PR 21087 at commit
|
retest this please |
Test build #94276 has finished for PR 21087 at commit
|
val e = intercept[AnalysisException](df.write.bucketBy(numBuckets, "i").saveAsTable("tt")) | ||
assert( | ||
e.getMessage.contains("Number of buckets should be greater than 0 but less than 100000")) | ||
Seq(-1, 0, 100001).foreach(numBuckets => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Only two parts are necessary to be updated for ease of tracking updates. Other changes look unnecessary.
100000
-> 100001
less than 100000
-> less than
@@ -1490,6 +1495,8 @@ class SQLConf extends Serializable with Logging { | |||
|
|||
def bucketingEnabled: Boolean = getConf(SQLConf.BUCKETING_ENABLED) | |||
|
|||
def bucketingMaxBuckets: Long = getConf(SQLConf.BUCKETING_MAX_BUCKETS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need Long
instead of Int
?
Test build #94492 has finished for PR 21087 at commit
|
Test build #94507 has finished for PR 21087 at commit
|
@@ -580,6 +580,11 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it consistent with spark.sql.sources.bucketing.enabled
? rename it to spark.sql.sources.bucketing.maxBuckets
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... did it change or I overlooked 'sources'? Sure I will change!
@@ -580,6 +580,11 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets") | |||
.doc("The maximum number of buckets allowed. Defaults to 100000") | |||
.intConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.checkValue(_ > 0, "the value of spark.sql.sources.bucketing.maxBuckets must be larger than 0")
Test build #94528 has finished for PR 21087 at commit
|
if (numBuckets <= 0 || numBuckets >= 100000) { | ||
def conf: SQLConf = SQLConf.get | ||
|
||
if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the condition is changed from >
to >=
, there is inconsistent between the condition and the error message.
If this condition is true, the message is like ... but less than or equal to bucketing.maxBuckets ...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you submit a followup PR to address this message issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge this PR first.
retest this please |
Test build #95194 has finished for PR 21087 at commit
|
retest this please |
Test build #95238 has finished for PR 21087 at commit
|
retest this please |
Test build #95244 has finished for PR 21087 at commit
|
retest this please |
Test build #95249 has finished for PR 21087 at commit
|
Any further changes? |
Thanks! Merged to master. |
@kiszk Please submit a follow-up PR to address your comment? |
@gatorsmile I see. I will open the PR today. |
## What changes were proposed in this pull request? This PR is an follow-up PR of apache#21087 based on [a discussion thread](apache#21087 (comment)]. Since apache#21087 changed a condition of `if` statement, the message in an exception is not consistent of the current behavior. This PR updates the exception message. ## How was this patch tested? Existing UTs Closes apache#22269 from kiszk/SPARK-23997-followup. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
This PR implements the possibility of the user to override the maximum number of buckets when saving to a table.
Currently the limit is a hard-coded 100k, which might be insufficient for large workloads.
A new configuration entry is proposed:
spark.sql.bucketing.maxBuckets
, which defaults to the previous 100k.How was this patch tested?
Added unit tests in the following spark.sql test suites: