-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns #10891
[SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns #10891
Conversation
Does Hive write them out? |
Test build #49962 has finished for PR 10891 at commit
|
Will hive throw exception for this? |
I am not a Hive expert. I just did a try in Hive 1.2.1:
It sounds like Hive does not allow users use Partitioning columns in Bucketing key. I think this is not an issue in Hive. However, this is not prohibited in our Spark SQL. @rxin @cloud-fan |
I think we should follow hive for this case, i.e. throw exception. |
Ok, let users change it. Will do the change. This can simplify the logics in bucket pruning. Thanks! |
Also updated the description of PR. Code is ready for review. : ) @cloud-fan Thanks! |
@@ -240,6 +241,15 @@ final class DataFrameWriter private[sql](df: DataFrame) { | |||
n <- numBuckets | |||
} yield { | |||
require(n > 0 && n < 100000, "Bucket number must be greater than 0 and less than 100000.") | |||
|
|||
// partitionBy columns cannot be used in blockedBy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: blockedBy.
Test build #49982 has finished for PR 10891 at commit
|
Test build #49991 has finished for PR 10891 at commit
|
@@ -37,7 +38,7 @@ import org.apache.spark.sql.sources.HadoopFsRelation | |||
* @since 1.4.0 | |||
*/ | |||
@Experimental | |||
final class DataFrameWriter private[sql](df: DataFrame) { | |||
final class DataFrameWriter private[sql](df: DataFrame) extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to do this?
LGTM exception some minor comments, thanks for working on it! |
Thank you for your review! So glad I can provide a help in this great feature! Table bucketing is very useful in real-world scenario. |
Test build #50003 has finished for PR 10891 at commit
|
LGTM |
Thanks, merging to master. |
When users are using
partitionBy
andbucketBy
at the same time, some bucketing columns might be part of partitioning columns. For example,However, in the above case, adding column
i
intobucketBy
is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change.Also added a test case for checking if the information of
sortBy
andbucketBy
columns are correctly saved in the metastore table.Could you check if my understanding is correct? @cloud-fan @rxin @marmbrus Thanks!