Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns #10891

Closed

Conversation

gatorsmile
Copy link
Member

When users are using partitionBy and bucketBy at the same time, some bucketing columns might be part of partitioning columns. For example,

        df.write
          .format(source)
          .partitionBy("i")
          .bucketBy(8, "i", "k")
          .saveAsTable("bucketed_table")

However, in the above case, adding column i into bucketBy is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change.

Also added a test case for checking if the information of sortBy and bucketBy columns are correctly saved in the metastore table.

Could you check if my understanding is correct? @cloud-fan @rxin @marmbrus Thanks!

@gatorsmile gatorsmile changed the title [SQL] Eliminate Bucketing Columns that are part of Partitioning Columns [SPARK-12975] [SQL] Eliminate Bucketing Columns that are part of Partitioning Columns Jan 25, 2016
@rxin
Copy link
Contributor

rxin commented Jan 25, 2016

Does Hive write them out?

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #49962 has finished for PR 10891 at commit 8c718b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Will hive throw exception for this?

@gatorsmile
Copy link
Member Author

I am not a Hive expert. I just did a try in Hive 1.2.1:

hive> CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
    > PARTITIONED BY(ds STRING)
    > CLUSTERED BY(ds, user_id) INTO 256 BUCKETS;
FAILED: SemanticException [Error 10002]: Invalid column reference

It sounds like Hive does not allow users use Partitioning columns in Bucketing key. I think this is not an issue in Hive. However, this is not prohibited in our Spark SQL. @rxin @cloud-fan

@cloud-fan
Copy link
Contributor

I think we should follow hive for this case, i.e. throw exception.

@gatorsmile
Copy link
Member Author

Ok, let users change it. Will do the change. This can simplify the logics in bucket pruning. Thanks!

@gatorsmile
Copy link
Member Author

Also updated the description of PR. Code is ready for review. : ) @cloud-fan Thanks!

@@ -240,6 +241,15 @@ final class DataFrameWriter private[sql](df: DataFrame) {
n <- numBuckets
} yield {
require(n > 0 && n < 100000, "Bucket number must be greater than 0 and less than 100000.")

// partitionBy columns cannot be used in blockedBy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: blockedBy.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #49982 has finished for PR 10891 at commit d207813.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #49991 has finished for PR 10891 at commit f9a8bdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -37,7 +38,7 @@ import org.apache.spark.sql.sources.HadoopFsRelation
* @since 1.4.0
*/
@Experimental
final class DataFrameWriter private[sql](df: DataFrame) {
final class DataFrameWriter private[sql](df: DataFrame) extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to do this?

@cloud-fan
Copy link
Contributor

LGTM exception some minor comments, thanks for working on it!

@gatorsmile
Copy link
Member Author

Thank you for your review! So glad I can provide a help in this great feature! Table bucketing is very useful in real-world scenario.

@gatorsmile gatorsmile changed the title [SPARK-12975] [SQL] Eliminate Bucketing Columns that are part of Partitioning Columns [SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns Jan 25, 2016
@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50003 has finished for PR 10891 at commit 2ace09f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@marmbrus
Copy link
Contributor

Thanks, merging to master.

@asfgit asfgit closed this in 9348431 Jan 25, 2016
@gatorsmile gatorsmile deleted the commonKeysInPartitionByBucketBy branch January 26, 2016 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants