[SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns #10891

gatorsmile · 2016-01-25T02:03:36Z

When users are using partitionBy and bucketBy at the same time, some bucketing columns might be part of partitioning columns. For example,

        df.write
          .format(source)
          .partitionBy("i")
          .bucketBy(8, "i", "k")
          .saveAsTable("bucketed_table")

However, in the above case, adding column i into bucketBy is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change.

Also added a test case for checking if the information of sortBy and bucketBy columns are correctly saved in the metastore table.

Could you check if my understanding is correct? @cloud-fan @rxin @marmbrus Thanks!

…tionByBucketBy

rxin · 2016-01-25T02:41:51Z

Does Hive write them out?

SparkQA · 2016-01-25T04:03:21Z

Test build #49962 has finished for PR 10891 at commit 8c718b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-25T05:33:16Z

Will hive throw exception for this?

gatorsmile · 2016-01-25T05:33:32Z

I am not a Hive expert. I just did a try in Hive 1.2.1:

hive> CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
    > PARTITIONED BY(ds STRING)
    > CLUSTERED BY(ds, user_id) INTO 256 BUCKETS;
FAILED: SemanticException [Error 10002]: Invalid column reference

It sounds like Hive does not allow users use Partitioning columns in Bucketing key. I think this is not an issue in Hive. However, this is not prohibited in our Spark SQL. @rxin @cloud-fan

cloud-fan · 2016-01-25T05:52:17Z

I think we should follow hive for this case, i.e. throw exception.

gatorsmile · 2016-01-25T05:59:46Z

Ok, let users change it. Will do the change. This can simplify the logics in bucket pruning. Thanks!

gatorsmile · 2016-01-25T07:02:47Z

Also updated the description of PR. Code is ready for review. : ) @cloud-fan Thanks!

viirya · 2016-01-25T07:24:42Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -240,6 +241,15 @@ final class DataFrameWriter private[sql](df: DataFrame) {
      n <- numBuckets
    } yield {
      require(n > 0 && n < 100000, "Bucket number must be greater than 0 and less than 100000.")
+
+      // partitionBy columns cannot be used in blockedBy


typo: blockedBy.

SparkQA · 2016-01-25T08:59:12Z

Test build #49982 has finished for PR 10891 at commit d207813.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-25T14:17:49Z

Test build #49991 has finished for PR 10891 at commit f9a8bdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-25T17:47:56Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -37,7 +38,7 @@ import org.apache.spark.sql.sources.HadoopFsRelation
 * @since 1.4.0
 */
 @Experimental
-final class DataFrameWriter private[sql](df: DataFrame) {
+final class DataFrameWriter private[sql](df: DataFrame) extends Logging  {


no need to do this?

cloud-fan · 2016-01-25T17:53:51Z

LGTM exception some minor comments, thanks for working on it!

gatorsmile · 2016-01-25T19:16:14Z

Thank you for your review! So glad I can provide a help in this great feature! Table bucketing is very useful in real-world scenario.

SparkQA · 2016-01-25T20:36:16Z

Test build #50003 has finished for PR 10891 at commit 2ace09f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-25T21:09:14Z

LGTM

marmbrus · 2016-01-25T21:37:41Z

Thanks, merging to master.

gatorsmile added 3 commits January 24, 2016 15:56

remove unnecessary columns from blockBy

14fb29d

added more test cases.

e68351b

Merge remote-tracking branch 'upstream/master' into commonKeysInParti…

e529b7d

…tionByBucketBy

gatorsmile changed the title ~~[SQL] Eliminate Bucketing Columns that are part of Partitioning Columns~~ [SPARK-12975] [SQL] Eliminate Bucketing Columns that are part of Partitioning Columns Jan 25, 2016

style fix.

8c718b3

address comments.

d207813

viirya reviewed Jan 25, 2016
View reviewed changes

gatorsmile added 2 commits January 25, 2016 04:20

typo fix.

64f5ea0

typo fix.

f9a8bdf

cloud-fan reviewed Jan 25, 2016
View reviewed changes

address comments.

2ace09f

gatorsmile changed the title ~~[SPARK-12975] [SQL] Eliminate Bucketing Columns that are part of Partitioning Columns~~ [SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns Jan 25, 2016

asfgit closed this in 9348431 Jan 25, 2016

gatorsmile deleted the commonKeysInPartitionByBucketBy branch January 26, 2016 00:16

tejasapatil mentioned this pull request Feb 14, 2017

[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter #16898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns #10891

[SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns #10891

gatorsmile commented Jan 25, 2016

rxin commented Jan 25, 2016

SparkQA commented Jan 25, 2016

cloud-fan commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

cloud-fan commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

viirya Jan 25, 2016

SparkQA commented Jan 25, 2016

SparkQA commented Jan 25, 2016

cloud-fan Jan 25, 2016

cloud-fan commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

SparkQA commented Jan 25, 2016

cloud-fan commented Jan 25, 2016

marmbrus commented Jan 25, 2016

[SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns #10891

[SPARK-12975] [SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns #10891

Conversation

gatorsmile commented Jan 25, 2016

rxin commented Jan 25, 2016

SparkQA commented Jan 25, 2016

cloud-fan commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

cloud-fan commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

viirya Jan 25, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 25, 2016

SparkQA commented Jan 25, 2016

cloud-fan Jan 25, 2016

Choose a reason for hiding this comment

cloud-fan commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

SparkQA commented Jan 25, 2016

cloud-fan commented Jan 25, 2016

marmbrus commented Jan 25, 2016