Skip to content

[SPARK-32712][SQL] Support writing Hive bucketed table (Hive file formats with Hive hash)#34103

Closed
c21 wants to merge 2 commits intoapache:masterfrom
c21:hive-bucket
Closed

[SPARK-32712][SQL] Support writing Hive bucketed table (Hive file formats with Hive hash)#34103
c21 wants to merge 2 commits intoapache:masterfrom
c21:hive-bucket

Conversation

@c21
Copy link
Contributor

@c21 c21 commented Sep 25, 2021

What changes were proposed in this pull request?

This is to support writing Hive bucketed table with Hive file formats (the code path for Hive table write - InsertIntoHiveTable). The bucketed table is partitioned with Hive hash, same as Hive, Presto and Trino.

Why are the changes needed?

To make Spark write other-SQL-engines-compatible bucketed table. Same motivation as #33432 .

Does this PR introduce any user-facing change?

Yes. Before this PR, writing to these Hive bucketed table would throw an exception in Spark if config "hive.enforce.bucketing" or "hive.enforce.sorting" set to true. After this PR, writing to these Hive bucketed table would succeed. The table can be read back by Presto and Trino efficiently as other Hive bucketed table.

How was this patch tested?

Modified unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly, for Hive write code path as well.

@github-actions github-actions bot added the SQL label Sep 25, 2021
df,
bucketIdExpression,
getBucketIdFromFileName)
withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c21 c21 changed the title [SPARK-32712][SQL] Support to write Hive bucketed table (Hive file formats with Hive hash) [SPARK-32712][SQL] Support writing Hive bucketed table (Hive file formats with Hive hash) Sep 25, 2021
@SparkQA
Copy link

SparkQA commented Sep 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48129/

@SparkQA
Copy link

SparkQA commented Sep 25, 2021

Test build #143617 has finished for PR 34103 at commit cb6b5b1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48129/

@SparkQA
Copy link

SparkQA commented Sep 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48138/

@SparkQA
Copy link

SparkQA commented Sep 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48138/

@SparkQA
Copy link

SparkQA commented Sep 26, 2021

Test build #143626 has finished for PR 34103 at commit 12a8aca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21
Copy link
Contributor Author

c21 commented Sep 27, 2021

@cloud-fan - could you help take a look when you have time? Thanks.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 978a915 Sep 27, 2021
@c21
Copy link
Contributor Author

c21 commented Sep 28, 2021

Thank you @cloud-fan for review!

@c21 c21 deleted the hive-bucket branch October 4, 2021 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants