[SPARK-32712][SQL] Support writing Hive bucketed table (Hive file formats with Hive hash)#34103
[SPARK-32712][SQL] Support writing Hive bucketed table (Hive file formats with Hive hash)#34103c21 wants to merge 2 commits intoapache:masterfrom
Conversation
| df, | ||
| bucketIdExpression, | ||
| getBucketIdFromFileName) | ||
| withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") { |
There was a problem hiding this comment.
This is added as Hive write code path enforces it - https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L161 .
|
Kubernetes integration test starting |
|
Test build #143617 has finished for PR 34103 at commit
|
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143626 has finished for PR 34103 at commit
|
|
@cloud-fan - could you help take a look when you have time? Thanks. |
|
thanks, merging to master! |
|
Thank you @cloud-fan for review! |
What changes were proposed in this pull request?
This is to support writing Hive bucketed table with Hive file formats (the code path for Hive table write -
InsertIntoHiveTable). The bucketed table is partitioned with Hive hash, same as Hive, Presto and Trino.Why are the changes needed?
To make Spark write other-SQL-engines-compatible bucketed table. Same motivation as #33432 .
Does this PR introduce any user-facing change?
Yes. Before this PR, writing to these Hive bucketed table would throw an exception in Spark if config "hive.enforce.bucketing" or "hive.enforce.sorting" set to true. After this PR, writing to these Hive bucketed table would succeed. The table can be read back by Presto and Trino efficiently as other Hive bucketed table.
How was this patch tested?
Modified unit test in
BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly, for Hive write code path as well.