-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32709][SQL] Support write Hive ORC/Parquet bucketed table (for Hive 1,2) #30003
Conversation
cc @cloud-fan , @maropu , @viirya , @sameeragarwal and @CodingCat if you guys have time to take a look, thanks. |
Test build #129630 has finished for PR 30003 at commit
|
Test build #129631 has finished for PR 30003 at commit
|
Kubernetes integration test starting |
Thanks for the work, @c21 ! btw, we need to care about hive v1? Recently, we've removed the hive-1.2 related code though. |
@maropu - this should work with hive 1.x.y and 2.x.y versions, and we still claim to support from hive 0.12.0 to 2.3.7. Btw presto (prestodb and prestosql) are supporting hivehash version bucketed table, so I think it should be good to support here in spark. In practice, hivehash bucketed table is still used in most companies deploy presto. |
btw #29961 seems to remove some legacy workaround code related to hive-1.2 in spark repo, but it does not mean we are not supporting hive-1.2 any more, right? I haven't looked closely into the PR but just put first feeling here. |
Kubernetes integration test status success |
Kubernetes integration test starting |
Yea, if we could support both without workaround code, I think it's okay. cc: @dongjoon-hyun @HyukjinKwon @wangyum |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status success |
Test build #129632 has finished for PR 30003 at commit
|
Yeah, I think we still support.
|
@@ -230,9 +236,9 @@ class DynamicPartitionDataWriter( | |||
description.customPartitionLocations.get(PartitioningUtils.parsePathFragment(dir)) | |||
} | |||
val currentPath = if (customPath.isDefined) { | |||
committer.newTaskTempFileAbsPath(taskAttemptContext, customPath.get, ext) | |||
committer.newTaskTempFileAbsPath(taskAttemptContext, customPath.get, prefix, ext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the prefix
parameter because we want to put the bucket id at the beginning of the file name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - yes, this is the cleanest way I can think of to achieve that (compatible file naming), let me know if you have any other idea, thanks.
@cloud-fan - wondering how do you think of current approach? do you have any other more comments? Thanks. |
@cloud-fan - here I changed to the approach to create one |
Test build #131334 has finished for PR 30003 at commit
|
Test build #131338 has finished for PR 30003 at commit
|
Test build #131340 has finished for PR 30003 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status failure |
Test build #131342 has finished for PR 30003 at commit
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
…xible file naming ### What changes were proposed in this pull request? This PR is to introduce a new sets of APIs `newTaskTempFile` and `newTaskTempFileAbsPath` inside `FileCommitProtocol`, to allow more flexible file naming of Spark output. The major change is to pass `FileNameSpec` into `FileCommitProtocol`, instead of original `ext` (currently having `prefix` and `ext`), to allow individual `FileCommitProtocol` implementation comes up with more flexible file names (e.g. has a custom `prefix`) for Hive/Presto bucketing - #30003. Provide a default implementations of the added APIs, so all existing implementation of `FileCommitProtocol` is NOT being broken. ### Why are the changes needed? To make commit protocol more flexible in terms of Spark output file name. Pre-requisite of #30003. ### Does this PR introduce _any_ user-facing change? Yes for developers who implement/run custom implementation of `FileCommitProtocol`. They can choose to implement for the newly added API. ### How was this patch tested? Existing unit tests as this is just adding an API. Closes #33012 from c21/commit-protocol-api. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ormat with Hive hash) ### What changes were proposed in this pull request? This is a re-work of #30003, here we add support for writing Hive bucketed table with Parquet/ORC file format (data source v1 write path and Hive hash as the hash function). Support for Hive's other file format will be added in follow up PR. The changes are mostly on: * `HiveMetastoreCatalog.scala`: When converting hive table relation to data source relation, pass bucket info (BucketSpec) and other hive related info as options into `HadoopFsRelation` and `LogicalRelation`, which can be later accessed by `FileFormatWriter` to customize bucket id and file name. * `FileFormatWriter.scala`: Use `HiveHash` for `bucketIdExpression` if it's writing to Hive bucketed table. In addition, Spark output file name should follow Hive/Presto/Trino bucketed file naming convention. Introduce another parameter `bucketFileNamePrefix` and it introduces subsequent change in `FileFormatDataWriter`. * `HadoopMapReduceCommitProtocol`: Implement the new file name APIs introduced in #33012, and change its sub-class `PathOutputCommitProtocol`, to make Hive bucketed table writing work with all commit protocol (including S3A commit protocol). ### Why are the changes needed? To make Spark write other-SQL-engines-compatible bucketed table. Currently Spark bucketed table cannot be leveraged by other SQL engines like Hive and Presto, because it uses a different hash function (Spark murmur3hash) and different file name scheme. With this PR, the Spark-written-Hive-bucketed-table can be efficiently read by Presto and Hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from Facebook, Lyft, etc) migrate bucketing workload from Hive to Spark. ### Does this PR introduce _any_ user-facing change? Yes, any Hive bucketed table (with Parquet/ORC format) written by Spark, is properly bucketed and can be efficiently processed by Hive and Presto/Trino. ### How was this patch tested? * Added unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly. * Tested by Lyft Spark team (Shashank Pedamallu) to read Spark-written bucketed table from Trino, Spark and Hive. Closes #33432 from c21/hive-bucket-v1. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ormat with Hive hash) ### What changes were proposed in this pull request? This is a re-work of apache#30003, here we add support for writing Hive bucketed table with Parquet/ORC file format (data source v1 write path and Hive hash as the hash function). Support for Hive's other file format will be added in follow up PR. The changes are mostly on: * `HiveMetastoreCatalog.scala`: When converting hive table relation to data source relation, pass bucket info (BucketSpec) and other hive related info as options into `HadoopFsRelation` and `LogicalRelation`, which can be later accessed by `FileFormatWriter` to customize bucket id and file name. * `FileFormatWriter.scala`: Use `HiveHash` for `bucketIdExpression` if it's writing to Hive bucketed table. In addition, Spark output file name should follow Hive/Presto/Trino bucketed file naming convention. Introduce another parameter `bucketFileNamePrefix` and it introduces subsequent change in `FileFormatDataWriter`. * `HadoopMapReduceCommitProtocol`: Implement the new file name APIs introduced in apache#33012, and change its sub-class `PathOutputCommitProtocol`, to make Hive bucketed table writing work with all commit protocol (including S3A commit protocol). ### Why are the changes needed? To make Spark write other-SQL-engines-compatible bucketed table. Currently Spark bucketed table cannot be leveraged by other SQL engines like Hive and Presto, because it uses a different hash function (Spark murmur3hash) and different file name scheme. With this PR, the Spark-written-Hive-bucketed-table can be efficiently read by Presto and Hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from Facebook, Lyft, etc) migrate bucketing workload from Hive to Spark. ### Does this PR introduce _any_ user-facing change? Yes, any Hive bucketed table (with Parquet/ORC format) written by Spark, is properly bucketed and can be efficiently processed by Hive and Presto/Trino. ### How was this patch tested? * Added unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly. * Tested by Lyft Spark team (Shashank Pedamallu) to read Spark-written bucketed table from Trino, Spark and Hive. Closes apache#33432 from c21/hive-bucket-v1. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Whether the spark-written-hive-bucketed-table can be read by spark-sql to do bucket filter pruning, join, group-by? In my test, bucket information cannot be used for group-by and join. spark-sql> explain select i,count(*) from tmp.hive_bucketed_table3 group by 1; spark-sql> explain select * from tmp.hive_bucketed_table3 a join tmp.hive_bucketed_table3 b on a.i=b.i limit 1000; |
What changes were proposed in this pull request?
Hive ORC/Parquet write code path is same as data source v1 code path (FileFormatWriter). This PR is to add the support to write Hive ORC/Parquet bucketed table with hivehash. The change is to custom
bucketIdExpression
to useHiveHash
when the table is Hive bucketed table, and the Hive version is 1.x.y or 2.x.y. Support for Hive 3 will be added later in other PR after Hive murmur3hash being added in spark.The changes are mostly on:
HiveMetastoreCatalog.scala
: When converting hive table relation to data source relation, pass bucket info (BucketSpec
) and other hive related info asoptions
intoHadoopFsRelation
andLogicalRelation
, which can be later accessed byInsertIntoHadoopFsRelationCommand
andFileFormatWriter
.FileFormatWriter.scala
: UseHiveHash
forbucketIdExpression
if it's writing to hive bucketed table. In addition, spark output file name should follow Hive (and Presto) bucketed file naming convention. Introduce another parameterbucketFileNamePrefix
and it introduces subsequent change inFileCommitProtocol
andHadoopMapReduceCommitProtocol
.DataSourceScanExec.scala
: Add an extra check forbucketedScan
that makes sure not enable bucketing when reading hive bucketed table as we propagate bucket spec from every hive relation (read and write) inHiveMetastoreCatalog.scala
.Why are the changes needed?
To make spark write other-SQL-engines-compatible bucketed table. Currently spark bucketed table cannot be leveraged by other SQL engines like hive and presto, because it uses a different hash function (spark murmur3hash). With this PR, the spark-written-hive-bucketed-table can be efficiently read by presto and hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from facebook, uber, etc) migrate bucketing workload from hive to spark.
Does this PR introduce any user-facing change?
Yes, any hive bucketed table written by spark with hive 1/2, is properly bucketed and can be efficiently processed by presto and hive.
How was this patch tested?
Added unit test in
BucketedWriteWithHiveSupportSuite.scala
, to verify bucket file names and each row in each bucket is written properly.Cross engines test (take prestosql as example here): set up presto server and hive metastore locally in laptop, and run presto and spark queries locally in laptop.
Created a hive bucketed table by using presto:
Write hive bucketed table (part='part0') by using spark, and read the table by using presto. Verify presto bucket pruning work:
Underlying files in the partition directory:
Write the hive bucketed table by using presto (NOTE: after this, one partition contains both presto and spark written files):
Use presto read the partition again, verify the bucket pruning work on mixed data written by spark and presto:
Underlying files in partition directory:
In addition, verify join spark-written-bucketed-table and presto-written-bucketed table and the result looks correct:
Aggregate on spark-written-bucketed-table and the result looks correct: