[SPARK-19256][SQL] Hive bucketing support #19001

tejasapatil · 2017-08-20T00:06:02Z

What changes were proposed in this pull request?

This PR implements both read and write side changes for supporting hive bucketing in Spark. I had initially created a PR for just the write side changes (#18954) for simplicity. If reviewers want to review reader and writer side changes separately, I am happy to wait for the writer side PR to get merged and then send a new PR for reader side changes.

Semantics for read:

outputPartitioning while scanning hive table would be the set of bucketing columns (whether its partitioned or not, whether you are reading single partition or multiple partitions)
outputOrdering would be the sort columns (actually prefix subset of sort columns being read from the table).
In case of reading multiple hive partitions of the table, there would be multiple files per bucket so global sorting across buckets is not there. Thus we would have to ignore the sort information.
See the documentation in HiveTableScanExec where the outputPartitioning and outputOrdering is populated for more nitty gritty details.

Semantics for write:

If the Hive table is bucketed, then INSERT node expect the child distribution to be based on the hash of the bucket columns. Else it would be empty. (Just to compare with Spark native bucketing : the required distribution is not enforced even if the table is bucketed or not... this saves the shuffle in comparison with hive).
Sort ordering for INSERT node over Hive bucketed table is determined as follows:

Table type	Normal table	Bucketed table
non-partitioned insert	Nil	sort columns
static partition	Nil	sort columns
dynamic partitions	partition columns	(partition columns + bucketId + sort columns)

Just to compare how sort ordering is expressed for Spark native bucketing:

Table type	Normal table	Bucketed table
sort ordering	partition columns	(partition columns + bucketId + sort columns)

Why is there a difference ? With hive, since there bucketed insertions would need a shuffle, sort ordering can be relaxed for both non-partitioned and static partition cases. Every RDD partition would get rows corresponding to a single bucket so those can be written to corresponding output file after sort. In case of dynamic partitions, the rows need to be routed to appropriate partition which makes it similar to Spark's constraints.

Only Overwrite mode is allowed for hive bucketed tables as any other mode will break the bucketing guarantees of the table. This is a difference wrt how Spark bucketing works.
With the PR, if there are no files created for empty buckets, the query will fail. Will support creation of empty files in coming iteration. This is a difference wrt how Spark bucketing works as it does NOT need files for empty buckets.

Summary of changes done:

ClusteredDistribution and HashPartitioning are modified to store the hashing function used.
RunnableCommand's' can now express the required distribution and ordering. This is used by ExecutedCommandExec which run these commands
- The good thing about this is that I could remove the logic for enforcing sort ordering inside FileFormatWriter which felt out of place. Ideally, this kinda adding of physical nodes should be done within the planner which is what happens with this PR.
InsertIntoHiveTable enforces both distribution and sort ordering
InsertIntoHadoopFsRelationCommand enforces sort ordering ONLY (and not the distribution)
Fixed a bug due to which any alter commands to bucketed table (eg. updating stats) would wipe out the bucketing spec from metastore. This made insertions to bucketed table non-idempotent operation.
HiveTableScanExec populates outputPartitioning and outputOrdering based on table's metadata, configs and the query
HadoopTableReader to use BucketizedSparkInputFormat for bucketed reads

How was this patch tested?

Added new unit tests

tejasapatil · 2017-08-20T00:15:35Z

cc @cloud-fan @gatorsmile @sameeragarwal @rxin

dongjoon-hyun · 2017-08-20T00:56:07Z

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/BucketizedSparkRecordReader.java

+
+package org.apache.hadoop.hive.ql.io;
+
+import org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil;


Hi, @tejasapatil
Is this the only actual Hive dependency? Without this, it seems that BucketizedSparkInputFormat and BucketizedSparkRecordReader can be promoted to sql/core.

What do we gain out of moving it to sql/core given that they are quite specific for Hive ? I don't see any other use cases besides hive benefiting from it so decided to keep it in sql/hive and have sql/core cleaner.

I see. Thanks.

SparkQA · 2017-08-20T02:40:07Z

Test build #80879 has finished for PR 19001 at commit 02d8711.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new IOException(\"Cannot find class \" + inputFormatClassName, e);
throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);

tejasapatil · 2017-08-20T05:01:06Z

Jenkins retest this please

SparkQA · 2017-08-20T07:04:49Z

Test build #80885 has finished for PR 19001 at commit 02d8711.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new IOException(\"Cannot find class \" + inputFormatClassName, e);
throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);

SparkQA · 2017-08-20T14:36:33Z

Test build #80900 has finished for PR 19001 at commit 02d8711.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new IOException(\"Cannot find class \" + inputFormatClassName, e);
throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);

dongjoon-hyun · 2017-08-20T17:00:55Z

The R failure looks irrelevant.

1. Error: spark.logit (@test_mllib_classification.R#288) -----------------------
java.lang.IllegalArgumentException: requirement failed: The input column stridx_c3082b343085 should have at least two distinct values.

dongjoon-hyun · 2017-08-20T17:01:02Z

Retest this please.

SparkQA · 2017-08-20T20:15:32Z

Test build #80908 has finished for PR 19001 at commit 02d8711.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new IOException(\"Cannot find class \" + inputFormatClassName, e);
throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);

SparkQA · 2017-08-22T23:28:04Z

Test build #81005 has finished for PR 19001 at commit a30b6ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new IOException(\"Cannot find class \" + inputFormatClassName, e);
throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);

tejasapatil · 2017-08-30T17:52:18Z

ping @cloud-fan @gatorsmile

tejasapatil · 2017-08-31T00:58:06Z

#19080 is improving the distribution semantic in planner. Will wait for that to get in.

cloud-fan · 2017-08-31T01:28:29Z

With the simplified distribution semantic, I think it's much easier to support the hive bucketing. We only need to create a HiveHashPartitioning, implement it similar to HashPartitioning without satisfying HashPartitionedDistribution, and then we can avoid shuffle for bucketed hive table in many cases like aggregate, repartitionBy, broadcast join, etc.

For non-broadcast join, we have the potential to support it, after we make the hash function configurable for HashPartitionedDistribution.

tejasapatil · 2018-01-11T21:56:35Z

Now that #19080 has been merged to trunk, I am rebasing this PR. A small part of this PR is put in #20206 and ready for review.

…iter`

…tus object

tejasapatil · 2018-01-12T02:55:30Z

Jenkins retest this please

SparkQA · 2018-01-12T06:31:46Z

Test build #86013 has finished for PR 19001 at commit 7b8a072.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new IOException(\"Cannot find class \" + inputFormatClassName, e);
throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);

SparkQA · 2018-01-13T06:17:19Z

Test build #86074 has finished for PR 19001 at commit 3c367a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new IOException(\"Cannot find class \" + inputFormatClassName, e);
throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);

tejasapatil · 2018-01-13T07:40:10Z

Jenkins retest this please

SparkQA · 2018-01-13T08:05:02Z

Test build #86085 has finished for PR 19001 at commit d37eb8b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2018-01-13T15:50:14Z

Jenkins retest this please

SparkQA · 2018-01-13T19:56:19Z

Test build #86097 has finished for PR 19001 at commit d37eb8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2018-01-13T20:51:31Z

cc @cloud-fan @gatorsmile @sameeragarwal for review

chrysan · 2018-03-19T07:46:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

-        val orderingExpr = requiredOrdering
-          .map(SortOrder(_, Ascending))
-          .map(BindReferences.bindReference(_, outputSpec.outputColumns))
-        SortExec(


Removing SortExec here and adding it in EnsureRequirements Strategy will have impact on many other DataWritingCommands which depends on FileFormatWriter, like CreateDataSourceTableAsSelectCommand. To fix it code changes are needed onto such DataWritingCommand implementations to export requiredDistribution and requiredOrdering.

chrysan · 2018-03-20T02:35:35Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

+  }
+
+  /**
+   * How is `requiredOrdering` determined ?


Why the definition of requiredOrdering here differs from that in InsertIntoHiveTable?

chrysan · 2018-03-31T14:46:54Z

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/BucketizedSparkInputFormat.java

+      newJob.setInputFormat(inputFormat.getClass());
+
+      for (int i = 0; i < numBuckets; i++) {
+        final FileStatus fileStatus = listStatus[i];


This logic depends on the files are listed in a right order, otherwise the RDD partitions to be joined cannot be zipped correctly. Logic should be fixed here to reorder the files listed.

cloud-fan

overall looks good, but we should separate this PR into smaller ones.

cloud-fan · 2018-04-25T06:13:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

-    right: SparkPlan) extends BinaryExecNode with CodegenSupport {
+    right: SparkPlan,
+    requiredNumPartitions: Option[Int] = None,
+    hashingFunctionClass: Class[_ <: HashExpression[Int]] = classOf[Murmur3Hash])


I think this can be done in a followup. For the first version we can just add a HiveHashPartitioning, which can satisfy ClusteredDistribution(save shuffle for aggregate) but not HashClusteredDistribution(can't save shuffle for join).

cloud-fan · 2018-04-25T06:17:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala

@@ -43,7 +44,13 @@ trait RunnableCommand extends Command {
  // `ExecutedCommand` during query planning.
  lazy val metrics: Map[String, SQLMetric] = Map.empty

-  def run(sparkSession: SparkSession): Seq[Row]
+  def run(sparkSession: SparkSession, children: Seq[SparkPlan]): Seq[Row] = {


ExecutedCommandExec doesn't call it.

cloud-fan · 2018-04-25T06:19:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -156,40 +144,14 @@ object FileFormatWriter extends Logging {
      statsTrackers = statsTrackers
    )

-    // We should first sort by partition columns, then bucket id, and finally sorting columns.
-    val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns


Can we send an individual PR to do this? i.e. do the sorting via requiredOrdering instead of doing it manually.

HyukjinKwon · 2018-07-16T03:21:43Z

Hi all, any updates on this PR?

tejasapatil · 2018-07-26T23:51:13Z

I will close this for now

cozos · 2019-07-04T02:56:25Z

Will work on this continue in the future?

tejasapatil mentioned this pull request Aug 20, 2017

[SPARK-17654] [SQL] Enable populating hive bucketed tables #18954

Closed

dongjoon-hyun reviewed Aug 20, 2017

View reviewed changes

tejasapatil force-pushed the bucket_read branch from 02d8711 to a30b6ce Compare August 22, 2017 20:09

jinxing64 mentioned this pull request Aug 23, 2017

[SPARK-21649][SQL] Support writing data into hive bucket table. #18866

Closed

tejasapatil mentioned this pull request Oct 13, 2017

[SPARK-21165][SQL] FileFormatWriter should handle mismatched attribute ids between logical and physical plan #19483

Closed

tejasapatil added 4 commits January 11, 2018 14:27

bucketed writer implementation

6ebd852

Move requiredOrdering into RunnableCommand instead of `FileFormatWr…

70feeed

…iter`

print only the files names in error message instead of entire FileSta…

669069c

…tus object

change to avoid NPE

7460770

tejasapatil force-pushed the bucket_read branch from a30b6ce to 7b8a072 Compare January 12, 2018 02:55

Reader side changes for hive bucketing

3c367a0

tejasapatil force-pushed the bucket_read branch from 7b8a072 to 3c367a0 Compare January 13, 2018 02:53

fix duplicate test

d37eb8b

wangshisan approved these changes Mar 14, 2018

View reviewed changes

chrysan reviewed Mar 19, 2018

View reviewed changes

chrysan reviewed Mar 20, 2018

View reviewed changes

chrysan reviewed Mar 31, 2018

View reviewed changes

cloud-fan reviewed Apr 25, 2018

View reviewed changes

tejasapatil closed this Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19256][SQL] Hive bucketing support #19001

[SPARK-19256][SQL] Hive bucketing support #19001

tejasapatil commented Aug 20, 2017

tejasapatil commented Aug 20, 2017

dongjoon-hyun Aug 20, 2017

tejasapatil Aug 20, 2017

dongjoon-hyun Aug 20, 2017

SparkQA commented Aug 20, 2017

tejasapatil commented Aug 20, 2017

SparkQA commented Aug 20, 2017

SparkQA commented Aug 20, 2017

dongjoon-hyun commented Aug 20, 2017

dongjoon-hyun commented Aug 20, 2017

SparkQA commented Aug 20, 2017

SparkQA commented Aug 22, 2017

tejasapatil commented Aug 30, 2017

tejasapatil commented Aug 31, 2017

cloud-fan commented Aug 31, 2017

tejasapatil commented Jan 11, 2018

tejasapatil commented Jan 12, 2018

SparkQA commented Jan 12, 2018

SparkQA commented Jan 13, 2018

tejasapatil commented Jan 13, 2018

SparkQA commented Jan 13, 2018

tejasapatil commented Jan 13, 2018

SparkQA commented Jan 13, 2018

tejasapatil commented Jan 13, 2018

chrysan Mar 19, 2018

chrysan Mar 20, 2018

chrysan Mar 31, 2018

cloud-fan left a comment

cloud-fan Apr 25, 2018

cloud-fan Apr 25, 2018

cloud-fan Apr 25, 2018

HyukjinKwon commented Jul 16, 2018

tejasapatil commented Jul 26, 2018

cozos commented Jul 4, 2019


		package org.apache.hadoop.hive.ql.io;

		import org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil;

[SPARK-19256][SQL] Hive bucketing support #19001

[SPARK-19256][SQL] Hive bucketing support #19001

Conversation

tejasapatil commented Aug 20, 2017

What changes were proposed in this pull request?

Semantics for read:

Semantics for write:

Summary of changes done:

How was this patch tested?

tejasapatil commented Aug 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 20, 2017

tejasapatil commented Aug 20, 2017

SparkQA commented Aug 20, 2017

SparkQA commented Aug 20, 2017

dongjoon-hyun commented Aug 20, 2017

dongjoon-hyun commented Aug 20, 2017

SparkQA commented Aug 20, 2017

SparkQA commented Aug 22, 2017

tejasapatil commented Aug 30, 2017

tejasapatil commented Aug 31, 2017

cloud-fan commented Aug 31, 2017

tejasapatil commented Jan 11, 2018

tejasapatil commented Jan 12, 2018

SparkQA commented Jan 12, 2018

SparkQA commented Jan 13, 2018

tejasapatil commented Jan 13, 2018

SparkQA commented Jan 13, 2018

tejasapatil commented Jan 13, 2018

SparkQA commented Jan 13, 2018

tejasapatil commented Jan 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 16, 2018

tejasapatil commented Jul 26, 2018

cozos commented Jul 4, 2019