[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter #16898

cloud-fan · 2017-02-12T09:19:24Z

What changes were proposed in this pull request?

In FileFormatWriter, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed.

However, if the data is already sorted, we will sort it again, which is unnecssary.

This PR removes the sorting logic in FileFormatWriter and use SortExec instead. We will not add SortExec if the data is already sorted.

How was this patch tested?

I did a micro benchmark manually

val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part")
spark.time(df.write.partitionBy("part").parquet("/tmp/test"))

The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards.

close #16724

cloud-fan · 2017-02-12T09:19:51Z

cc @viirya @gatorsmile

viirya · 2017-02-12T09:31:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+      // We should first sort by partition columns, then bucket id, and finally sorting columns.
+      val requiredOrdering = (partitionColumns ++ bucketIdExpression ++ sortColumns)
+        .map(SortOrder(_, Ascending))
+      val rdd = if (requiredOrdering == queryExecution.executedPlan.outputOrdering) {


If data's outputOrdering is [partCol1, partCol2, dataCol1], here the requiredOrdering is [partCol1, partCol2], you will miss this optimization.

oh, I should check the subset, good catch!

viirya · 2017-02-12T09:33:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+      val rdd = if (requiredOrdering == queryExecution.executedPlan.outputOrdering) {
+        queryExecution.toRdd
+      } else {
+        SortExec(requiredOrdering, global = false, queryExecution.executedPlan).execute()


Using SortExec here is clever.

viirya · 2017-02-12T09:38:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+      val actualOrdering = queryExecution.executedPlan.outputOrdering
+      // We can still avoid the sort if the required ordering is [partCol] and the actual ordering
+      // is [partCol, anotherCol].
+      val rdd = if (requiredOrdering == actualOrdering.take(requiredOrdering.length)) {


We only care if partition columns are the same between requiredOrdering and actualOrdering. The sort direction doesn't matter.

viirya · 2017-02-12T09:44:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -120,9 +127,10 @@ object FileFormatWriter extends Logging {
      serializableHadoopConf = new SerializableConfiguration(job.getConfiguration),
      outputWriterFactory = outputWriterFactory,
      allColumns = queryExecution.logical.output,


Directly use allColumns created above.

viirya · 2017-02-12T09:54:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

-          currentKey = nextKey.copy()
-          logDebug(s"Writing partition: $currentKey")
+      for (row <- iter) {
+        val nextPartColsAndBucketId = getPartitionColsAndBucketId(row)


getPartitionColsAndBucketId is an unsafe projection. So nextPartColsAndBucketId is a new unsafe row. Do we still need a copy when assigning it to currentPartColsAndBucketId?

Previously we need a copy because getBucketingKey can be an identity function. So the nextKey can be the same unsafe row.

if you take a look at the GenerateUnsafeProject, actually it will reuse the same row instance, so we need to copy.

viirya · 2017-02-12T09:54:34Z

Few comments. Others LGTM.

SparkQA · 2017-02-12T10:44:07Z

Test build #72764 has finished for PR 16898 at commit 602a1e4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-12T11:05:42Z

Test build #72767 has finished for PR 16898 at commit 00e2f22.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-02-12T13:39:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+      val rdd = if (requiredOrdering == actualOrdering.take(requiredOrdering.length)) {
+        queryExecution.toRdd
+      } else {
+        SortExec(requiredOrdering, global = false, queryExecution.executedPlan).execute()


Oh, I met this case before. IIRC, this complains in Scala 2.10. I guess it should be

SortExec(requiredOrdering, global = false, child = queryExecution.executedPlan).execute()

because it seems the complier gets confused the positional/named arguments.

I am running a build with 2.10 to help verify.

Yea, it seems it complains.

[error] .../spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala:160: not enough arguments for method apply: (sortOrder: Seq[org.apache.spark.sql.catalyst.expressions.SortOrder], global: Boolean, child: org.apache.spark.sql.execution.SparkPlan, testSpillFrequency: Int)org.apache.spark.sql.execution.SortExec in object SortExec. [error] Unspecified value parameter child. [error] SortExec(requiredOrdering, global = false, queryExecution.executedPlan).execute() [error]

Good catch!

tejasapatil · 2017-02-12T16:11:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

    val partitionSet = AttributeSet(partitionColumns)
    val dataColumns = queryExecution.logical.output.filterNot(partitionSet.contains)
+    val bucketColumns = bucketSpec.toSeq.flatMap {
+      spec => spec.bucketColumnNames.map(c => allColumns.find(_.name == c).get)


nit: allColumns -> dataColumns ?

No need to look at all columns since Spark doesn't allow bucketing over partition columns.

scala> df1.write.format("orc").partitionBy("i").bucketBy(8, "i").sortBy("k").saveAsTable("table70") org.apache.spark.sql.AnalysisException: bucketBy columns 'i' should not be part of partitionBy columns 'i';

tejasapatil · 2017-02-12T16:32:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+        HashPartitioning(bucketColumns, spec.numBuckets).partitionIdExpression
+      }
+      // We should first sort by partition columns, then bucket id, and finally sorting columns.
+      val requiredOrdering = (partitionColumns ++ bucketIdExpression ++ sortColumns)


Possible over-optimization : Spark allows sorting over partition columns so requiredOrdering can be changed to do:

partitionColumns + bucketIdExpression + (sortColumns which are not in partitionColumns)

so that any extra column(s) in sort expression can be deduped.

scala> df1.write.format("orc").partitionBy("i").bucketBy(8, "i").sortBy("k").saveAsTable("table70") org.apache.spark.sql.AnalysisException: bucketBy columns 'i' should not be part of partitionBy columns 'i';

does it make sense to sort over partition columns in a bucket? I'm surprised if we support this...

It does not make sense (I thought it was intentional). This should definitely be fixed. I digged the commit logs to see that this was fixed for bucketing columns in #10891 but no discussion around sort columns. Will log a JIRA for this.

https://issues.apache.org/jira/browse/SPARK-19587

tejasapatil · 2017-02-12T16:38:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+      val requiredOrdering = (partitionColumns ++ bucketIdExpression ++ sortColumns)
+        .map(SortOrder(_, Ascending))
+      val actualOrdering = queryExecution.executedPlan.outputOrdering
+      // We can still avoid the sort if the required ordering is [partCol] and the actual ordering


The comment makes it feel like its specific to partition columns but it the code below does not have anything specific to partition columns

tejasapatil · 2017-02-12T16:40:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+      val actualOrdering = queryExecution.executedPlan.outputOrdering
+      // We can still avoid the sort if the required ordering is [partCol] and the actual ordering
+      // is [partCol, anotherCol].
+      val rdd = if (requiredOrdering == actualOrdering.take(requiredOrdering.length)) {


You could do semantic equals and not object equals. I recall using object equals in in EnsureRequirements was adding unnecessary SORT in some cases : https://github.com/apache/spark/pull/14841/files#diff-cdb577e36041e4a27a605b6b3063fd54

tejasapatil · 2017-02-12T16:52:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -189,7 +215,7 @@ object FileFormatWriter extends Logging {
    committer.setupTask(taskAttemptContext)

    val writeTask =
-      if (description.partitionColumns.isEmpty && description.bucketSpec.isEmpty) {
+      if (description.partitionColumns.isEmpty && description.numBuckets == 0) {


For someone reading the code, this might be non intuitive to understand that you are checking if there is no bucketing. 0 has been used in many places in this PR to check if table has bucketing. Maybe orthogonal to the PR, but in general we could have a util method to do this. I can send a tiny PR for this if you agree that its a good thing to do.

PS: Having 0 buckets is a thing in Hive however logically it makes no sense and confusing. Under the hood, it treats that as a table with single bucket. Its good that Spark does not allow this.

# hive-1.2.1 hive> CREATE TABLE tejasp_temp_can_be_deleted (key string, value string) CLUSTERED BY (key) INTO 0 BUCKETS; Time taken: 1.144 seconds hive> desc formatted tejasp_temp_can_be_deleted; # Storage Information ... Num Buckets: 0 Bucket Columns: [key] Sort Columns: [] hive>INSERT OVERWRITE TABLE tejasp_temp_can_be_deleted SELECT * FROM ....; # doing `ls` on the output directory shows a single file

tejasapatil · 2017-02-12T19:03:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -329,31 +349,41 @@ object FileFormatWriter extends Logging {
     * If bucket id is specified, we will append it to the end of the file name, but before the


nit for previous line: Open and returns a ...

this method does not return anything

viirya · 2017-02-15T08:45:14Z

nit: typo in pr title advoid.

SparkQA · 2017-02-15T18:02:52Z

Test build #72942 has finished for PR 16898 at commit fc591d1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-15T20:15:36Z

Test build #72945 has finished for PR 16898 at commit 728e1c8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-15T21:28:41Z

retest this please

SparkQA · 2017-02-15T23:56:09Z

Test build #72955 has finished for PR 16898 at commit 728e1c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-17T18:22:50Z

retest this please

SparkQA · 2017-02-17T20:31:49Z

Test build #73059 has finished for PR 16898 at commit 728e1c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-17T20:36:45Z

Test build #73061 has finished for PR 16898 at commit 83053ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-18T00:15:19Z

cc @tejasapatil how is the updated version?

tejasapatil · 2017-02-18T02:58:01Z

@cloud-fan : LGTM

gatorsmile · 2017-02-18T03:35:56Z

Sorry, I am late. Will review it tonight. Thanks!

gatorsmile · 2017-02-18T08:10:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -108,9 +107,21 @@ object FileFormatWriter extends Logging {
    job.setOutputValueClass(classOf[InternalRow])
    FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))

+    val allColumns = queryExecution.logical.output
    val partitionSet = AttributeSet(partitionColumns)
    val dataColumns = queryExecution.logical.output.filterNot(partitionSet.contains)


If we rewrite it to val dataColumns = allColumns.filterNot(partitionColumns.contains), we do not need partitionSet

it's so minor, I'll fix it in my next PR

gatorsmile · 2017-02-18T08:14:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -287,31 +320,16 @@ object FileFormatWriter extends Logging {
   * multiple directories (partitions) or files (bucketing).
   */
  private class DynamicPartitionWriteTask(
-      description: WriteJobDescription,
+      desc: WriteJobDescription,


SingleDirectoryWriteTask is still using description. Change both or keep it unchanged?

I'd like to change both to make it consistent.

gatorsmile · 2017-02-18T08:40:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+    } else {
+      requiredOrdering.zip(actualOrdering).forall {
+        case (requiredOrder, childOutputOrder) =>
+          requiredOrder.semanticEquals(childOutputOrder)


Because bucketIdExpression is HashPartitioning, this will never match, right?

it's HashPartitioning(...).partitionIdExpression, which returns Pmod(new Murmur3Hash(expressions), Literal(numPartitions)), so it may match

cloud-fan · 2017-02-20T02:13:56Z

thanks for the review, merging to master!

## What changes were proposed in this pull request? In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed. However, if the data is already sorted, we will sort it again, which is unnecssary. This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted. ## How was this patch tested? I did a micro benchmark manually ``` val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part") spark.time(df.write.partitionBy("part").parquet("/tmp/test")) ``` The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards. close apache#16724 Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16898 from cloud-fan/writer.

leachbj · 2018-08-07T04:16:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+    // We should first sort by partition columns, then bucket id, and finally sorting columns.
+    val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
+    // the sort order doesn't matter
+    val actualOrdering = queryExecution.executedPlan.outputOrdering.map(_.child)


@cloud-fan would it be possible to use the logical plan rather than the executedPlan? If the optimizer decides the data is already sorted according according to the logical plan the executedPlan won't include the fields.

That would be great, but may need some refactoring.

In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed. However, if the data is already sorted, we will sort it again, which is unnecssary. This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted. I did a micro benchmark manually ``` val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part") spark.time(df.write.partitionBy("part").parquet("/tmp/test")) ``` The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards. close apache#16724 Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16898 from cloud-fan/writer.

cloud-fan mentioned this pull request Feb 12, 2017

[SPARK-19352][SQL] Preserve sort order when saving dataset if data is sorted by partition columns #16724

Closed

viirya reviewed Feb 12, 2017

View reviewed changes

HyukjinKwon reviewed Feb 12, 2017

View reviewed changes

tejasapatil reviewed Feb 12, 2017

View reviewed changes

cloud-fan changed the title ~~[SPARK-19563][SQL] advoid unnecessary sort in FileFormatWriter~~ [SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter Feb 15, 2017

cloud-fan added 3 commits February 15, 2017 08:25

advoid unnecessary sort in FileFormatWriter

d249e1b

address comments

66dbef0

address comments

fc591d1

cloud-fan force-pushed the writer branch from 00e2f22 to fc591d1 Compare February 15, 2017 16:34

fix a bug

728e1c8

simplify

83053ef

gatorsmile reviewed Feb 18, 2017

View reviewed changes

asfgit closed this in 776b8f1 Feb 20, 2017

Downchuck mentioned this pull request Apr 7, 2017

[SPARK-15420] [SQL] Add repartition and sort to prepare output data #13206

Closed

cloud-fan mentioned this pull request Apr 10, 2017

[SPARK-18934][SQL] Writing to dynamic partitions does not preserve sort order if spills occur #16347

Closed

leachbj reviewed Aug 7, 2018

View reviewed changes

		@@ -329,31 +349,41 @@ object FileFormatWriter extends Logging {
		* If bucket id is specified, we will append it to the end of the file name, but before the

[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter #16898

[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter #16898

Conversation

cloud-fan commented Feb 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Feb 12, 2017

Choose a reason for hiding this comment

cloud-fan Feb 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Feb 12, 2017

SparkQA commented Feb 12, 2017

SparkQA commented Feb 12, 2017

HyukjinKwon Feb 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Feb 15, 2017

SparkQA commented Feb 15, 2017

SparkQA commented Feb 15, 2017

cloud-fan commented Feb 15, 2017

SparkQA commented Feb 15, 2017

cloud-fan commented Feb 17, 2017

SparkQA commented Feb 17, 2017

SparkQA commented Feb 17, 2017

cloud-fan commented Feb 18, 2017

tejasapatil commented Feb 18, 2017

gatorsmile commented Feb 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Feb 12, 2017 •

edited

Loading

HyukjinKwon Feb 12, 2017 •

edited

Loading