Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19256][SQL] Remove ordering enforcement from FileFormatWriter and let planner do that #20206

Closed
wants to merge 3 commits into from

Conversation

tejasapatil
Copy link
Contributor

@tejasapatil tejasapatil commented Jan 9, 2018

What changes were proposed in this pull request?

This is as per discussion in #19483 (comment) . Enforcing Sort at right places in the tree is something that EnsureRequirements should take care of. This PR removes SORT node insertion done inside FileFormatWriter.

How was this patch tested?

  • Existing unit tests
  • Looked at the query plan for bucketed insert. Sort was added in the plan.
scala> hc.sql(" desc formatted test1  ").collect.foreach(println)
.....
[Num Buckets,8,]
[Bucket Columns,[`j`, `k`],]
[Sort Columns,[`j`, `k`],]


scala> hc.sql(" EXPLAIN INSERT OVERWRITE TABLE test1 SELECT * FROM test2 ").collect.foreach(println)
[== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand file:/warehouse/test1, false, 8 buckets, bucket columns: [j, k], sort columns: [j, k], ...
+- *Sort [pmod(hash(j#56, k#57, 42), 8) ASC NULLS FIRST, j#56 ASC NULLS FIRST, k#57 ASC NULLS FIRST], false, 0
   +- *FileScan orc default.test2[i#55,j#56,k#57] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/warehouse/test2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int,j:int,k:string>]

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85859 has finished for PR 20206 at commit 652dca2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2018

Test build #85891 has finished for PR 20206 at commit 1008b2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85936 has finished for PR 20206 at commit 8c91ff9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85946 has finished for PR 20206 at commit 8c91ff9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

cc @cloud-fan @gengliangwang for review

} else {
// SPARK-21165: the `requiredOrdering` is based on the attributes from analyzed plan, and
// the physical plan may have different attribute ids due to optimizer removing some
// aliases. Here we bind the expression ahead to avoid potential attribute ids mismatch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concern is still valid, the DataWritingCommand.requiredChildOrdering is based on logical plan's output attribute ids, how can we safely apply it in DataWritingCommandExec?

*
* table type | requiredOrdering
* -----------------+-------------------------------------------------
* normal table | partition columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: non-bucketed table, a partitioned table is not a normal table...

@@ -150,6 +152,10 @@ case class InsertIntoHadoopFsRelationCommand(
}
}

val partitionSet = AttributeSet(partitionColumns)
val dataColumns = query.output.filterNot(partitionSet.contains)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use outputColumns instead of query.output, cc @gengliangwang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it should be outputColumns here, which is the output columns of analyzed plan. See #20020 for details.

@HyukjinKwon
Copy link
Member

Hi all, any update here?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 14, 2020
@github-actions github-actions bot closed this Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants