[SPARK-37287][SQL] Pull out dynamic partition and bucket sort from FileFormatWriter #37099

allisonwang-db · 2022-07-06T06:47:26Z

What changes were proposed in this pull request?

FileFormatWriter.write is used by all V1 write commands including data source and hive tables. Depending on dynamic partitions, bucketed, and sort columns in the V1 write command, FileFormatWriter can add a physical sort on top of the query plan which is not visible from plan directly.

This PR (based on #34568) intends to pull out the physical sort added by FileFormatWriter into logical planning. It adds a new logical rule V1Writes to add logical Sort operators based on the required ordering of a V1 write command. This behavior can be controlled by the new config spark.sql.optimizer.plannedWrite.enabled (default: true).

Why are the changes needed?

Improve observability of V1 write, and unify the logic of V1 and V2 write commands.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests.

c21

Thank you @allisonwang-db for picking this up! Knowing the PR is still draft, but just leave some early questions/comments. cc @ulysses-you and @cloud-fan as well.

c21 · 2022-07-07T06:56:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .internal()
+    .doc("When set to true, Spark adds logical sorts to V1 write commands if needed so that " +
+      "`FileFormatWriter` does not need to insert physical sorts.")
+    .version("3.2.0")


nit: 3.4.0?

c21 · 2022-07-07T06:59:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3781,6 +3781,14 @@ object SQLConf {
    .intConf
    .createWithDefault(0)

+  val PLANNED_WRITE_ENABLED = buildConf("spark.sql.plannedWrite.enabled")


nit: feeling the config name is a little bit obscure. could it be spark.sql.requireOrderingForV1Writers or something similar?

Indeed the name is not very descriptive. Planned write here means we want to explicitly plan file writes instead of adding various operations when executing the write. It could include things other than required ordering in the future. I am happy to brainstorm more here.

c21 · 2022-07-07T07:00:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3781,6 +3781,14 @@ object SQLConf {
    .intConf
    .createWithDefault(0)

+  val PLANNED_WRITE_ENABLED = buildConf("spark.sql.plannedWrite.enabled")
+    .internal()
+    .doc("When set to true, Spark adds logical sorts to V1 write commands if needed so that " +


Spark -> Spark optimizer could be clearer that sort is added during query planning.

c21 · 2022-07-07T07:04:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+      outputColumns,
+      query,
+      SparkSession.active.sessionState.conf.resolver)
+    // We do not need the path option from the table location to get writer bucket spec.


sorry why we need the comment here?

I've removed the confusing comment. It means we don't need other option values like table path when getting the writer bucket spec.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

Lines 176 to 180 in 3331d4c

val tableLocation = if (table.tableType == CatalogTableType.MANAGED) {

Some(sessionState.catalog.defaultTablePath(table.identifier))

} else {

table.storage.locationUri

}

c21 · 2022-07-07T07:05:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -211,6 +180,7 @@ object FileFormatWriter extends Logging {

    try {
      val (rdd, concurrentOutputWriterSpec) = if (orderingMatched) {
+        logInfo(s"Output ordering is matched for write job ${description.uuid}")


Is it just for debugging?

This is used in unit test to check when v1 writes is enabled, we should have added a logical sort and thus do not need to add a physical sort (ordering should match).

c21 · 2022-07-07T07:17:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala

+trait V1WriteCommand extends DataWritingCommand {
+  // Specify the required ordering for the V1 write command. `FileFormatWriter` will
+  // add SortExec if necessary when the requiredOrdering is empty.
+  def requiredOrdering: Seq[SortOrder]


Just brainstorming here, if we plan to add a requirement for partitioning, e.g. support shuffle before writing bucket table. Do we want to add a similar RequiresDistributionAndOrdering as v2 now or not?

I think we can add one more method: requiredPartitioning

c21 · 2022-07-19T16:59:38Z

Thanks @cloud-fan and @allisonwang-db for pushing on this. This is great!

I will work on supporting required partitioning for V1 in this week (https://issues.apache.org/jira/browse/SPARK-37287). The motivation is to support shuffling on bucket columns when writing Hive bucket table. cc @cloud-fan, @allisonwang-db and @ulysses-you FYI.

gengliangwang · 2022-08-29T20:16:31Z

So there was an optimization in #32198 which can avoid local sort if there are only a small set of partition/bucket values.
Is there optimization gone after the changes in this PR?

cloud-fan · 2022-08-30T00:36:59Z

That optimization is off by default. When it's turned on, we skip planned write.

dongjoon-hyun

Hi, @allisonwang-db and @cloud-fan

There is a correctness issue report for this configuration, SPARK-44512, for Apache Spark 3.4.0+. Could you take a look at that?

dongjoon-hyun · 2023-11-13T02:20:48Z

Sorry all. After checking the reported use case once more, I found that that it's a false alarm. I closed the issue as Not A Problem.

…rom `DataSource` ### What changes were proposed in this pull request? `resolvePartitionColumns` was introduced by SPARK-37287 (#37099) and become unused after SPARK-41713 (#39220), so this pr remove it from `DataSource`. ### Why are the changes needed? Clean up unused code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43779 from LuciferYang/SPARK-45902. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

allisonwang-db marked this pull request as draft July 6, 2022 06:47

github-actions bot added the SQL label Jul 6, 2022

allisonwang-db force-pushed the spark-37287-v1-writes branch from dca128e to 13042a7 Compare July 6, 2022 17:45

c21 reviewed Jul 7, 2022

View reviewed changes

allisonwang-db added 4 commits July 12, 2022 19:25

v1 writes

f486f18

add aqe ordering and more tests

eab24ab

turn on config

021ff95

update tests

44662ed

allisonwang-db force-pushed the spark-37287-v1-writes branch from 7507ae6 to 44662ed Compare July 18, 2022 19:38

allisonwang-db added 2 commits July 18, 2022 15:09

apply AQE on top of DataWritingCommandExec

f5e05ae

fix AQE tests and add more Hive tests

aa00cda

allisonwang-db marked this pull request as ready for review July 19, 2022 05:27

cloud-fan approved these changes Jul 19, 2022

View reviewed changes

cloud-fan closed this in 2562274 Jul 19, 2022

dongjoon-hyun reviewed Nov 13, 2023

View reviewed changes

LuciferYang mentioned this pull request Nov 13, 2023

[SPARK-45902][SQL] Remove unused function resolvePartitionColumns from DataSource #43779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37287][SQL] Pull out dynamic partition and bucket sort from FileFormatWriter #37099

[SPARK-37287][SQL] Pull out dynamic partition and bucket sort from FileFormatWriter #37099

allisonwang-db commented Jul 6, 2022 •

edited

Loading

c21 left a comment

c21 Jul 7, 2022

c21 Jul 7, 2022

allisonwang-db Jul 8, 2022

c21 Jul 7, 2022

c21 Jul 7, 2022

allisonwang-db Jul 8, 2022

c21 Jul 7, 2022

allisonwang-db Jul 8, 2022

c21 Jul 7, 2022

cloud-fan Jul 19, 2022

c21 commented Jul 19, 2022

gengliangwang commented Aug 29, 2022

cloud-fan commented Aug 30, 2022

dongjoon-hyun left a comment

dongjoon-hyun commented Nov 13, 2023

	val tableLocation = if (table.tableType == CatalogTableType.MANAGED) {
	Some(sessionState.catalog.defaultTablePath(table.identifier))
	} else {
	table.storage.locationUri
	}

[SPARK-37287][SQL] Pull out dynamic partition and bucket sort from FileFormatWriter #37099

[SPARK-37287][SQL] Pull out dynamic partition and bucket sort from FileFormatWriter #37099

Conversation

allisonwang-db commented Jul 6, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Jul 19, 2022

gengliangwang commented Aug 29, 2022

cloud-fan commented Aug 30, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 13, 2023

allisonwang-db commented Jul 6, 2022 •

edited

Loading