[SPARK-26164][SQL] Allow concurrent writers for writing dynamic partitions and bucket table #32198

c21 · 2021-04-16T01:06:23Z

What changes were proposed in this pull request?

This is a re-proposal of #23163. Currently spark always requires a local sort before writing to output table with dynamic partition/bucket columns. The sort can be unnecessary if cardinality of partition/bucket values is small, and can be avoided by keeping multiple output writers concurrently.

This PR introduces a config spark.sql.maxConcurrentOutputFileWriters (which disables this feature by default), where user can tune the maximal number of concurrent writers. The config is needed here as we cannot keep arbitrary number of writers in task memory which can cause OOM (especially for Parquet/ORC vectorization writer).

The feature is to first use concurrent writers to write rows. If the number of writers exceeds the above config specified limit. Sort rest of rows and write rows one by one (See DynamicPartitionDataConcurrentWriter.writeWithIterator()).

In addition, interface WriteTaskStatsTracker and its implementation BasicWriteTaskStatsTracker are also changed because previously they are relying on the assumption that only one writer is active for writing dynamic partitions and bucketed table.

Why are the changes needed?

Avoid the sort before writing output for dynamic partitioned query and bucketed table.
Help improve CPU and IO performance for these queries.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test in DataFrameReaderWriterSuite.scala.

c21 · 2021-04-16T01:07:01Z

cc @cloud-fan and @maropu could you help take a look when you have time? Thanks.

SparkQA · 2021-04-16T02:30:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42026/

SparkQA · 2021-04-16T02:31:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42026/

HyukjinKwon · 2021-04-16T02:58:25Z

@c21 would you mind rebasing w/ the latest master branch? Seems like your branch is based on the old master branch.

github-actions · 2021-04-16T03:01:10Z

Test build #754194290 for PR 32198 at commit 4b2801b.

c21 · 2021-04-16T03:01:34Z

@HyukjinKwon - thanks for the heads up, updated.

SparkQA · 2021-04-16T03:48:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42033/

SparkQA · 2021-04-16T03:48:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42033/

SparkQA · 2021-04-16T06:20:03Z

Test build #137451 has finished for PR 32198 at commit 18f2851.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ConcurrentOutputWriterSpec(

SparkQA · 2021-04-16T08:23:42Z

Test build #137459 has finished for PR 32198 at commit 4b2801b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ConcurrentOutputWriterSpec(

cloud-fan · 2021-04-20T14:20:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+    }
+  }
+
+  sealed abstract class WriterMode


This abstraction is a bit confusing. single writer or concurrent writers are like a mode that is decided statically. before-sort and after-sort are more like runtime states instead of mode.

I'd expect different FileFormatDataWriter implementations for single and concurrent writers, and the concurrent writers implementation has a boolean state to indicate before and after sort.

@cloud-fan - sounds good I agree with it. Will re-structure the code.

Btw what do you think of change in WriteTaskStatsTracker and BasicWriteTaskStatsTracker? Do you have any concern with those interface change?

cloud-fan · 2021-04-20T14:21:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+   *            Keep all writers open and write rows one by one.
+   *  - Step 2: If number of concurrent writers exceeds limit, sort rest of rows. Write rows
+   *            one by one, and eagerly close the writer when finishing each partition and/or
+   *            bucket.


does it mean we can have limit + 1 writers at most?

cloud-fan · 2021-04-20T14:30:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+      var outputWriter: OutputWriter,
+      var recordsInFile: Long,
+      var fileCounter: Int,
+      var filePath: String)


does it mean the latest file path?

Yes, because we may create a new file if exceeding limit of number of records.

c21 · 2021-04-21T03:42:49Z

@cloud-fan - updated the PR to keep single and concurrent writers implementation separately. The PR is ready for review again, thanks.

SparkQA · 2021-04-21T04:58:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42242/

SparkQA · 2021-04-21T04:58:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42242/

SparkQA · 2021-04-21T06:50:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42244/

SparkQA · 2021-04-21T06:50:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42244/

SparkQA · 2021-04-21T08:15:31Z

Test build #137714 has finished for PR 32198 at commit 0442f05.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BaseDynamicPartitionDataWriter(
class DynamicPartitionDataSingleWriter(
class DynamicPartitionDataConcurrentWriter(

cloud-fan · 2021-04-21T08:39:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3150,6 +3150,14 @@ object SQLConf {
    .booleanConf
    .createWithDefault(false)

+  val MAX_CONCURRENT_OUTPUT_WRITERS = buildConf("spark.sql.maxConcurrentOutputWriters")


maxConcurrentOutputFileWriters? To indicate it's for file source only.

@cloud-fan - updated.

cloud-fan · 2021-04-21T08:42:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala

-        numFiles += 1
-      }
-      curFile = None
+  private def getFileStats(filePath: String): Unit = {


seems it's not getFileStats, but updateFileStats

@cloud-fan - updated.

cloud-fan · 2021-04-21T08:58:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

@@ -47,6 +48,7 @@ abstract class FileFormatDataWriter(
  protected val MAX_FILE_COUNTER: Int = 1000 * 1000
  protected val updatedPartitions: mutable.Set[String] = mutable.Set[String]()
  protected var currentWriter: OutputWriter = _


It seems all OutputWriter implementations have a path string. Shall we simply add a def path: String in OutputWriter? Then we don't need the currentPath

@cloud-fan - makes sense. Af first place I was hesitating to make broader change of interface OutputWriter. But updated now.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

cloud-fan · 2021-04-21T09:06:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+    var bucketId: Option[Int])
+
+  /** Wrapper class for status of a unique concurrent output writer. */
+  private case class WriterStatus(


its fields are all var, we can make it a class instead of case class.

@cloud-fan - updated.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

cloud-fan · 2021-04-21T09:12:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

-          dataWriter.write(iterator.next())
+        dataWriter match {
+          case w: DynamicPartitionDataConcurrentWriter =>
+            w.writeWithIterator(iterator)


We can make it an API in the base class, which by default just do

while (iterator.hasNext) { write(iterator.next()) }

@cloud-fan - ~~wondering what's the benefit of doing it? Updated anyway now~~. After a second thought I think I get it. You want to avoid the pattern matching here, updated.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

cloud-fan · 2021-04-23T07:30:07Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+  test("SPARK-26164: Allow concurrent writers for multiple partitions and buckets") {
+    withTable("t1", "t2") {
+      val df = spark.range(200).map(_ => {
+        val n = scala.util.Random.nextInt


can we use a fixed seed in the test? Otherwise there is a small possibility that the distinct values are less than 3 and the fallback test doesn't trigger.

@cloud-fan - good call. Updated.

cloud-fan

LGTM except some minor comments

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

c21

Addressed all comments besides the ones I replied back with questions. cc @cloud-fan thanks.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

c21 · 2021-04-26T05:30:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+          statsTrackers.foreach(_.newPartition(currentWriterId.partitionValues.get))
+        }
+      }
+      retrieveWriterInMap()


@cloud-fan - updated.

c21 · 2021-04-26T05:30:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+            s" which is beyond max value ${concurrentOutputWriterSpec.maxWriters + 1}")
+      }
+      concurrentWriters.put(
+        WriterIndex(currentWriterId.partitionValues, currentWriterId.bucketId),


@cloud-fan - updated.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

c21 · 2021-04-26T05:42:26Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+  test("SPARK-26164: Allow concurrent writers for multiple partitions and buckets") {
+    withTable("t1", "t2") {
+      val df = spark.range(200).map(_ => {
+        val n = scala.util.Random.nextInt


@cloud-fan - good call. Updated.

SparkQA · 2021-04-26T07:22:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42457/

SparkQA · 2021-04-26T07:27:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42457/

cloud-fan · 2021-04-26T08:36:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+   */
+  private def clearCurrentWriterStatus(): Unit = {
+    if (currentWriterId.partitionValues.isDefined || currentWriterId.bucketId.isDefined) {
+      updateCurrentWriterStatusInMap()


shall we call it right after when sorted becomes true?

I wish I could do it to tight the logic more closely, but unfortunately no. We need to write a record (writeRecord) between (1).set the sorted to true (setupCurrentWriterUsingMap) and (2).clean up current writer status (clearCurrentWriterStatus).

writeRecord will change the status of recordsInFile to be increased by 1.

SparkQA · 2021-04-26T11:21:25Z

Test build #137935 has finished for PR 32198 at commit 2895837.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-26T20:11:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42490/

SparkQA · 2021-04-26T20:11:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42490/

SparkQA · 2021-04-27T00:13:59Z

Test build #137970 has finished for PR 32198 at commit efe026c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-27T05:37:07Z

thanks, merging to master!

c21 · 2021-04-27T06:39:10Z

Thank you @cloud-fan for all the dedicated help and careful review!
Thank you @imback82 and @ulysses-you for commenting and review too!

imback82

Late +1, thanks @c21!

hvanhovell · 2021-04-28T09:46:20Z

@c21 this doesn't do any sort of memory tracking right? How do you avoid OOMs?

hvanhovell · 2021-04-28T09:51:10Z

One more thing, how much does this improve the write? Local sorts before the write are typically not too bad if you look at the cycles spend during the write. A much bigger target here would be to properly interleave I/O and CPU operations. You sort of achieve that by having multiple writers, but it IMO feels like quite a big hammer.

c21 · 2021-04-28T15:12:29Z

this doesn't do any sort of memory tracking right?

Yes. It seems to me there's no way to track the memory usage accurately because writer is using on-heap memory. And we need memory usage information available to retrieve from each individual writer implementation (Parquet, ORC, Aveo, etc), which is not the case right now.

One immature idea though is to look at executor JVM heap memory usage (which I think should already be captured).

c21 · 2021-04-28T15:14:56Z

How do you avoid OOMs?

Note the feature is designed to be disabled by default, and to be enabled case by case now. The fallback logic here is intended to avoid OOM when opening too many writers.

c21 · 2021-04-28T15:20:54Z

One more thing, how much does this improve the write? Local sorts before the write are typically not too bad if you look at the cycles spend during the write. A much bigger target here would be to properly interleave I/O and CPU operations. You sort of achieve that by having multiple writers, but it IMO feels like quite a big hammer.

I will add a benchmark for this as a followup.

IMHO how much this can improve thing is really depending on query shape (cardinality of dynamic partitions and buckets). In one environment, if most queries having low number of partitions and users set buckets relatively small, this feature can help more. If in another environment, query tends to write a lot of partitions and users set buckets quite large, this feature helps less. We do see benefit for improving query internally and people raised the request in spark dev as well.

…file the row is written to ### What changes were proposed in this pull request? This is a follow-up of #32198 Before #32198, in `WriteTaskStatsTracker.newRow`, we know that the row is written to the current file. After #32198 , we no longer know this connection. This PR adds the file path parameter in `WriteTaskStatsTracker.newRow` to bring back the connection. ### Why are the changes needed? To not break some custom `WriteTaskStatsTracker` implementations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #32459 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparksFyz · 2022-03-09T10:15:28Z

One more thing, how much does this improve the write? Local sorts before the write are typically not too bad if you look at the cycles spend during the write. A much bigger target here would be to properly interleave I/O and CPU operations. You sort of achieve that by having multiple writers, but it IMO feels like quite a big hammer.

I will add a benchmark for this as a followup.

IMHO how much this can improve thing is really depending on query shape (cardinality of dynamic partitions and buckets). In one environment, if most queries having low number of partitions and users set buckets relatively small, this feature can help more. If in another environment, query tends to write a lot of partitions and users set buckets quite large, this feature helps less. We do see benefit for improving query internally and people raised the request in spark dev as well.

@c21 Hi~ Any link for benchmark? Thanks~
I am interested in how much does this improve for tuning the number to 2 for static partition write(single partition)

github-actions bot added the SQL label Apr 16, 2021

c21 force-pushed the writer branch from 18f2851 to 4b2801b Compare April 16, 2021 03:00

cloud-fan reviewed Apr 20, 2021

View reviewed changes

c21 force-pushed the writer branch from 4b2801b to 0442f05 Compare April 21, 2021 03:41

cloud-fan reviewed Apr 21, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala Show resolved Hide resolved

cloud-fan reviewed Apr 21, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala Show resolved Hide resolved

cloud-fan reviewed Apr 21, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 23, 2021

View reviewed changes

cloud-fan approved these changes Apr 23, 2021

View reviewed changes

ulysses-you reviewed Apr 23, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala Show resolved Hide resolved

Address all comments

2895837

c21 commented Apr 26, 2021

View reviewed changes

cloud-fan reviewed Apr 26, 2021

View reviewed changes

Address comment for renaming increaseFileCounter

efe026c

cloud-fan approved these changes Apr 27, 2021

View reviewed changes

cloud-fan closed this in 7f51106 Apr 27, 2021

c21 deleted the writer branch April 27, 2021 06:39

imback82 reviewed Apr 28, 2021

View reviewed changes

cloud-fan mentioned this pull request May 6, 2021

[SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to #32459

Closed

gengliangwang mentioned this pull request Aug 29, 2022

[SPARK-37287][SQL] Pull out dynamic partition and bucket sort from FileFormatWriter #37099

Closed

ulysses-you mentioned this pull request Apr 26, 2023

[SPARK-43281][SQL] Fix concurrent writer does not update file metrics #40952

Closed

[SPARK-26164][SQL] Allow concurrent writers for writing dynamic partitions and bucket table #32198

[SPARK-26164][SQL] Allow concurrent writers for writing dynamic partitions and bucket table #32198

Conversation

c21 commented Apr 16, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 commented Apr 16, 2021

SparkQA commented Apr 16, 2021

SparkQA commented Apr 16, 2021

HyukjinKwon commented Apr 16, 2021

github-actions bot commented Apr 16, 2021

c21 commented Apr 16, 2021

SparkQA commented Apr 16, 2021

SparkQA commented Apr 16, 2021

SparkQA commented Apr 16, 2021

SparkQA commented Apr 16, 2021

cloud-fan Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 26, 2021

SparkQA commented Apr 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 26, 2021

SparkQA commented Apr 26, 2021

SparkQA commented Apr 26, 2021

SparkQA commented Apr 27, 2021

cloud-fan commented Apr 27, 2021

c21 commented Apr 27, 2021

imback82 left a comment

Choose a reason for hiding this comment

hvanhovell commented Apr 28, 2021

hvanhovell commented Apr 28, 2021

c21 commented Apr 28, 2021

c21 commented Apr 28, 2021

c21 commented Apr 28, 2021

SparksFyz commented Mar 9, 2022 • edited Loading

c21 commented Apr 16, 2021 •

edited

Loading

cloud-fan Apr 20, 2021 •

edited

Loading

c21 Apr 21, 2021 •

edited

Loading

SparksFyz commented Mar 9, 2022 •

edited

Loading