[Spark] Implement optimized write. #2145

weiluo-db · 2023-10-06T18:44:16Z

Which Delta project/connector is this regarding?

Description

Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size as data is written and benefit subsequent reads on the table.

This PR introduces a new DeltaOptimizedWriterExec exec node. It's responsible for executing the shuffle (HashPartitioning based on the table's partition columns) and rebalancing afterwards. More specifically, the number of shuffle partitions is controlled by two new knobs:

spark.databricks.delta.optimizeWrite.numShuffleBlocks (default=50,000,000), which controls "maximum number of shuffle blocks to target";
spark.databricks.delta.optimizeWrite.maxShufflePartitions (default=2,000), which controls "max number of output buckets (reducers) that can be used by optimized writes".

After repartitioning, the blocks are then sorted in ascending order by size and bin-packed into appropriately-sized bins for output tasks. The bin size is controlled by the following new knob:

spark.databricks.delta.optimizeWrite.binSize (default=512MiB).

Note that this knob is based on the in-memory size of row-based shuffle blocks. So the final output Parquet size is usually smaller than the bin size due to column-based encoding and compression.

The whole optimized write feature can be controlled in the following ways, in precedence order from high to low (i.e. each option takes precedence over any successive ones):

The optimizeWrite Delta option in DataFrameWriter (default=None), e.g. spark.range(0, 100).toDF().write.format("delta").option("optimizedWrite", "true").save(...);
The spark.databricks.delta.optimizeWrite.enabled Spark session setting (default=None).
The delta.autoOptimize.optimizeWrite table property (default=None);

Optimized write is DISABLED by default.

Fixes #1158

How was this patch tested?

Unit tests: OptimizedWritesSuite andBinPackingUtilsSuite.

Does this PR introduce any user-facing changes?

Yes. Please see the Description for details.

felipepessoto · 2023-10-06T20:35:23Z

@weiluo-db there is a PR open waiting for review: #1198

weiluo-db · 2023-10-09T16:36:23Z

@weiluo-db there is a PR open waiting for review: #1198

@felipepessoto This PR's approach to Optimized Write is more well tested in production environments.

rahulsmahadev · 2023-10-10T16:23:53Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommandBase.scala

@@ -228,7 +228,9 @@ abstract class MergeIntoCommandBase extends LeafRunnableCommand
      txn: OptimisticTransaction,
      outputDF: DataFrame): Seq[FileAction] = {
    val partitionColumns = txn.metadata.partitionColumns
-    if (partitionColumns.nonEmpty && spark.conf.get(DeltaSQLConf.MERGE_REPARTITION_BEFORE_WRITE)) {
+    // If the write will be optimized write, which shuffles the data anyway, then don't repartition.


can you mention in the comment that Optimized write will handle both cases of splitting the task when its very large and combining tasks when they are very small.

Updated the comment.

rahulsmahadev · 2023-10-10T16:24:35Z

spark/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala

@@ -348,6 +354,7 @@ trait TransactionalWrite extends DeltaLogging { self: OptimisticTransactionImpl
  def writeFiles(
      inputData: Dataset[_],
      writeOptions: Option[DeltaOptions],
+      isOptimize: Boolean,


can you clarify why we need this new isOptimize flag now ?

Added some comments.

rahulsmahadev · 2023-10-10T16:25:43Z

spark/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala

@@ -449,4 +462,27 @@ trait TransactionalWrite extends DeltaLogging { self: OptimisticTransactionImpl

    resultFiles.toSeq ++ committer.changeFiles
  }
+
+  /**
+   * Optimized writes can be enabled/disabled through the following order:


this makes sense, I guess the intention is to have an explicit way to turn on/off for every write/table/session in that order

thanks for changing the precendence

rahulsmahadev · 2023-10-10T16:26:29Z

spark/src/main/scala/org/apache/spark/sql/delta/perf/DeltaOptimizedWriterExec.scala

+ * @param partitionColumns The partition columns of the table. Used for hash partitioning the write
+ * @param deltaLog The DeltaLog for the table. Used for logging only
+ */
+case class DeltaOptimizedWriterExec(


When I insepect the plan is this what will show up in the write node ?

rahulsmahadev · 2023-10-10T16:27:26Z

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

+      .doc("Maximum number of shuffle blocks to target for the adaptive shuffle " +
+        "in optimized writes.")
+      .intConf
+      .createWithDefault(50000000)


is there a reason for choosing these defaults

As explained in the config doc, optimizeWrite.maxShufflePartitions must be not be larger than spark.shuffle.minNumPartitionsToHighlyCompress, which is 2000 by default. And then optimizeWrite.numShuffleBlocks is set high enough to produce sufficient number of partitions (while still being limited by optimizeWrite.maxShufflePartitions).

rahulsmahadev

Thanks for this contribution, left a few questions

weiluo-db · 2023-10-10T18:06:03Z

Thanks for this contribution, left a few questions

Thanks for the review. PTAL.

rasidhan · 2023-10-10T19:13:04Z

Why is this effort being duplicated and being released in such a hurry when there is an existing PR that has been pending review for 16 months #1198 and has been on the roadmap #1307 too?

felipepessoto · 2023-10-12T00:40:32Z

spark/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala

+    // We want table properties to take precedence over the session/default conf.
+    DeltaConfigs.OPTIMIZE_WRITE
+      .fromMetaData(metadata)
+      .orElse(sessionConf.getConf(DeltaSQLConf.DELTA_OPTIMIZE_WRITE_ENABLED))


When session config is set, shouldn't it have precedence over table properties?

I didn’t check Databricks, do you know what is the behavior?

makes sense

tdas · 2023-10-16T19:49:13Z

Hello @rasidhan, apologies for the delay in responding to you. First and foremost, I would like to sincerely thank @sezruby for their contributions and dedication to the project. I absolutely recognize the time and energy that goes into making such contributions.

In evaluating #1198, we did review the logic and we did consider various technical factors to contrast it against our implementation. We have been evaluating our implementation of optimized write in various production workloads for many months. So while the proposed implementation in this PR may seem sudden, this implementation has been battle-tested under a wide variety of data scale and cluster configurations.

I can assure you this does not diminish the value of the work you've done. The Delta project is genuinely grateful for your understanding and we hope to better our communication and collaboration processes in the future to prevent such instances.

rahulsmahadev · 2023-10-30T21:15:39Z

spark/src/test/scala/org/apache/spark/sql/delta/util/BinPackingUtilsSuite.scala

+      assert(BinPackingUtils.binPackBySize(input, (x: Int) => x, (x: Int) => x, binSize) == expect)
+    }
+  }
+


nit: remove line

rahulsmahadev · 2023-10-30T21:18:06Z

spark/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala

+    writeOptions.flatMap(_.optimizeWrite)
+      .getOrElse(TransactionalWrite.shouldOptimizeWrite(metadata, sessionConf))
+  }
+


nit: remove line

rahulsmahadev · 2023-10-30T21:19:14Z

spark/src/main/scala/org/apache/spark/sql/delta/perf/DeltaOptimizedWriterExec.scala

+  private def getShuffleRDD: ShuffledRowRDD = {
+    if (cachedShuffleRDD == null) {
+      val resolver = org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
+      val saltedPartitioning = HashPartitioning(


Is there a reason for choosing HashPartitioning over other schemes ?

This matches table partition. For unpartitioned tables, it shouldn't really matter that much. But HashPartitioning should be more robust to certain failure cases, e.g. https://issues.apache.org/jira/browse/SPARK-38388.

rahulsmahadev

Left few more comments

rahulsmahadev · 2023-10-30T21:22:24Z

can you retrigger the CI, looks like a flaky test runner

felipepessoto · 2023-10-30T21:24:56Z

@rahulsmahadev do you know how to retrigger CI? I'm asking because I have another PR open with same test issues. The errors don't seem flaky tho. It seems more consistently failing.

rahulsmahadev · 2023-10-30T21:27:21Z

@felipepessoto I just create an empty commit usually git commit –allow-empty . cc: @scottsand-db if you are aware of a better way to retrigger CI

rahulsmahadev

LGTM!

Kimahriman · 2023-11-14T21:45:10Z

How is this supposed to work for large writes (large in this case being several TiB or more)? I'm seeing very skewed partitions writing data testing this out. First, my reducer tasks seem to be capped at maxShufflePartitions, even though it looked like that config is just used to determine how many reducers to pretend will exist during the map phase, and then the computeBins combines individual map outputs from there (which could end up being more or less than the initial pretend reducer count). Is that not how this works?

When I increase the maxShufflePartitions (and highly compressed shuffle config) large enough, it seems to combine some map outputs, but I still get incredibly skewed reducing tasks, reading anywhere from < 100 MiB of data up to several GiB of the shuffle data.

When I do the exact same job with #1198 (which I've been using in production for for over a year now), I get much more evenly distributed reducer tasks (reading ~750 MiB of shuffle data, 750 because of the parquet compression ratio it assumes)

weiluo-db · 2023-11-14T23:29:08Z

@Kimahriman The number of a reducers is controlled by both numShuffleBlocks and maxShufflePartitions, as you can see here: https://github.com/delta-io/delta/pull/2145/files#diff-90671ab8774fd49ade989674814a12d06081192e33e559c955f4c26e6733a221R73-R78.

How many mappers do you have in your test? When you run the same job with #1198, what bin size do you use, and how many reducers do you get?

It'll be super helpful if you could also paste the full metrics from the DeltaOptimizedWriterExec node here. Thanks!

Kimahriman · 2023-11-15T00:00:07Z

The number of a reducers is controlled by both numShuffleBlocks and maxShufflePartitions

My understanding is that the numShuffleBlocks is like the total mappers * total reducers. So if you have say 1k mappers, you'll get min(numShuffleBlocks / 1k, maxShufflePartitions) reducers, which with default settings will also use maxShufflePartitions until you have more than 25k mappers. And the total mappers * total reducers is effectively the chunks that are then combined into "bins", so if you have 1k mappers and 2k reducers, you have 2 million shuffle blocks to work with to combine into similarly sized bins. Is that all correct?

How many mappers do you have in your test? When you run the same job with #1198, what bin size do you use, and how many reducers do you get?

There were about ~4500 mappers writing roughly 2.3 TiB of shuffle data. With the default settings in this PR, I ended up with exactly 2000 reducing tasks. I tried increasing maxShufflePartitions to 10k and ended up with ~9870 reducers, some of which were reading up to 3+ GiB of shuffle data. The resulting data is pretty skewed as far as output partitions.

For the other PR, we also have binSize set to 512 MiB. The same job resulted in ~20k reducing tasks. The one difference here is that we actually have the spark.sql.shuffle.partitions set to 20k, which is what is used as the equivalent maxShufflePartitions. I didn't try setting maxShufflePartitions to 20k to match to see how they compared. But with 4500 mappers I still would assume it could be split pretty evenly (4500 * 2000 = 9 million shuffle blocks to work with right?). We did end up with a lot of small partitions, but the largest reducers didn't read more than 1 GiB of shuffle data.

It'll be super helpful if you could also paste the full metrics from the DeltaOptimizedWriterExec node here. Thanks!

Don't have those right now unfortunately (and probably can't get until next week). Was going to try adding some logging to to understand what was going on too.

weiluo-db · 2023-11-20T19:54:17Z

With the default settings in this PR, I ended up with exactly 2000 reducing tasks.

BTW, did you ever try running with the default 2k reducers? If so, how skewed was it compared to the other runs? I can only speculate that we somehow didn't get accurate stats about the shuffle blocks (e.g. HighlyCompressedMapStatus). Perhaps try raising the reducers to 20k, which could allow a more leveled comparison.

Any additional metrics and logs would definitely help. Thanks!

Kimahriman · 2023-11-20T20:02:46Z

BTW, did you ever try running with the default 2k reducers? If so, how skewed was it compared to the other runs? I can only speculate that we somehow didn't get accurate stats about the shuffle blocks (e.g. HighlyCompressedMapStatus). Perhaps try raising the reducers to 20k, which could allow a more leveled comparison.

Yeah the first attempt was with the default 2k reducers and I got 2k reducing tasks exactly, which is odd because I expected the partitions to get split due to the large size, but that didn't seem to happen at all. I have the HighlyCompressedMapStatus config set to 100k to make sure it doesn't become an issue (realized that when initially testing out the other PR way back when).

Any additional metrics and logs would definitely help. Thanks!

I should get a chance to do some more testing tomorrow (and added a few more log statements as well).

-Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size as data is written and benefit subsequent reads on the table. This PR introduces a new `DeltaOptimizedWriterExec` exec node. It's responsible for executing the shuffle (`HashPartitioning` based on the table's partition columns) and rebalancing afterwards. More specifically, the number of shuffle partitions is controlled by two new knobs: - `spark.databricks.delta.optimizeWrite.numShuffleBlocks` (default=50,000,000), which controls "maximum number of shuffle blocks to target"; - `spark.databricks.delta.optimizeWrite.maxShufflePartitions` (default=2,000), which controls "max number of output buckets (reducers) that can be used by optimized writes". After repartitioning, the blocks are then sorted in ascending order by size and bin-packed into appropriately-sized bins for output tasks. The bin size is controlled by the following new knob: - `spark.databricks.delta.optimizeWrite.binSize` (default=512MiB). Note that this knob is based on the in-memory size of row-based shuffle blocks. So the final output Parquet size is usually smaller than the bin size due to column-based encoding and compression. The whole optimized write feature can be controlled in the following ways, in precedence order from high to low (i.e. each option takes precedence over any successive ones): 1. The `optimizeWrite` Delta option in DataFrameWriter (default=None), e.g. `spark.range(0, 100).toDF().write.format("delta").option("optimizedWrite", "true").save(...)`; 2. The `spark.databricks.delta.optimizeWrite.enabled` Spark session setting (default=None). 3. The `delta.autoOptimize.optimizeWrite` table property (default=None); Optimized write is **DISABLED** by default. Closes delta-io#2145  GitOrigin-RevId: f76f96d7a94fddab027bfa512d223b12ab3dd681

Kimahriman · 2023-11-21T19:57:41Z

Figured out what the issue was. The original PR treats binSize as a byte value, likely based on the Databricks docs that treat "targetFileSize" as a byte value: https://docs.databricks.com/en/delta/tune-file-size.html#set-a-target-file-size. So I had my bin size set to 512 * 1024 * 1024, which is being interpreted as that many megabytes, not bytes. I think most Spark settings are interpreted as bytes by default, only a few things aren't. Removing that setting fixed things for me.

felipepessoto · 2023-11-21T21:09:30Z

Should we change the config to use bytes unit instead of MB?

Optimized write feature was added by #2145. This PR adds the corresponding documentation for the feature. Co-authored-by: Venki Korukanti <venki.korukanti@gmail.com>

(Cherry-pick of 494f2b2 to branch-3.1) Optimized write feature was added by delta-io#2145. This PR adds the corresponding documentation for the feature. Co-authored-by: Venki Korukanti <venki.korukanti@gmail.com>

(Cherry-pick of 494f2b2 to branch-3.1) Optimized write feature was added by #2145. This PR adds the corresponding documentation for the feature. Co-authored-by: Venki Korukanti <venki.korukanti@gmail.com>

implement optimized write

f94cc4c

rahulsmahadev reviewed Oct 10, 2023

View reviewed changes

address review comments

a063653

weiluo-db requested a review from rahulsmahadev October 10, 2023 18:06

weiluo-db changed the title ~~[SPARK] Implement optimized write.~~ [Spark] Implement optimized write. Oct 10, 2023

rasidhan mentioned this pull request Oct 10, 2023

[Feature Request] OPTIMIZED WRITE #1158

Closed

1 task

felipepessoto reviewed Oct 12, 2023

View reviewed changes

let session conf take precedence over table property

94fe218

rahulsmahadev reviewed Oct 30, 2023

View reviewed changes

weiluo-db added 2 commits October 31, 2023 15:56

address review comments

1ef15e9

trigger ci

646f53a

rahulsmahadev approved these changes Nov 1, 2023

View reviewed changes

tdas added this to the 3.1.0 milestone Nov 1, 2023

Merge branch 'master' into oss_optimized_write_1

01a4d98

fix scalastyle

013bd67

tdas closed this in fcfd440 Nov 14, 2023

weiluo-db mentioned this pull request Jan 9, 2024

Add docs for optimized write #2452

Merged

5 tasks

vkorukanti added a commit that referenced this pull request Jan 16, 2024

Add docs for optimized write (#2452)

494f2b2

Optimized write feature was added by #2145. This PR adds the corresponding documentation for the feature. Co-authored-by: Venki Korukanti <venki.korukanti@gmail.com>

liurenjie1024 mentioned this pull request Apr 18, 2024

Delta Lake MERGE/UPDATE/DELETE on Databricks should trigger optimized write and auto compaction NVIDIA/spark-rapids#10417

Open

[Spark] Implement optimized write. #2145

[Spark] Implement optimized write. #2145

Conversation

weiluo-db commented Oct 6, 2023 • edited

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

felipepessoto commented Oct 6, 2023

weiluo-db commented Oct 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulsmahadev left a comment

Choose a reason for hiding this comment

weiluo-db commented Oct 10, 2023

rasidhan commented Oct 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas commented Oct 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulsmahadev left a comment

Choose a reason for hiding this comment

rahulsmahadev commented Oct 30, 2023

felipepessoto commented Oct 30, 2023

rahulsmahadev commented Oct 30, 2023

rahulsmahadev left a comment

Choose a reason for hiding this comment

Kimahriman commented Nov 14, 2023

weiluo-db commented Nov 14, 2023

Kimahriman commented Nov 15, 2023

weiluo-db commented Nov 20, 2023

Kimahriman commented Nov 20, 2023

Kimahriman commented Nov 21, 2023

felipepessoto commented Nov 21, 2023 • edited

weiluo-db commented Oct 6, 2023 •

edited

felipepessoto commented Nov 21, 2023 •

edited