Spark 3.4: Multiple shuffle partitions per file in compaction #7897

aokolnychyi · 2023-06-24T00:25:11Z

This PR adds a new compaction option called shuffle-partitions-per-file for shuffle-based file rewriters.

By default, our shuffling file rewriters assume each shuffle partition would become a separate output file. Attempting to generate large output files of 512 MB and more may strain the memory resources of the cluster as such rewrites would require lots of Spark memory. This parameter can be used to further divide up the data which will end up in a single file. For example, if the target file size is 2 GB, but the cluster can only handle shuffles of 512 MB, this parameter could be set to 4. Iceberg will use a custom coalesce operation to stitch these sorted partitions back together into a single sorted file.

aokolnychyi · 2023-06-24T17:11:42Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java

@@ -59,7 +61,24 @@ abstract class SparkShufflingDataRewriter extends SparkSizeBasedDataRewriter {

  public static final double COMPRESSION_FACTOR_DEFAULT = 1.0;

+  /**


I tested the current implementation on a table with 1 TB of data and a cluster with 16 GB executors 7 cores each. The target file size is 1 GB (zstd Parquet data). Sort-based optimizations without this option were spilling and failed, I lost all executors one by one. I tried using 8 shuffle partitions per file and the operation succeeded without any failures and produced properly sized files.

aokolnychyi · 2023-06-24T17:12:55Z

spark/v3.4/spark/src/main/scala/org/apache/spark/sql/execution/OrderAwareCoalesceExec.scala

+import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
+import org.apache.spark.sql.catalyst.plans.physical.UnknownPartitioning
+
+case class OrderAwareCoalesceExec(


Inspired by CoalesceExec in Spark.

aokolnychyi · 2023-06-26T15:56:54Z

cc @szehon-ho @flyrain @RussellSpitzer @singhpk234 @amogh-jahagirdar @rdblue

singhpk234

LGTM, Thanks @aokolnychyi, this is a really nice addition to the compaction !

wondering if you are targeting updating the doc, with this new param in procedure, in a separate pr : https://iceberg.apache.org/docs/latest/spark-procedures/#usage-7

singhpk234 · 2023-06-26T16:55:24Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

+
+    List<Object[]> output =
+        sql(
+            "CALL %s.system.rewrite_data_files("


should we also assert that OrderAwareCoaleseExec is inserted by inspecting the plan ?

It is a bit tricky in this case as the result plan would be CallExec, we don't have an easy way to inspect the triggered plan from the procedure. I did check manually, though.

aokolnychyi · 2023-06-26T18:40:14Z

@singhpk234, I was originally planning to update the doc in a separate PR.

szehon-ho

Looks good to me. This works for sorted data, because we always use range partitioning for sort, right?

szehon-ho · 2023-06-26T20:26:36Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

@@ -225,6 +249,43 @@ public void testRewriteDataFilesWithZOrder() {
    assertEquals("Should have expected rows", expectedRows, sql("SELECT * FROM %s", tableName));
  }

+  @Test


This is nice, but did we also add a test that assert the sort order is preserved within partition? (ex, small partition, and just assert that that the file is in order)

There is a check below for the order of records. I just added a similar one for the regular sort, so we verify the order of records is correct both in regular sorts and in z-ordering.

szehon-ho · 2023-06-26T20:27:14Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java

+  /**
+   * The number of shuffle partitions to use for each output file. By default, this file rewriter
+   * assumes each shuffle partition would become a separate output file. Attempting to generate
+   * large output files of 512 MB and more may strain the memory resources of the cluster as such


and more => or higher

szehon-ho · 2023-06-26T20:59:27Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java

+   *
+   * <p>Note using this parameter requires enabling Iceberg Spark session extensions.
+   */
+  public static final String SHUFFLE_PARTITIONS_PER_FILE = "shuffle-partitions-per-file";


Not to block this change, but did we consider having shuffle-threshold? Ie, if we have some partition with 2G but others that are way less than 512MB, no need to shuffle the ones that are less?

You mean like switching to a local sort if the size of the data to compact is small?

I was just wondering the use case, where we set shuffle-partitions-per-file to 4, because we want 2GB files but can only shuffle 512mb. However, consider an Iceberg partition (rewrite group) that has only 512MB files during this rewrite. Will we still shuffle to four partitions in this case and coalesce at end, unnecessarily? I may be missing something.

It should be still fine to apply this optimization as there is no extra cost. I achieved best results with 128 MB shuffle blocks so it should be fairly safe to assume the operation would complete fine.

I see, but would there be issues in contending for pods. Also wouldn't it make more sense to have 128MB as a conf (shuffle-threshold), otherwise its always a bit dynamic depending on the max partition size? Not sure if there are other issues with this approach.

To be honest, I have never seen issues with this approach in any of our prod jobs in the last few years. Not applying this split if the size of the job is less than 128MB could be a valid step but it would require quite a bit of changes to pass more info around. I'd probably skip it for now until we experience any issues.

Sure, we can do it later then if there's a need

aokolnychyi · 2023-06-27T00:58:04Z

I just realized we don't provide a comprehensive list of supported options in the docs. I have been meaning to improve our docs for a while, so I'll add this config then.

…#7897)

Spark 3.4: Multiple shuffle partitions per file in compaction

e3ce81b

github-actions bot added the spark label Jun 24, 2023

Fix test validating options

da41f2a

aokolnychyi commented Jun 24, 2023

View reviewed changes

Fix new assertion

af1800a

singhpk234 approved these changes Jun 26, 2023

View reviewed changes

szehon-ho approved these changes Jun 26, 2023

View reviewed changes

Review feedback

45bd3b0

aokolnychyi merged commit d98e7a1 into apache:master Jun 27, 2023
31 checks passed

rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024

Spark 3.4: Multiple shuffle partitions per file in compaction (apache…

4984a16

…#7897)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.4: Multiple shuffle partitions per file in compaction #7897

Spark 3.4: Multiple shuffle partitions per file in compaction #7897

aokolnychyi commented Jun 24, 2023

aokolnychyi Jun 24, 2023

aokolnychyi Jun 24, 2023

aokolnychyi commented Jun 26, 2023

singhpk234 left a comment

singhpk234 Jun 26, 2023

aokolnychyi Jun 26, 2023

aokolnychyi commented Jun 26, 2023

szehon-ho left a comment

szehon-ho Jun 26, 2023 •

edited

aokolnychyi Jun 26, 2023

szehon-ho Jun 26, 2023

aokolnychyi Jun 26, 2023

szehon-ho Jun 26, 2023

aokolnychyi Jun 26, 2023

szehon-ho Jun 26, 2023 •

edited

aokolnychyi Jun 26, 2023

szehon-ho Jun 26, 2023

aokolnychyi Jun 27, 2023

szehon-ho Jun 27, 2023

aokolnychyi commented Jun 27, 2023

		@@ -59,7 +61,24 @@ abstract class SparkShufflingDataRewriter extends SparkSizeBasedDataRewriter {

		public static final double COMPRESSION_FACTOR_DEFAULT = 1.0;

		/**

Spark 3.4: Multiple shuffle partitions per file in compaction #7897

Spark 3.4: Multiple shuffle partitions per file in compaction #7897

Conversation

aokolnychyi commented Jun 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Jun 26, 2023

singhpk234 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Jun 26, 2023

szehon-ho left a comment

Choose a reason for hiding this comment

szehon-ho Jun 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Jun 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Jun 27, 2023

szehon-ho Jun 26, 2023 •

edited

szehon-ho Jun 26, 2023 •

edited