Spark 3.5: Increase default advisory partition size for writes #8660

aokolnychyi · 2023-09-26T23:43:19Z

This PR increases the default advisory partition size for writes and allows users to control it explicitly.

By default, we will try to write 128 MB data files and 32 MB position delete files. In the future, we may make this configurable but it should be a reasonable start, especially given that users can set this manually now.

aokolnychyi · 2023-09-26T23:44:22Z

spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

    withSQLConf(
        ImmutableMap.of(
-            SQLConf.SHUFFLE_PARTITIONS().key(), "200",


This isn't me, it is Spotless. I only changed the last line.

aokolnychyi · 2023-09-26T23:48:05Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCompressionUtil.java

+    return value != null ? value.toLowerCase(Locale.ROOT) : null;
+  }
+
+  private static Map<Pair<String, String>, Double> initColumnarCompressions() {


We should not expect these values to be precise but they should be reasonable. I tested some of them on the cluster, some of them locally. It boils down to what kind of encoding we can apply to the incoming data. We can't predict that, unfortunately. We should be able to adaptively learn that in future PRs.

aokolnychyi · 2023-09-26T23:49:02Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkConfParser.java

@@ -106,7 +106,7 @@ public int parse() {
    }

    public Integer parseOptional() {
-      return parse(Integer::parseInt, null);
+      return parse(Integer::parseInt, defaultValue);


If a default value was provided explicitly, it should be used.

Why not throw an exception? Seems like these cases are mutually exclusive.

Oh, I see. You can have a null default for objects like String. I think that's fine for String but when the defaultValue is set as a primitive long or int I think we don't need to support parseOptional and require parse.

Yeah, we can probably do this for String only. There are table properties with explicit NULL default values.

Well, parseOptional does produce Integer so if we have a default value that is null, it may still apply.

public static final String PARQUET_COMPRESSION_LEVEL_DEFAULT = null;

We can hit such cases for ints as well, I assume.

aokolnychyi · 2023-09-26T23:49:33Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

@@ -64,4 +64,7 @@ private SparkSQLProperties() {}

  // Overrides the delete planning mode
  public static final String DELETE_PLANNING_MODE = "spark.sql.iceberg.delete-planning-mode";
+
+  // Overrides the advisory partition size
+  public static final String ADVISORY_PARTITION_SIZE = "spark.sql.iceberg.advisory-partition-size";


Question: Do we need the write prefix or make it part of the name? The config only affects the final write.

Shouldn't this be a write.spark.advisory-partition-size table property? I wouldn't want to set this in the Spark context.

Nevermind, I see this is a Spark default that is overridden by the table property.

Correct, there is a table property, a SQL config and a write option.

aokolnychyi · 2023-09-26T23:50:14Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

@@ -75,6 +76,10 @@ public class SparkWriteConf {

  private static final Logger LOG = LoggerFactory.getLogger(SparkWriteConf.class);

+  private static final long DATA_FILE_SIZE = 128 * 1024 * 1024; // 128 MB


These are some default values we aim for. They should be safe without affecting the parallelism too much. We can make this configurable in the future.

aokolnychyi · 2023-09-26T23:51:56Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

@@ -484,42 +525,24 @@ private Map<String, String> deleteWriteProperties() {

    switch (deleteFormat) {
      case PARQUET:
-        setWritePropertyWithFallback(


Instead of using setWritePropertyWithFallback, I added defaults to our delete configs. It made my life easier as I needed to know the delete output codec and I no longer have to check nulls and overrides explicitly.

There are tests in TestCompressionSettings and TestSparkWriteConf to verify the new logic.

This seems fine to me, but was it related to the advisory partition size changes? It seems like it doesn't overlap.Maybe just because we're predicting the compression ratio based on the codec?

Correct, it was needed to compute the codec.

aokolnychyi · 2023-09-26T23:53:43Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteRequirements.java

@@ -47,4 +51,8 @@ public SortOrder[] ordering() {
  public boolean hasOrdering() {
    return ordering.length != 0;
  }
+
+  public long advisoryPartitionSize() {
+    return distribution instanceof UnspecifiedDistribution ? 0 : advisoryPartitionSize;


Spark will complain if we request a value > 0 with UnspecifiedDistribution.

Good thing for a comment.

aokolnychyi · 2023-09-27T00:01:19Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

@@ -132,7 +132,9 @@ class SparkPositionDeltaWrite implements DeltaWrite, RequiresDistributionAndOrde

  @Override
  public Distribution requiredDistribution() {
-    return writeRequirements.distribution();
+    Distribution distribution = writeRequirements.distribution();


Both Distribution and SortOrder implementation in Spark provide reasonable toString().

23/09/26 17:00:55 INFO SparkWrite: Requesting 402653184 bytes advisory partition size for table testhive.default.table 23/09/26 17:00:55 INFO SparkWrite: Requesting ClusteredDistribution(bucket(8, c3)) as write distribution for table testhive.default.table 23/09/26 17:00:55 INFO SparkWrite: Requesting [bucket(8, c3) ASC NULLS FIRST, id ASC NULLS FIRST] as write ordering for table testhive.default.table

aokolnychyi · 2023-09-27T00:54:51Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCompressionUtil.java

+
+class SparkCompressionUtil {
+
+  private static final String SHUFFLE_COMPRESSION_ENABLED = "spark.shuffle.compress";


These properties are internal in Spark. It requires some ugly code to get them:

org.apache.spark.internal.config.package$.MODULE$.SHUFFLE_COMPRESS().defaultValueString()

It is doable but I am not sure is worth the effort.

I'm fine with duplication. Let's just make sure that we have a comment for the block of settings that states that they come from Spark.

I'll add a comment. There is also a test that verifies the Spark default values are as we expect.

aokolnychyi · 2023-09-27T01:08:31Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCompressionUtil.java

+  private static Map<Pair<String, String>, Double> initColumnarCompressions() {
+    Map<Pair<String, String>, Double> compressions = Maps.newHashMap();
+
+    compressions.put(Pair.of("none", "zstd"), 4.0);


These values (none, zstd, lz4, etc) should probably become constants.

Another way to implement this is to define some mappings for codecs + some ratio for the format.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCompressionUtil.java

rdblue · 2023-09-27T16:38:05Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

+  }
+
+  private long advisoryPartitionSize(long defaultValue) {
+    return confParser


I think this is okay, although since the default differs between data and deletes, it seems strange not to allow setting this specifically for deletes. The write option makes sense, but the table and session properties would likely differ.

Yeah, I was thinking about this too. The problem there is that copy-on-write DELETE still uses the data config. If we were to offer such a property, would it mean it is supported for copy-on-write as well? Any thoughts on the potential name?

I think we typically use a variation that adds merge, update, or delete somewhere. I don't feel strongly so let's go without this for now. We can add more settings later.

rdblue · 2023-09-27T16:38:54Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

+  }
+
+  private long advisoryPartitionSize(
+      long targetFileSize, FileFormat outputFileFormat, String outputCodec) {


It's a little odd to call this targetFileSize when it isn't the table's target file size. I don't have a much better name though. Maybe a comment here to clarify?

We can call it expectedFileSize or similar?

rdblue · 2023-09-27T16:43:51Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCompressionUtil.java

+    compressions.put(Pair.of("none", "gzip"), 2.0);
+
+    compressions.put(Pair.of("lz4", "zstd"), 1.5);
+    compressions.put(Pair.of("lz4", "gzip"), 1.5);


Why no zstd or snappy shuffle compression options? Just assume that this will use the default?

I should probably add these too, I missed them when I added values for ORC and Parquet.

rdblue · 2023-09-27T16:44:32Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteRequirements.java

 import org.apache.spark.sql.connector.expressions.SortOrder;

 /** A set of requirements such as distribution and ordering reported to Spark during writes. */
 public class SparkWriteRequirements {

  public static final SparkWriteRequirements EMPTY =
-      new SparkWriteRequirements(Distributions.unspecified(), new SortOrder[0]);
+      new SparkWriteRequirements(Distributions.unspecified(), new SortOrder[0], 0);


Is 0 a special signal to Spark that this is not requesting an advisory size?

Yes, it matches the default value in RequiresDistributionAndOrdering and means no preference.

/** * Returns the advisory (not guaranteed) shuffle partition size in bytes for this write. * <p> * Implementations may override this to indicate the preferable partition size in shuffles * performed to satisfy the requested distribution. Note that Spark doesn't support setting * the advisory partition size for {@link UnspecifiedDistribution}, the query will fail if * the advisory partition size is set but the distribution is unspecified. Data sources may * either request a particular number of partitions via {@link #requiredNumPartitions()} or * a preferred partition size, not both. * <p> * Data sources should be careful with large advisory sizes as it will impact the writing * parallelism and may degrade the overall job performance. * <p> * Note this value only acts like a guidance and Spark does not guarantee the actual and advisory * shuffle partition sizes will match. Ignored if the adaptive execution is disabled. * * @return the advisory partition size, any value less than 1 means no preference. */ default long advisoryPartitionSizeInBytes() { return 0; }

rdblue

+1 overall, with a few minor comments.

danielcweeks

I'm +1 one on this and spoke to Ryan about it. We need to make sure to note that you can disable by setting the advisory size to zero as a fallback for anyone who experiences undesirable sizes/performance.

aokolnychyi · 2023-09-27T20:08:53Z

Thank you, @rdblue @danielcweeks!

Spark 3.5: Increase default advisory partition size for writes

2509ab6

github-actions bot added spark core labels Sep 26, 2023

aokolnychyi added this to the Iceberg 1.4.0 milestone Sep 26, 2023

aokolnychyi commented Sep 26, 2023

View reviewed changes

aokolnychyi commented Sep 27, 2023

View reviewed changes

rdblue reviewed Sep 27, 2023

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCompressionUtil.java Show resolved Hide resolved

rdblue reviewed Sep 27, 2023

View reviewed changes

rdblue approved these changes Sep 27, 2023

View reviewed changes

danielcweeks approved these changes Sep 27, 2023

View reviewed changes

Review feedback

32334e8

aokolnychyi merged commit 0b1b624 into apache:master Sep 27, 2023
47 checks passed

		@@ -75,6 +76,10 @@ public class SparkWriteConf {

		private static final Logger LOG = LoggerFactory.getLogger(SparkWriteConf.class);

		private static final long DATA_FILE_SIZE = 128 * 1024 * 1024; // 128 MB


		class SparkCompressionUtil {

		private static final String SHUFFLE_COMPRESSION_ENABLED = "spark.shuffle.compress";

Spark 3.5: Increase default advisory partition size for writes #8660

Spark 3.5: Increase default advisory partition size for writes #8660

Conversation

aokolnychyi commented Sep 26, 2023 • edited

aokolnychyi Sep 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Sep 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Sep 26, 2023 • edited

Choose a reason for hiding this comment

aokolnychyi Sep 26, 2023 • edited

Choose a reason for hiding this comment

rdblue Sep 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Sep 27, 2023 • edited

Choose a reason for hiding this comment

aokolnychyi Sep 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Sep 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Sep 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Sep 27, 2023 • edited

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

danielcweeks left a comment • edited

Choose a reason for hiding this comment

aokolnychyi commented Sep 27, 2023

aokolnychyi commented Sep 26, 2023 •

edited

aokolnychyi Sep 26, 2023 •

edited

aokolnychyi Sep 26, 2023 •

edited

aokolnychyi Sep 26, 2023 •

edited

aokolnychyi Sep 26, 2023 •

edited

rdblue Sep 27, 2023 •

edited

aokolnychyi Sep 27, 2023 •

edited

aokolnychyi Sep 27, 2023 •

edited

aokolnychyi Sep 27, 2023 •

edited

rdblue Sep 27, 2023 •

edited

aokolnychyi Sep 27, 2023 •

edited

danielcweeks left a comment •

edited