Spark 3.3: Change default distribution modes #6828

aokolnychyi · 2023-02-13T20:00:32Z

This PR changes the default distribution modes in Spark 3.3.

Default distribution mode for partitioned but unsorted tables in INSERT is HASH (instead of NONE).
Default distribution mode for partitioned but unsorted tables in CoW MERGE is HASH (instead of NONE).
Default distribution mode for partitioned and sorted tables in CoW MERGE is HASH (instead of RANGE).
Default distribution mode for MoR MERGE is always HASH (instead of relying on write distribution).

aokolnychyi · 2023-02-13T20:37:25Z

cc @RussellSpitzer @rdblue @jackye1995 @dramaticlly

RussellSpitzer · 2023-02-13T20:45:47Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

+    } else if (table.spec().isPartitioned()) {
+      return HASH;
+    } else {
+      return NONE;


if we keep this as hash, we will avoid small files even with unpartitioned tables by forcing the rebalance right?

We have to request a valid Distribution in order for AQE to do its job. If the table is unpartitioned, I don't think we have what to cluster by.

This is purely for inserts, though. I should probably call it defaultInsertDistributionMode() or something.

I would think we could just cluster by all columns

I am not sure how that would perform, to be honest. What about exploring this separately? Maybe, we can also handle this natively in Spark.

What about exploring this separately?

+1, the benefit seems to be not so straightforward to cluster by all columns. We can first have this generic default and then make more changes for unpartitioned case.

aokolnychyi · 2023-02-14T05:59:23Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestPartitionValues.java

@@ -491,6 +491,7 @@ public void testReadPartitionColumn() throws Exception {
            .option(SparkReadOptions.VECTORIZATION_ENABLED, String.valueOf(vectorized))
            .load(baseLocation)
            .select("struct.innerName")
+            .orderBy("struct.innerName")


Required by the check below as the default distribution changes the order of elements.

jackye1995

looks good to me!

aokolnychyi · 2023-02-16T05:10:30Z

Thanks for reviewing, @RussellSpitzer @jackye1995 @dramaticlly!
Let me cherry-pick this to 3.2.

github-actions bot added the spark label Feb 13, 2023

RussellSpitzer reviewed Feb 13, 2023

View reviewed changes

jackye1995 added this to In progress in [Release] Iceberg 1.2 via automation Feb 13, 2023

aokolnychyi commented Feb 14, 2023

View reviewed changes

jackye1995 approved these changes Feb 14, 2023

View reviewed changes

[Release] Iceberg 1.2 automation moved this from In progress to Reviewer approved Feb 14, 2023

dramaticlly approved these changes Feb 14, 2023

View reviewed changes

Spark 3.3: Change default distribution modes

fa7965c

aokolnychyi force-pushed the change-default-distribution-modes branch from e3f9394 to fa7965c Compare February 16, 2023 03:06

aokolnychyi mentioned this pull request Feb 16, 2023

Spark 3.3: Add a new Spark SQLConf to influence the write distribution mode #6838

Merged

aokolnychyi merged commit a6ad1d1 into apache:master Feb 16, 2023

[Release] Iceberg 1.2 automation moved this from Reviewer approved to Done Feb 16, 2023

aokolnychyi mentioned this pull request Feb 17, 2023

Spark 3.2: Change default distribution modes #6877

Merged

jackye1995 added this to the Iceberg 1.2.0 milestone Feb 21, 2023

hililiwei mentioned this pull request Mar 11, 2023

Flink 1.16: Change distribution modes #7077

Open

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Spark 3.3: Change default distribution modes (apache#6828)

f3b7101

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.3: Change default distribution modes #6828

Spark 3.3: Change default distribution modes #6828

aokolnychyi commented Feb 13, 2023

aokolnychyi commented Feb 13, 2023

RussellSpitzer Feb 13, 2023

aokolnychyi Feb 13, 2023

aokolnychyi Feb 13, 2023

RussellSpitzer Feb 13, 2023

aokolnychyi Feb 14, 2023

jackye1995 Feb 14, 2023

aokolnychyi Feb 14, 2023

jackye1995 left a comment

aokolnychyi commented Feb 16, 2023

Spark 3.3: Change default distribution modes #6828

Spark 3.3: Change default distribution modes #6828

Conversation

aokolnychyi commented Feb 13, 2023

aokolnychyi commented Feb 13, 2023

RussellSpitzer Feb 13, 2023

Choose a reason for hiding this comment

aokolnychyi Feb 13, 2023

Choose a reason for hiding this comment

aokolnychyi Feb 13, 2023

Choose a reason for hiding this comment

RussellSpitzer Feb 13, 2023

Choose a reason for hiding this comment

aokolnychyi Feb 14, 2023

Choose a reason for hiding this comment

jackye1995 Feb 14, 2023

Choose a reason for hiding this comment

aokolnychyi Feb 14, 2023

Choose a reason for hiding this comment

jackye1995 left a comment

Choose a reason for hiding this comment

aokolnychyi commented Feb 16, 2023