Spark 3.4: Remove no longer needed write extensions #7443

aokolnychyi · 2023-04-27T00:53:04Z

This PR removes no longer needed write extensions. Notable changes:

Switch to use the function catalog instead of custom Catalyst expressions for distribution and ordering.
Extensions are no longer needed to request a proper distribution and ordering.
Support for coalescing too small files with AQE (handling skew coming in a separate PR).
A new function catalog that can be used without a metastore.

aokolnychyi · 2023-04-27T15:05:26Z

...tensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedV2Writes.scala

-      }
-      val newQuery = ExtendedDistributionAndOrderingUtils.prepareQuery(write, query, conf)
-      o.copy(write = Some(write), query = newQuery)
-
    case rd @ ReplaceIcebergData(r: DataSourceV2Relation, query, _, None) =>


We have to keep plans for row-level operations for now as Spark plans don't support runtime filtering for UPDATE and MERGE. It will be part of Spark 3.5.

aokolnychyi · 2023-04-27T15:06:12Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

+  @BeforeClass
+  public static void setupSpark() {
+    // disable AQE as tests assume that writes generate a particular number of files
+    spark.conf().set(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), "false");


After this PR, AQE coalesces small tasks. Hence, I had to disable it.

aokolnychyi · 2023-04-27T15:07:12Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SortOrderToSpark.java

@@ -53,7 +53,7 @@ public SortOrder truncate(
      String sourceName, int id, int width, SortDirection direction, NullOrder nullOrder) {
    return Expressions.sort(
        Expressions.apply(
-            "truncate", Expressions.column(quotedName(id)), Expressions.literal(width)),
+            "truncate", Expressions.literal(width), Expressions.column(quotedName(id))),


Had to switch it around so that truncate is resolvable. Spark did not support transforms with multiple arguments before.

aokolnychyi · 2023-04-27T15:08:08Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkFunctionCatalog.java

+/**
+ * A function catalog that can be used to resolve Iceberg functions without a metastore connection.
+ */
+public class SparkFunctionCatalog implements SupportsFunctions {


This class is used directly in the compaction code but can be also configured as a proper catalog.

aokolnychyi · 2023-04-27T15:08:33Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SupportsFunctions.java

+    return namespace.length == 0;
+  }
+
+  default Identifier[] listFunctions(String[] namespace) throws NoSuchNamespaceException {


Copied from BaseCatalog to reuse in different places.

aokolnychyi · 2023-04-27T15:09:21Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java

+
+    @Override
+    public int requiredNumPartitions() {
+      return numShufflePartitions;


This is used to request a particular number of partitions instead of setting a SQL conf.

aokolnychyi · 2023-04-27T15:11:00Z

cc @nastra @amogh-jahagirdar @jackye1995 @szehon-ho @flyrain @RussellSpitzer @singhpk234

RussellSpitzer · 2023-04-27T19:06:22Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkFunctionCatalog.java

@@ -16,23 +16,30 @@
 * specific language governing permissions and limitations
 * under the License.
 */
+package org.apache.iceberg.spark;



I wonder a little about the naming here, I know i'm always confused about SparkCatalog (our class) versus the Spark Catalog class. IcebergFunctionCatalog? I don't feel strongly about this though.

I also thought a bit about it but then looked at all our other classes like SparkCatalog, SparkWrite, etc. I guess it makes sense to follow this pattern because catalogs are referred using the qualified name that includes the package too.

Yeah the main issue is whenever we discuss it in docs or with users. Then i'm constantly trying to explain the difference between Spark's Catalog , SparkCatalog and of course Hive's Catalog and HiveCatalog :)

Yeah, I agree, it was a questionable decision on our end to go with this naming in the first place.

RussellSpitzer · 2023-04-27T19:25:14Z

...k/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java

+    df.select("c1", "c2", "c3")
+        .write()
+        .format("iceberg")
+        .option(SparkWriteOptions.USE_TABLE_DISTRIBUTION_AND_ORDERING, "false")


Why do we need to turn this off for the test?

The check below compares rows, which started to fail. Before this PR, the incoming data was not ordered as the table had a truncate transform (unsupported without extensions). After this PR, there is a sort, which breaks the test.

We could compare counts too. No preference on my side so I went for the smallest change.

I would probably just compare counts, otherwise it makes it seem like there is some relationship between Distribution and Ordering and this test.

Sound good, I'll switch.

aokolnychyi · 2023-04-27T21:00:50Z

Thanks, @RussellSpitzer!

github-actions bot added the spark label Apr 27, 2023

aokolnychyi commented Apr 27, 2023

View reviewed changes

aokolnychyi force-pushed the simplify-spark-write-extensions branch from 920c364 to 72181b8 Compare April 27, 2023 17:18

Spark 3.4: Remove no longer needed write extensions

395cd06

aokolnychyi force-pushed the simplify-spark-write-extensions branch from 72181b8 to 395cd06 Compare April 27, 2023 17:48

RussellSpitzer reviewed Apr 27, 2023

View reviewed changes

RussellSpitzer approved these changes Apr 27, 2023

View reviewed changes

Switch comparison in test

4abefc5

aokolnychyi merged commit 91327e7 into apache:master Apr 27, 2023

singhpk234 mentioned this pull request Apr 29, 2023

Spark: Fix Failing SS ratelimit UT #7470

Merged

aokolnychyi mentioned this pull request May 4, 2023

Spark 3.4: Tests for coalescing small writing tasks #7532

Merged

manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023

Spark 3.4: Remove no longer needed write extensions (apache#7443)

96195a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.4: Remove no longer needed write extensions #7443

Spark 3.4: Remove no longer needed write extensions #7443

aokolnychyi commented Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi commented Apr 27, 2023

RussellSpitzer Apr 27, 2023

aokolnychyi Apr 27, 2023 •

edited

Loading

RussellSpitzer Apr 27, 2023 •

edited

Loading

aokolnychyi Apr 27, 2023 •

edited

Loading

RussellSpitzer Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi Apr 27, 2023

RussellSpitzer Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi Apr 27, 2023

aokolnychyi commented Apr 27, 2023

Spark 3.4: Remove no longer needed write extensions #7443

Spark 3.4: Remove no longer needed write extensions #7443

Conversation

aokolnychyi commented Apr 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Apr 27, 2023

Choose a reason for hiding this comment

aokolnychyi Apr 27, 2023 • edited Loading

Choose a reason for hiding this comment

RussellSpitzer Apr 27, 2023 • edited Loading

Choose a reason for hiding this comment

aokolnychyi Apr 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Apr 27, 2023

aokolnychyi Apr 27, 2023 •

edited

Loading

RussellSpitzer Apr 27, 2023 •

edited

Loading

aokolnychyi Apr 27, 2023 •

edited

Loading