Spark: Implement merge-on-read DELETE #3763

aokolnychyi · 2021-12-17T15:13:36Z

This PR implements merge-on-read DELETE in Spark.

Resolves #3629.

aokolnychyi · 2021-12-17T15:15:01Z

...spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/WriteDelta.scala

+
+  override protected lazy val stringArgs: Iterator[Any] = Iterator(table, query, write)
+
+  // TODO: validate the row ID and metadata schema


This seems minor to me. Since we are on a tight schedule, I'd skip it for now.

I'm debating this. It seems like it will always be correct, but resolution is a nice way to sanity check and fail rather than doing the wrong thing at runtime.

I'd say that this is probably worth doing and isn't going to be a lot of work compared with the rest of this PR.

Added some validation.

aokolnychyi · 2021-12-17T15:15:35Z

...extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteDeltaExec.scala

+}
+
+// a trait similar to V2ExistingTableWriteExec but supports custom write tasks
+trait ExtendedV2ExistingTableWriteExec extends V2ExistingTableWriteExec {


Mostly copied from Spark.

aokolnychyi · 2021-12-17T15:16:00Z

...extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteDeltaExec.scala

+  }
+}
+
+trait WritingSparkTask extends Logging with Serializable {


Same here. Mostly from Spark.

aokolnychyi · 2021-12-17T15:16:25Z

...extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteDeltaExec.scala

+  }
+}
+
+case class DeltaWithMetadataWritingSparkTask(


This is custom and needs reviews.

aokolnychyi · 2021-12-17T15:19:57Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

-  private final Integer splitLookback;
-  private final Long splitOpenFileCost;
+  private final TableScan scan;
+  private final Context ctx;


I am not a big fan of this class here but it is needed for equals and hashCode. Another option I considered was to implement equals and hashCode in all TableScan implementations. Unfortunately, we have a lot of such classes and "equal" scans in Spark are a slightly weaker concept (i.e. not every detail must be the same to consider two scans identical).

Alternatives are welcome.

aokolnychyi · 2021-12-17T15:20:15Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java


    super(spark, table, readConf, expectedSchema, filters);

-    this.snapshotId = readConf.snapshotId();


Moved to SparkScanBuilder.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

aokolnychyi · 2021-12-17T15:24:34Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

+    }
+
+    @Override
+    public void commit(WriterCommitMessage[] messages) {


Requires extra attention!

aokolnychyi · 2021-12-17T15:25:15Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

+    }
+  }
+
+  private static class Context implements Serializable {


A helper class to avoid passing a huge list of params to methods.

api/src/main/java/org/apache/iceberg/util/StructProjection.java

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

...xtensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelCommand.scala

....2/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/InternalRowProjection.scala

spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

...extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteDeltaExec.scala

....2/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/InternalRowProjection.scala

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDistributionAndOrderingUtil.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaOperation.java

...v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWriteBuilder.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

rdblue

I have no major issues with this, although I made a few comments about minor things.

aokolnychyi · 2021-12-21T16:50:01Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

+    });
+  }
+
+  private static void validateSchema(Schema expectedSchema, Schema actualSchema, Boolean checkNullability,


I did this refactoring to slightly reduce the code duplication. Now I look at it and I am not sure it was worth it.

Yeah, seems like just adding the context part is all you need to do.

....2/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/ProjectingInternalRow.scala

aokolnychyi · 2021-12-21T17:26:34Z

...spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/WriteDelta.scala

+
+  override protected lazy val stringArgs: Iterator[Any] = Iterator(table, query, write)
+
+  private def operationResolved: Boolean = {


@rdblue, what do you think of the validation here?

I added some comments in the implementation.

aokolnychyi · 2021-12-21T17:28:13Z

This needs another round or two. I am switching to copy-on-write MERGE for now.

rdblue · 2021-12-21T17:44:15Z

...spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/WriteDelta.scala

+    }
+  }
+
+  private def isCompatible(projectionField: StructField, outAttr: NamedExpression): Boolean = {


Looks good.

rdblue · 2021-12-21T17:48:19Z

...spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/WriteDelta.scala

+
+  private def rowIdAttrsResolved: Boolean = {
+    projections.rowIdProjection.schema.forall { field =>
+      originalTable.resolve(Seq(field.name), conf.resolver) match {


Why does this use originalTable? I thought these fields should be coming from query?

Well, it is a little bit tricky. The actual type is defined by the projection. For example, consider MERGE operations. The incoming plan will have wrong nullability for metadata and row ID columns (they will be always nullable as those columns are null for records to insert). However, we never pass row ID or metadata columns with inserts. We only pass them with updates and deletes where those columns have correct values. In other words, the projection has more precise types. The existing logic validates that whatever the projections produce satisfies the target output attributes.

That being said, you are also right that we probably need some validation that we can actually project those columns from query...

What do you think, @rdblue?

The incoming fields are probably fine because they're coming from query via rowIdProjection. For the output fields, I think it makes sense to go back to what the table requested. Since the output relation, table is probably a V2Relation that is wrapping the RowLevelOperationTable, we should actually be able to recover the requested fields without using `originalTable.

I think that makes the most sense: we want to validate that the incoming fields (query or rowIdProjection) satisfy the requirements from the operation. The original table doesn't really need to be used.

Yeah, I agree. I’ll try to implement.

rdblue · 2021-12-21T17:49:11Z

...spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/WriteDelta.scala

+    projections.metadataProjection match {
+      case Some(projection) =>
+        projection.schema.forall { field =>
+          originalTable.metadataOutput.exists(metadataAttr => isCompatible(field, metadataAttr))


Same here. Shouldn't these be looked up in query since that's what produces the row that the metadata projection wraps?

aokolnychyi · 2022-01-15T01:51:45Z

...spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/WriteDelta.scala

+    })
+  }
+
+  private def rowIdAttrsResolved: Boolean = {


@rdblue, I changed the validation a bit. Like discussed before, the intention is to validate whatever comes out of the projection satisfies the reported row ID attributes. I couldn't avoid using originalTable as the operation only tells me attribute names and I have to resolve them against something.

That being said, it may not be the final iteration. Feedback would be appreciated.

Why not resolve the attrs against the child query: Projection? That's where the data is coming from. So you'd be finding the row ID fields that are coming from the incoming data that will be extracted by projections.rowIdProjection.

Yeah, I don't quite get why you can't use query instead of originalTable to look up the row ID attrs and then you'd no longer need originalTable. Same with metadata attrs.

I think we cannot use query for MERGE commands. The actual nullability is defined by the projection and may differ from the nullability of the attributes in query. Consider a MERGE plan with records to update and insert. The metadata and row ID columns will be always nullable as those columns are null for records to insert. However, we never pass row ID or metadata columns with inserts. We only pass them with updates and deletes where those columns have correct values. In other words, the projection has more precise types. The existing logic checks that whatever the projection produces satisfies the original row ID and metadata attrs.

Apart from that, we still need originalTable to refresh the cache later.

That makes sense, but it sounds to me like we could use query and ignore nullability in some cases. I'm more concerned about type widening that is unexpected because we're validating based on what the table produced and not what the query produced.

For cache refreshing, shouldn't we use a callback that captures the table in a closure like we do for other plans?

aokolnychyi · 2022-01-15T01:51:56Z

...spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/WriteDelta.scala

+    }
+  }
+
+  private def metadataAttrsResolved: Boolean = {


Same here, @rdblue.

spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

aokolnychyi · 2022-01-15T01:54:19Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

  private static final Logger LOG = LoggerFactory.getLogger(SparkBatchQueryScan.class);

+  private final TableScan scan;
  private final Long snapshotId;


@rdblue, I reverted back the original change with using Context here.

aokolnychyi · 2022-01-15T01:57:19Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

@@ -153,8 +158,90 @@ private Schema schemaWithMetadataColumns() {

  @Override
  public Scan build() {


@rdblue, I am not entirely happy with this place but it is probably better than using a context.

rdblue · 2022-01-17T20:46:14Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

+   */
+  public static void validateSchema(String context, Schema expectedSchema, Schema providedSchema,
+                                    boolean checkNullability, boolean checkOrdering) {
+    String errMsg = String.format("Provided %s schema is incompatible with expected %s schema:", context, context);


Do we need context twice? I think that "expected schema" is nearly equivalent to "expected row ID schema" if "row ID" was used the first time, "Provided row ID schema ...".

Nah, just one is probably enough. Updated.

rdblue · 2022-01-17T20:48:30Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

+          .append("\n")
+          .append(providedSchema)
+          .append("\n")
+          .append("problems:");


Nit: I prefer capitalizing these like they were before. It looks weird to not use sentence case.

We capitalized only one before so it looked inconsistent. Made both start with capital letters now.

rdblue · 2022-01-17T21:08:53Z

...extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteDeltaExec.scala

+
+  private lazy val rowProjection = projs.rowProjection.orNull
+  private lazy val rowIdProjection = projs.rowIdProjection
+  private lazy val metadataProjection = projs.metadataProjection.orNull


Looks like this may throw NPE when a projection is null but the operation causes it to be accessed? Is there a better way to fail? Maybe check which ones are null and add cases like case UPDATE_OPERATION if !hasUpdateProjections => throw ...

This may not be a good idea if we think we can guarantee that the required projections will be there.

Maybe all we need instead is to catch NPE and wrap it with the projection context and operation.

I thought about this too but it seems such a sensitive area that gets invoked for every row so I've tried to avoid any extra actions. While try/catch does not cost much unless there is an exception, JVM may not rewrite and perform advanced optimizations on the code inside the block. And having an extra if would potentially be even worse.

I checked the code that produces these projections and it seems unlikely we can get an NPE given our tests.

rdblue · 2022-01-17T21:17:37Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

    }
  }

+  public DistributionMode positionDeleteDistributionMode() {


Seems like some reasonable defaults to me.

rdblue · 2022-01-17T21:35:02Z

Still looks good to me. I had a few minor comments, but overall +1.

aokolnychyi · 2022-01-18T07:21:05Z

Thanks for reviewing, @rdblue! I've merged this one as the remaining open points are relatively minor and can be further discussed separately.

github-actions bot added API core spark labels Dec 17, 2021

aokolnychyi commented Dec 17, 2021

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java Show resolved Hide resolved

aokolnychyi commented Dec 17, 2021

View reviewed changes