[SPARK-56510][SQL] Fix ReplaceData DML without metadata attributes not projecting out the operation column by ZiyaZa · Pull Request #55372 · apache/spark

ZiyaZa · 2026-04-16T16:39:28Z

What changes were proposed in this pull request?

Previously, all DSv2 tests used an in-memory table that had some metadata attributes. This caused the code path for no-metadata attributes to be missed. This PR introduces a new property no-metadata for testing with an InMemoryTable without metadata attributes.

Previous implementation had a bug for ReplaceData plans that it would use DataWritingSparkTask without projection, which means that the connector would receive one more column (the __row_operation column) in addition to the row data to write. This is fixed in this PR by creating a new Writing Task DataWithProjectionWritingSparkTask that supports projecting only row data.

Additionally, following changes are done to clean-up the code:

Removed WRITE_WITH_METADATA_OPERATION and WRITE_OPERATION, and instead created COPY_OPERATION to be used along with other existing operations.
Created RowLevelWriteExec as a parent of ReplaceDataExec / WriteDeltaExec, which now holds a helper getMetricValue for metric computation

Why are the changes needed?

To fix a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.6

…e operation column

szehon-ho

Good bug fix for the no-metadata-attributes code path in ReplaceData. The core change (introducing DataWithProjectionWritingSparkTask) is correct and well-targeted — it ensures the __row_operation column is projected out even when there are no metadata attributes. A few suggestions below.

szehon-ho · 2026-04-16T21:57:27Z

+package org.apache.spark.sql.connector
+
+class GroupBasedNoMetadataDeleteFromTableSuite extends DeleteFromTableSuiteBase {
+


suggestion: Six new test files, each ~10 lines of actual code, is significant boilerplate. Consider adding a noMetadata flag to the existing suites and running them with both configurations (e.g., via a shared trait or parameterization). This would avoid class proliferation and keep the test matrix more maintainable.

This follows the existing structure for DML suites, having one suite per file. I don't think it is good to have it this way, but without a large diff, I cannot change this.

Note that we still need to have different classes for each configuration, but we could place multiple suites in a single file.

szehon-ho · 2026-04-16T21:57:27Z

-        override def build(): Write = new Write with RequiresDistributionAndOrdering {
-          override def requiredDistribution: Distribution = {
-            Distributions.clustered(Array(PARTITION_COLUMN_REF))
+        override def build(): Write = if (noMetadata) {


question: Is it intentional that the noMetadata path bypasses RequiresDistributionAndOrdering? This exercises a different physical plan (no shuffle/sort). If the goal is just to test the no-metadata code path, consider keeping the distribution/ordering requirements so these tests cover the same physical plan shape as the existing suites.

It's intentional. Without metadata, we don't have PARTITION_COLUMN_REF. Without it, we can't guarantee partitioning and can't know which column to use for distribution / ordering. This still exercises the code paths for different writing tasks, so it should be enough.

szehon-ho · 2026-04-16T21:57:27Z

    val pk = id.getInt(0)
    buffer.deletes += pk
-    val logEntry = new GenericInternalRow(Array[Any](DELETE, pk, meta.copy(), null))
+    val metaCopy = if (meta != null) meta.copy() else null


This null guard is needed because DeltaWritingSparkTask passes null for metadata when requiredMetadataAttributes() is empty. However, the DeltaWriter API methods (delete(meta, id), update(meta, id, row), reinsert(meta, row)) don't document that meta can be null. Third-party connectors could hit the same NPE. Consider adding Javadoc on those API methods to clarify the contract.

I believe this falls outside the scope of this PR.

aokolnychyi · 2026-04-16T22:21:05Z

Let me take a look.

aokolnychyi · 2026-04-16T23:35:36Z

I would explore the possibility of fixing this by introducing correct row operation types for ReplaceData.

szehon-ho · 2026-04-17T00:05:32Z

that is a nice idea (fixing this and #55141 )

aokolnychyi

Looks great to me, two minor questions and good to go. Thanks for the patience!

aokolnychyi · 2026-04-18T01:46:38Z

+    addOperationColumn(Literal(operation, IntegerType), plan)
+  }
+
+  protected def addOperationColumn(operation: Expression, plan: LogicalPlan): LogicalPlan = {


Is this being used? Is it needed?

Good catch, it must have been left from my previous attempts. It's not used anymore, removed it.

aokolnychyi · 2026-04-18T02:00:49Z

@@ -72,12 +72,13 @@ object RewriteUpdateTable extends RewriteRowLevelCommand {
    val readRelation = buildRelationWithAttrs(relation, operationTable, metadataAttrs)

    // build a plan with updated and copied over records
-    val updatedAndRemainingRowsPlan = buildReplaceDataUpdateProjection(
-      readRelation, assignments, cond)
+    // the conditional operation column needs to be added in the same Projection as cond is


Why can't we have this inside buildReplaceDataUpdateProjection and rely on the optimizer to fold this?

If(TrueLiteral, UPDATE_OPERATION, COPY_OPERATION) -> UPDATE_OPERATION

We already call buildReplaceDataUpdateProjection WITHOUT the condition in the union path.

// buildReplaceDataUpdateProjection val operation = If(cond, Literal(UPDATE_OPERATION), Literal(COPY_OPERATION)) Project(Alias(operation, OPERATION_COLUMN)() +: updatedValues, plan)

We can, I changed it now to be inside buildReplaceDataUpdateProjection.

ZiyaZa added 3 commits April 16, 2026 15:25

Fix ReplaceData DML without metadata attributes not projecting out th…

c70e1da

…e operation column

Rename WRITE_OPERATION and WRITE_WITH_METADATA_OPERATION

8b37604

Tests without metadata attributes

f69f4e6

ZiyaZa changed the title ~~[SQL] Fix ReplaceData DML without metadata attributes not projecting out the operation column~~ [SPARK-56510][SQL] Fix ReplaceData DML without metadata attributes not projecting out the operation column Apr 16, 2026

juliuszsompolski approved these changes Apr 16, 2026

View reviewed changes

Fix comment

f70cc22

szehon-ho reviewed Apr 16, 2026

View reviewed changes

Remove WRITE_OPERATION and instead use fine-grained operations

cae5e1e

aokolnychyi approved these changes Apr 18, 2026

View reviewed changes

Address comments

69b2c42

aokolnychyi closed this in d6c1f9a Apr 20, 2026

		package org.apache.spark.sql.connector

		class GroupBasedNoMetadataDeleteFromTableSuite extends DeleteFromTableSuiteBase {

Conversation

ZiyaZa commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 16, 2026

Uh oh!

aokolnychyi commented Apr 16, 2026

Uh oh!

szehon-ho commented Apr 17, 2026

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZiyaZa commented Apr 16, 2026 •

edited

Loading