[SPARK-56551][SQL] Add operation metrics for DELETE queries in DSv2 by ZiyaZa · Pull Request #55428 · apache/spark

ZiyaZa · 2026-04-20T15:59:49Z

What changes were proposed in this pull request?

Added numDeletedRows and numCopiedRows metrics for DELETE operations in DSv2. These metrics are calculated in the WritingSparkTask.

Metadata-only DELETEs are excluded from this PR and will be tackled in a future PR.

Why are the changes needed?

For better visibility into what happened as a result of an DELETE query.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Added metric value validation to most DELETE unit tests.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

…e operation column

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelCommand.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteUpdateTable.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RowDeltaUtils.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala # sql/core/src/test/scala/org/apache/spark/sql/connector/DeltaBasedNoMetadataUpdateTableSuite.scala

aokolnychyi · 2026-04-20T20:35:33Z

+        case b: BatchScanExec if b.table.isInstanceOf[RowLevelOperationTable] =>
+          getMetricValue(b.metrics, "numOutputRows")
+      }
+      val numCopiedRows = getMetricValue(metrics, "numCopiedRows")


This seems like a reasonable approach to handle DELETE. Is there a metric for the overall number of output rows? Can we add some sanity checks that we only had copied rows and its count matches the number of output rows?

We could do this using totalNumRowsAccumulator.value and verify if the numbers in WriteSummary matches what we expect. Since this can be done for other commands too, let's do this in a follow-up PR for all commands.

+1 to the idea (check whether it matches numCopiedRows in ReplaceData , and numDeletedRows in WriteDelta case)

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

andreaschat-db

Looks nice overall. Left two comments.

andreaschat-db · 2026-04-22T08:12:16Z

+      }
+      val numCopiedRows = getMetricValue(metrics, "numCopiedRows")
+      val numDeletedRows = if (numScannedRows.exists(_ >= 0) && numCopiedRows >= 0) {
+        numScannedRows.get - numCopiedRows


Can scanned rows be less than numCopiedRows? This looks like it needs a sanity check?

It would mean the plan is creating new rows, and it shouldn't do that. We can add a sanity check, together with others in a follow-up PR. See #55428 (comment). This also has dependency on #55371.

Metric values can get overcounted on retries. The scan and the write can be executed in different stages, so can have different retries, so technically you can get an overcounted numCopiedRows to turn this negative. Using the metrics infrastructure from #55371 that I want to get it would fix that.

andreaschat-db · 2026-04-22T08:15:23Z

+      // DELETE ReplaceData plans filter out the deleted rows early in the plan, and they don't
+      // reach this node. We need to calculate this value as numScannedRows - numCopiedRows.
+      val numScannedRows = collectFirst(query) {
+        case b: BatchScanExec if b.table.isInstanceOf[RowLevelOperationTable] =>


Is this the only scan type we have today?

DataSourceV2ScanExecBase has 4 child scans, seemingly only this is used for DELETEs. The others scans seem to be used only for streaming reads.

andreaschat-db

LGTM.

szehon-ho

Looks good to me. +1 to put assertions in the follow up prs for expected presence or values for various metrics

szehon-ho · 2026-04-22T11:35:17Z

+      // DELETE ReplaceData plans filter out the deleted rows early in the plan, and they don't
+      // reach this node. We need to calculate this value as numScannedRows - numCopiedRows.
+      val numScannedRows = collectFirst(query) {
+        case b: BatchScanExec if b.table.isInstanceOf[RowLevelOperationTable] =>


we don't expect this to happen right? Should we log a warning if we cannot find batchscanexec, or combine it with one of the assertions?

Currently if we can't find such a scan node, we set the metric as -1. Is this good enough, what do you think? It kind of fits the theme we have been going with so far, that if there is some problem, we use -1 as the metric value. If any metric value is -1, it means we have a problem somewhere (excluding insert-only merges where this is intentional at the moment, but we'll fix that). When we discussed this previously, we decided against throwing an error and letting connectors handle it, but we didn't discuss logging.

A case to consider is when the optimizer squashes and replaces the scan with an empty relation. This can happen in multiple cases. What is our behavior in that case?

szehon-ho · 2026-04-22T11:36:33Z

+        case b: BatchScanExec if b.table.isInstanceOf[RowLevelOperationTable] =>
+          getMetricValue(b.metrics, "numOutputRows")
+      }
+      val numCopiedRows = getMetricValue(metrics, "numCopiedRows")


+1 to the idea (check whether it matches numCopiedRows in ReplaceData , and numDeletedRows in WriteDelta case)

ZiyaZa added 22 commits April 1, 2026 17:29

Add operation metrics for UPDATE queries in DSv2

0755e67

Add comment

7336546

Address comments

e58f003

Remove IncrementMetric, compute metrics via additional attribute

b8fcfb4

Revert unnecessary change

5e12e94

Address comments

6eb4407

Address comments

9757ba7

Add 2 more tests

072df0f

-1 if missing, add RowLevelWriteExec

2b100a3

Replace __is_updated with operation column

f54923b

Fix ReplaceData DML without metadata attributes not projecting out th…

c70e1da

…e operation column

Rename WRITE_OPERATION and WRITE_WITH_METADATA_OPERATION

8b37604

Tests without metadata attributes

f69f4e6

Fix comment

f70cc22

Remove WRITE_OPERATION and instead use fine-grained operations

cae5e1e

Resolve conflicts

268c29c

Address comments

69b2c42

Merge branch 'dml-no-metadata' into dsv2-update-metrics

d73381d

Merge branch 'master' into dsv2-update-metrics

f22c999

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala # sql/core/src/test/scala/org/apache/spark/sql/connector/DeltaBasedNoMetadataUpdateTableSuite.scala

DELETE metrics for WriteDelta

551a701

DELETE metrics for ReplaceData

0d6c5f4

aokolnychyi reviewed Apr 20, 2026

View reviewed changes

ZiyaZa added 2 commits April 22, 2026 07:27

Merge branch 'master' into dsv2-delete-metrics

ddcba05

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

Resolve conflicts

670a0c4

juliuszsompolski approved these changes Apr 22, 2026

View reviewed changes

andreaschat-db reviewed Apr 22, 2026

View reviewed changes

andreaschat-db approved these changes Apr 22, 2026

View reviewed changes

szehon-ho reviewed Apr 22, 2026

View reviewed changes

aokolnychyi approved these changes Apr 23, 2026

View reviewed changes

aokolnychyi closed this in 569d56a Apr 23, 2026

Conversation

ZiyaZa commented Apr 20, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreaschat-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreaschat-db left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants