DELETE should return the number of deleted rows #1240

edmondop · 2022-06-29T23:21:00Z

Description

Resolves #1222

How was this patch tested?

The SQL test suite was extended

Does this PR introduce any user-facing changes?

The returned DataFrame from a delete should not be empty, but it will contain a single row

jaceklaskowski · 2022-06-30T05:33:17Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala

@@ -93,9 +93,10 @@ case class DeleteCommand(
      // Re-cache all cached plans(including this relation itself, if it's cached) that refer to
      // this data source relation.
      sparkSession.sharedState.cacheManager.recacheByPlan(sparkSession, target)
+      Seq(
+        Row(this.metrics.getOrElse("numDeletedRows", 0))


nit: Seq(Row(...)) (or better? Row(...) :: Nil

@jaceklaskowski Can you clarify with me your view on how using the factory method defined on the Seq companion object compare with the concrete list constructor ? I would say that the factory method is preferrable from the point of view of general programming, as it can return a specific implementation of the trait depending on some properties of the input data.

I'll fix the formatting

you can just use metrics("numDeletedRows").value. It is initialized with a default value of 0, anyways

You should also cover the case of deletes at partition boundaries. Deletes at partition boundaries is a metadata operation, therefore we don't actually have any information around how many rows were deleted.

So if metrics("numRemovedFiles") is greater than 0, and metrics("numDeletedRows") is equal to 0, then we have reached this metadata-only delete case. in that case, let's just return -1 since we don't actually know how many rows were deleted.

@scottsand-db

Using the get method on the map, we get an option and we are kind of forced to deal with the potential absence of the value, which is more robust. Don't you think that accessing a map directly relying on the fact that same values are always present is relying on some implicit knowledge about the content of the map that is not encoded within the type system and which can also change leading to failures?

Seq(Row(this.metrics.get("numDeletedRows").map(_.value).getOrElse(0L)))

sure, that SGTM. make sure to cover the partition boundary case, too.

core/src/test/scala/org/apache/spark/sql/delta/DeleteSQLSuite.scala

jaceklaskowski · 2022-06-30T05:39:16Z

Can you fix the title to be "DELETE should return the number of deleted rows"?

scottsand-db · 2022-06-30T17:42:03Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala

@@ -93,9 +93,10 @@ case class DeleteCommand(
      // Re-cache all cached plans(including this relation itself, if it's cached) that refer to
      // this data source relation.
      sparkSession.sharedState.cacheManager.recacheByPlan(sparkSession, target)
+      Seq(
+        Row(this.metrics.getOrElse("numDeletedRows", 0))


You should also cover the case of deletes at partition boundaries. Deletes at partition boundaries is a metadata operation, therefore we don't actually have any information around how many rows were deleted.

So if metrics("numRemovedFiles") is greater than 0, and metrics("numDeletedRows") is equal to 0, then we have reached this metadata-only delete case. in that case, let's just return -1 since we don't actually know how many rows were deleted.

scottsand-db · 2022-06-30T17:47:29Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala

    }
-
-    Seq.empty[Row]


move the above line Seq(Row(...)) to here

let's see if this fixes your tests

It doesn't, I dig some investigation and I found out that returning that Row makes the optimizer fails because it cannot guess the schema.

I took a screenshot (I tried to add a second field to the row to see what was the cause). This loop goes into an index array out of bound exception because the same index is used to address both the row and the converters array. As you see:

converters array has size 0

row has size 2

It could at least check that the two have the same size, or use a .zip and a foreach instead of a while loop. I don't know what's the semantic and why a while loop was preferred to index two collections that could have a different size

.

We can fix this error by adding this (right before innerChildren will work) to define the output schema for DeleteCommand.

override val output: Seq[Attribute] = Seq(AttributeReference("numDeletedRows", LongType)())

Thank you, this is very helpful. Is there some documentation around the relationship of the output attributes and the row returned by the command? Maybe we can introduce some sort of validation at the parent class to avoid errors in the catalyst optimizer

I'm not sure if there's any documentation for this, but since this is just a spark class we're extending I don't think we should be introducing any validation (it would be outside of the delta codebase). We can check for the column "numDeletedRows" in the resulting df in the test suite.

allisonport-db

We should also add a few additional test cases for (a) when numDeletedRows = 0, and (b) a partition delete

allisonport-db · 2022-07-07T19:09:51Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala

    }
-
-    Seq.empty[Row]


We can fix this error by adding this (right before innerChildren will work) to define the output schema for DeleteCommand.

override val output: Seq[Attribute] = Seq(AttributeReference("numDeletedRows", LongType)())

allisonport-db · 2022-07-07T19:26:17Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala

+      val knownDeletedRows = deletionResultFromMetrics match {
+        case RowInFilesDeletion(deletedRows) => deletedRows
+        case FilePruningAtPartitionBoundary => -1
+      }


I think using these traits and case classes is more complex than necessary (we should be able to do this in a few lines of code).

I don't see why we can't just check if numDeletedRows = 0 and numRemovedFiles > 0 and then update our result to be -1 if so. A comment here explaining that this is because it's a delete over partitions and we can't know the number of deleted rows would suffice.

Hey I saw your slack comment. I personally feel something like this is much easier to read, especially with the nested if statements you have above. I agree with the principles you have in mind in regards to self-documenting code, but I think this specific scenario is so simple logic-wise that it's unnecessary and actually more complex to the reader.

var deletedRows = metrics("numDeletedRows").value if (metrics("numRemovedFiles").value > 0 && deletedRows == 0) { // this is a delete over a partition boundary which is a metadata only operation. therefore we can't know how many rows are deleted deletedRows = -1 }

If you feel really strongly about this, we can look into making your current implementation more readable (it personally took me a few seconds to deduce what was going on)

You are more familiar with the codebase and the context than me for sure, so if you think the scenario is simple enough so that separating interpreting metrics from results is not required, I will just remove it.

The reason why it was surprising to me is that the -1 is sort of a magic value, and I would have expected some sort of metadata to be available to tell how many files rows are deleted in each file

allisonport-db · 2022-07-07T19:29:34Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala

+    if(deletedRows > 0) {
+      RowInFilesDeletion(deletedRows)
+    } else {
+      val deletedFiles = this.metrics.get("numRemovedFiles").map(_.value).getOrElse(0L)


We can also get the metric value using just metrics("numRemovedFiles").value and metrics("numDeletedRows").value. This is how we do it throughout the code, and we see above that both keys are defined in createMetrics. And as @scottsand-db mentioned the default value is 0 anyways.

Don't you think that accessing a map directly relying on the fact that same values are always present is relying on some implicit knowledge about the content of the map that is not encoded within the type system and which can also change leading to failures?

I think we would want to know if the content of our metrics map changed to not contain these values that we explicitly define.

@allisonport-db are there unit tests that verify that certain metrics are set by 0 to default?

I'm not sure if there's a test for this as SQLMetric is also a spark class. We can see in the class header the default value is 0.

We can test this to some degree however by, as I've mentioned, adding a test for when numDeletedRows=0.

allisonport-db · 2022-08-04T17:56:08Z

Hey @edmondo1984 we'd love to resolve #1222 are you still interested in working on this PR?

rahulsmahadev · 2022-08-04T18:45:41Z

core/src/test/scala/org/apache/spark/sql/delta/DeleteSQLSuite.scala

@@ -56,7 +56,7 @@ class DeleteSQLSuite extends DeleteSuiteBase  with DeltaSQLCommandTest {
      withTempView("v") {
        Seq((1, 1), (0, 3), (1, 5)).toDF("key", "value").write.format("delta").saveAsTable("tab")


Can you make sure to add tests for the following scenarios

DELETE on partitioned tables

full table DELETE etc.

scottsand-db · 2022-08-10T22:25:15Z

I've submitted a PR here: #1328

@edmondo1984 want to take a look?

edmondop added 3 commits June 28, 2022 18:04

Returning the num of deleted rows from the run operation

17998f4

Returning the number of rows from delete

84ae576

Fixing wrong typing

e5e6296

jaceklaskowski suggested changes Jun 30, 2022

View reviewed changes

scottsand-db assigned scottsand-db and unassigned scottsand-db Jun 30, 2022

scottsand-db self-requested a review June 30, 2022 17:36

edmondop changed the title ~~Issue 1222~~ DELETE should return the number of deleted rows Jun 30, 2022

scottsand-db requested changes Jun 30, 2022

View reviewed changes

Improving the get operation

f167eda

scottsand-db reviewed Jun 30, 2022

View reviewed changes

scottsand-db self-requested a review June 30, 2022 17:59

Managing partition boundary as well

df21ff9

tdas requested a review from allisonport-db July 7, 2022 18:09

allisonport-db requested changes Jul 7, 2022

View reviewed changes

Getting the test to pass (step 1)

fb411c0

rahulsmahadev reviewed Aug 4, 2022

View reviewed changes

sherlockbeard mentioned this pull request Aug 10, 2022

Make MERGE operation return useful metrics #1327

Closed

scottsand-db closed this Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DELETE should return the number of deleted rows #1240

DELETE should return the number of deleted rows #1240

edmondop commented Jun 29, 2022

jaceklaskowski Jun 30, 2022

edmondop Jun 30, 2022

scottsand-db Jun 30, 2022

scottsand-db Jun 30, 2022

edmondop Jun 30, 2022

scottsand-db Jun 30, 2022

jaceklaskowski commented Jun 30, 2022

scottsand-db Jun 30, 2022

scottsand-db Jun 30, 2022

scottsand-db Jun 30, 2022

edmondop Jun 30, 2022 •

edited

Loading

allisonport-db Jul 7, 2022

edmondop Jul 7, 2022

allisonport-db Jul 11, 2022

allisonport-db left a comment

allisonport-db Jul 7, 2022

allisonport-db Jul 7, 2022

allisonport-db Jul 11, 2022

allisonport-db Jul 11, 2022

edmondop Jul 12, 2022

allisonport-db Jul 7, 2022

allisonport-db Jul 7, 2022

edmondop Jul 7, 2022

allisonport-db Jul 11, 2022

allisonport-db commented Aug 4, 2022

rahulsmahadev Aug 4, 2022 •

edited

Loading

scottsand-db commented Aug 10, 2022

		@@ -56,7 +56,7 @@ class DeleteSQLSuite extends DeleteSuiteBase with DeltaSQLCommandTest {
		withTempView("v") {
		Seq((1, 1), (0, 3), (1, 5)).toDF("key", "value").write.format("delta").saveAsTable("tab")

DELETE should return the number of deleted rows #1240

DELETE should return the number of deleted rows #1240

Conversation

edmondop commented Jun 29, 2022

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaceklaskowski commented Jun 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edmondop Jun 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db commented Aug 4, 2022

rahulsmahadev Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

scottsand-db commented Aug 10, 2022

edmondop Jun 30, 2022 •

edited

Loading

rahulsmahadev Aug 4, 2022 •

edited

Loading