Spark 3.3: Uniquess validation when computing updates of changelogs #7388

flyrain · 2023-04-20T18:35:43Z

Iceberg doesn't enforce the row uniqueness for a value of identifier fields(a.k.a primary key in the other system). That means there can be duplicate rows with the same identifier fields values.

We can handle duplicate rows while removing the carryover rows. But it is impossible to computing updates in that case. This PR improves the logic to handle duplicated rows in removing carryovers, and throw exception for computing updates.
cc @RussellSpitzer @szehon-ho @aokolnychyi @rdblue

flyrain · 2023-04-21T18:49:45Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

+  private Row deletedRow = null;
+  private long deletedRowCount = 0;


We can use a stack/list for the same purpose, but this solution has less memory cost.

Are there any potential issues with reusing the same object given that we mutate that object in place in pre/post image computation? Or is it safe because there will be an exception if multiple DELETE rows get there? Seems like it is going to work, just checking.

We don't mutate the object while removing carry-over rows. To provide more context, this PR split the ChangelogIterator to two iterators, the first one(CarryoverRemoveIterator) only handles the removing carry-over rows, after that, we apply the second one(Modified ChangelogIterator), which computes updates only.

Correct, we don't mutate them in CarryoverRemoveIterator but in ChangelogIterator when computing update images. However, that seems to happen only for a pair of delete and insert, which should work, I guess.

That seems to work correctly.

That's a good point. I overlooked it at the beginning. Thinking a bit more. It wouldn't be an issue in the current use case as you mentioned. The shared object only happens when there are multiple identical delete rows. For example, assume the following rows went through the removeCarryoverIterator

{0, "a", "data", DELETE} {0, "a", "data", DELETE} {3, "a", "new_data", INSERT}

The iterator output will be identical to the input with a subtle difference, which is the first 2 rows share the same object. It is fine in the current use case, since computeUpdates will throw an exception when there are duplicated delete rows.

{0, "a", "data", DELETE} {0, "a", "data", DELETE} {3, "a", "new_data", INSERT}

There is minor a risk. That in case if an iterator in the future concatenate the RemoveCarryoverIterator, and try to mutate one of a delete, this would be surprise, since it can mutate multiple rows. Overall, we are good here.

flyrain · 2023-04-21T18:51:16Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

+  private boolean popupDeleteRow() {
+    return (!rowIterator.hasNext() || nextCachedRow != null) && hasDeleteRow();
+  }


We need to return cached delete rows when it hit a boundary.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

aokolnychyi · 2023-04-28T17:34:58Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

+  private Row deletedRow = null;
+  private long deletedRowCount = 0;


Are there any potential issues with reusing the same object given that we mutate that object in place in pre/post image computation? Or is it safe because there will be an exception if multiple DELETE rows get there? Seems like it is going to work, just checking.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

flyrain · 2023-04-28T23:56:55Z

Thanks @aokolnychyi for the review. Ready for another look,

flyrain · 2023-04-29T01:07:59Z

These failures are not related.

TestStructuredStreamingRead3 > [catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}] > testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfRows_1[catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}] FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.apache.iceberg.spark.source.TestStructuredStreamingRead3.testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfRows_1(TestStructuredStreamingRead3.java:176)

TestStructuredStreamingRead3 > [catalogName = testhadoop, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hadoop, cache-enabled=false}] > testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfRows_1[catalogName = testhadoop, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hadoop, cache-enabled=false}] FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.apache.iceberg.spark.source.TestStructuredStreamingRead3.testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfRows_1(TestStructuredStreamingRead3.java:176)

TestStructuredStreamingRead3 > [catalogName = spark_catalog, implementation = org.apache.iceberg.spark.SparkSessionCatalog, config = {type=hive, default-namespace=default, parquet-enabled=true, cache-enabled=false}] > testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfRows_1[catalogName = spark_catalog, implementation = org.apache.iceberg.spark.SparkSessionCatalog, config = {type=hive, default-namespace=default, parquet-enabled=true, cache-enabled=false}] FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.apache.iceberg.spark.source.TestStructuredStreamingRead3.testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfRows_1(TestStructuredStreamingRead3.java:176)

flyrain · 2023-05-01T19:07:27Z

retest this please

RussellSpitzer · 2023-05-04T21:14:23Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

+  }
+
+  /**
+   * Pop up the delete rows if there are delete rows cached and the next row is not the same record


I think this needs a better description, i'm not sure what "pop up" means in this context.

I agree. Will it be correct that this method returns true if a previously buffered delete row must returned? If so, can we adjust the comment and the method name?

RussellSpitzer · 2023-05-04T21:26:24Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java

- *   <li>(id=1, data='b', op='UPDATE_AFTER')
- * </ul>
- */
+/** An iterator that transforms rows from changelog tables within a single Spark task. */


Need to describe how this iterator transforms rows

+1. I also think we need some details about the actual algorithm in the doc. Should be added in implementations, I guess?

The detailed algorithms are documented in each subclass. ComputeUpdateIterator and RemoveCarryoverIterator. I also made this class to be abstract class. Two static public methods are also documented. Would that be good?

public static Iterator<Row> computeUpdates() public static Iterator<Row> removeCarryovers()

As an abstract class I think it's ok to keep this javadoc as is

aokolnychyi · 2023-05-05T16:43:07Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

+ *   <li>(id=2, data='b', op='DELETE')
+ * </ul>
+ */
+class CarryoverRemoveIterator extends ChangelogIterator {


Will RemoveCarryoversIterator be a bit more natural and match the procedure param name?

aokolnychyi · 2023-05-05T16:44:09Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

+ */
+class CarryoverRemoveIterator extends ChangelogIterator {
+  private final Iterator<Row> rowIterator;
+  private final int[] indicesForIdentifySameRow;


minor: indicesForIdentifySameRow -> indicesToIdentifySameRow?

aokolnychyi · 2023-05-05T16:45:18Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

+ * </ul>
+ */
+class CarryoverRemoveIterator extends ChangelogIterator {
+  private final Iterator<Row> rowIterator;


minor: Shall we use rowIterator() from the parent instead of having the same var here too?

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

aokolnychyi · 2023-05-05T16:47:36Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java

-    return sameLogicalRow(currentRow, nextRow)
-        && currentRow.getString(changeTypeIndex).equals(DELETE)
-        && nextRow.getString(changeTypeIndex).equals(INSERT);
+  protected boolean isColumnSame(Row currentRow, Row nextRow, int idx) {


optional: It seems the result of this method is always negated. I am wondering whether we want to switch the logic around.

protected boolean isDifferentValue(Row currentRow, Row nextRow, int idx) { return !Objects.equals(nextRow.get(idx), currentRow.get(idx)); }

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java

aokolnychyi · 2023-05-05T17:23:34Z

I did a more detailed round. I think the algorithm is correct. I left some minor comments but it seems close.

flyrain · 2023-05-05T20:37:36Z

Thanks @aokolnychyi and @RussellSpitzer for the reviews. Resolved all comments and ready for another look.

core/src/test/java/org/apache/iceberg/util/TestTasks.java

RussellSpitzer · 2023-05-05T20:54:12Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java

  }

  /**
-   * Creates an iterator for records of a changelog table.
+   * Creates an iterator combine with {@link RemoveCarryoverIterator} and {@link


Creates an iterator composing @link ... and @link

RussellSpitzer · 2023-05-05T20:59:18Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java

-      }
-    }
-    return true;
+  @Override


Class is abstract now we can drop these

RussellSpitzer · 2023-05-05T21:11:04Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java

+  }
+
+  private boolean cachedUpdateRecord() {
+    return cachedRow != null


can't this only be UPDATE_AFTER?

Yes, it is only UPDATE_AFTER. Is it more clear to do that this way?

private boolean cachedUpdateRecord() { return cachedRow != null && cachedRow.getString(changeTypeIndex()).equals(UPDATE_AFTER); }

RussellSpitzer · 2023-05-05T21:18:40Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java

+
+  @Override
+  public Row next() {
+    // if there is an updated cached row, return it directly


I wonder if this comment can note,
// An UPDATE_BEFORE record was returned on the last invocation, this time return the the UPDATE_AFTER?

Just to be a little more clear about why this branch exists

I agree. This only returns UPDATE_AFTER

RussellSpitzer · 2023-05-05T21:22:17Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java

+      return row;
+    }
+
+    Row currentRow = currentRow();


// Either a cached record which is not an UPDATE or the next record in the iterator.

Not sure if this comment is needed, but I need this to remind me what is going on.

Nice to have one.

RussellSpitzer · 2023-05-05T21:24:56Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java

+
+        Preconditions.checkState(
+            nextRowChangeType.equals(INSERT),
+            "The next row should be an INSERT row, but it is %s. That means there are multiple"


Cannot X because Y.

"Cannot compute updates because there are multiple rows inserted with the same identifier fields. ...." ?

RussellSpitzer · 2023-05-05T21:36:27Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java

+
+  @Override
+  public Row next() {
+    if (returnCachedDeleteRow()) {


// Non-carryover delete rows found. 1 or more identical delete rows were seen followed by a non-identical insert. This means none of the delete rows were carry over rows. Emit one delete row and decrease the amount of delete rows seen.

RussellSpitzer · 2023-05-05T21:39:05Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java

+      // cache the delete row if there is 0 delete row cached
+      if (!hasCachedDeleteRow()) {
+        deletedRow = currentRow;
+        deletedRowCount++;


shouldn't this always be 1?
deletedRowCount = 1

Good point.

RussellSpitzer · 2023-05-05T21:43:24Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java

+
+      Row nextRow = rowIterator().next();
+
+      if (isSameRecord(currentRow, nextRow)) {


Shouldn't we be comparing deleteRow here?

the current row is the delete row if there are cached delete rows.

RussellSpitzer · 2023-05-05T21:46:29Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java

+
+    Row currentRow = currentRow();
+
+    if (currentRow.getString(changeTypeIndex()).equals(DELETE) && rowIterator().hasNext()) {


this I think would be easier to understand as as

if (deletedRow != null && rowIterator().hasNext)

And nulling out deletedRow when deletedRowCount is decremented to 0

or you could use

if (hasCachedDeleteRow && rowIterator().hasNext)

The current row could also be the cachedNextRecord, which can be a delete or insert row.

RussellSpitzer · 2023-05-05T21:47:52Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java

+
+    if (currentRow.getString(changeTypeIndex()).equals(DELETE) && rowIterator().hasNext()) {
+      // cache the delete row if there is 0 delete row cached
+      if (!hasCachedDeleteRow()) {


I would pull this out of this if branch since it's a separate handling.

Not sure I understand you correctly, but the current row could be an insert. We need to check its change type anyway.

aokolnychyi

No other comments apart from what Russell noted.

RussellSpitzer · 2023-05-05T21:52:51Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java

+        }
+      } else {
+        // mark the boundary since the next row is not the same record as the current row
+        cachedNextRecord = nextRow;


I'm not sure this behavior is correct, my gut says there needs to be a while loop in here

Rewrote it with while-loop in the new commit.

Resolve the comments

flyrain · 2023-05-08T21:22:04Z

Thanks @aokolnychyi and @RussellSpitzer for the review. Resolved all comments, ready for another look.

RussellSpitzer

Thanks for the updates, it seems much clearer to me now. Hopefully I will feel the same when I look at it in the future :)

flyrain · 2023-05-09T22:36:45Z

Thanks a lot for the reviews, @aokolnychyi @RussellSpitzer !

flyrain added 2 commits April 19, 2023 17:07

Add CarryoverRemoveIterator

1f0a453

Throw exception if there are duplicate rows.

6414ff0

flyrain requested a review from aokolnychyi April 20, 2023 18:35

github-actions bot added the spark label Apr 20, 2023

flyrain commented Apr 21, 2023

View reviewed changes

aokolnychyi reviewed Apr 28, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java Outdated Show resolved Hide resolved

flyrain added 4 commits April 28, 2023 14:56

Resolve comments

ddb1b50

Resolve comments

0f1687b

Resolve comments

8f8bbf2

Resolve comments

c173d5c

RussellSpitzer reviewed May 4, 2023

View reviewed changes

aokolnychyi reviewed May 5, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 5, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 5, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/CarryoverRemoveIterator.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 5, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 5, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java Outdated Show resolved Hide resolved

Resolve comments

b586b7e

github-actions bot added the core label May 5, 2023

RussellSpitzer reviewed May 5, 2023

View reviewed changes

core/src/test/java/org/apache/iceberg/util/TestTasks.java Outdated Show resolved Hide resolved

Remove unrelated change.

24cfd26

RussellSpitzer reviewed May 5, 2023

View reviewed changes

aokolnychyi approved these changes May 5, 2023

View reviewed changes

RussellSpitzer reviewed May 5, 2023

View reviewed changes

flyrain added 3 commits May 8, 2023 10:53

Resolve the comments

a011f5b

Resolve the comments

325af75

Resolve the comments

9e867c5

Resolve the comments

RussellSpitzer approved these changes May 9, 2023

View reviewed changes

flyrain merged commit 08dace7 into apache:master May 9, 2023
31 checks passed

flyrain mentioned this pull request May 9, 2023

Spark 3.2/3.4: Uniqueness validation when computing updates of changelogs #7573

Merged

		private Row deletedRow = null;
		private long deletedRowCount = 0;


		Row nextRow = rowIterator().next();

		if (isSameRecord(currentRow, nextRow)) {


		Row currentRow = currentRow();

		if (currentRow.getString(changeTypeIndex()).equals(DELETE) && rowIterator().hasNext()) {

Spark 3.3: Uniquess validation when computing updates of changelogs #7388

Spark 3.3: Uniquess validation when computing updates of changelogs #7388

Conversation

flyrain commented Apr 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain May 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain commented Apr 28, 2023

flyrain commented Apr 29, 2023

flyrain commented May 1, 2023

Choose a reason for hiding this comment

aokolnychyi May 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi May 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented May 5, 2023

flyrain commented May 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer May 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer May 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain commented May 8, 2023

RussellSpitzer left a comment

Choose a reason for hiding this comment

flyrain commented May 9, 2023

flyrain commented Apr 20, 2023 •

edited

flyrain May 5, 2023 •

edited

aokolnychyi May 5, 2023 •

edited

aokolnychyi May 5, 2023 •

edited

RussellSpitzer May 5, 2023 •

edited

RussellSpitzer May 5, 2023 •

edited