Change Data Capture(CDC)[Draft] #4539

flyrain · 2022-04-11T22:27:56Z

The draft PR for change data capture. It largely aligns with the MVP we discussed in the design doc and mail list.

To emit delete and insert CDC records only
Create a Spark action for CDC generation
Leverage the _deleted metadata column for both pos deletes and eq deletes. Both deleted(pos and eq) rows are in the same format.
For row-level deletes, it supports non-vectorized read and parquet vectorized read.

It is a draft PR though. So there are limitations

Multiple optimization can be done, for example, meta column _deleted pushdown.
Need to expand interface to support query by timestamp besides snapshot ids.
More test cases

Happy to take feedbacks and file the formal PRs.
cc @aokolnychyi @RussellSpitzer @szehon-ho @jackye1995 @kbendick @karuppayya @chenjunjiedada @stevenzwu @rdblue

kbendick · 2022-04-11T23:03:15Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

+  /**
+   * Instantiates an action to generate CDC records.
+   */
+  default Cdc generateCdcRecords(Table table) {


Nit: Consider using CDC as all caps instead, like it is in the javadoc comments. For me, it looks a lot cleaner.

I'm also open to other names, e.g., ChangeDataSet, ChangeDataCapture

In Flink it's referred to as Changelog.

I feel we should not use acronyms.

Yeah I agree that using a full word would be better.

In comments and even method names it's fine in my opinion but as the main class name it probably would be best to use the full name.

+1 for Changelog. Here, it means generateChangelog

I am OK with generateChangelog.

or generateChangeSet, since action is a batch execution. if it is a long-running streaming execution, changelog would be more accurate as it implies a stream.

I liked generateChangelog but if it can confuse people, generateChangeSet sounds good too.

Combining feedbacks, changed it to GetChangeSet. The name GenerateChangeSet is good, but it is way too long. Think about the class name BaseGenerateChangeSetSparkActionResult. I admit the verb get is plain comparing to generate. But I think it is fine, a plain name is suitable for a tool.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

rajarshisarkar · 2022-04-13T08:23:02Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

+  /**
+   * Instantiates an action to generate CDC records.
+   */
+  default Cdc generateCdcRecords(Table table) {


I feel we should not use acronyms.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

singhpk234 · 2022-04-13T12:09:25Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+    ManifestGroup manifestGroup = new ManifestGroup(table.io(), snapshot.dataManifests(), snapshot.deleteManifests())
+        .filterData(filter)
+        .ignoreAdded()
+        .specsById(((HasTableOperations) table).operations().current().specsById());


[question] should we also set caseSensitive ? It's used in ResidualEvaluator when we planFiles

caseSensitive is not necessary here.

singhpk234 · 2022-04-13T12:11:14Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+      return null;
+    }
+
+    String groupID = UUID.randomUUID().toString();


should we give it a prefix relevant to data scanned as well , something like
readRowLevelDeletes-{UUID}

I assume this is for easier debugging. We probably don't do that since this uuid is used totally internally, it won't be in any log or UI.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

core/src/main/java/org/apache/iceberg/ManifestGroup.java

singhpk234 · 2022-04-13T17:38:35Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+    for (int i = 0; i < snapshotIds.size(); i++) {
+      generateCdcRecordsPerSnapshot(snapshotIds.get(i), i);
+    }


[question] how about doing this in separate threads ? (may be later iteration of PR)

df's would be computed lazily, though there can be some value that MT could bring in constructing DF's (as we would be calling planFiles for each of them). Your thoughts ?

Yes, we can plan them in parallel if the planning perf is an issue. It'd be especially usefully in case of a big table and/or a big number of snapshots.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkFilesScan.java

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

hameizi · 2022-04-14T02:26:38Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+    }
+
+    for (int i = 0; i < snapshotIds.size(); i++) {
+      generateCdcRecordsPerSnapshot(snapshotIds.get(i), i);


I think we can describe commit order use file sequence number.

Yep, it is a part of the design. I simplified it in the draft. Will add it later. To be clear, v1 doesn't have seq number, we will give a zero-based order.

hameizi · 2022-04-14T03:11:45Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+    // new data file as the insert
+    Dataset<Row> df = readAppendDataFiles(snapshotId, commitOrder);
+    if (df != null) {
+      dfs.add(df);
+    }
+
+    // metadata deleted data files
+    Dataset<Row> deletedDf = readMetadataDeletedFiles(snapshotId, commitOrder);
+    if (deletedDf != null) {
+      dfs.add(deletedDf);
+    }
+
+    // pos and eq deletes
+    Dataset<Row> rowLevelDeleteDf = readRowLevelDeletes(snapshotId, commitOrder);
+    if (rowLevelDeleteDf != null) {
+      dfs.add(rowLevelDeleteDf);
+    }
+  }


I think there dfs should be first add deleteDf and then add appendDataDf. Because delete data are apply to old snapshots. For example:
table xxx (id,data) primary key is id.
snap 1 ->insert (1,aa)
snap 2 ->delete(1,aa) insert(1,bb)
If there we deal snap 2 that add insert (1,bb) first delete (1,aa) second in dfs(dfs is aLinkedList), the final result we get in downstream is delete primary key id=1.

I can definitely make the change. Although, I don't think it makes any difference. Basically, these cdc records(e.g. 1, aa, D, 1, bb, I ) within a snapshot doesn't have any order. Please check the discussion in the design doc. We should treat them same in term of order while ingesting them into downstream. Here, we cannot delete the row (1, aa) by just checking its id, instead, we may delete it by checking both values.

I agree with @hameizi on emitting eq deletes before the append, as eq deletes are applied to the previous snapshots. We should apply/emit eq deletes before the appended rows in the current snapshot.

I also left a comment regarding the pos deletes in issue #3941 . I am not sure we need to emit pos deletes, as they are applied as delete filter for the inserted rows in the current snapshot already. It is like they are squashed away already.

I'm OK with that, will make the change.

For pos deletes, there are two scenarios:

It deletes rows from the previous snapshots. We need to emit them as other deleted rows.

It deletes rows within the same snapshots. We can squash it by default. I use a flag ignoreRowsDeletedWithinSnapshot to control that in case people don't want to squash it.

My understanding is that pos deletes are only applicable to files appended in the same snapshot and eq deletes are only applied to the previous snapshots.

Spark does, and it doesn't use eq delete at all. In case of merge, delete, update, Spark uses pos deletes to remove rows from previous snapshots.

stevenzwu · 2022-04-20T04:02:05Z

api/src/main/java/org/apache/iceberg/actions/Cdc.java

+
+package org.apache.iceberg.actions;
+
+public interface Cdc extends Action<Cdc, Cdc.Result> {


Is this an action? Action typically modify the table like rewrite, expire snapshots.

It is. Action usually modifies table, but is not necessarily limited by that. This is the first PR, we will also explore a way to use scan for CDC.

All the existing actions in Iceberg seem to modify the table. Since this is a public API, I like to double check. Can we keep it in SparkActions only for now if we are in experimental phase?

Conceptually, this is a scan/read (not a maintenance action).

Agreed. This is the start point. As I said in another comment, we need a well designed scan interface first.

I wouldn't mind having an action like this. We have RemoveOrphanFiles that does not modify the table state, for instance.

I would match the naming style we have in other actions, though. I think it can be GenerateChangelog as all other actions start with a verb.

RemoveOrphanFiles is also a maintenance action on the table. Agree that technically it doesn't modify the table state. But it is a garbage collection action for the tables. Those orphaned files may be part of the table before but got unreferenced due to compaction or retention purge. Or those orphaned files were intended to be added to the table but got aborted in the middle. To me, it is a modification of the broader env/infra of the table.

The primary goal of adding the Action API was to provide scalable and ready-to-use recipes for common scenarios. I am not sure we should strictly restrict the usage to maintenance.

That being said, I'd be happy to discuss alternative ways to expose this.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

stevenzwu · 2022-04-20T23:49:50Z

core/src/main/java/org/apache/iceberg/ManifestGroup.java

@@ -41,7 +41,7 @@
 import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ParallelIterable;

-class ManifestGroup {
+public class ManifestGroup {


If we use scan API, we may not need to expose ManifestGroup as public

It'd be awesome. We have to design the scan API nicely though.

I am not sure about extending TableScan (can be convinced) but we can definitely create a utility class in core that will internally use ManifestGroup. I don' think we should expose this class. Maybe, it is safer to start with a utility and then see if it is something that can be part of TableScan.

@aokolnychyi we are not talking about extending TableScan. We are discussing introducing new scan interfaces to support incremental scan (appends only and CDC). Please see PR 4580 : #4580 (comment)

core/src/main/java/org/apache/iceberg/ManifestGroup.java

stevenzwu · 2022-04-21T02:08:24Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+    }
+
+    PositionSetDeleteMarker<T> deleteMarker = new PositionSetDeleteMarker<>(rowToPosition, deleteSet, markDeleted);
+    return deleteMarker.filter(rows);


since PositionSetDeleteMarker always return true, we are not actually doing filter. it seems that we are mainly leverage the filter to traverse the iterable and get the side effect of calling markDeleted with matched rows. The semantics is a little odd to me. Maybe we don't need to introduce PositionSetDeleteMarker filter?

Maybe just use this method from CloseableIterable to traverse? it returns the same object after applying the Consumer function.

static <I, O> CloseableIterable<O> transform(CloseableIterable<I> iterable, Function<I, O> transform) {

I think it traverses the rows and adds is_deleted value to the row. +1 to use transform from ClosableIterable. The Filter interface makes it hard to understand.

Or maybe we can add a new interface called Marker to marker these rows. I think it will be easier to understand than using CloseableIterable.

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

stevenzwu · 2022-04-21T03:39:03Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+    };
+  }
+
+  protected void markRowDeleted(T item) {


should this just be an abstract method and force impl classes to implement this method?

Probably, it needs changes for all subclass of DeleteFilter. This is PR is already big. Maybe we should split it. For example, we can have a separated PR to read deleted rows in case of row-level change. cc @chenjunjiedada

The previous PR which read the deleted rows has such implementation, +1 to use another PR.

+1 for use abstract method and force impl this. And I think the changes of Deletes and the DeleteFilter should implement in other separated PR, because implement read cdc for Flink will need this changes too.

stevenzwu · 2022-04-21T04:13:02Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkActionResult.java

+  }
+
+  @Override
+  public Object cdcRecords() {


maybe make Cdc.Result a generic class to get type info?

Good idea, will make the change.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

stevenzwu · 2022-04-21T04:19:31Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+        "The fromSnapshot(%s) is not an ancestor of the toSnapshot(%s)", fromSnapshotId, toSnapshotId);
+
+    snapshotIds.clear();
+    // include the fromSnapshotId


exclusive behavior for fromSnapshotId might be easier to work with . Assuming last enumerated position (toSnapshotId) needs to be saved somewhere. Next run could just set the fromSnapshotId to the saved position.

I believe that is the intention why TableScan#appendsBetween has the semantic of (fromSnapshotId, toSnapshotId].

Make sense. One concern is that it cannot get the first snapshot of a table. Two solutions:

Add a new interface like toSnapshot(), which compute all snapshots to the toSnapshot.

Add a overwrite method between(long fromSnapshotId, long toSnapshotId, boolean includeFromSnapshotId)
What do you think?

what do you mean "get the first snapshot of a table"?

Add a new interface like toSnapshot(), which compute all snapshots to the toSnapshot.

In PR #4580, This is covered if fromSnapshotId is not set, which means null is used. Then IncrementalScan would process all snapshots up to the toSnapshotId. As if we trace the ancestor chain of toSnapshotId, we will eventually trace to null snapshotId and stop.

Add a overwrite method between(long fromSnapshotId, long toSnapshotId, boolean includeFromSnapshotId)

This API is not not necessary. we can just set the fromSnapshotId to the parent snapshot id if we want the inclusive behavior for the very first scan.

null can solve the issue. However, I was trying to avoid it since it doesn't have a clear meaning to users.

Users don't know they can input a null.

Users don't know what will happen when they input a null.

So users have to rely on the doc to figure that out. I'd suggest to provide the new interface beforeSnapshot() or toSnapshot(). It did the same thing under the hood, but more meaningful to users. Another benefit is that it'd be logically complete since @aokolnychyi also proposed the method afterSnapshot() in here, #4539 (comment).

yeah. @rdblue also made a similar suggestion: #4580 (comment). We will go with fromSnapshotInclusive and fromSnapshotExclusive in the new increment scan interface

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

stevenzwu · 2022-04-21T04:32:47Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+    return new BaseCdcSparkActionResult(outputDf);
+  }
+
+  private void generateCdcRecordsPerSnapshot(long snapshotId, int commitOrder) {


nit: maybe return the Datasets from this method (instead of updating a class variable). then outside can do the union. E.g. use the reduce method from Java stream API.

stevenzwu · 2022-04-21T04:46:21Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+      generateCdcRecordsPerSnapshot(snapshotIds.get(i), i);
+    }
+
+    Dataset<Row> outputDf = null;


can we use unionAll?

We expected the schema of these datasets are the same. There are no difference between unionAll and unionByName. BTW, unionAll only union one dataset, not a list of datasets,

stevenzwu · 2022-04-21T04:47:10Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+    }
+
+    for (int i = 0; i < snapshotIds.size(); i++) {
+      generateCdcRecordsPerSnapshot(snapshotIds.get(i), i);


We pass the index of the snapshot list as commitOrder. What is the intended usage of commitOrder?

see #4539 (comment)

stevenzwu · 2022-04-21T04:50:18Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+
+import static org.apache.spark.sql.functions.lit;
+
+public class BaseCdcSparkAction extends BaseSparkAction<Cdc, Cdc.Result> implements Cdc {


A high level question. since this algorithm doesn't guarantee ordering within a snapshot (for a good reason), does that mean we can only have parallel executors if it only scans one snapshot? If this scans multiple snapshots, parallel executors could mess up the ordering.

We got the metadata column commit order to indicate the order of cdc records from multiple snapshots. The order won't be messed up.

does that mean we can only have parallel executors if it only scans one snapshot? If this scans multiple snapshots, parallel executors could mess up the ordering.

No, they won't. The order will be kept by the metadata column _commit_order for multiple snapshots in terms of parallel executors. Each snapshot will get a unique _commit_order for its change set no matter how parallel the data is read.

stevenzwu · 2022-04-21T21:17:18Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+    return withCdcColumns(scanDF, snapshotId, "I", commitOrder);
+  }
+
+  private List<FileScanTask> planAppendedFiles(long snapshotId) {


Currently, this is how this implementation discovers files appended in the specified snapshot.

discover all files reachable by the snapshotId via time travel.

found all files added in this snapshot.

filter the files discovered during the first step with the appended data files discovered in second step

This method probably can be simplified. We can use TableScan#appendsBetween(parentSnapshotId, snapshotId)#planFiles to get the FileScanTask collection.

Then we can apply the ignoreRowsDeletedWithinSnapshot transformation if needed.

sure, let me make the change

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

stevenzwu · 2022-04-21T21:29:02Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java

+
+  private Dataset<Row> withCdcColumns(Dataset<Row> df, long snapshotId, String cdcType, int commitOrder) {
+    return df.withColumn(RECORD_TYPE, lit(cdcType))
+        .withColumn(COMMIT_SNAPSHOT_ID, lit(snapshotId))


what are the intended usage of those 3 commit related metadata columns?

RECORD_TYPE indicates the change types.(I, D, -U, +U)
COMMIT_SNAPSHOT_ID indicates which snapshot makes the change record.
COMMIT_TIMESTAMP indicates when the change happend.
COMMIT_ORDER indicates change order in case of multiple snapshots.

flyrain · 2022-04-21T22:37:11Z

...3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestCdcWithMerge.java

+  }
+
+  @Test
+  public void testMergeWithOnlyUpdateClause() throws ParseException, NoSuchTableException {


@stevenzwu, here is an example that Spark uses pos deletes in the merge command.

Are you say8ing Spark merge into writes pos deletes? if so, it doesn't answer my question why do we need to emit pos deletes, as pos deletes are already applied as delete filters when emit insert records for the same snapshot.

As I mentioned here, #4539 (comment), we only apply the pos deletes for data files generated in the same snapshot. We still emit pos deletes when they delete rows from previous snapshots.

aokolnychyi · 2022-04-22T02:23:19Z

api/src/main/java/org/apache/iceberg/actions/Cdc.java

+   * @param snapshotId id of the snapshot to generate changed data
+   * @return this for method chaining
+   */
+  Cdc ofSnapshot(long snapshotId);


If we decide to include a verb in the action name, I'd also consider renaming these methods like this:

// changes for a specific snapshot GenerateChangelog forSnapshot(long snapshotId); // changes starting from a particular snapshot (exclusive) GenerateChangelog afterSnapshot(long fromSnapshotId); // changes from start snapshot (exclusive) to end snapshot (inclusive) GenerateChangelog betweenSnapshots(long fromSnapshotId, long toSnapshotId);

aokolnychyi · 2022-04-22T02:25:09Z

api/src/main/java/org/apache/iceberg/actions/Cdc.java

+   *
+   * @return this for method chaining
+   */
+  Cdc ofCurrentSnapshot();


In which case will this method be useful? I guess consumers will probably always want to consume from a particular point in time so I am not sure CDC records for a current snapshot would be very helpful.

+1. for incremental read, we need to bookkeep the position. implicit current snapshot would make the bookkeeping impossible and hence is probably not a valid use case

chenjunjiedada · 2022-04-22T04:14:22Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+   * A Dummy Vector Reader which doesn't actually read files, instead it returns a dummy
+   * VectorHolder which indicates whether the row is deleted.
+   */
+  public static class DeletedVectorReader extends VectorizedArrowReader {


A basic question, do we need to enable vectorization to read CDC? What about the non-vectorized reader?

Vectorized read is significant faster than non-vectorized read, check benchmark in #3287. We should have it. Non-vectorized read is handled by the change I made in class RowDataReader, DeleteFilter and Deletes.

chenjunjiedada · 2022-04-22T04:45:13Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

@@ -226,10 +266,14 @@ private CloseableIterable<T> applyPosDeletes(CloseableIterable<T> records) {

    // if there are fewer deletes than a reasonable number to keep in memory, use a set
    if (posDeletes.stream().mapToLong(DeleteFile::recordCount).sum() < setFilterThreshold) {
-      return Deletes.filter(records, this::pos, Deletes.toPositionIndex(filePath, deletes));
+      return hasMetadataColumnIsDeleted ?
+          Deletes.marker(records, this::pos, Deletes.toPositionIndex(filePath, deletes), this::markRowDeleted) :


I feel we should abstract these later.

if (needWrapDeleteColumn) { Deletes.map(records, this::pos, Deletes.toPositionoIndex(filePath, deletes), this::markRowDeleted); } else { Deletes.filter(..) }

What about adding a class DeleteStream that supports filter and map.

The DeleteStream contains the filter logics in this DeleteFilter also contains transforming logics.

We can definitely refactor here. I've got some ideas in my mind. Trying to understand your point here. Do you mean a class DeleteStream combine both functionality from PositionSetDeleteFilter and PositionSetDeleteMarker? I will test these ideas anyway. Thanks for the suggestion.

Yes, like the Java steam API.

chenjunjiedada · 2022-04-22T04:47:13Z

...4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java

@@ -70,8 +70,7 @@ public class TestSparkParquetReadMetadataColumns {
  private static final Schema PROJECTION_SCHEMA = new Schema(
      required(100, "id", Types.LongType.get()),
      required(101, "data", Types.StringType.get()),
-      MetadataColumns.ROW_POSITION,
-      MetadataColumns.IS_DELETED


Why we change these unit tests?

Some tests(e.g., testReadRowNumbersWithDelete) failed due to it uses the PROJECTION_SCHEMA, which contains the column IS_DELETED. So the row count was wrong since it still assumes only undeleted rows are there, with the new logic, both delete and undeleted rows are there.
To fix it, we can either remove the column IS_DELETED so that deleted rows won't be in the results, or change the test case to filter out the deleted rows in the results. I choose the former so that it can fix multiple failures at once.

aokolnychyi

I did a review of the action. I think it shows the algorithm we discussed in the issue can be implemented fairly easy. I'd say supporting is_deleted is actually more challenging.

@flyrain, what about creating a separate PR for is_deleted? I think these two are independent. We can approach vectorized and non-vectorized readers separately.

aokolnychyi · 2022-04-22T22:56:47Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

+  /**
+   * Instantiates an action to generate a change data set.
+   */
+  default GetChangeSet getChangeSet(Table table) {


For other reviewers: we had a discussion on the name in this thread.

I understand the concern about long class names but getXXX usually indicates something already exists and we simply return it. In this case, though, we actually perform quite some computation to build/generate a set of changes. Using generate or build would consume just a few extra chars but will be more descriptive, in my view. All public methods and classes will be short enough.

Anton made a good argument here. generate is not much longer

aokolnychyi · 2022-04-22T22:59:41Z

api/src/main/java/org/apache/iceberg/actions/GetChangeSet.java

+   *
+   * @return this for method chaining
+   */
+  GetChangeSet forCurrentSnapshot();


@stevenzwu and I had a short discussion on whether this method is useful. @flyrain, do you have a valid use case for this method in mind?

aokolnychyi · 2022-04-22T23:00:42Z

api/src/main/java/org/apache/iceberg/actions/GetChangeSet.java

+    /**
+     * Returns the change set.
+     */
+    Object changeSet();


I think we may want to parameterize this and use Dataset<Row> in Spark instead of plain Object.

Made it return a generic type.

aokolnychyi · 2022-04-22T23:03:06Z

core/src/main/java/org/apache/iceberg/ManifestGroup.java

@@ -41,7 +41,7 @@
 import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ParallelIterable;

-class ManifestGroup {
+public class ManifestGroup {


I am not sure about extending TableScan (can be convinced) but we can definitely create a utility class in core that will internally use ManifestGroup. I don' think we should expose this class. Maybe, it is safer to start with a utility and then see if it is something that can be part of TableScan.

aokolnychyi · 2022-04-22T23:08:38Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseGetChangeSetSparkAction.java

+  public static final String COMMIT_TIMESTAMP = "_commit_timestamp";
+  public static final String COMMIT_ORDER = "_commit_order";
+
+  private final List<Long> snapshotIds = Lists.newLinkedList();


While using a single snapshot ID list is convenient, the downside is that we can't validate whether there are conflicting calls to betweenSnapshots, forSnapshot, etc. In SparkScanBuilder and BaseTableScan, we took a different approach.

We should consider them incompatible, right? For example, if you do

getChangeSet(table).betweenSnapshots(s1, s2).forCurrentSnapshot(s3).execute()

It should only honor the later one, which is forCurrentSnapshot(s3). If you flip the order like this

getChangeSet(table).forCurrentSnapshot(s3).betweenSnapshots(s1, s2).execute()

It should only pick up snapshots between s1 and s2.
The current logic support this logic already. Basically, it cleans up the lists every time before adding any snapshot id into it.

aokolnychyi · 2022-04-22T23:13:33Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseGetChangeSetSparkAction.java

+
+    String groupID = UUID.randomUUID().toString();
+    FileScanTaskSetManager manager = FileScanTaskSetManager.get();
+    manager.stageTasks(table, groupID, fileScanTasks);


We stage tasks but we never invalidate them. This will cause a memory leak. It will be tricky to invalidate them as we don't know whether it is already safe to do so. We can instruct the manager to remove tasks after the first access or make the action result closable. Let's think what will be best.

aokolnychyi · 2022-04-22T23:14:38Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseGetChangeSetSparkAction.java

+  }
+
+  private List<FileScanTask> planAppendedFiles(long snapshotId) {
+    CloseableIterable<FileScanTask> fileScanTasks = table.newScan()


This is a full table scan. If you use a utility with ManifestGroup, you would be able to get a much better performance.

yeah, I made a similar comment here: #4539 (comment)

aokolnychyi · 2022-04-22T23:16:45Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseGetChangeSetSparkAction.java

+    fileScanTasks.forEach(fileScanTask -> {
+      if (fileScanTask.file().content().equals(FileContent.DATA) && dataFiles.contains(fileScanTask.file().path())) {
+        FileScanTask newFileScanTask = fileScanTask;
+        if (!ignoreRowsDeletedWithinSnapshot) {


I am not sure this is correct. Records added and removed in the same snapshot must have a lesser commit order compared to all other records added in that snapshot. I'd not add this functionality to start with.

Yes, order is tricky when we output both deletes and inserts. I will remove the option to do that. It always applies the deletes within the same snapshot.

aokolnychyi · 2022-04-22T23:17:57Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseGetChangeSetSparkAction.java

+
+  private List<FileScanTask> planMetadataDeletedFiles(long snapshotId) {
+    Snapshot snapshot = table.snapshot(snapshotId);
+    ManifestGroup manifestGroup = new ManifestGroup(table.io(), snapshot.dataManifests(), snapshot.deleteManifests())


Do we have to access all data manifests? Aren't we interested only in data manifests added in this snapshot? Do we also need delete manifests for this?

aokolnychyi · 2022-04-22T23:19:39Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseGetChangeSetSparkAction.java

+    ManifestGroup manifestGroup = new ManifestGroup(table.io(), snapshot.dataManifests(), manifestFiles)
+        .filterData(filter)
+        .onlyWithRowLevelDeletes()
+        .specsById(((HasTableOperations) table).operations().current().specsById());


nit: why not use table.specs()?

Sure, will make the change. Double checked, they point to the same object.

Reo-LEI · 2022-04-23T07:04:48Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+    }
+
+    PositionSetDeleteMarker<T> deleteMarker = new PositionSetDeleteMarker<>(rowToPosition, deleteSet, markDeleted);
+    return deleteMarker.filter(rows);


Or maybe we can add a new interface called Marker to marker these rows. I think it will be easier to understand than using CloseableIterable.

Reo-LEI · 2022-04-23T07:25:11Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

@@ -185,14 +204,35 @@ private CloseableIterable<T> applyEqDeletes(CloseableIterable<T> records) {
        .reduce(Predicate::and)
        .orElse(t -> true);

-    Filter<T> remainingRowsFilter = new Filter<T>() {
+    Filter<T> remainingRowsFilter = hasMetadataColumnIsDeleted ? getMarker(remainingRows) : getFilter(remainingRows);


Maybe we shouldn't use this implicit way to decide whether to keep deleted records. We can add the _delete column by default and use different methods to decide whether to keep the deleted records, and then use different methods to get the required data in different situations.

Yeah, I think the delete logic is cleaner in that way. We are more or less in that direction.

Reo-LEI · 2022-04-23T07:28:56Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+    };
+  }
+
+  protected void markRowDeleted(T item) {


+1 for use abstract method and force impl this. And I think the changes of Deletes and the DeleteFilter should implement in other separated PR, because implement read cdc for Flink will need this changes too.

flyrain · 2023-01-12T21:16:05Z

There will be no update on this PR. Please check the latest progress here, https://github.com/apache/iceberg/projects/26.

RussellSpitzer · 2023-01-12T21:24:52Z

api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java

+  GenerateChangeSet forCurrentSnapshot();
+
+  /**
+   * Emit changed data from a particular snapshot(exclusive).


I think "from" here is ambiguous, probably "since" is a bit closer?

flyrain added 2 commits April 11, 2022 14:24

The init commit of cdc

c31b81f

_delete column should be added in the plan phase

28002f6

github-actions bot added API arrow core data spark labels Apr 11, 2022

flyrain mentioned this pull request Apr 11, 2022

[Feature Request] Support for change data capture #3941

Open

kbendick reviewed Apr 11, 2022

View reviewed changes

flyrain added 2 commits April 11, 2022 16:03

_delete column should be added in the plan phase

86d6c8b

Fix the style issue

2c4daf0

rajarshisarkar reviewed Apr 13, 2022

View reviewed changes

singhpk234 reviewed Apr 14, 2022

View reviewed changes

Resolve comments.

8cf9678

hameizi reviewed Apr 18, 2022

View reviewed changes

flyrain mentioned this pull request Apr 19, 2022

API: Introduce a new IncrementalAppendScan interface #4580

Merged

Resolve comments.

5c28931

stevenzwu reviewed Apr 20, 2022

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java Show resolved Hide resolved

stevenzwu reviewed Apr 20, 2022

View reviewed changes

Add support for non-vectorized read

12be09d

stevenzwu reviewed Apr 21, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/ManifestGroup.java Outdated Show resolved Hide resolved

Fix the test failures for Spark2.4, Spark3.0 and Spark3.1

0dd085b

stevenzwu reviewed Apr 21, 2022

View reviewed changes

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java Show resolved Hide resolved

stevenzwu reviewed Apr 21, 2022

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java Outdated Show resolved Hide resolved

stevenzwu reviewed Apr 21, 2022

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java Outdated Show resolved Hide resolved

stevenzwu reviewed Apr 21, 2022

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCdcSparkAction.java Outdated Show resolved Hide resolved

stevenzwu reviewed Apr 21, 2022

View reviewed changes

Resolve comments.

fb91731

flyrain commented Apr 21, 2022

View reviewed changes

Resolve comments.

4ca354c

aokolnychyi reviewed Apr 22, 2022

View reviewed changes

chenjunjiedada reviewed Apr 22, 2022

View reviewed changes

flyrain added 2 commits April 22, 2022 11:41

Change the action name to GetChangeSet.

02aef3c

Use generic for class GetChangeSet.Result

172f9e5

aokolnychyi reviewed Apr 22, 2022

View reviewed changes

Resolve comments.

5e0d03a

Reo-LEI reviewed Apr 23, 2022

View reviewed changes

Rename to GenerateChangeSet

4756489

This was referenced May 2, 2022

Read deleted rows with metadata column IS_DELETED #4683

Merged

API: Add an action to generate table change set #4708

Closed

sungheun mentioned this pull request May 18, 2022

Materialized view improvement trinodb/trino#12466

Open

7 tasks

RussellSpitzer reviewed Jan 12, 2023

View reviewed changes


		package org.apache.iceberg.actions;

		public interface Cdc extends Action<Cdc, Cdc.Result> {


		import static org.apache.spark.sql.functions.lit;

		public class BaseCdcSparkAction extends BaseSparkAction<Cdc, Cdc.Result> implements Cdc {

Change Data Capture(CDC)[Draft] #4539

Are you sure you want to change the base?

Change Data Capture(CDC)[Draft] #4539

Conversation

flyrain commented Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Apr 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpk234 Apr 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

stevenzwu Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

aokolnychyi Apr 22, 2022 • edited Loading

Choose a reason for hiding this comment

stevenzwu Apr 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Apr 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Apr 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Apr 21, 2022 • edited Loading

flyrain commented Apr 11, 2022 •

edited

Loading

stevenzwu Apr 22, 2022 •

edited

Loading

flyrain Apr 21, 2022 •

edited

Loading

flyrain Apr 19, 2022 •

edited

Loading

singhpk234 Apr 14, 2022 •

edited

Loading

flyrain Apr 21, 2022 •

edited

Loading

stevenzwu Apr 21, 2022 •

edited

Loading

flyrain Apr 21, 2022 •

edited

Loading

aokolnychyi Apr 22, 2022 •

edited

Loading

stevenzwu Apr 22, 2022 •

edited

Loading

stevenzwu Apr 22, 2022 •

edited

Loading

chenjunjiedada Apr 21, 2022 •

edited

Loading

stevenzwu Apr 21, 2022 •

edited

Loading

stevenzwu Apr 22, 2022 •

edited

Loading

stevenzwu Apr 21, 2022 •

edited

Loading

flyrain Apr 21, 2022 •

edited

Loading

aokolnychyi Apr 22, 2022 •

edited

Loading