Incremental processing implementation #315

rdsr · 2019-07-26T04:16:37Z

WIP branch to feedback on the overall approach.

The basic idea is if the user is asking for incremental scan between s1 and s2 . We only scan manifests which belong in the range (s1, s2] - [excluding s1 and including s2] and in those manifests only look at manifest entries which belong in the range (s1, s2]

api/src/main/java/org/apache/iceberg/IncrTableScan.java

api/src/main/java/org/apache/iceberg/TableScan.java

api/src/main/java/org/apache/iceberg/IncrTableScan.java

core/src/main/java/org/apache/iceberg/IncrAppend.java

core/src/main/java/org/apache/iceberg/IncrDataTableScan.java

SurenNihalani

One blocking comment: can we rename startSnapshotId to exclusiveStartSnapshotId to be explicit?

api/src/main/java/org/apache/iceberg/IncrTableScan.java

api/src/main/java/org/apache/iceberg/Table.java

core/src/main/java/org/apache/iceberg/GenericDataFile.java

core/src/main/java/org/apache/iceberg/IncrDataTableScan.java

rdsr · 2019-08-02T20:08:02Z

api/src/main/java/org/apache/iceberg/Table.java

+   * @return a table scan which can read incremental data from {@param fromSnapshotId}
+   * exclusive and up to {@toSnapshotId} inclusive
+   */
+  TableScan newIncrementalScan(long fromSnapshotId, long toSnapshotId);


TODO: What about when the user reads incremental scan for the first time [bootstrap] what should the fromSnapshotId be , or even toSnapshotId ?

For new tables, from will be the first snapshot ID. For existing tables, where the oldest snapshot has already expired, users will need to choose a starting snapshot. Usually, that would be after running a batch process to handle existing table data before the incremental process starts. So it would be the snapshot on which the batch process ran.

I think it is essential that we can start streaming from a large table and have multiple batches for this. @rdblue, can you elaborate on the batch process you mentioned?

I've also summarized my thoughts on requirements for Structured Streaming sources in this comment. Let me know if that makes sense to you, @rdsr @rdblue.

Splitting a snapshot into multiple batches will require some thinking. Maybe, this can be done by streaming sources. For example, our implementation of Offset in Spark can store additional information. Storing consumed files in offset JSON files isn't an option but maybe we can store a pointer to a list of consumed manifests or something (which can be an Avro or a Parquet file, for example).

That sounds reasonable to me, but I think it should also be possible to start consuming from a specific snapshot, since processing the rest of the table seems needlessly expensive in a lot of cases. We move jobs from batch to streaming and don't want to rewrite the history.

Absolutely agree, consuming from a given snapshot covers most of use cases. At the same time, there are scenarios when Structured Streaming in Spark is used, for example, beyond streaming. I just want to capture those use cases as well.

When using incremental for the first time [bootstrap] I think fromSnapshotId should be null since the contract is an incremental scan gives all the appends between fromSnapshotId exclusive to toSnapshotId . When fromSnapshotId is null we will return all the appends till toSnapshotId. In this way a user will not miss any data.

thoughts?

How about using Table#newScan for the first time, and then call TableScan#snapshot to know till what snapshot the first time read?

core/src/main/java/org/apache/iceberg/IncrementalDataScan.java

aokolnychyi · 2019-08-05T15:59:58Z

core/src/main/java/org/apache/iceberg/DataTableScan.java

+  }
+
+  protected CloseableIterable<FileScanTask> planFiles(TableOperations ops, Snapshot snapshot,
+                                                   Expression rowFilter, boolean caseSensitive, boolean colStats,


nit: formatting of the params

aokolnychyi · 2019-08-05T16:04:33Z

Will it be possible to use this API for #179?

rdsr · 2019-08-05T16:40:35Z

Hi @aokolnychyi , I haven't looked too closely into #179. I'll read that PR and get back to you

rdsr · 2019-08-16T22:11:30Z

@rdblue, any comments on this updated rb?

rdblue · 2019-08-24T21:21:07Z

@rdsr, sorry for the delay! I didn't notice that this was updated or see your request. Sorry!

core/src/main/java/org/apache/iceberg/IncrementalDataScan.java

core/src/test/java/org/apache/iceberg/TestIncrementalScan.java

rdblue · 2019-08-24T21:44:58Z

@rdsr, thanks for working on this! I really like how clean this version is. Definitely improving quickly!

api/src/main/java/org/apache/iceberg/Table.java

rdsr · 2019-11-20T04:37:16Z

This has been lagging for a while since I didn't get time to address the comments. I plan to take this up again in the coming weeks.

xabriel

Thanks for working on this @rdsr. We have a similar use case, so LMK if you need help testing this.

api/src/main/java/org/apache/iceberg/Table.java

core/src/main/java/org/apache/iceberg/IncrementalDataScan.java

prodeezy · 2019-12-04T19:03:30Z

Nice work @rdsr , are you also thinking of an API that's exposed under IcebergSource so that say spark readers can use this as an option maybe?

rdsr · 2019-12-04T20:23:58Z

@prodeezy yes we will need to expose such an api. Maybe we can take that up as part of a separate PR?
@xabriel I'm looking to starting work on this from today!

api/src/main/java/org/apache/iceberg/TableScan.java

core/src/main/java/org/apache/iceberg/DataTableScan.java

core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java

rdblue · 2020-01-31T18:59:55Z

core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java

+        table().snapshot(newFromSnapshotId) != null, "fromSnapshotId: %s does not exist", newFromSnapshotId);
+    Preconditions.checkArgument(
+        table().snapshot(newToSnapshotId) != null, "toSnapshotId: %s does not exist", newToSnapshotId);
+    return new IncrementalDataTableScan(


Should this ensure that newFromSnapshot is an ancestor of newToSnapshot?

This probably doesn't need to be done in this commit, but it would be a good follow-up to ensure that the range exists.

Since this is a refinement, it may also be a good idea to make this a subset of the existing selected range. That is, both newFromSnapshotId and newToSnapshotId must be in the existing range of fromSnapshotId to toSnapshotId.

Putting it another way, when I create a scan using appendsBetween(A, C).appendsBetween(B, D), what should the behavior be? I'd say that is concerning because D is outside the original range. Probably a good idea to fail instead.

Thankyou, I'll file ticket for followup!

core/src/main/java/org/apache/iceberg/DataTableScan.java

rdblue · 2020-01-31T19:04:48Z

core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java

+            manifestEntry.status() == ManifestEntry.Status.ADDED;
+
+    return planFiles(
+        tableOps(), snapshot, filter(), isCaseSensitive(), colStats(), matchingManifests, matchingManifestEntries);


I think this is correct.

As a follow-up, we can change the logic slightly to only require one ManifestGroup. As long as we read each manifest that was added in a selected snapshot only once and only select ADDED files with the right snapshot ID, we can do the planning in a single run.

I think this makes sense. I'll def. tackle this in a followup. Thks!

rdblue · 2020-01-31T19:06:02Z

@rdsr, only minor things left! Thanks for all your work on this, it's really coming together nicely.

rdsr · 2020-02-01T07:58:59Z

One of the commits in the PR was not cleanly applying. I had to squash all commits to a single commit and fix diff issues. This has resulted in a single commit for the PR. Sorry about that as it may get hard to review the new changes

rdblue · 2020-02-02T21:53:07Z

core/src/main/java/org/apache/iceberg/ManifestGroup.java

+      String specString = PartitionSpecParser.toJson(spec);
+      ResidualEvaluator residuals = ResidualEvaluator.of(spec, dataFilter, caseSensitive);
+      return CloseableIterable.transform(entries, e -> new BaseFileScanTask(
+          e.copy().file(), schemaString, specString, residuals));


Later, we will probably want to detect whether any stats column was projected and use file.copy() or file.copyWithoutStats() depending on the projection.

rdblue · 2020-02-02T21:59:36Z

core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java

@@ -42,6 +42,16 @@ private SnapshotUtil() {
    return ancestorIds(table.currentSnapshot(), table::snapshot);
  }

+  /**
+   * @return List of snapshot ids in the range - (fromSnapshotId, toSnapshotId]
+   * This method assumes that fromSnapshotId is an ancestor of toSnapshotId


I don't see any place that checks whether the from snapshot is an ancestor of the to snapshot. That seems like a requirement for this to work correctly to me.

Yes. It will be in the follow-up

rdblue · 2020-02-02T22:02:18Z

@rdsr, I'm merging this. Thanks for all your work on it!

I think we do need to follow up pretty quickly with better validations for snapshot IDs passed to appendsBetween and appendsAfter. Right now, I don't think there is anything that validates that there is a range of snapshots from the start to the end -- it looks like this would use the whole ancestry of the "to" snapshot if "from" isn't an ancestor. Let's clean that up with validations, but for now I've committed this since it quite a large patch.

Fixes: apache#765: This addresses the following follow-ups from apache#315 1. Have validations on snapshot id range 2. Improve tests for the same

rdsr force-pushed the incr branch 2 times, most recently from 730b85f to dcc4020 Compare July 26, 2019 16:28

rdblue reviewed Jul 30, 2019

View reviewed changes

api/src/main/java/org/apache/iceberg/IncrTableScan.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 30, 2019

View reviewed changes

api/src/main/java/org/apache/iceberg/TableScan.java Show resolved Hide resolved

rdblue reviewed Jul 30, 2019

View reviewed changes

api/src/main/java/org/apache/iceberg/IncrTableScan.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 30, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/IncrAppend.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 30, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/IncrDataTableScan.java Outdated Show resolved Hide resolved

SurenNihalani suggested changes Jul 31, 2019

View reviewed changes

rdblue mentioned this pull request Jul 31, 2019

Improve filtering in Snapshot#addedFiles. #341

Merged

rdsr force-pushed the incr branch from dcc4020 to 525889d Compare August 2, 2019 20:03

rdsr commented Aug 2, 2019

View reviewed changes

rdsr force-pushed the incr branch from 525889d to 23ee389 Compare August 2, 2019 22:23

shardulm94 reviewed Aug 3, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/IncrementalDataScan.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Aug 5, 2019

View reviewed changes

rdblue mentioned this pull request Aug 9, 2019

Return full stats for added files from Snapshot #369

Merged

rdsr marked this pull request as ready for review August 9, 2019 19:31

rdblue reviewed Aug 24, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/IncrementalDataScan.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 24, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/IncrementalDataScan.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 24, 2019

View reviewed changes

core/src/test/java/org/apache/iceberg/TestIncrementalScan.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 24, 2019

View reviewed changes

api/src/main/java/org/apache/iceberg/Table.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 24, 2019

View reviewed changes

api/src/main/java/org/apache/iceberg/Table.java Outdated Show resolved Hide resolved

xabriel reviewed Dec 3, 2019

View reviewed changes

api/src/main/java/org/apache/iceberg/Table.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/IncrementalDataScan.java Outdated Show resolved Hide resolved