Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Data Capture(CDC)[Draft] #4539

Closed
wants to merge 14 commits into from
Original file line number Diff line number Diff line change
Expand Up @@ -74,4 +74,11 @@ default ExpireSnapshots expireSnapshots(Table table) {
default DeleteReachableFiles deleteReachableFiles(String metadataLocation) {
throw new UnsupportedOperationException(this.getClass().getName() + " does not implement deleteReachableFiles");
}

/**
* Instantiates an action to generate CDC records.
*/
default Cdc generateCdcRecords(Table table) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Consider using CDC as all caps instead, like it is in the javadoc comments. For me, it looks a lot cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also open to other names, e.g., ChangeDataSet, ChangeDataCapture

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Flink it's referred to as Changelog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should not use acronyms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that using a full word would be better.

In comments and even method names it's fine in my opinion but as the main class name it probably would be best to use the full name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Changelog. Here, it means generateChangelog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with generateChangelog.

Copy link
Contributor

@stevenzwu stevenzwu Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or generateChangeSet, since action is a batch execution. if it is a long-running streaming execution, changelog would be more accurate as it implies a stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked generateChangelog but if it can confuse people, generateChangeSet sounds good too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining feedbacks, changed it to GetChangeSet. The name GenerateChangeSet is good, but it is way too long. Think about the class name BaseGenerateChangeSetSparkActionResult. I admit the verb get is plain comparing to generate. But I think it is fine, a plain name is suitable for a tool.

throw new UnsupportedOperationException(this.getClass().getName() + " does not implement generateCdcRecords");
}
}
49 changes: 49 additions & 0 deletions api/src/main/java/org/apache/iceberg/actions/Cdc.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.iceberg.actions;

public interface Cdc extends Action<Cdc, Cdc.Result> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an action? Action typically modify the table like rewrite, expire snapshots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. Action usually modifies table, but is not necessarily limited by that. This is the first PR, we will also explore a way to use scan for CDC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the existing actions in Iceberg seem to modify the table. Since this is a public API, I like to double check. Can we keep it in SparkActions only for now if we are in experimental phase?

Conceptually, this is a scan/read (not a maintenance action).

Copy link
Contributor Author

@flyrain flyrain Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is the start point. As I said in another comment, we need a well designed scan interface first.

Copy link
Contributor

@aokolnychyi aokolnychyi Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind having an action like this. We have RemoveOrphanFiles that does not modify the table state, for instance.

I would match the naming style we have in other actions, though. I think it can be GenerateChangelog as all other actions start with a verb.

Copy link
Contributor

@stevenzwu stevenzwu Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoveOrphanFiles is also a maintenance action on the table. Agree that technically it doesn't modify the table state. But it is a garbage collection action for the tables. Those orphaned files may be part of the table before but got unreferenced due to compaction or retention purge. Or those orphaned files were intended to be added to the table but got aborted in the middle. To me, it is a modification of the broader env/infra of the table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary goal of adding the Action API was to provide scalable and ready-to-use recipes for common scenarios. I am not sure we should strictly restrict the usage to maintenance.

That being said, I'd be happy to discuss alternative ways to expose this.

/**
* Emit changed data set by a snapshot id.
*
* @param snapshotId id of the snapshot to generate changed data
* @return this for method chaining
*/
Cdc useSnapshot(long snapshotId);

/**
* Emit changed data set by a range of snapshots
*
* @param fromSnapshotId id of the first snapshot
* @param toSnapshotId id of the last snapshot
* @return this for method chaining
*/
Cdc between(long fromSnapshotId, long toSnapshotId);

/**
* The action result that contains a dataset of changed rows.
*/
interface Result {
/**
* Returns CDC records.
*/
Object cdcRecords();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,10 @@ public static <T> VectorHolder constantHolder(int numRows, T constantValue) {
return new ConstantVectorHolder(numRows, constantValue);
}

public static <T> VectorHolder deleteMetaColumnHolder(int numRows) {
return new DeletedVectorHolder(numRows);
}

public static VectorHolder dummyHolder(int numRows) {
return new ConstantVectorHolder(numRows);
}
Expand Down Expand Up @@ -146,4 +150,17 @@ public PositionVectorHolder(FieldVector vector, Type type, NullabilityHolder nul
}
}

public static class DeletedVectorHolder extends VectorHolder {
flyrain marked this conversation as resolved.
Show resolved Hide resolved
private final int numRows;

public DeletedVectorHolder(int numRows) {
this.numRows = numRows;
}

@Override
public int numValues() {
return numRows;
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -517,5 +517,32 @@ public void setBatchSize(int batchSize) {
}
}

/**
* A Dummy Vector Reader which doesn't actually read files, instead it returns a dummy
* VectorHolder which indicates whether the row is deleted.
*/
public static class DeletedVectorReader extends VectorizedArrowReader {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A basic question, do we need to enable vectorization to read CDC? What about the non-vectorized reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorized read is significant faster than non-vectorized read, check benchmark in #3287. We should have it. Non-vectorized read is handled by the change I made in class RowDataReader, DeleteFilter and Deletes.

public DeletedVectorReader() {
}

@Override
public VectorHolder read(VectorHolder reuse, int numValsToRead) {
return VectorHolder.deleteMetaColumnHolder(numValsToRead);
}

@Override
public void setRowGroupInfo(PageReadStore source, Map<ColumnPath, ColumnChunkMetaData> metadata, long rowPosition) {
}

@Override
public String toString() {
return "DeletedVectorReader";
}

@Override
public void setBatchSize(int batchSize) {
}
}

}

Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ public VectorizedReader<?> message(
reorderedFields.add(VectorizedArrowReader.positions());
}
} else if (id == MetadataColumns.IS_DELETED.fieldId()) {
reorderedFields.add(new VectorizedArrowReader.ConstantVectorReader<>(false));
reorderedFields.add(new VectorizedArrowReader.DeletedVectorReader());
} else if (reader != null) {
reorderedFields.add(reader);
} else {
Expand Down
4 changes: 4 additions & 0 deletions core/src/main/java/org/apache/iceberg/BaseFileScanTask.java
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,10 @@ public BaseFileScanTask(DataFile file, DeleteFile[] deletes, String schemaString
this.residuals = residuals;
}

public BaseFileScanTask cloneWithoutDeletes() {
return new BaseFileScanTask(file, new DeleteFile[0], schemaString, specString, residuals);
}

@Override
public DataFile file() {
return file;
Expand Down
44 changes: 37 additions & 7 deletions core/src/main/java/org/apache/iceberg/ManifestGroup.java
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
import org.apache.iceberg.types.Types;
import org.apache.iceberg.util.ParallelIterable;

class ManifestGroup {
public class ManifestGroup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use scan API, we may not need to expose ManifestGroup as public

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be awesome. We have to design the scan API nicely though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about extending TableScan (can be convinced) but we can definitely create a utility class in core that will internally use ManifestGroup. I don' think we should expose this class. Maybe, it is safer to start with a utility and then see if it is something that can be part of TableScan.

Copy link
Contributor

@stevenzwu stevenzwu Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi we are not talking about extending TableScan. We are discussing introducing new scan interfaces to support incremental scan (appends only and CDC). Please see PR 4580 : #4580 (comment)

private static final Types.StructType EMPTY_STRUCT = Types.StructType.of();

private final FileIO io;
Expand All @@ -55,6 +55,8 @@ class ManifestGroup {
private Expression partitionFilter;
private boolean ignoreDeleted;
private boolean ignoreExisting;
private boolean ignoreAdded;
private boolean onlyWithDeletes;
private boolean ignoreResiduals;
private List<String> columns;
private boolean caseSensitive;
Expand All @@ -66,7 +68,7 @@ class ManifestGroup {
Iterables.filter(manifests, manifest -> manifest.content() == ManifestContent.DELETES));
}

ManifestGroup(FileIO io, Iterable<ManifestFile> dataManifests, Iterable<ManifestFile> deleteManifests) {
public ManifestGroup(FileIO io, Iterable<ManifestFile> dataManifests, Iterable<ManifestFile> deleteManifests) {
this.io = io;
this.dataManifests = Sets.newHashSet(dataManifests);
this.deleteIndexBuilder = DeleteFileIndex.builderFor(io, deleteManifests);
Expand All @@ -75,20 +77,22 @@ class ManifestGroup {
this.partitionFilter = Expressions.alwaysTrue();
this.ignoreDeleted = false;
this.ignoreExisting = false;
this.ignoreAdded = false;
this.onlyWithDeletes = false;
this.ignoreResiduals = false;
this.columns = ManifestReader.ALL_COLUMNS;
this.caseSensitive = true;
this.manifestPredicate = m -> true;
this.manifestEntryPredicate = e -> true;
}

ManifestGroup specsById(Map<Integer, PartitionSpec> newSpecsById) {
public ManifestGroup specsById(Map<Integer, PartitionSpec> newSpecsById) {
this.specsById = newSpecsById;
deleteIndexBuilder.specsById(newSpecsById);
return this;
}

ManifestGroup filterData(Expression newDataFilter) {
public ManifestGroup filterData(Expression newDataFilter) {
this.dataFilter = Expressions.and(dataFilter, newDataFilter);
deleteIndexBuilder.filterData(newDataFilter);
return this;
Expand Down Expand Up @@ -125,6 +129,16 @@ ManifestGroup ignoreExisting() {
return this;
}

public ManifestGroup ignoreAdded() {
this.ignoreAdded = true;
return this;
}

public ManifestGroup onlyWithDeletes() {
flyrain marked this conversation as resolved.
Show resolved Hide resolved
this.onlyWithDeletes = true;
return this;
}

ManifestGroup ignoreResiduals() {
this.ignoreResiduals = true;
return this;
Expand Down Expand Up @@ -180,7 +194,7 @@ public CloseableIterable<FileScanTask> planFiles() {
return CloseableIterable.transform(entries, e -> new BaseFileScanTask(
e.file().copy(), deleteFiles.forEntry(e), schemaString, specString, residuals));
}
});
}, deleteFiles);

if (executorService != null) {
return new ParallelIterable<>(tasks, executorService);
Expand All @@ -198,11 +212,13 @@ public CloseableIterable<FileScanTask> planFiles() {
* @return a CloseableIterable of manifest entries.
*/
public CloseableIterable<ManifestEntry<DataFile>> entries() {
return CloseableIterable.concat(entries((manifest, entries) -> entries));
return CloseableIterable.concat(entries((manifest, entries) -> entries, null));
}

@SuppressWarnings({"unchecked", "checkstyle:CyclomaticComplexity"})
private <T> Iterable<CloseableIterable<T>> entries(
BiFunction<ManifestFile, CloseableIterable<ManifestEntry<DataFile>>, CloseableIterable<T>> entryFn) {
BiFunction<ManifestFile, CloseableIterable<ManifestEntry<DataFile>>, CloseableIterable<T>> entryFn,
DeleteFileIndex deleteFiles) {
LoadingCache<Integer, ManifestEvaluator> evalCache = specsById == null ?
null : Caffeine.newBuilder().build(specId -> {
PartitionSpec spec = specsById.get(specId);
Expand Down Expand Up @@ -237,6 +253,12 @@ private <T> Iterable<CloseableIterable<T>> entries(
manifest -> manifest.hasAddedFiles() || manifest.hasDeletedFiles());
}

if (ignoreAdded) {
// only scan manifests that have entries other than added
matchingManifests = Iterables.filter(matchingManifests,
manifest -> manifest.hasExistingFiles() || manifest.hasDeletedFiles());
}

matchingManifests = Iterables.filter(matchingManifests, manifestPredicate::test);

return Iterables.transform(
Expand All @@ -258,6 +280,14 @@ private <T> Iterable<CloseableIterable<T>> entries(
entry -> entry.status() != ManifestEntry.Status.EXISTING);
}

if (ignoreAdded) {
entries = CloseableIterable.filter(entries, entry -> entry.status() != ManifestEntry.Status.ADDED);
}

if (onlyWithDeletes && deleteFiles != null) {
entries = CloseableIterable.filter(entries, entry -> deleteFiles.forEntry(entry).length > 0);
}

if (evaluator != null) {
entries = CloseableIterable.filter(entries,
entry -> evaluator.eval((GenericDataFile) entry.file()));
Expand Down
82 changes: 80 additions & 2 deletions core/src/main/java/org/apache/iceberg/deletes/Deletes.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import java.io.IOException;
import java.io.UncheckedIOException;
import java.util.List;
import java.util.function.Consumer;
import java.util.function.Function;
import org.apache.iceberg.Accessor;
import org.apache.iceberg.MetadataColumns;
Expand Down Expand Up @@ -73,6 +74,16 @@ public static <T> CloseableIterable<T> filter(CloseableIterable<T> rows, Functio
return filter.filter(rows);
}

public static <T> CloseableIterable<T> marker(CloseableIterable<T> rows, Function<T, Long> rowToPosition,
PositionDeleteIndex deleteSet, Consumer<T> markDeleted) {
if (deleteSet.isEmpty()) {
return rows;
}

PositionSetDeleteMarker<T> deleteMarker = new PositionSetDeleteMarker<>(rowToPosition, deleteSet, markDeleted);
return deleteMarker.filter(rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since PositionSetDeleteMarker always return true, we are not actually doing filter. it seems that we are mainly leverage the filter to traverse the iterable and get the side effect of calling markDeleted with matched rows. The semantics is a little odd to me. Maybe we don't need to introduce PositionSetDeleteMarker filter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just use this method from CloseableIterable to traverse? it returns the same object after applying the Consumer function.

  static <I, O> CloseableIterable<O> transform(CloseableIterable<I> iterable, Function<I, O> transform) {

Copy link
Collaborator

@chenjunjiedada chenjunjiedada Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it traverses the rows and adds is_deleted value to the row. +1 to use transform from ClosableIterable. The Filter interface makes it hard to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we can add a new interface called Marker to marker these rows. I think it will be easier to understand than using CloseableIterable.

}

public static StructLikeSet toEqualitySet(CloseableIterable<StructLike> eqDeletes, Types.StructType eqType) {
try (CloseableIterable<StructLike> deletes = eqDeletes) {
StructLikeSet deleteSet = StructLikeSet.create(eqType);
Expand Down Expand Up @@ -107,6 +118,14 @@ public static <T> CloseableIterable<T> streamingFilter(CloseableIterable<T> rows
return new PositionStreamDeleteFilter<>(rows, rowToPosition, posDeletes);
}


public static <T> CloseableIterable<T> streamingMarker(CloseableIterable<T> rows,
Function<T, Long> rowToPosition,
CloseableIterable<Long> posDeletes,
Consumer<T> markDeleted) {
return new PositionStreamDeleteMarker<>(rows, rowToPosition, posDeletes, markDeleted);
}

public static CloseableIterable<Long> deletePositions(CharSequence dataLocation,
CloseableIterable<StructLike> deleteFile) {
return deletePositions(dataLocation, ImmutableList.of(deleteFile));
Expand Down Expand Up @@ -152,6 +171,29 @@ protected boolean shouldKeep(T row) {
}
}

private static class PositionSetDeleteMarker<T> extends Filter<T> {
private final Function<T, Long> rowToPosition;
private final PositionDeleteIndex deleteSet;
private final Consumer<T> markDeleted;

private PositionSetDeleteMarker(Function<T, Long> rowToPosition, PositionDeleteIndex deleteSet,
Consumer<T> markDeleted) {
this.rowToPosition = rowToPosition;
this.deleteSet = deleteSet;
this.markDeleted = markDeleted;
}

@Override
protected boolean shouldKeep(T row) {
if (deleteSet.isDeleted(rowToPosition.apply(row))) {
markDeleted.accept(row);
}

// always return true, since we don't want to remove the row
return true;
}
}

private static class PositionStreamDeleteFilter<T> extends CloseableGroup implements CloseableIterable<T> {
private final CloseableIterable<T> rows;
private final Function<T, Long> extractPos;
Expand All @@ -170,7 +212,7 @@ public CloseableIterator<T> iterator() {

CloseableIterator<T> iter;
if (deletePosIterator.hasNext()) {
iter = new PositionFilterIterator(rows.iterator(), deletePosIterator);
iter = createPosDeleteIterator(rows.iterator(), deletePosIterator);
} else {
iter = rows.iterator();
try {
Expand All @@ -185,7 +227,12 @@ public CloseableIterator<T> iterator() {
return iter;
}

private class PositionFilterIterator extends FilterIterator<T> {
protected PositionFilterIterator createPosDeleteIterator(CloseableIterator<T> items,
CloseableIterator<Long> deletePosIterator) {
return new PositionFilterIterator(items, deletePosIterator);
}

protected class PositionFilterIterator extends FilterIterator<T> {
private final CloseableIterator<Long> deletePosIterator;
private long nextDeletePos;

Expand Down Expand Up @@ -227,6 +274,37 @@ public void close() {
}
}

private static class PositionStreamDeleteMarker<T> extends PositionStreamDeleteFilter<T> {
private final Consumer<T> markDeleted;

private PositionStreamDeleteMarker(CloseableIterable<T> rows, Function<T, Long> extractPos,
CloseableIterable<Long> deletePositions, Consumer<T> markDeleted) {
super(rows, extractPos, deletePositions);
this.markDeleted = markDeleted;
}

@Override
protected PositionFilterIterator createPosDeleteIterator(CloseableIterator<T> items,
CloseableIterator<Long> deletePosIterator) {
return new PositionDeleteMarkerIterator(items, deletePosIterator);
}

private class PositionDeleteMarkerIterator extends PositionFilterIterator {
private PositionDeleteMarkerIterator(CloseableIterator<T> items, CloseableIterator<Long> deletePositions) {
super(items, deletePositions);
}

@Override
protected boolean shouldKeep(T row) {
boolean isDeleted = !super.shouldKeep(row);
if (isDeleted) {
markDeleted.accept(row);
}
return true;
}
}
}

private static class DataFileFilter<T extends StructLike> extends Filter<T> {
private final CharSequence dataLocation;

Expand Down
Loading