Add position delete filter and utils #1301

rdblue · 2020-08-06T01:23:50Z

This adds a position delete row filter and helpers for constructing a row filter that will filter deletes to a specific data file and merge deletes from multiple delete files. These utilities are written using CloseableIterable and handle file closing.

rdblue · 2020-08-06T01:24:25Z

FYI @prodeezy and @rymurr if you want to review.

rdblue · 2020-08-06T01:25:19Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+import org.apache.iceberg.util.StructLikeWrapper;
+
+public class Deletes {
+  private static final Schema POSITION_DELETE_SCHEMA = new Schema(


We can move this to a better place later, but it isn't available elsewhere so it makes sense to add it here.

rymurr

LGTM, I enjoyed the streaming sort algorithm :-)

Something for later: a benchmark to understand how large delete files can be before the filter cost explodes.

rdblue · 2020-08-06T16:14:11Z

Something for later: a benchmark to understand how large delete files can be before the filter cost explodes.

Agreed. We might want to read all of the delete files at once and keep the records in memory if the set is small enough. We will definitely want to look into that for the vectorized case.

aokolnychyi · 2020-08-06T17:31:57Z

Let me go through this now.

aokolnychyi · 2020-08-06T18:57:21Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+
+public class Deletes {
+  private static final Schema POSITION_DELETE_SCHEMA = new Schema(
+      Types.NestedField.required(1, "filename", Types.StringType.get(), "Data file location of the deleted row"),


Why filename and not file_path like we use for DataFile?

Didn't think about it. file_path sounds good to me.

aokolnychyi · 2020-08-06T19:38:51Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+    return new SortedMerge<>(Long::compare, positions);
+  }
+
+  private static class PositionDeleteFilter<T> extends CloseableGroup implements CloseableIterable<T> {


Does this mean we will filter out positions after we read data and project meta columns? Do we plan to push this down to readers in the future? How will it work with vectorized execution?

This filter doesn't require us to answer most of those questions right now. All it requires is some object and a way to get the position of that object. That should be flexible enough that we can handle position as a column or using a class that directly supports it.

For the initial integration with Spark, I was thinking that we would add the _pos metadata column at the end of the requested projection. That way, we can return the rows without copying the data. At most, we would need to tell the row to report that it is one column shorter.

The main idea here is to make it easy for readers to create filters for tasks and apply them. The reader just needs to open the delete files, then pass them to these methods to merge deletes together and use them as a row filter.

For vectorization, we will probably want a different implementation, but we could reuse some of these classes (like the merging iterator).

aokolnychyi · 2020-08-06T20:08:03Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+
+      @Override
+      protected boolean shouldKeep(T row) {
+        long currentPos = extractPos.apply(row);


Does this assume rows are ordered by position?

Yes, it assumes that positions are ascending, like the delete positions.

aokolnychyi · 2020-08-06T20:15:20Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+    List<CloseableIterable<Long>> positions = Lists.transform(deleteFiles, deletes ->
+        CloseableIterable.transform(locationFilter.filter(deletes), row -> (Long) POSITION_ACCESSOR.get(row)));
+
+    return new SortedMerge<>(Long::compare, positions);


One of the assumptions was that positional deletes are lightweight and we can build an in-memory map from file_path to a set of deleted positions for a given task. If I understand correctly, the current logic will scan deleteFiles for every data file.

Do we consider merge sort as a primary way of applying positional deletes or as a fallback when the number of delete files is too large?

That's right. This implementation is the streaming one. It should be simpler to build the version that caches deletes as a set.

aokolnychyi · 2020-08-06T21:06:45Z

This looks good to me too. The build failed, though.

[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/core/src/test/java/org/apache/iceberg/deletes/TestPositionFilter.java:22:1: Use org.apache.iceberg.relocated.* classes from bundled-guava module instead. [BanUnrelocatedGuavaClasses]

aokolnychyi

+1 when the tests pass

rdblue · 2020-08-06T22:08:33Z

Thanks for reviewing @rymurr, @aokolnychyi!

Add position delete filter and utils.

01dc934

rdblue added this to the Row-level Delete milestone Aug 6, 2020

rdblue requested a review from aokolnychyi August 6, 2020 01:24

rdblue commented Aug 6, 2020

View reviewed changes

rymurr approved these changes Aug 6, 2020

View reviewed changes

Remove unused imports to fix checkstyle.

2b74c35

Fix line length for checkstyle.

ed60d0e

rdblue force-pushed the v2-pos-filter branch from e81ef95 to ed60d0e Compare August 6, 2020 17:50

aokolnychyi reviewed Aug 6, 2020

View reviewed changes

rdblue added 2 commits August 6, 2020 14:08

Use file_path for position delete location.

4842c6a

Fix imports for checkstyle.

1ec74b7

aokolnychyi approved these changes Aug 6, 2020

View reviewed changes

rdblue merged commit 4836006 into apache:master Aug 6, 2020

This was referenced Aug 7, 2020

Add a row filter implementation for position-based deletes #1022

Closed

Add a merge-based row filter for equality deletes #1024

Closed

rdblue linked an issue Aug 10, 2020 that may be closed by this pull request

Add a merge-based row filter for equality deletes #1024

Closed

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Core: Add position delete filter and utils (apache#1301)

e5becf8

Add position delete filter and utils #1301

Add position delete filter and utils #1301

Uh oh!

Conversation

rdblue commented Aug 6, 2020

Uh oh!

rdblue commented Aug 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rymurr left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 6, 2020

Uh oh!

aokolnychyi commented Aug 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Aug 6, 2020

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aokolnychyi Aug 6, 2020 •

edited

Loading