Core: Look up targeted position deletes by path #9251

aokolnychyi · 2023-12-08T17:27:05Z

This PR optimizes our DeleteFileIndex to look up deletes by path if they are targeting a single data file. This change is one in the set of properly supporting an optional 1-1 mapping for data files and position deletes. This PR relies on stored boundaries. If they are the same, it means the full path was persisted. In the future, we may optimize this part by persisting extra metadata.

This PR adds a few new tests and relies on other existing tests. It also comes with a benchmark that did not complete in 5 minutes before.

Benchmark                                           (oneToOneMapping)  Mode  Cnt            Score            Error   Units
DeleteFileIndexBenchmark.buildIndexAndLookup                     true    ss   10            9.670 ±          0.251    s/op
DeleteFileIndexBenchmark.buildIndexAndLookup                    false    ss   10            0.514 ±          0.026    s/op

aokolnychyi · 2023-12-12T12:20:53Z

@szehon-ho @RussellSpitzer @flyrain @amogh-jahagirdar @nastra @Fokko, could you check this one?

szehon-ho

Thanks @aokolnychyi it looks neat. Will need to catch up on the overall plan, how to introduce 1-to-1 data-to-delete file mapping, if you have some links? I put some comments

szehon-ho · 2023-12-13T02:07:31Z

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

+    }
+
+    private static DeleteFile[] indexFiles(List<DeleteFile> list) {
+      DeleteFile[] array = list.toArray(NO_DELETES);


I personally feel its clearer to use DeleteFile::new, as it looks strange to use NO_DELETES

I wish we could use that but it is not allowed in Java 8. I renamed the var to EMPTY_DELETES. Let me know what you think. I don't want to use a new array on each call.

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

szehon-ho · 2023-12-13T03:17:18Z

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

+      EqualityDeletes globalDeletes = new EqualityDeletes();
+      PartitionMap<EqualityDeletes> eqDeletesByPartition = PartitionMap.create(specsById);
+      PartitionMap<PositionDeletes> posDeletesByPartition = PartitionMap.create(specsById);
+      CharSequenceMap<PositionDeletes> posDeletesByPath = CharSequenceMap.create();


It is possible we'll see a memory growth with using path map, example if we have a million data files, can be ~100 mbs, if i calculate correctly?. (We use DeleteFileIndex centrally if i understand correctly). I wonder if we considered doing some common-prefix encoding, or other encoding, of paths anywhere. Not sure if its worth it , given the typical memory size of planner

It would require a bit of extra space but it would not be possible to plan jobs at all otherwise if we have 1-1 mapping of data and deletes. I am going to add a benchmark. Without this PR, the planning did not complete in 5 minutes. With this PR, it took around 5 seconds.

There are also some ideas about storing extra metadata but it will require changing the spec. This can be done later. The current solution should be good enough.

szehon-ho · 2023-12-13T03:23:35Z

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

+          PartitionSpec spec = specsById.get(specId);
+          StructLike partition = file.partition();
+          if (file.content() == FileContent.POSITION_DELETES) {
+            PositionDeletes deletes = fetchPosDeletes(posDeletesByPartition, specId, partition);


fetch sounds like it will load from disk, I think removing the helper method in favor of the underlying code is clearer, or renaming the method.

I restructured this part quite a bit. Let me know if it is any better.

szehon-ho · 2023-12-13T03:39:20Z

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

+
+    private long[] seqs = null;
+    private DeleteFile[] files = null;
+    private volatile List<DeleteFile> buffer = Lists.newArrayList();


This should be init'ed to null, to allow double check locking to work, right?

Not really. We start by adding elements to the buffer list and then the group is indexed once needed.

I see it now.

One issue, it seems that if someone calls add() after some getter method, a NPE will be thrown?

Overall was trying to see if this class can be simplified by eliminating so many nullable states (buffer, seqs, files), ie if we only call one kind of getter method then we can just sort on demand? Or is there some case we do need the sorted array cached?

Yep, adding more elements upon indexing must throw an exception. I can add a precondition with a proper message but I was not sure it was worth it given that it will be called for each delete file and we may have millions of them. This is very internal code, not accessible from outside. I can add a precondition and benchmark it if you think that will be easier to follow.

I call indexIfNeeded() in all methods except add(), meaning all method trigger indexing. Let me know if you have other ideas, I would love to explore other options.

Yea I was wondering if we could eliminate those member variables somehow. I feel like we always need the sorted version in the end, so is there any need to be lazy?
For example, always keep everything in a sorted treeset. Or just have DeleteFileIndex::build() method populate intermediate maps of Lists, and then have a final pass that makes EqualityDeletes/PositionDeletes with sorted arrays?

Yea if not possible to eliminate the member variables, i guess we can just try to add the precondition and benchmark, hopefully it may get optimized out by branch prediciton

We actually started with two passes but I migrated to one pass and indexing on demand for these reasons. Let me think a bit more on what can be done, I hear you.

I do like using System.arraycopy and binary search on arrays for lookups. I don't think we will beat that in terms of performance. The extra pass will also be expensive for path-based deletes and it will mean indexing all deletes rather than indexing only what's covered by the query.

Let me start with a few comments and a precondition. I am going to think more about this.

core/src/main/java/org/apache/iceberg/util/ArrayUtil.java

aokolnychyi · 2023-12-13T13:47:00Z

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

+   * @param seq the sequence number to search for
+   * @return the index of the first occurrence or the insertion point
+   */
+  private static int findStartIndex(long[] seqs, long seq) {


The content of this method is just copied as it was.

aokolnychyi · 2023-12-14T10:22:44Z

@szehon-ho, could you take another look? I did some updates and added a benchmark. The performance is OK for now but we can definitely optimize it further in following PRs.

aokolnychyi · 2023-12-14T10:27:10Z

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

-      // also, separate out equality deletes in an unpartitioned spec that should be applied
-      // globally
-      DeleteFileGroup globalDeletes = null;
-      for (Pair<Integer, StructLikeWrapper> partition : deleteFilesByPartition.keySet()) {


Instead of doing two passes and always indexing/sorting everything, I switched to one pass and indexing on demand. We will load all files but index only those partitions that are affected by the query. This has proven to be highly beneficial, especially for path-based deletes. Computing equals and hashCode on paths is not cheap.

szehon-ho

Mostly look good. Few more comments, and also I am not sure if we can do something to simplify the EqualityDeletes/PositionDeletes class

szehon-ho · 2023-12-15T03:57:51Z

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

+          case EQUALITY_DELETES:
+            PartitionSpec spec = specsById.get(file.specId());
+            EqualityDeleteFile eqFile = new EqualityDeleteFile(spec, file);
+            EqualityDeletes eqGroup = findGroup(globalDeletes, eqDeletesByPartition, eqFile);


why not just have the method called 'add' and do in one line ? add(globalDeletes, eqDeletesByPartition, eqFile)

I actually had that but then switched. I'll switch back.

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

szehon-ho

Thanks, looks good to me!

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

aokolnychyi · 2023-12-18T19:49:39Z

I run the benchmark one more time and did not see any difference with the precondition, so I'll keep it for clarity. I also played around with the on-demand indexing but the current solution seems the most promising.

Thanks a lot for reviewing, @szehon-ho!

szehon-ho

Mostly look good. Few more comments, and also I am not sure if we can do something to simplify the EqualityDeletes/PositionDeletes class

github-actions bot added API core labels Dec 8, 2023

aokolnychyi force-pushed the delete-lookup-by-path-v2 branch from 386fa98 to 8373e48 Compare December 12, 2023 12:18

github-actions bot added the data label Dec 12, 2023

szehon-ho reviewed Dec 13, 2023

View reviewed changes

aokolnychyi force-pushed the delete-lookup-by-path-v2 branch 3 times, most recently from 7464e61 to 5cdb6b6 Compare December 13, 2023 13:38

aokolnychyi commented Dec 13, 2023

View reviewed changes

aokolnychyi changed the title ~~Core: Lookup targeted position deletes by path~~ Core: Look up targeted position deletes by path Dec 14, 2023

aokolnychyi force-pushed the delete-lookup-by-path-v2 branch from 5cdb6b6 to 6d806be Compare December 14, 2023 10:17

github-actions bot added the spark label Dec 14, 2023

aokolnychyi commented Dec 14, 2023

View reviewed changes

szehon-ho reviewed Dec 15, 2023

View reviewed changes

aokolnychyi force-pushed the delete-lookup-by-path-v2 branch 2 times, most recently from fb68908 to 0042f2e Compare December 16, 2023 08:37

Core: Look up targeted position deletes by path

923e719

aokolnychyi force-pushed the delete-lookup-by-path-v2 branch from 0042f2e to 923e719 Compare December 16, 2023 08:43

szehon-ho approved these changes Dec 18, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java Show resolved Hide resolved

aokolnychyi merged commit ad3cf9d into apache:main Dec 18, 2023
42 checks passed

szehon-ho reviewed Jan 3, 2024

View reviewed changes

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024

Core: Look up targeted position deletes by path (apache#9251)

578ea4a

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024

Core: Look up targeted position deletes by path (apache#9251)

32bcd37

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Core: Look up targeted position deletes by path (apache#9251)

cf8595c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Look up targeted position deletes by path #9251

Core: Look up targeted position deletes by path #9251

aokolnychyi commented Dec 8, 2023 •

edited

aokolnychyi commented Dec 12, 2023

szehon-ho left a comment •

edited

szehon-ho Dec 13, 2023

aokolnychyi Dec 13, 2023 •

edited

szehon-ho Dec 13, 2023 •

edited

aokolnychyi Dec 13, 2023

aokolnychyi Dec 13, 2023

szehon-ho Dec 13, 2023

aokolnychyi Dec 13, 2023

szehon-ho Dec 13, 2023

aokolnychyi Dec 13, 2023

szehon-ho Dec 14, 2023 •

edited

aokolnychyi Dec 14, 2023 •

edited

szehon-ho Dec 15, 2023 •

edited

aokolnychyi Dec 15, 2023 •

edited

aokolnychyi Dec 15, 2023 •

edited

aokolnychyi Dec 13, 2023

aokolnychyi commented Dec 14, 2023

aokolnychyi Dec 14, 2023 •

edited

szehon-ho left a comment

szehon-ho Dec 15, 2023

aokolnychyi Dec 15, 2023

szehon-ho left a comment

aokolnychyi commented Dec 18, 2023

szehon-ho left a comment

Core: Look up targeted position deletes by path #9251

Core: Look up targeted position deletes by path #9251

Conversation

aokolnychyi commented Dec 8, 2023 • edited

aokolnychyi commented Dec 12, 2023

szehon-ho left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Dec 13, 2023 • edited

Choose a reason for hiding this comment

szehon-ho Dec 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Dec 14, 2023 • edited

Choose a reason for hiding this comment

aokolnychyi Dec 14, 2023 • edited

Choose a reason for hiding this comment

szehon-ho Dec 15, 2023 • edited

Choose a reason for hiding this comment

aokolnychyi Dec 15, 2023 • edited

Choose a reason for hiding this comment

aokolnychyi Dec 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Dec 14, 2023

aokolnychyi Dec 14, 2023 • edited

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

aokolnychyi commented Dec 18, 2023

szehon-ho left a comment

Choose a reason for hiding this comment

aokolnychyi commented Dec 8, 2023 •

edited

szehon-ho left a comment •

edited

aokolnychyi Dec 13, 2023 •

edited

szehon-ho Dec 13, 2023 •

edited

szehon-ho Dec 14, 2023 •

edited

aokolnychyi Dec 14, 2023 •

edited

szehon-ho Dec 15, 2023 •

edited

aokolnychyi Dec 15, 2023 •

edited

aokolnychyi Dec 15, 2023 •

edited

aokolnychyi Dec 14, 2023 •

edited