Combine tasks to scan up to target split size using parquet row group information #204

samarthjain · 2019-06-03T22:41:46Z

No description provided.

…information is available

rdblue · 2019-06-03T23:00:15Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+      this.targetSplitSize = targetSplitSize;
+      this.splitSizes = new ArrayList<>(offsetList.size());
+      int idx = 0;
+      while (idx < offsets.size()) {


Could you use a for loop instead of a while?

rdblue · 2019-06-03T23:06:08Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+      return combinedTask;
+    }
+
+    private long getSplitSize(int idx) {


If this were left in the for loop, you could handle the last offset outside the loop instead of using the check here:

int lastIndex = offsets.size() - 1 for (int index = 0; index < lastIndex; index += 1) { splitSizes.add(offsets.get(index + 1) - offsets.get(index)); } splitSizes.add(parentScanTask.length() - offsets.get(lastIndex));

rdblue · 2019-06-03T23:07:33Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+      long currentSize = splitSizes.get(sizeIdx);
+      FileScanTask combinedTask;
+      sizeIdx++;
+      while (hasNext()) {


I think it would be better for maintaining this over time if this didn't make assumptions about how hasNext is implemented. You should probably copy that check here.

rdblue · 2019-06-03T23:15:32Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+          return combinedTask;
+        }
+      }
+      combinedTask = new SplitScanTask(offsets.get(offsetIdx), currentSize, parentScanTask);


This doesn't update offsetIdx and it isn't obvious at first why that is okay (because splitSizes is finished). I think it would be better to simplify this logic a little bit to have only one return statement. Like this:

long currentSize = splitSizes.get(sizeIdx); sizeIdx += 1; // always consume at least one file split while (sizeIdx < splitSizes.size() && currentSize + splitSizes.get(sizeIdx) <= targetSplitSize) { currentSize += splitSizes.get(sizeIdx); sizeIdx += 1; } combinedTask = new SplitScanTask(offsets.get(offsetIdx), currentSize, parentScanTask); offsetIdx = sizeIdx; return combinedTask;

That way, the behavior is always the same for all splits.

rdblue · 2019-06-03T23:15:52Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

-      long end = hasNext() ? splitOffsets.get(idx) : parentScanTask.length();
-      return new SplitScanTask(start, end - start, parentScanTask);
+      if (!hasNext()) {
+        throw new NoSuchElementException();


+1 for correctly implementing the contract!

rdblue · 2019-06-03T23:16:48Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+    private int offsetIdx = 0;
+    private int sizeIdx = 0;
+
+    OffsetsAwareTargetSplitSizeScanTaskIterator(


Nit: style should be:

OffsetsAwareTargetSplitSizeScanTaskIterator( List<Long> offsetList, FileScanTask parentScanTask, long targetSplitSize) { ... }

rdblue · 2019-06-03T23:17:21Z

core/src/test/java/org/apache/iceberg/TestOffsetsBasedSplitScanTaskIterator.java

+                             long targetSplitSize, List<List<Long>> offsetLenPairs) {
    List<FileScanTask> tasks = Lists.newArrayList(
-            new BaseFileScanTask.OffsetsBasedSplitScanTaskIterator(offsetRanges, new MockFileScanTask(fileLen)));
+        new BaseFileScanTask.OffsetsAwareTargetSplitSizeScanTaskIterator(offsetRanges,


Nit: style should be to wrap at the function call and place all arguments on the next line.

rdblue · 2019-06-03T23:20:28Z

core/src/test/java/org/apache/iceberg/TestOffsetsBasedSplitScanTaskIterator.java

 import org.junit.Test;

 public class TestOffsetsBasedSplitScanTaskIterator {
+


Nit: unless this was required by lint checks, could we avoid adding newlines? They can cause conflicts.

rdblue · 2019-06-03T23:32:45Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+
+    OffsetsAwareTargetSplitSizeScanTaskIterator(
+        List<Long> offsetList, FileScanTask parentScanTask, long targetSplitSize) {
+      this.offsets = ImmutableList.copyOf(offsetList);


Missed this the first time: why copy the offset list? It shouldn't change.

Just being a bit defensive. Future implementations of the DataFile interface may not create an immutable copy before passing it downstream.

Seems fine since this should be an ImmutableList in the current read path and won't actually copy.

rdblue · 2019-06-03T23:34:44Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+        throw new NoSuchElementException();
+      }
+      long currentSize = splitSizes.get(sizeIdx);
+      sizeIdx += 1; // always consume at least one file split


Is offsetIdx needed? It is always set to sizeIdx before the end of next. I think it could be a local variable instead and you could remove size from sizeIdx.

rdblue · 2019-06-03T23:34:53Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+      offsetIdx = sizeIdx;
+      return combinedTask;
    }
+


Nit: another blank line

rdblue · 2019-06-03T23:35:14Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

        .toString();
  }

+


Nit: extra blank line

rdblue · 2019-06-04T18:38:33Z

Merged. Thanks for fixing this, @samarthjain!

* matf-non-flattening-mongodb-debezium-smt - adds debezium mongo SMT for converting BSON before/after into typed Struct before/after (cherry picked from commit 21d741e53ce77547edbb5838f1b2b49db619be0c)

samarthjain added 2 commits June 3, 2019 15:37

Combine tasks to scan up to target split size when parquet row group …

201dad6

…information is available

Cleanup

4173d53

rdblue reviewed Jun 3, 2019

View reviewed changes

Address code review comments

457e2fb

rdblue reviewed Jun 3, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

offsetIdx = sizeIdx;

return combinedTask;

}

Copy link

Contributor

rdblue Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: another blank line

rdblue reviewed Jun 3, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java Outdated

.toString();

}

Copy link

Contributor

rdblue Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra blank line

offsetIdx doesn't need to be an instance variable

70ee933

samarthjain force-pushed the combinefilesplits branch from cb3dd11 to 70ee933 Compare June 4, 2019 05:24

rdblue merged commit 7cde609 into apache:master Jun 4, 2019

aokolnychyi mentioned this pull request Aug 16, 2023

Core: Optimize offset splits handling #8336

Merged

		import org.junit.Test;

		public class TestOffsetsBasedSplitScanTaskIterator {

Combine tasks to scan up to target split size using parquet row group information #204

Combine tasks to scan up to target split size using parquet row group information #204

Uh oh!

Conversation

samarthjain commented Jun 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdblue Jun 3, 2019 •

edited

Loading