Skip to content

Conversation

@samarthjain
Copy link
Collaborator

No description provided.

this.targetSplitSize = targetSplitSize;
this.splitSizes = new ArrayList<>(offsetList.size());
int idx = 0;
while (idx < offsets.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a for loop instead of a while?

return combinedTask;
}

private long getSplitSize(int idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were left in the for loop, you could handle the last offset outside the loop instead of using the check here:

int lastIndex = offsets.size() - 1
for (int index = 0; index < lastIndex; index += 1) {
  splitSizes.add(offsets.get(index + 1) - offsets.get(index));
}
splitSizes.add(parentScanTask.length() - offsets.get(lastIndex));

long currentSize = splitSizes.get(sizeIdx);
FileScanTask combinedTask;
sizeIdx++;
while (hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better for maintaining this over time if this didn't make assumptions about how hasNext is implemented. You should probably copy that check here.

return combinedTask;
}
}
combinedTask = new SplitScanTask(offsets.get(offsetIdx), currentSize, parentScanTask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't update offsetIdx and it isn't obvious at first why that is okay (because splitSizes is finished). I think it would be better to simplify this logic a little bit to have only one return statement. Like this:

long currentSize = splitSizes.get(sizeIdx);
sizeIdx += 1; // always consume at least one file split

while (sizeIdx < splitSizes.size() && currentSize + splitSizes.get(sizeIdx) <= targetSplitSize) {
  currentSize += splitSizes.get(sizeIdx);
  sizeIdx += 1;
}

combinedTask = new SplitScanTask(offsets.get(offsetIdx), currentSize, parentScanTask);
offsetIdx = sizeIdx;

return combinedTask;

That way, the behavior is always the same for all splits.

long end = hasNext() ? splitOffsets.get(idx) : parentScanTask.length();
return new SplitScanTask(start, end - start, parentScanTask);
if (!hasNext()) {
throw new NoSuchElementException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for correctly implementing the contract!

private int offsetIdx = 0;
private int sizeIdx = 0;

OffsetsAwareTargetSplitSizeScanTaskIterator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: style should be:

    OffsetsAwareTargetSplitSizeScanTaskIterator(
        List<Long> offsetList, FileScanTask parentScanTask, long targetSplitSize) {
      ...
    }

long targetSplitSize, List<List<Long>> offsetLenPairs) {
List<FileScanTask> tasks = Lists.newArrayList(
new BaseFileScanTask.OffsetsBasedSplitScanTaskIterator(offsetRanges, new MockFileScanTask(fileLen)));
new BaseFileScanTask.OffsetsAwareTargetSplitSizeScanTaskIterator(offsetRanges,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: style should be to wrap at the function call and place all arguments on the next line.

import org.junit.Test;

public class TestOffsetsBasedSplitScanTaskIterator {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unless this was required by lint checks, could we avoid adding newlines? They can cause conflicts.


OffsetsAwareTargetSplitSizeScanTaskIterator(
List<Long> offsetList, FileScanTask parentScanTask, long targetSplitSize) {
this.offsets = ImmutableList.copyOf(offsetList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this the first time: why copy the offset list? It shouldn't change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just being a bit defensive. Future implementations of the DataFile interface may not create an immutable copy before passing it downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine since this should be an ImmutableList in the current read path and won't actually copy.

throw new NoSuchElementException();
}
long currentSize = splitSizes.get(sizeIdx);
sizeIdx += 1; // always consume at least one file split
Copy link
Contributor

@rdblue rdblue Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is offsetIdx needed? It is always set to sizeIdx before the end of next. I think it could be a local variable instead and you could remove size from sizeIdx.

offsetIdx = sizeIdx;
return combinedTask;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: another blank line

.toString();
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra blank line

@rdblue rdblue merged commit 7cde609 into apache:master Jun 4, 2019
@rdblue
Copy link
Contributor

rdblue commented Jun 4, 2019

Merged. Thanks for fixing this, @samarthjain!

ismailsimsek pushed a commit to ismailsimsek/iceberg that referenced this pull request Jan 15, 2025
* matf-non-flattening-mongodb-debezium-smt

- adds debezium mongo SMT for converting BSON before/after into typed Struct before/after

(cherry picked from commit 21d741e53ce77547edbb5838f1b2b49db619be0c)
ismailsimsek pushed a commit to ismailsimsek/iceberg that referenced this pull request Feb 18, 2025
* matf-non-flattening-mongodb-debezium-smt

- adds debezium mongo SMT for converting BSON before/after into typed Struct before/after

(cherry picked from commit 21d741e53ce77547edbb5838f1b2b49db619be0c)
ismailsimsek pushed a commit to ismailsimsek/iceberg that referenced this pull request Feb 18, 2025
* matf-non-flattening-mongodb-debezium-smt

- adds debezium mongo SMT for converting BSON before/after into typed Struct before/after

(cherry picked from commit 21d741e53ce77547edbb5838f1b2b49db619be0c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants