Core: fix reading of split offsets in manifests #8834

bryanck · 2023-10-14T20:06:11Z

This PR fixes a critical bug in reading split offsets from manifests. A change in #8336 added caching of the offsets collection in BaseFile to avoid reallocation. However, the Avro reader will reuse the same BaseFile object when reading, thus only the offsets from the first entry are allocated, and then those are reused for all other entries.

cc @aokolnychyi @RussellSpitzer @rdblue @danielcweeks This can result in corrupted metadata in cases the invalid offsets are read in then written back, e.g. when rewriting manifests.

core/src/main/java/org/apache/iceberg/BaseFile.java

This reverts commit 4671f82.

RussellSpitzer · 2023-10-15T17:41:01Z

core/src/test/java/org/apache/iceberg/TestManifestReader.java

+  private static final RecursiveComparisonConfiguration FILE_COMPARISON_CONFIG =
+      RecursiveComparisonConfiguration.builder()
+          .withIgnoredFields(
+              "dataSequenceNumber", "fileOrdinal", "fileSequenceNumber", "fromProjectionPos")


So this will compare split offsets now?

yes, it will compare all fields from the input data files, except for these excluded fields which aren't expected to match

RussellSpitzer · 2023-10-15T17:44:16Z

core/src/test/java/org/apache/iceberg/TableTestBase.java

@@ -127,6 +129,7 @@ public class TableTestBase {
          .withFileSizeInBytes(10)
          .withPartitionPath("data_bucket=3") // easy way to set partition data for now
          .withRecordCount(1)
+          .withSplitOffsets(ImmutableList.of(4L, 10_000_000L, 20_000_000L))


Minor comment but can we make the splits unique? Just want to make sure things will break in an obvious way if they are not matching

Sure, I made this change

advancedxy · 2023-10-16T03:40:14Z

Other than to revert the optimize in #8336, is it better to invalidate the cached splitOffsetList? The proposed change is in the org.apache.iceberg.BaseFile#put function:

...
      case 12:
        this.upperBounds = SerializableByteBufferMap.wrap((Map<Integer, ByteBuffer>) value);
        return;
      case 13:
        this.keyMetadata = ByteBuffers.toByteArray((ByteBuffer) value);
        return;
      case 14:
        this.splitOffsets = ArrayUtil.toLongArray((List<Long>) value);
        this.splitOffsetList = null; // invalidate the cache
        return;
      case 15:
        this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value);
....

bryanck · 2023-10-16T10:54:11Z

Other than to revert the optimize in #8336, is it better to invalidate the cached splitOffsetList? The proposed change is in the org.apache.iceberg.BaseFile#put function:
`

The goal of this PR is to revert the code to a known working state. There are a couple of optimizations that could be made (e.g. to equality IDs also) but felt that belongs in a separate PR.

nastra · 2023-10-16T12:09:09Z

core/src/test/java/org/apache/iceberg/TestManifestReader.java

-          "Should read the expected files",
-          Lists.newArrayList(FILE_A.path(), FILE_B.path(), FILE_C.path()),
-          files);
+      assertThat(files)


I believe this can be simplified to

assertThat(files) .usingRecursiveComparison() .ignoringFields( "dataSequenceNumber", "fileOrdinal", "fileSequenceNumber", "fromProjectionPos") .isEqualTo(Lists.newArrayList(FILE_A, FILE_B, FILE_C));

that way you don't need the static final field for the comparison config

My thought was the config variable could be reused in other tests

advancedxy · 2023-10-16T12:24:59Z

Other than to revert the optimize in #8336, is it better to invalidate the cached splitOffsetList? The proposed change is in the org.apache.iceberg.BaseFile#put function:
`

The goal of this PR is to revert the code to a known working state. There are a couple of optimizations that could be made (e.g. to equality IDs also) but felt that belongs in a separate PR.

If the current PR is intended to be included in the hot fix version 1.4.1, then I think it's the safe way to revert the code.

rdblue · 2023-10-16T15:43:14Z

core/src/main/java/org/apache/iceberg/BaseFile.java

-    }
-
-    return splitOffsetList;
+    return ArrayUtil.toUnmodifiableLongList(splitOffsets);


🤦 Good catch.

The list was not correctly invalidated when reusing the file.

The list was not correctly invalidated when reusing the file. Co-authored-by: Bryan Keller <bryanck@gmail.com>

Core: fix reading of split offsets in manifests

11cd94f

github-actions bot added the core label Oct 14, 2023

bryanck commented Oct 14, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseFile.java Outdated Show resolved Hide resolved

bryanck added 5 commits October 14, 2023 13:37

return null instead of empty list

22ed9e8

test for manifest reader

14ee4c0

update array util

4671f82

Revert "update array util"

bb2dff6

This reverts commit 4671f82.

use array util

5d64655

RussellSpitzer approved these changes Oct 15, 2023

View reviewed changes

unique offsets for tests

44a5891

nastra added this to the Iceberg 1.4.1 milestone Oct 16, 2023

nastra approved these changes Oct 16, 2023

View reviewed changes

amogh-jahagirdar approved these changes Oct 16, 2023

View reviewed changes

rdblue reviewed Oct 16, 2023

View reviewed changes

rdblue approved these changes Oct 16, 2023

View reviewed changes

rdblue merged commit 46cad6d into apache:main Oct 16, 2023
45 checks passed

nastra pushed a commit to nastra/iceberg that referenced this pull request Oct 16, 2023

Core: Do not use a lazy split offset list in manifests (apache#8834)

2c562ea

The list was not correctly invalidated when reusing the file.

nastra mentioned this pull request Oct 16, 2023

[1.4.x] Core: Do not use a lazy split offset list in manifests (#8834) #8845

Merged

rdblue pushed a commit that referenced this pull request Oct 17, 2023

Core: Do not use a lazy split offset list in manifests (#8834) (#8845)

8aa948f

The list was not correctly invalidated when reusing the file. Co-authored-by: Bryan Keller <bryanck@gmail.com>

amogh-jahagirdar mentioned this pull request Oct 17, 2023

Core: Ignore split offsets when the last split offset is past the file length #8860

Merged

camper42 mentioned this pull request Oct 26, 2023

java.lang.IllegalArgumentException: requirement failed while read migrated parquet table #8863

Open

rjayapalan mentioned this pull request Feb 8, 2024

java.lang.IllegalArgumentException: requirement failed: length (-6235972) cannot be smaller than -1 #9689

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: fix reading of split offsets in manifests #8834

Core: fix reading of split offsets in manifests #8834

bryanck commented Oct 14, 2023 •

edited

RussellSpitzer Oct 15, 2023

bryanck Oct 15, 2023

RussellSpitzer Oct 15, 2023

bryanck Oct 15, 2023

advancedxy commented Oct 16, 2023

bryanck commented Oct 16, 2023

nastra Oct 16, 2023

bryanck Oct 16, 2023

advancedxy commented Oct 16, 2023

rdblue Oct 16, 2023

Core: fix reading of split offsets in manifests #8834

Core: fix reading of split offsets in manifests #8834

Conversation

bryanck commented Oct 14, 2023 • edited

RussellSpitzer Oct 15, 2023

Choose a reason for hiding this comment

bryanck Oct 15, 2023

Choose a reason for hiding this comment

RussellSpitzer Oct 15, 2023

Choose a reason for hiding this comment

bryanck Oct 15, 2023

Choose a reason for hiding this comment

advancedxy commented Oct 16, 2023

bryanck commented Oct 16, 2023

nastra Oct 16, 2023

Choose a reason for hiding this comment

bryanck Oct 16, 2023

Choose a reason for hiding this comment

advancedxy commented Oct 16, 2023

rdblue Oct 16, 2023

Choose a reason for hiding this comment

bryanck commented Oct 14, 2023 •

edited