Spark: Fix first row ID carry over for manifest rewrite by amogh-jahagirdar · Pull Request #16699 · apache/iceberg

amogh-jahagirdar · 2026-06-06T17:52:52Z

During a manifest rewrite via Spark actions, we do not correctly carry over the first row ID for the entries being moved. This is because the records from the entries metadata table that Spark reads are adapted into a SparkContentFile structure. This SparkContentFile structure did not override firstRowId and as a result the adaptation would null out the carried over first row ID. This can lead to new row IDs being assigned via inheritance.

amogh-jahagirdar · 2026-06-06T17:53:22Z

I think the fix is small enough to just backport directly to spark versions in this PR but let's first agree on the fix before we do that work.

stevenzwu

LGTM — fix follows the existing referencedDataFile/contentOffset/contentSizeInBytes override pattern.

A few minor nits inline.

stevenzwu · 2026-06-06T18:59:08Z

  private final int referencedDataFilePosition;
  private final int contentOffsetPosition;
  private final int contentSizePosition;
+  private final int firstRowIdPosition;


Nit: the FIRST_ROW_ID schema field (id 142) sits before REFERENCED_DATA_FILE/CONTENT_OFFSET/CONTENT_SIZE (143–145), and the rest of the constructor mirrors that order. Suggest moving this declaration above referencedDataFilePosition (and the assignment on line 111) for consistency.

stevenzwu · 2026-06-06T18:59:08Z

+    PartitionSpec spec = PartitionSpec.unpartitioned();
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    options.put(TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED, snapshotIdInheritanceEnabled);


Nit: SNAPSHOT_ID_INHERITANCE_ENABLED is a v1-only knob, but these new tests assert formatVersion >= 3, so the option is a no-op here. Same on lines 251 and 296 — can be dropped from all three new tests.

stevenzwu · 2026-06-06T18:59:08Z

+  }
+
+  @TestTemplate
+  public void testRewriteV3PartitionedManifestsPreservesFirstRowId() {


Optional: this test differs from testRewriteV3ManifestsPreservesFirstRowId only by PartitionSpec. Could fold into a single helper that takes the spec. Existing tests in this file already split partitioned/unpartitioned variants, so leaving as-is is also fine.

stevenzwu · 2026-06-06T18:59:08Z

+            .orderBy("_row_id")
+            .collectAsList();
+
+    assertThat(rowsAfter).containsExactlyElementsOf(rowsBefore);


containsExactlyElementsOf is null-safe — if _row_id resolves to null for both rowsBefore and rowsAfter (e.g., a future regression in metadata-column resolution), the equality would silently pass and hide the regression. testRewriteManifestsAfterV2ToV3Upgrade already guards against this with .extracting(_row_id).doesNotContainNull(). Suggest adding the same check on rowsBefore here and at line 286 so the assertion is self-defending.

+1 to adding the null guard.

huaxingao

LGTM

Spark: Fix first row ID carry over for manifest rewrite

9ed3e1d

github-actions Bot added the spark label Jun 6, 2026

amogh-jahagirdar requested review from RussellSpitzer, kevinjqliu, nastra, rdblue, singhpk234 and stevenzwu June 6, 2026 17:54

stevenzwu approved these changes Jun 6, 2026

View reviewed changes

huaxingao approved these changes Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Fix first row ID carry over for manifest rewrite#16699

Spark: Fix first row ID carry over for manifest rewrite#16699
amogh-jahagirdar wants to merge 1 commit into
apache:mainfrom
amogh-jahagirdar:rewrite-manifests-fix

amogh-jahagirdar commented Jun 6, 2026

Uh oh!

amogh-jahagirdar commented Jun 6, 2026

Uh oh!

stevenzwu left a comment

Uh oh!

stevenzwu Jun 6, 2026

Uh oh!

stevenzwu Jun 6, 2026

Uh oh!

stevenzwu Jun 6, 2026

Uh oh!

stevenzwu Jun 6, 2026

Uh oh!

huaxingao Jun 7, 2026

Uh oh!

huaxingao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amogh-jahagirdar commented Jun 6, 2026

Uh oh!

amogh-jahagirdar commented Jun 6, 2026

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

huaxingao Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

huaxingao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants