Skip to content

Spark: Fix first row ID carry over for manifest rewrite#16699

Open
amogh-jahagirdar wants to merge 1 commit into
apache:mainfrom
amogh-jahagirdar:rewrite-manifests-fix
Open

Spark: Fix first row ID carry over for manifest rewrite#16699
amogh-jahagirdar wants to merge 1 commit into
apache:mainfrom
amogh-jahagirdar:rewrite-manifests-fix

Conversation

@amogh-jahagirdar
Copy link
Copy Markdown
Contributor

During a manifest rewrite via Spark actions, we do not correctly carry over the first row ID for the entries being moved. This is because the records from the entries metadata table that Spark reads are adapted into a SparkContentFile structure. This SparkContentFile structure did not override firstRowId and as a result the adaptation would null out the carried over first row ID. This can lead to new row IDs being assigned via inheritance.

@github-actions github-actions Bot added the spark label Jun 6, 2026
@amogh-jahagirdar
Copy link
Copy Markdown
Contributor Author

I think the fix is small enough to just backport directly to spark versions in this PR but let's first agree on the fix before we do that work.

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — fix follows the existing referencedDataFile/contentOffset/contentSizeInBytes override pattern.

A few minor nits inline.

private final int referencedDataFilePosition;
private final int contentOffsetPosition;
private final int contentSizePosition;
private final int firstRowIdPosition;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the FIRST_ROW_ID schema field (id 142) sits before REFERENCED_DATA_FILE/CONTENT_OFFSET/CONTENT_SIZE (143–145), and the rest of the constructor mirrors that order. Suggest moving this declaration above referencedDataFilePosition (and the assignment on line 111) for consistency.

PartitionSpec spec = PartitionSpec.unpartitioned();
Map<String, String> options = Maps.newHashMap();
options.put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
options.put(TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED, snapshotIdInheritanceEnabled);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: SNAPSHOT_ID_INHERITANCE_ENABLED is a v1-only knob, but these new tests assert formatVersion >= 3, so the option is a no-op here. Same on lines 251 and 296 — can be dropped from all three new tests.

}

@TestTemplate
public void testRewriteV3PartitionedManifestsPreservesFirstRowId() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: this test differs from testRewriteV3ManifestsPreservesFirstRowId only by PartitionSpec. Could fold into a single helper that takes the spec. Existing tests in this file already split partitioned/unpartitioned variants, so leaving as-is is also fine.

.orderBy("_row_id")
.collectAsList();

assertThat(rowsAfter).containsExactlyElementsOf(rowsBefore);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containsExactlyElementsOf is null-safe — if _row_id resolves to null for both rowsBefore and rowsAfter (e.g., a future regression in metadata-column resolution), the equality would silently pass and hide the regression. testRewriteManifestsAfterV2ToV3Upgrade already guards against this with .extracting(_row_id).doesNotContainNull(). Suggest adding the same check on rowsBefore here and at line 286 so the assertion is self-defending.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to adding the null guard.

Copy link
Copy Markdown
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants