Spark: Fix first row ID carry over for manifest rewrite#16699
Spark: Fix first row ID carry over for manifest rewrite#16699amogh-jahagirdar wants to merge 1 commit into
Conversation
|
I think the fix is small enough to just backport directly to spark versions in this PR but let's first agree on the fix before we do that work. |
stevenzwu
left a comment
There was a problem hiding this comment.
LGTM — fix follows the existing referencedDataFile/contentOffset/contentSizeInBytes override pattern.
A few minor nits inline.
| private final int referencedDataFilePosition; | ||
| private final int contentOffsetPosition; | ||
| private final int contentSizePosition; | ||
| private final int firstRowIdPosition; |
There was a problem hiding this comment.
Nit: the FIRST_ROW_ID schema field (id 142) sits before REFERENCED_DATA_FILE/CONTENT_OFFSET/CONTENT_SIZE (143–145), and the rest of the constructor mirrors that order. Suggest moving this declaration above referencedDataFilePosition (and the assignment on line 111) for consistency.
| PartitionSpec spec = PartitionSpec.unpartitioned(); | ||
| Map<String, String> options = Maps.newHashMap(); | ||
| options.put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion)); | ||
| options.put(TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED, snapshotIdInheritanceEnabled); |
There was a problem hiding this comment.
Nit: SNAPSHOT_ID_INHERITANCE_ENABLED is a v1-only knob, but these new tests assert formatVersion >= 3, so the option is a no-op here. Same on lines 251 and 296 — can be dropped from all three new tests.
| } | ||
|
|
||
| @TestTemplate | ||
| public void testRewriteV3PartitionedManifestsPreservesFirstRowId() { |
There was a problem hiding this comment.
Optional: this test differs from testRewriteV3ManifestsPreservesFirstRowId only by PartitionSpec. Could fold into a single helper that takes the spec. Existing tests in this file already split partitioned/unpartitioned variants, so leaving as-is is also fine.
| .orderBy("_row_id") | ||
| .collectAsList(); | ||
|
|
||
| assertThat(rowsAfter).containsExactlyElementsOf(rowsBefore); |
There was a problem hiding this comment.
containsExactlyElementsOf is null-safe — if _row_id resolves to null for both rowsBefore and rowsAfter (e.g., a future regression in metadata-column resolution), the equality would silently pass and hide the regression. testRewriteManifestsAfterV2ToV3Upgrade already guards against this with .extracting(_row_id).doesNotContainNull(). Suggest adding the same check on rowsBefore here and at line 286 so the assertion is self-defending.
There was a problem hiding this comment.
+1 to adding the null guard.
During a manifest rewrite via Spark actions, we do not correctly carry over the first row ID for the entries being moved. This is because the records from the
entriesmetadata table that Spark reads are adapted into a SparkContentFile structure. This SparkContentFile structure did not override firstRowId and as a result the adaptation would null out the carried over first row ID. This can lead to new row IDs being assigned via inheritance.