fix(scan): use current_schema for default-snapshot column validation#2566
Open
nazq wants to merge 1 commit into
Open
fix(scan): use current_schema for default-snapshot column validation#2566nazq wants to merge 1 commit into
nazq wants to merge 1 commit into
Conversation
A default table scan (no explicit snapshot_id) currently validates
caller-supplied column names against the snapshot's schema_id, not the
table's current schema. After an UpdateSchemaAction commit changes the
current schema (rename/add/delete column), pre-existing snapshots still
point at the old schema_id, so the validation loop in
TableScanBuilder::build rejects names that are valid against the
post-evolution schema with:
DataInvalid => Column <new_name> not found in table. Schema: <old>
The downstream Parquet projection
(arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids)
already maps field IDs to on-disk column names via PARQUET:field_id
metadata, so resolving names against the current schema is safe
end-to-end — field IDs are stable across schema versions, and the
file's original column names live in the parquet metadata until the
file is rewritten.
Fix: branch on whether the caller asked for a specific snapshot.
Explicit snapshot_id (time-travel) keeps the snapshot-time vocabulary;
default scan uses the table's current schema.
Tests: three regression tests on a fixture with current-schema-id=1
(id, value) and a sole snapshot at schema-id=0 (id, tmp):
* test_default_scan_uses_current_schema_after_evolution — select(['id','value'])
succeeds in the default scan
* test_default_scan_rejects_old_name_after_rename — select(['id','tmp'])
fails with DataInvalid in the default scan
* test_snapshot_id_scan_uses_snapshot_schema — snapshot_id(1).select(['id','tmp'])
succeeds (time-travel), and snapshot_id(1).select(['id','value']) fails
All 1299 iceberg lib tests pass (37 in scan::tests = 34 existing + 3
new). Clippy + rustfmt clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
UpdateSchemaActioncommit #2565.What changes are included in this PR?
TableScanBuilder::build()currently validates caller-supplied column names againstsnapshot.schema(metadata), which returns the schema the snapshot was written under. For default scans (no explicitsnapshot_id) on a table whose current schema differs from the snapshot's schema — i.e. any table that's been through anUpdateSchemaActioncommit — this rejects columns that are valid against the post-evolution schema:Fix
crates/iceberg/src/scan/mod.rs:221:snapshot_id(time-travel): keep the snapshot-time vocabulary. A caller asking for "the table as it existed at snapshot N" should see schema N's columns.snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream Parquet projection (get_arrow_projection_mask_with_field_ids, which readsPARQUET:field_idmetadata embedded by the writer) still finds the right on-disk columns even when the file's column names differ from the current schema's column names.This matches PyIceberg's approach in
pyiceberg/io/pyarrow.py::_task_to_record_batches— project by field ID, rename the arrow batch on the way out.The column-name validation loop (lines 224–237) and the subsequent
field_id_by_namelookup (lines 256–261) share the sameschemavariable, so the fix is one assignment with a 12-line comment explaining the invariant.Why this wasn't caught earlier
UpdateSchemaAction(#2120) shipped with metadata-only tests incrates/catalog/loader/tests/schema_update_suite.rs— none of them exercisetable.scan().select_columns(...)after the schema commit. The pre-existingcrates/integration_tests/tests/read_evolved_schema.rsonly usestable.scan().build()with noselect_columns, which bypasses the validation loop entirely. So the regression has been latent in the add/delete path since #2120 merged; it's just easier to trip with rename.Are these changes tested?
Yes — three regression tests in
scan::tests, all on the same minimal fixture (make_schema_evolved_table) withcurrent-schema-id=1(id, value) and a sole snapshot atschema-id=0(id, tmp):test_default_scan_uses_current_schema_after_evolutionselect(["id", "value"])succeeds in the default scan — the post-evolution column name resolvestest_default_scan_rejects_old_name_after_renameselect(["id", "tmp"])fails withDataInvalid— the pre-evolution name is no longer part of the public vocabulary in a default scantest_snapshot_id_scan_uses_snapshot_schemasnapshot_id(1).select(["id", "tmp"])succeeds (time-travel keeps the old vocabulary);snapshot_id(1).select(["id", "value"])failsAll 34 existing
scan::testsstill pass. Full iceberg lib suite: 1299/1299. Clippy + rustfmt clean.