feat: Don't drop additional statistics #1849

Fokko · 2025-11-12T18:29:13Z

This is a behavioral change.

In Iceberg-Rust we require upper/lower bounds to be part of the schema. But in some cases, this in't the case, most obvious schema evolution.

In PyIceberg we expect these values in some tests:

FAILED tests/integration/test_inspect_table.py::test_inspect_files[2] - AssertionError: Difference in column lower_bounds: {} != {2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c18-079b-4217-afd8-559ce216e875.parquet', 2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00'}
assert {} == {2147483545: ...e875.parquet'}
  Right contains 2 more items:
  {2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00',
   2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1'
               b'8-079b-4217-afd8-559ce216e875.parquet'}
  Full diff:
    {
  +  ,
  -  2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00',
  -  2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1'
  -              b'8-079b-4217-afd8-559ce216e875.parquet',
    }
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
==== 1 failed, 238 passed, 32 skipped, 3123 deselected in 61.56s (0:01:01) =====

This is a positional delete where the field-IDs are constant, but never part of a schema (they are reserved).

Which issue does this PR close?

Closes #.

What changes are included in this PR?

Are these changes tested?

This is a behavioral change. In Iceberg-Rust we require upper/lower bounds to be part of the schema. But in some cases, this in't the case, most obvious schema evolution. In PyIceberg we expect these values in some tests: ``` FAILED tests/integration/test_inspect_table.py::test_inspect_files[2] - AssertionError: Difference in column lower_bounds: {} != {2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c18-079b-4217-afd8-559ce216e875.parquet', 2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00'} assert {} == {2147483545: ...e875.parquet'} Right contains 2 more items: {2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00', 2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1' b'8-079b-4217-afd8-559ce216e875.parquet'} Full diff: { + , - 2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00', - 2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1' - b'8-079b-4217-afd8-559ce216e875.parquet', } !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! ==== 1 failed, 238 passed, 32 skipped, 3123 deselected in 61.56s (0:01:01) ===== ``` This is a positional delete where the field-IDs are constant, but never part of a schema (they are reserved).

kevinjqliu

LGTM! This aligns with both pyiceberg and spark behavior

kevinjqliu · 2025-11-12T19:32:47Z

manually retriggering ci runs, due to #1838 😞

liurenjie1024 · 2025-11-13T09:43:45Z

crates/iceberg/src/spec/manifest/_serde.rs

+        } else {
+            // Field is not in current schema (e.g., dropped field due to schema evolution).
+            // Store the statistic as binary data to preserve it even though we don't know its type.
+            m.insert(entry.key, Datum::binary(entry.value.to_vec()));


This fix may lead to slient error in user application, because user assumes that the returned statistics in Manifest are all parsed, and they will see a type mismatch. I think there are two ways to do this fix:

We add a new enum:

enum Statistic { Parsed(Datum), Raw(Vec<u8) }

And change DataFile's lower/upper bound to HashMap<i32, Statistic>. This will lead to breaking api change, but will users will be aware of this, and will not see slient breaking of their application.

We pass TableMetadata into this parsing function, and search for missing field id in all schemas in TableMetadata. This approach may slow down the deserialization a little when seeing field ids due to schema evolution, but it will not lead to any api change.

Personally I perfer to approach 2, WDYT?

Fokko mentioned this pull request Nov 12, 2025

[epic] address manifest reader feature gaps between rust and python implementations #1714

Open

10 tasks

kevinjqliu approved these changes Nov 12, 2025

View reviewed changes

kevinjqliu mentioned this pull request Nov 12, 2025

Tracking issues of Iceberg Rust 0.8 Release #1850

Open

11 tasks

liurenjie1024 reviewed Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Don't drop additional statistics #1849

feat: Don't drop additional statistics #1849

Uh oh!

Fokko commented Nov 12, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

kevinjqliu commented Nov 12, 2025

Uh oh!

liurenjie1024 Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Don't drop additional statistics #1849

Are you sure you want to change the base?

feat: Don't drop additional statistics #1849

Uh oh!

Conversation

Fokko commented Nov 12, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Nov 12, 2025

Uh oh!

liurenjie1024 Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liurenjie1024 Nov 13, 2025 •

edited

Loading