Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Nov 12, 2025

This is a behavioral change.

In Iceberg-Rust we require upper/lower bounds to be part of the schema. But in some cases, this in't the case, most obvious schema evolution.

In PyIceberg we expect these values in some tests:

FAILED tests/integration/test_inspect_table.py::test_inspect_files[2] - AssertionError: Difference in column lower_bounds: {} != {2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c18-079b-4217-afd8-559ce216e875.parquet', 2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00'}
assert {} == {2147483545: ...e875.parquet'}
  Right contains 2 more items:
  {2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00',
   2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1'
               b'8-079b-4217-afd8-559ce216e875.parquet'}
  Full diff:
    {
  +  ,
  -  2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00',
  -  2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1'
  -              b'8-079b-4217-afd8-559ce216e875.parquet',
    }
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
==== 1 failed, 238 passed, 32 skipped, 3123 deselected in 61.56s (0:01:01) =====

This is a positional delete where the field-IDs are constant, but never part of a schema (they are reserved).

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

Are these changes tested?

This is a behavioral change.

In Iceberg-Rust we require upper/lower bounds to be part
of the schema. But in some cases, this in't the case, most
obvious schema evolution.

In PyIceberg we expect these values in some tests:

```
FAILED tests/integration/test_inspect_table.py::test_inspect_files[2] - AssertionError: Difference in column lower_bounds: {} != {2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c18-079b-4217-afd8-559ce216e875.parquet', 2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00'}
assert {} == {2147483545: ...e875.parquet'}
  Right contains 2 more items:
  {2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00',
   2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1'
               b'8-079b-4217-afd8-559ce216e875.parquet'}
  Full diff:
    {
  +  ,
  -  2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00',
  -  2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1'
  -              b'8-079b-4217-afd8-559ce216e875.parquet',
    }
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
==== 1 failed, 238 passed, 32 skipped, 3123 deselected in 61.56s (0:01:01) =====
```

This is a positional delete where the field-IDs are constant,
but never part of a schema (they are reserved).
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This aligns with both pyiceberg and spark behavior

@kevinjqliu
Copy link
Contributor

manually retriggering ci runs, due to #1838 😞

} else {
// Field is not in current schema (e.g., dropped field due to schema evolution).
// Store the statistic as binary data to preserve it even though we don't know its type.
m.insert(entry.key, Datum::binary(entry.value.to_vec()));
Copy link
Contributor

@liurenjie1024 liurenjie1024 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix may lead to slient error in user application, because user assumes that the returned statistics in Manifest are all parsed, and they will see a type mismatch. I think there are two ways to do this fix:

  1. We add a new enum:
enum Statistic {
   Parsed(Datum),
   Raw(Vec<u8)
}

And change DataFile's lower/upper bound to HashMap<i32, Statistic>. This will lead to breaking api change, but will users will be aware of this, and will not see slient breaking of their application.

  1. We pass TableMetadata into this parsing function, and search for missing field id in all schemas in TableMetadata. This approach may slow down the deserialization a little when seeing field ids due to schema evolution, but it will not lead to any api change.

Personally I perfer to approach 2, WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants