-
Notifications
You must be signed in to change notification settings - Fork 347
feat: Don't drop additional statistics #1849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Don't drop additional statistics #1849
Conversation
This is a behavioral change.
In Iceberg-Rust we require upper/lower bounds to be part
of the schema. But in some cases, this in't the case, most
obvious schema evolution.
In PyIceberg we expect these values in some tests:
```
FAILED tests/integration/test_inspect_table.py::test_inspect_files[2] - AssertionError: Difference in column lower_bounds: {} != {2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c18-079b-4217-afd8-559ce216e875.parquet', 2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00'}
assert {} == {2147483545: ...e875.parquet'}
Right contains 2 more items:
{2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00',
2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1'
b'8-079b-4217-afd8-559ce216e875.parquet'}
Full diff:
{
+ ,
- 2147483545: b'\x00\x00\x00\x00\x00\x00\x00\x00',
- 2147483546: b's3://warehouse/default/table_metadata_files/data/00000-0-8d621c1'
- b'8-079b-4217-afd8-559ce216e875.parquet',
}
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
==== 1 failed, 238 passed, 32 skipped, 3123 deselected in 61.56s (0:01:01) =====
```
This is a positional delete where the field-IDs are constant,
but never part of a schema (they are reserved).
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This aligns with both pyiceberg and spark behavior
|
manually retriggering ci runs, due to #1838 😞 |
| } else { | ||
| // Field is not in current schema (e.g., dropped field due to schema evolution). | ||
| // Store the statistic as binary data to preserve it even though we don't know its type. | ||
| m.insert(entry.key, Datum::binary(entry.value.to_vec())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix may lead to slient error in user application, because user assumes that the returned statistics in Manifest are all parsed, and they will see a type mismatch. I think there are two ways to do this fix:
- We add a new enum:
enum Statistic {
Parsed(Datum),
Raw(Vec<u8)
}And change DataFile's lower/upper bound to HashMap<i32, Statistic>. This will lead to breaking api change, but will users will be aware of this, and will not see slient breaking of their application.
- We pass
TableMetadatainto this parsing function, and search for missing field id in all schemas inTableMetadata. This approach may slow down the deserialization a little when seeing field ids due to schema evolution, but it will not lead to any api change.
Personally I perfer to approach 2, WDYT?
This is a behavioral change.
In Iceberg-Rust we require upper/lower bounds to be part of the schema. But in some cases, this in't the case, most obvious schema evolution.
In PyIceberg we expect these values in some tests:
This is a positional delete where the field-IDs are constant, but never part of a schema (they are reserved).
Which issue does this PR close?
What changes are included in this PR?
Are these changes tested?