-
Notifications
You must be signed in to change notification settings - Fork 335
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
As a next step in integrating PyIceberg and Iceberg-Rust, it would be great to push down the Parquet reading (including all the schema evolution) to Iceberg-Rust. Today, in PyIceberg, we go over each of the record batches, which causes a lot of pressure on the GIL. This logic should all happen in the Parquet reader (schema evolution, projecting missing columns, renames, re-ordering, etc), but from PyArrow we don't have the flexibility to project on ID, so this is what we ended up with.
The most logical separation would be to pass the FileScanTask
into Iceberg-Rust.
We can break it down into building blocks:
- Ability to leverage the Iceberg-Rust FileIO in PyIceberg to open up streams
- Ability to pass down a PyIceberg schema into Iceberg-Rust.
- Can we serialize it into JSON? But that seems to be costly. Ideally, we want to reuse objects and not have to copy them from one to the other.
- Pass down expressions.
From the callgraph:
Describe the solution you'd like
No response
Willingness to contribute
None
kevinjqliu and liurenjie1024hendrikmakait, Xuanwo, jonathanc-n, liurenjie1024, c-thiel and 1 morekevinjqliu and aschreiber1
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request