Skip to content

PyIceberg-Core: Push down Parquet reading to Iceberg-Rust #1144

@Fokko

Description

@Fokko

Is your feature request related to a problem or challenge?

As a next step in integrating PyIceberg and Iceberg-Rust, it would be great to push down the Parquet reading (including all the schema evolution) to Iceberg-Rust. Today, in PyIceberg, we go over each of the record batches, which causes a lot of pressure on the GIL. This logic should all happen in the Parquet reader (schema evolution, projecting missing columns, renames, re-ordering, etc), but from PyArrow we don't have the flexibility to project on ID, so this is what we ended up with.

The most logical separation would be to pass the FileScanTask into Iceberg-Rust.

We can break it down into building blocks:

  • Ability to leverage the Iceberg-Rust FileIO in PyIceberg to open up streams
  • Ability to pass down a PyIceberg schema into Iceberg-Rust.
    • Can we serialize it into JSON? But that seems to be costly. Ideally, we want to reuse objects and not have to copy them from one to the other.
  • Pass down expressions.

From the callgraph:

Image

Image

Describe the solution you'd like

No response

Willingness to contribute

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions