Python write support #6564

Fokko · 2023-01-11T14:30:31Z

Feature Request / Improvement

This is a placeholder ticket for implementing write support for PyIceberg.

Since we don't want PyIceberg to write the actual data, and only handle the metadata part of the Iceberg table format, we need to get an overview of the frameworks we most likely want to integrate with (PyArrow, Dask (fastparquet?), etc).

I would suggest the following first steps to keep it simple: Write using PyArrow (since that's the most commonly used FileIO) and start with unpartitioned tables.

What we need:

Avro write support: Python: Avro write support #7255
Write files and extract statistics: Python: Write Parquet file using PyArrow #7256
Ability to alter the Manifest JSON: Python: Ability to the Metadata JSON #7257
Proper integration tests between Java and Python: Python: Integration tests #6398

Query engine

None

nazq · 2023-01-14T16:19:39Z

Keen to see this one progress, especially w pyarrow. Cc @asheeshgarg. Happy to help out

Hayder-Aziz-cardano · 2023-02-06T14:22:39Z

it would be really nice to get it working with polars (which utilises pyarrow / connectorx) as it provides the possibility of running execution services without large monolithic services like hive or spark.

marsupialtail · 2023-03-07T23:06:12Z

A great intermediate step would be to simply allow an engine like Polars/Pandas/Quokka to write out the actual parquet files in a given location with provided names that don't conflict with existing files, and then "register" those Parquet files in the iceberg metadata file.

asheeshgarg · 2023-03-10T14:37:33Z

One thing we can use is directly leverage Java APIs using PY4J or JYPE to call the underlying API. This will reduce the maintenance at two places.

corleyma · 2023-03-28T23:10:00Z

One thing we can use is directly leverage Java APIs using PY4J or JYPE to call the underlying API. This will reduce the maintenance at two places.

From my perspective at least, any solution that makes the JVM a dependency for users of the Python API is a poor solution.

Matthieusalor · 2023-03-31T09:34:41Z

Regarding statistics of written files, the write_to_dataset and write_dataset functions of pyarrow are providing a file_visitor argument that allows to retrieve the path and metadata of each written file.
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html

The metadata object is https://arrow.apache.org/docs/python/generated/pyarrow.parquet.FileMetaData.html
It allows to retrieve each row group metadata https://arrow.apache.org/docs/python/generated/pyarrow.parquet.RowGroupMetaData.html#pyarrow.parquet.RowGroupMetaData

Allowing to retrieve each column chunk metadata (including statistics) https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ColumnChunkMetaData.html#pyarrow.parquet.ColumnChunkMetaData

Fokko · 2023-03-31T10:19:59Z

@Matthieusalor Thanks for the suggestion. I was looking into the metadata_collector: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_metadata.html It looks like we can leverage this to collect the metadata during the write.

luancaarvalho · 2023-08-03T19:50:40Z

Hello everyone!
I was wondering if there have been any updates recently. Additionally, is there a roadmap available that shows if any contributors are currently working on the feature I am interested in?
Thank you!

corleyma · 2023-08-15T18:19:23Z

I dunno about roadmap, but there are some outstanding PRs in different stages of development that are directly relevant to this feature and are linked/referenced by the task list in the issue body.

Specifically, these PR seems worth following up on if you're looking to contribute:
#7831

Also, this repo organizes issues into milestones; there you can see milestones for the next two planned releases of pyiceberg and the issues that are currently assigned to each.

Fokko · 2023-10-02T10:15:37Z

Migrated to the new repository: apache/iceberg-python#23

Fokko added the python label Jan 11, 2023

nicor88 mentioned this issue Feb 13, 2023

Wrangler to support Hudi/Iceberg datasets read/write aws/aws-sdk-pandas#1470

Open

Fokko closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python write support #6564

Python write support #6564

Fokko commented Jan 11, 2023 •

edited

Loading

nazq commented Jan 14, 2023

Hayder-Aziz-cardano commented Feb 6, 2023

marsupialtail commented Mar 7, 2023

asheeshgarg commented Mar 10, 2023

corleyma commented Mar 28, 2023

Matthieusalor commented Mar 31, 2023 •

edited

Loading

Fokko commented Mar 31, 2023

luancaarvalho commented Aug 3, 2023

corleyma commented Aug 15, 2023 •

edited

Loading

Fokko commented Oct 2, 2023

Python write support #6564

Python write support #6564

Comments

Fokko commented Jan 11, 2023 • edited Loading

Feature Request / Improvement

Query engine

nazq commented Jan 14, 2023

Hayder-Aziz-cardano commented Feb 6, 2023

marsupialtail commented Mar 7, 2023

asheeshgarg commented Mar 10, 2023

corleyma commented Mar 28, 2023

Matthieusalor commented Mar 31, 2023 • edited Loading

Fokko commented Mar 31, 2023

luancaarvalho commented Aug 3, 2023

corleyma commented Aug 15, 2023 • edited Loading

Fokko commented Oct 2, 2023

Fokko commented Jan 11, 2023 •

edited

Loading

Matthieusalor commented Mar 31, 2023 •

edited

Loading

corleyma commented Aug 15, 2023 •

edited

Loading