Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python write support #6564

Closed
2 of 4 tasks
Fokko opened this issue Jan 11, 2023 · 10 comments
Closed
2 of 4 tasks

Python write support #6564

Fokko opened this issue Jan 11, 2023 · 10 comments
Labels

Comments

@Fokko
Copy link
Contributor

Fokko commented Jan 11, 2023

Feature Request / Improvement

This is a placeholder ticket for implementing write support for PyIceberg.

Since we don't want PyIceberg to write the actual data, and only handle the metadata part of the Iceberg table format, we need to get an overview of the frameworks we most likely want to integrate with (PyArrow, Dask (fastparquet?), etc).

I would suggest the following first steps to keep it simple: Write using PyArrow (since that's the most commonly used FileIO) and start with unpartitioned tables.

What we need:

Query engine

None

@Fokko Fokko added the python label Jan 11, 2023
@nazq
Copy link

nazq commented Jan 14, 2023

Keen to see this one progress, especially w pyarrow. Cc @asheeshgarg. Happy to help out

@Hayder-Aziz-cardano
Copy link

it would be really nice to get it working with polars (which utilises pyarrow / connectorx) as it provides the possibility of running execution services without large monolithic services like hive or spark.

@marsupialtail
Copy link

A great intermediate step would be to simply allow an engine like Polars/Pandas/Quokka to write out the actual parquet files in a given location with provided names that don't conflict with existing files, and then "register" those Parquet files in the iceberg metadata file.

@asheeshgarg
Copy link

One thing we can use is directly leverage Java APIs using PY4J or JYPE to call the underlying API. This will reduce the maintenance at two places.

@corleyma
Copy link

One thing we can use is directly leverage Java APIs using PY4J or JYPE to call the underlying API. This will reduce the maintenance at two places.

From my perspective at least, any solution that makes the JVM a dependency for users of the Python API is a poor solution.

@Matthieusalor
Copy link

Matthieusalor commented Mar 31, 2023

Regarding statistics of written files, the write_to_dataset and write_dataset functions of pyarrow are providing a file_visitor argument that allows to retrieve the path and metadata of each written file.
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html

The metadata object is https://arrow.apache.org/docs/python/generated/pyarrow.parquet.FileMetaData.html
It allows to retrieve each row group metadata https://arrow.apache.org/docs/python/generated/pyarrow.parquet.RowGroupMetaData.html#pyarrow.parquet.RowGroupMetaData

Allowing to retrieve each column chunk metadata (including statistics) https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ColumnChunkMetaData.html#pyarrow.parquet.ColumnChunkMetaData

@Fokko
Copy link
Contributor Author

Fokko commented Mar 31, 2023

@Matthieusalor Thanks for the suggestion. I was looking into the metadata_collector: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_metadata.html It looks like we can leverage this to collect the metadata during the write.

@luancaarvalho
Copy link

Hello everyone!
I was wondering if there have been any updates recently. Additionally, is there a roadmap available that shows if any contributors are currently working on the feature I am interested in?
Thank you!

@corleyma
Copy link

corleyma commented Aug 15, 2023

I dunno about roadmap, but there are some outstanding PRs in different stages of development that are directly relevant to this feature and are linked/referenced by the task list in the issue body.

Specifically, these PR seems worth following up on if you're looking to contribute:
#7831

Also, this repo organizes issues into milestones; there you can see milestones for the next two planned releases of pyiceberg and the issues that are currently assigned to each.

@Fokko
Copy link
Contributor Author

Fokko commented Oct 2, 2023

Migrated to the new repository: apache/iceberg-python#23

@Fokko Fokko closed this as completed Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants