-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python write support #6564
Comments
Keen to see this one progress, especially w pyarrow. Cc @asheeshgarg. Happy to help out |
it would be really nice to get it working with polars (which utilises pyarrow / connectorx) as it provides the possibility of running execution services without large monolithic services like hive or spark. |
A great intermediate step would be to simply allow an engine like Polars/Pandas/Quokka to write out the actual parquet files in a given location with provided names that don't conflict with existing files, and then "register" those Parquet files in the iceberg metadata file. |
One thing we can use is directly leverage Java APIs using PY4J or JYPE to call the underlying API. This will reduce the maintenance at two places. |
From my perspective at least, any solution that makes the JVM a dependency for users of the Python API is a poor solution. |
Regarding statistics of written files, the write_to_dataset and write_dataset functions of pyarrow are providing a file_visitor argument that allows to retrieve the path and metadata of each written file. The metadata object is https://arrow.apache.org/docs/python/generated/pyarrow.parquet.FileMetaData.html Allowing to retrieve each column chunk metadata (including statistics) https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ColumnChunkMetaData.html#pyarrow.parquet.ColumnChunkMetaData |
@Matthieusalor Thanks for the suggestion. I was looking into the |
Hello everyone! |
I dunno about roadmap, but there are some outstanding PRs in different stages of development that are directly relevant to this feature and are linked/referenced by the task list in the issue body. Specifically, these PR seems worth following up on if you're looking to contribute: Also, this repo organizes issues into milestones; there you can see milestones for the next two planned releases of pyiceberg and the issues that are currently assigned to each. |
Migrated to the new repository: apache/iceberg-python#23 |
Feature Request / Improvement
This is a placeholder ticket for implementing write support for PyIceberg.
Since we don't want PyIceberg to write the actual data, and only handle the metadata part of the Iceberg table format, we need to get an overview of the frameworks we most likely want to integrate with (PyArrow, Dask (fastparquet?), etc).
I would suggest the following first steps to keep it simple: Write using PyArrow (since that's the most commonly used FileIO) and start with unpartitioned tables.
What we need:
Query engine
None
The text was updated successfully, but these errors were encountered: