-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create iceberg table from existsing parquet files with slightly different schemas (schemas merge is possible). #601
Comments
There's a You can also read all the parquet files into memory with PyArrow, merge the different schemas (using pyarrow.unify_schemas) and then write Arrow as Iceberg table. Maybe DuckDB works well here since |
Thank you @kevinjqliu ! I successfully merged schemas:
but the next lines give an error:
Reg. duckdb - unfortunately union by name does not work for nested parquet files with changes in schemas on any level of nested structures. BTW it works for json in duckdb. It is my question in duckdb discussion: |
Looks like your schema is nested, which makes things more complicated. It's pretty difficult to deal with merging nested schemas. I'm not sure if there's an out-of-the-box solution for this. That said, most of the difficulties here are not related to Iceberg. One thing I wonder is if PyIceberg can handle schema evolution of nested structs. |
@kevinjqliu |
Looks like it can.
|
BTW: Found some explaination why merge of Arrow tables with different schemas is not possible: Probably it is possible to implement table merge in PyArrow after the check that there are no duplicated column names in each struct and on root level. |
Wow, I learned something today. I hope nobody uses that in real life.
Nested structs, or structs inside a maps and lists are all supported :) @sergun In PyIceberg we also have a |
have any java soultion that import parquet files ? @Fokko |
Question
Hi!
What is a right way to create an iceberg table from existsing parquet files with slightly different schemas? So merge of their schemas is possible.
I would like to create the iceberg table by iceberg-python library (without Spark).
The text was updated successfully, but these errors were encountered: