-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Parquet v2 Spark vectorized read #7162
Comments
Will start to work on this, assigning to myself. |
Doesn't this require full Parquet V2 Support? @jackye1995 ? If so I think we should rename the issue |
Yes that's right |
Is there any workaround for this issue on Spark? |
@wanglinsong No, since Iceberg uses custom readers. |
Perhaps this configuration will help you. spark.read \
.format("iceberg") \
.option("vectorization-enabled", "false") \
.load(f"`{catalog}`.`{database}`.`{table}`") |
@jackye1995 would love an update if possible. This issue is almost a year old now |
@jackye1995 I can work on this if you see fit. |
Just want to chime in that I am also seeing this error on tables written by Spark 3.4 with recent Iceberg/Parquet, then reading those same tables again with Spark 3.4/Iceberg. Testing @quatpv 's suggested workaround from above... |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
Feature Request / Improvement
As it stands today, if you want to employ both Spark and AWS Athena for your iceberg tables in v1.1.0, you must disable the vectorized reader. The reason is because Athena writes fields in a delta encoded manner, which is unsupported by the vectorized reader.
If you have ever hit the following error for a primitive type (complex types can be solved by #521), you have probably been impacted by this issue:
Spark has implemented this support in 2022: apache/spark#35262. However, Iceberg uses its own vectorized reader.
Is it possible to implement support for these encodings? It would solve a significant interoperability problem between Athena, Spark, and possibly other query engines using them.
Query engine
Athena + Spark
The text was updated successfully, but these errors were encountered: