Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet v2 Spark vectorized read #7162

Closed
anthonysgro opened this issue Mar 21, 2023 · 11 comments
Closed

Support Parquet v2 Spark vectorized read #7162

anthonysgro opened this issue Mar 21, 2023 · 11 comments
Assignees
Labels

Comments

@anthonysgro
Copy link

anthonysgro commented Mar 21, 2023

Feature Request / Improvement

As it stands today, if you want to employ both Spark and AWS Athena for your iceberg tables in v1.1.0, you must disable the vectorized reader. The reason is because Athena writes fields in a delta encoded manner, which is unsupported by the vectorized reader.

If you have ever hit the following error for a primitive type (complex types can be solved by #521), you have probably been impacted by this issue:

java.lang.UnsupportedOperationException: Cannot support vectorized reads for column [email] optional binary email (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:96)

Spark has implemented this support in 2022: apache/spark#35262. However, Iceberg uses its own vectorized reader.

Is it possible to implement support for these encodings? It would solve a significant interoperability problem between Athena, Spark, and possibly other query engines using them.

Query engine

Athena + Spark

@jackye1995
Copy link
Contributor

Will start to work on this, assigning to myself.

@jackye1995 jackye1995 self-assigned this Apr 20, 2023
@RussellSpitzer
Copy link
Member

Doesn't this require full Parquet V2 Support? @jackye1995 ? If so I think we should rename the issue

@jackye1995
Copy link
Contributor

Yes that's right

@jackye1995 jackye1995 changed the title Missing vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support Support Parquet v2 Spark vectorized read Apr 20, 2023
@wanglinsong
Copy link

Is there any workaround for this issue on Spark?

@RussellSpitzer
Copy link
Member

@wanglinsong No, since Iceberg uses custom readers.

@quatpv
Copy link

quatpv commented Aug 1, 2023

Is there any workaround for this issue on Spark?

Perhaps this configuration will help you.

spark.read \
.format("iceberg") \
.option("vectorization-enabled", "false") \
.load(f"`{catalog}`.`{database}`.`{table}`")

@anthonysgro
Copy link
Author

@jackye1995 would love an update if possible. This issue is almost a year old now

@wgtmac
Copy link
Member

wgtmac commented Feb 7, 2024

@jackye1995 I can work on this if you see fit.

@atronchi
Copy link

atronchi commented Mar 27, 2024

Just want to chime in that I am also seeing this error on tables written by Spark 3.4 with recent Iceberg/Parquet, then reading those same tables again with Spark 3.4/Iceberg. Testing @quatpv 's suggested workaround from above...

Copy link

github-actions bot commented Oct 4, 2024

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Oct 4, 2024
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants