Support Parquet v2 Spark vectorized read #7162

anthonysgro · 2023-03-21T18:30:08Z

Feature Request / Improvement

As it stands today, if you want to employ both Spark and AWS Athena for your iceberg tables in v1.1.0, you must disable the vectorized reader. The reason is because Athena writes fields in a delta encoded manner, which is unsupported by the vectorized reader.

If you have ever hit the following error for a primitive type (complex types can be solved by #521), you have probably been impacted by this issue:

java.lang.UnsupportedOperationException: Cannot support vectorized reads for column [email] optional binary email (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:96)

Spark has implemented this support in 2022: apache/spark#35262. However, Iceberg uses its own vectorized reader.

Is it possible to implement support for these encodings? It would solve a significant interoperability problem between Athena, Spark, and possibly other query engines using them.

Query engine

Athena + Spark

The text was updated successfully, but these errors were encountered:

jackye1995 · 2023-04-20T04:09:38Z

Will start to work on this, assigning to myself.

RussellSpitzer · 2023-04-20T14:45:16Z

Doesn't this require full Parquet V2 Support? @jackye1995 ? If so I think we should rename the issue

jackye1995 · 2023-04-20T15:58:53Z

Yes that's right

wanglinsong · 2023-04-25T04:41:20Z

Is there any workaround for this issue on Spark?

RussellSpitzer · 2023-04-25T19:08:06Z

@wanglinsong No, since Iceberg uses custom readers.

quatpv · 2023-08-01T06:48:54Z

Is there any workaround for this issue on Spark?

Perhaps this configuration will help you.

spark.read \
.format("iceberg") \
.option("vectorization-enabled", "false") \
.load(f"`{catalog}`.`{database}`.`{table}`")

anthonysgro · 2024-02-01T21:15:36Z

@jackye1995 would love an update if possible. This issue is almost a year old now

wgtmac · 2024-02-07T15:26:52Z

@jackye1995 I can work on this if you see fit.

atronchi · 2024-03-27T16:59:36Z

Just want to chime in that I am also seeing this error on tables written by Spark 3.4 with recent Iceberg/Parquet, then reading those same tables again with Spark 3.4/Iceberg. Testing @quatpv 's suggested workaround from above...

github-actions · 2024-10-04T00:14:47Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2024-10-19T00:14:32Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

anthonysgro mentioned this issue Mar 21, 2023

Reading records inserted using Athena throws UOE exception when read using Spark (AWS) #5593

Closed

jackye1995 self-assigned this Apr 20, 2023

jackye1995 changed the title ~~Missing vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support~~ Support Parquet v2 Spark vectorized read Apr 20, 2023

wgtmac mentioned this issue Feb 22, 2024

Parquet, Arrow: Refactor vectorized reader #9772

Open

anthonysgro mentioned this issue Mar 11, 2024

Iceberg containing Parquet v2 files cannot be read unless read.parquet.vectorization.enabled is set to false #8430

Closed

github-actions bot added the stale label Oct 4, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parquet v2 Spark vectorized read #7162

Support Parquet v2 Spark vectorized read #7162

anthonysgro commented Mar 21, 2023 •

edited

Loading

jackye1995 commented Apr 20, 2023

RussellSpitzer commented Apr 20, 2023

jackye1995 commented Apr 20, 2023

wanglinsong commented Apr 25, 2023

RussellSpitzer commented Apr 25, 2023

quatpv commented Aug 1, 2023 •

edited

Loading

anthonysgro commented Feb 1, 2024

wgtmac commented Feb 7, 2024

atronchi commented Mar 27, 2024 •

edited

Loading

github-actions bot commented Oct 4, 2024

github-actions bot commented Oct 19, 2024

Support Parquet v2 Spark vectorized read #7162

Support Parquet v2 Spark vectorized read #7162

Comments

anthonysgro commented Mar 21, 2023 • edited Loading

Feature Request / Improvement

Query engine

jackye1995 commented Apr 20, 2023

RussellSpitzer commented Apr 20, 2023

jackye1995 commented Apr 20, 2023

wanglinsong commented Apr 25, 2023

RussellSpitzer commented Apr 25, 2023

quatpv commented Aug 1, 2023 • edited Loading

anthonysgro commented Feb 1, 2024

wgtmac commented Feb 7, 2024

atronchi commented Mar 27, 2024 • edited Loading

github-actions bot commented Oct 4, 2024

github-actions bot commented Oct 19, 2024

anthonysgro commented Mar 21, 2023 •

edited

Loading

quatpv commented Aug 1, 2023 •

edited

Loading

atronchi commented Mar 27, 2024 •

edited

Loading