[Parquet] Throw Better Exception with Vectorized Parquet V2 Format by RussellSpitzer · Pull Request #2740 · apache/iceberg

RussellSpitzer · 2021-06-25T15:47:26Z

Previously when vectorizied reading was enabled and a ParquetV2 File
was being read our reader would throw a NullPointer exception because
the inputStream would not have the expected bytes. To temporarily give
a better error, we rethrow any npe errors in attempting to read the
encoding as a ParquetDecodingException with details on Vectorized
reading only being compatible with V1 Encodings.

Previously when vectorizied reading was enabled and a ParquetV2 File was being read our reader would throw a NullPointer exception because the inputStream would not have the expected bytes. To temporarily give a better error, we rethrow any npe errors in attempting to read the encoding as a ParquetDecodingException with details on Vectorized reading only being compatible with V1 Encodings.

RussellSpitzer · 2021-06-25T15:49:05Z

Fixes #2692

RussellSpitzer · 2021-06-25T15:50:29Z

@kbendick + @samarthjain Could you take a look at this?

samarthjain · 2021-06-25T17:25:44Z

@RussellSpitzer - I am hoping we can find a better solution here. I am generally not a fan of catching NPEs :)

There are a few other approaches possible here:

Parquet v2 actually isn't that well tested. The later versions of Trino though have started writing parquet files in V2 format. We encountered this issue in Iceberg vectorized reads when we upgraded our Presto clusters to trino 350 release. We worked around the issue by reintroducing the older parquet write path in Trino that writes Parquet V1 files.
To fix this in Iceberg

We should either look into supporting vectorized reads for v2
We should disable vectorized reads when/if we can detect that the parquet files are in V2 format.

I can take up looking into 2).

RussellSpitzer · 2021-06-25T17:27:46Z

@samarthjain i'm fine with just disabling if we detect if files are in V2 format. I just wanted to get this fix in for now since the current behavior is just throw an NPE which doesn't contain any information. I'd be fine removing it once we have a better fix, or if you know of a fast way we can detect if a file is V2

kbendick · 2021-06-28T20:36:12Z

Will current Iceberg not work for v2 files at all, for example if the columns all wind up with plain encoding (as a trivial example)?

I agree that catching the NPE isn't the most elegant solution, but if there's a possibility that page v2 records can be read if they only use the existing encoding, that would be nice.

Otherwise, explicitly failing upon the encounter of a v2 page would be the best way to start.

Please let me know if I can help somehow @samarthjain

samarthjain · 2021-06-28T20:50:11Z

For now, a better solution would be to throw a similar exception like UnsupportedOperationException("Vectorized reads not supported for Parquet V2) .

The place to make this change would be:
https://github.com/apache/iceberg/blob/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedPageIterator.java#L535

samarthjain · 2021-06-28T23:50:17Z

I submitted a PR that adds limited support for Parquet V2. Like Spark vectorized reads, we fail if the encoding of non-dictionary encoded data is not plain. @RussellSpitzer, @kbendick - would be good to get your eyes on it. Thanks!

github-actions bot added the arrow label Jun 25, 2021

RussellSpitzer closed this Jul 2, 2021

szehon-ho mentioned this pull request Jul 22, 2022

Allow table read/write options to be configured and/or enforced at catalog level using catalog properties #5343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] Throw Better Exception with Vectorized Parquet V2 Format#2740

[Parquet] Throw Better Exception with Vectorized Parquet V2 Format#2740
RussellSpitzer wants to merge 1 commit intoapache:masterfrom
RussellSpitzer:ErrorVectorizedReadParquet

RussellSpitzer commented Jun 25, 2021

Uh oh!

RussellSpitzer commented Jun 25, 2021

Uh oh!

RussellSpitzer commented Jun 25, 2021

Uh oh!

samarthjain commented Jun 25, 2021

Uh oh!

RussellSpitzer commented Jun 25, 2021

Uh oh!

kbendick commented Jun 28, 2021

Uh oh!

samarthjain commented Jun 28, 2021

Uh oh!

samarthjain commented Jun 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RussellSpitzer commented Jun 25, 2021

Uh oh!

RussellSpitzer commented Jun 25, 2021

Uh oh!

RussellSpitzer commented Jun 25, 2021

Uh oh!

samarthjain commented Jun 25, 2021

Uh oh!

RussellSpitzer commented Jun 25, 2021

Uh oh!

kbendick commented Jun 28, 2021

Uh oh!

samarthjain commented Jun 28, 2021

Uh oh!

samarthjain commented Jun 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants