Skip to content

[Parquet] Throw Better Exception with Vectorized Parquet V2 Format#2740

Closed
RussellSpitzer wants to merge 1 commit intoapache:masterfrom
RussellSpitzer:ErrorVectorizedReadParquet
Closed

[Parquet] Throw Better Exception with Vectorized Parquet V2 Format#2740
RussellSpitzer wants to merge 1 commit intoapache:masterfrom
RussellSpitzer:ErrorVectorizedReadParquet

Conversation

@RussellSpitzer
Copy link
Member

Previously when vectorizied reading was enabled and a ParquetV2 File
was being read our reader would throw a NullPointer exception because
the inputStream would not have the expected bytes. To temporarily give
a better error, we rethrow any npe errors in attempting to read the
encoding as a ParquetDecodingException with details on Vectorized
reading only being compatible with V1 Encodings.

Previously when vectorizied reading was enabled and a ParquetV2 File
was being read our reader would throw a NullPointer exception because
the inputStream would not have the expected bytes. To temporarily give
a better error, we rethrow any npe errors in attempting to read the
encoding as a ParquetDecodingException with details on Vectorized
reading only being compatible with V1 Encodings.
@github-actions github-actions bot added the arrow label Jun 25, 2021
@RussellSpitzer
Copy link
Member Author

Fixes #2692

@RussellSpitzer
Copy link
Member Author

@kbendick + @samarthjain Could you take a look at this?

@samarthjain
Copy link
Collaborator

@RussellSpitzer - I am hoping we can find a better solution here. I am generally not a fan of catching NPEs :)

There are a few other approaches possible here:

  1. Parquet v2 actually isn't that well tested. The later versions of Trino though have started writing parquet files in V2 format. We encountered this issue in Iceberg vectorized reads when we upgraded our Presto clusters to trino 350 release. We worked around the issue by reintroducing the older parquet write path in Trino that writes Parquet V1 files.

  2. To fix this in Iceberg

  • We should either look into supporting vectorized reads for v2
  • We should disable vectorized reads when/if we can detect that the parquet files are in V2 format.

I can take up looking into 2).

@RussellSpitzer
Copy link
Member Author

@samarthjain i'm fine with just disabling if we detect if files are in V2 format. I just wanted to get this fix in for now since the current behavior is just throw an NPE which doesn't contain any information. I'd be fine removing it once we have a better fix, or if you know of a fast way we can detect if a file is V2

@kbendick
Copy link
Contributor

Will current Iceberg not work for v2 files at all, for example if the columns all wind up with plain encoding (as a trivial example)?

I agree that catching the NPE isn't the most elegant solution, but if there's a possibility that page v2 records can be read if they only use the existing encoding, that would be nice.

Otherwise, explicitly failing upon the encounter of a v2 page would be the best way to start.

Please let me know if I can help somehow @samarthjain

@samarthjain
Copy link
Collaborator

For now, a better solution would be to throw a similar exception like UnsupportedOperationException("Vectorized reads not supported for Parquet V2) .

The place to make this change would be:
https://github.com/apache/iceberg/blob/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedPageIterator.java#L535

@samarthjain
Copy link
Collaborator

I submitted a PR that adds limited support for Parquet V2. Like Spark vectorized reads, we fail if the encoding of non-dictionary encoded data is not plain. @RussellSpitzer, @kbendick - would be good to get your eyes on it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants