New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2219: ParquetFileReader skips empty row group #1018
Conversation
The parquet specs does not forbid empty row group and some implementations are able to generate files with empty row group. The commit aims to make ParquetFileReader robust by skipping empty row group while reading.
@gszadovszky @ggershinsky @shangxinli @sunchao Could you please take a look when you have time? cc @emkornfield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you for fixing this. I've added some comments.
Also, could you add a similar test for the filtered row groups?
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReaderEmptyBlock.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Show resolved
Hide resolved
@gszadovszky Nice to see you are back! |
- add test file for empty blocks next to each other
Thanks for your review @gszadovszky ! I have addressed all of your comments. Please take a look again. |
try { | ||
rowGroup = internalReadRowGroup(currentBlock); | ||
} catch (ParquetEmptyBlockException e) { | ||
LOG.warn("Read empty block at index {}", currentBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way to add file path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the file path to the log. Please take a look again. Thanks!
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
@shangxinli, I wouldn't say I'm back, unfortunately. I'm a bit closer to Parquet at Dremio but actually not working on it. We'll see if I will have some spare time for it. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good from my side
Jira
My PR addresses the PARQUET-2219.
Tests
My PR adds the following unit test to read parquet file with empty row group:
Two test parquet files are created by C++ parquet writer from Apache Arrow.
Commits
The parquet specs does not forbid empty row group and some implementations are able to generate files with empty row group. The commit aims to make ParquetFileReader robust by skipping empty row group while reading.