Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2219: ParquetFileReader skips empty row group #1018

Merged
merged 3 commits into from Jan 15, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Jan 8, 2023

Jira

My PR addresses the PARQUET-2219.

Tests

My PR adds the following unit test to read parquet file with empty row group:

  • parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReaderEmptyBlock.java

Two test parquet files are created by C++ parquet writer from Apache Arrow.

  • Single empty row group: parquet-hadoop/src/test/resources/test-empty-row-group_1.parquet
  • Three row groups with empty one in the middle: parquet-hadoop/src/test/resources/test-empty-row-group_2.parquet
  • For row groups with two empty ones in the middle: parquet-hadoop/src/test/resources/test-empty-row-group_3.parquet

Commits

The parquet specs does not forbid empty row group and some implementations are able to generate files with empty row group. The commit aims to make ParquetFileReader robust by skipping empty row group while reading.

The parquet specs does not forbid empty row group and some
implementations are able to generate files with empty row group.
The commit aims to make ParquetFileReader robust by skipping
empty row group while reading.
@wgtmac
Copy link
Member Author

wgtmac commented Jan 8, 2023

@gszadovszky @ggershinsky @shangxinli @sunchao Could you please take a look when you have time?

cc @emkornfield

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you for fixing this. I've added some comments.
Also, could you add a similar test for the filtered row groups?

@shangxinli
Copy link
Contributor

@gszadovszky Nice to see you are back!

- add test file for empty blocks next to each other
@wgtmac
Copy link
Member Author

wgtmac commented Jan 10, 2023

Thanks you for fixing this. I've added some comments. Also, could you add a similar test for the filtered row groups?

Thanks for your review @gszadovszky !

I have addressed all of your comments. Please take a look again.

try {
rowGroup = internalReadRowGroup(currentBlock);
} catch (ParquetEmptyBlockException e) {
LOG.warn("Read empty block at index {}", currentBlock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to add file path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the file path to the log. Please take a look again. Thanks!

@gszadovszky
Copy link
Contributor

@gszadovszky Nice to see you are back!

@shangxinli, I wouldn't say I'm back, unfortunately. I'm a bit closer to Parquet at Dremio but actually not working on it. We'll see if I will have some spare time for it. :)

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good from my side

@shangxinli shangxinli merged commit 2fa8f94 into apache:master Jan 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants