Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable decoding of 'mixed' page types in a single column row group #38

Closed
wants to merge 17 commits into from

Conversation

nregenauer
Copy link

I recently faced an issue where I needed to decode a Parquet file that had been encoded for reads with Amazon Athena.

This file had a mixture of 'PLAIN' and 'PLAIN_DICTIONARY' pages in it; when decoding, I noticed an issue where all records from the 'PLAIN' groups were being set to unknown.

The commit below is a quick fix for this issue; possibly the correct long term fix is to enable this library to support different Parquet versions?

@nregenauer
Copy link
Author

Just checking in to see if you've gotten a chance to take a look at this change

@Acen
Copy link

Acen commented Apr 20, 2020

Nice work @nregenauer
@ZJONSSON could you give the PR a quick once-over?

@ZJONSSON
Copy link
Owner

Thanks @nregenauer - do you have a test case / test file with mixed pages that would fail before this fi?

@nregenauer
Copy link
Author

Sorry about the delayed response - I've done some more thorough testing of this change and I'm not confident that it's non-breaking (it seems to be a step in the right direction, but not a perfect fix).

Previously, on my local test cases I was seeing issues where ~half the records in large files (on the order of 27-30 MB) were blank. After merging this change, records were no longer blank; however, a random percentage of records in the file had one or more fields "swapped" (RecordX contained value Y for attribute A, the value that should have been assigned to RecordZ). This issue appeared to affect ~10-20% of the total records in any given file.

Since this PR doesn't fully resolve this issue, I'm going to close for now - if I get the bandwidth I'll revisit and add test cases.

@nregenauer nregenauer closed this May 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants