-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable decoding of 'mixed' page types in a single column row group #38
Conversation
Just checking in to see if you've gotten a chance to take a look at this change |
Nice work @nregenauer |
Thanks @nregenauer - do you have a test case / test file with mixed pages that would fail before this fi? |
Sorry about the delayed response - I've done some more thorough testing of this change and I'm not confident that it's non-breaking (it seems to be a step in the right direction, but not a perfect fix). Previously, on my local test cases I was seeing issues where ~half the records in large files (on the order of 27-30 MB) were blank. After merging this change, records were no longer blank; however, a random percentage of records in the file had one or more fields "swapped" (RecordX contained value Y for attribute A, the value that should have been assigned to RecordZ). This issue appeared to affect ~10-20% of the total records in any given file. Since this PR doesn't fully resolve this issue, I'm going to close for now - if I get the bandwidth I'll revisit and add test cases. |
I recently faced an issue where I needed to decode a Parquet file that had been encoded for reads with Amazon Athena.
This file had a mixture of 'PLAIN' and 'PLAIN_DICTIONARY' pages in it; when decoding, I noticed an issue where all records from the 'PLAIN' groups were being set to unknown.
The commit below is a quick fix for this issue; possibly the correct long term fix is to enable this library to support different Parquet versions?