Enable decoding of 'mixed' page types in a single column row group #38

nregenauer · 2020-02-13T23:02:21Z

I recently faced an issue where I needed to decode a Parquet file that had been encoded for reads with Amazon Athena.

This file had a mixture of 'PLAIN' and 'PLAIN_DICTIONARY' pages in it; when decoding, I noticed an issue where all records from the 'PLAIN' groups were being set to unknown.

The commit below is a quick fix for this issue; possibly the correct long term fix is to enable this library to support different Parquet versions?

…DICTIONARY pages

nregenauer · 2020-02-26T14:51:35Z

Just checking in to see if you've gotten a chance to take a look at this change

Acen · 2020-04-20T07:54:13Z

Nice work @nregenauer
@ZJONSSON could you give the PR a quick once-over?

ZJONSSON · 2020-04-20T12:02:25Z

Thanks @nregenauer - do you have a test case / test file with mixed pages that would fail before this fi?

nregenauer · 2020-05-12T01:40:21Z

Sorry about the delayed response - I've done some more thorough testing of this change and I'm not confident that it's non-breaking (it seems to be a step in the right direction, but not a perfect fix).

Previously, on my local test cases I was seeing issues where ~half the records in large files (on the order of 27-30 MB) were blank. After merging this change, records were no longer blank; however, a random percentage of records in the file had one or more fields "swapped" (RecordX contained value Y for attribute A, the value that should have been assigned to RecordZ). This issue appeared to affect ~10-20% of the total records in any given file.

Since this PR doesn't fully resolve this issue, I'm going to close for now - if I get the bandwidth I'll revisit and add test cases.

Nicole Regenauer added 8 commits February 13, 2020 16:00

Enable reading of files that contain a mixture of DICTIONARY and non-…

001d96a

…DICTIONARY pages

Upgrade thrift, update package.json

4b836fb

Roll back a thrift version

5938548

Use APPROVED version of thrift

deb1e97

Get tests working again, bump package version, lazy-install lzo

a71be79

Add missing dependencies

29e4c0d

Remove unneeded dependencies, directly include all needed dependencies

6f5c670

Bump package version

761294b

nregenauer requested a review from ZJONSSON February 17, 2020 20:40

Nicole Regenauer added 8 commits February 17, 2020 17:34

Don't check in package lock

9fe4dfd

Bump package version

6ca2656

Don't include test files

8a28326

Don't publish modules with package

89e39f7

Disable LZO fully

886cbed

Disable LZO fully

93083ac

Add BROTLI back in

95f4750

Change BSON version

3b16f99

Update README.md

be2eaa3

nregenauer closed this May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable decoding of 'mixed' page types in a single column row group #38

Enable decoding of 'mixed' page types in a single column row group #38

nregenauer commented Feb 13, 2020

nregenauer commented Feb 26, 2020

Acen commented Apr 20, 2020

ZJONSSON commented Apr 20, 2020

nregenauer commented May 12, 2020

Enable decoding of 'mixed' page types in a single column row group #38

Enable decoding of 'mixed' page types in a single column row group #38

Conversation

nregenauer commented Feb 13, 2020

nregenauer commented Feb 26, 2020

Acen commented Apr 20, 2020

ZJONSSON commented Apr 20, 2020

nregenauer commented May 12, 2020