PARQUET-1622: Add BYTE_STREAM_SPLIT encoding #144

martinradev · 2019-07-11T14:01:03Z

As discussed in the Jira issue, it would be a great benefit for better compression ratio and speed to add an encoding for FP data.

The patch extends the format to add the BYTE_STREAM_SPLIT encoding and adds documentation for it.

wesm · 2019-07-12T15:05:39Z

I think we need to vote on this

gszadovszky

The vote is still not initialized. I think, it would be much harder to start a vote with both the stream split and the ZFP. I would handle them separately.
While I find the stream split encoding simple and ready for implementation (after a successful vote), ZFP would require additional discussions. Parquet did not have any lossy stuff before. If the community would like to accept it it is still open if we would like to handle it like the other compression codecs.

martinradev · 2019-08-06T09:38:18Z

I have not started a vote yet. I am currently considering a possible change to the encoding to make it adaptive. The reason for this is that it performs worse than plain encoding when the data is very repetitive. The initial results were somewhat of an outlier because the dataset ( https://userweb.cs.txstate.edu/~burtscher/research/datasets/ ) I used is not repetitive and difficult to compress for regular text compressors like gzip, zstd, etc. However, real-world data would likely be a mixture of repetitive and non-repetitive time-series data.

I'll do a bit more investigation and initiate a vote soon.

I apologize for the caused inconvenience and overhead.

martinradev · 2019-08-06T09:40:14Z

As for the ZFP stuff, this was my mistake. I will have to clearly work on it on a separate branch.
I am more familiar with the "Gerrit-style" of contributing to code. The "Github-style" is new to me.
I will clean-up the pull request soon.

martinradev · 2019-08-27T12:21:06Z

We could maybe have a vote on this. I will open a VOTE thread.

martinradev · 2019-11-06T16:37:20Z

@zivanfi @gszadovszky would you be able to review the proposed encoding and achieved results? The vote is initialized on the parquet mailing list and there are two +1s.

Results from the encoding: https://github.com/martinradev/arrow-fp-compression-bench
The results are based on an implementation in Arrow but I also already have unpolished patches for parquet-mr.

pitrou · 2024-03-13T10:35:37Z

For posterity, here is the archive of the VOTE thread for this feature: https://lists.apache.org/thread/xs5qt2odm299pxgqb22mty2csc1so5yr

PARQUET-1622: Add BYTE_STREAM_SPLIT encoding

a4a3924

The patch extends the format to add the BYTE_STREAM_SPLIT encoding and adds documentation for it.

zivanfi approved these changes Jul 12, 2019

View reviewed changes

martinradev force-pushed the master branch from f0406a8 to 3c8fc80 Compare August 5, 2019 20:38

gszadovszky requested changes Aug 6, 2019

View reviewed changes

martinradev force-pushed the master branch from 3c8fc80 to a4a3924 Compare August 9, 2019 15:36

gszadovszky approved these changes Nov 7, 2019

View reviewed changes

gszadovszky merged commit ee02ef8 into apache:master Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1622: Add BYTE_STREAM_SPLIT encoding #144

PARQUET-1622: Add BYTE_STREAM_SPLIT encoding #144

martinradev commented Jul 11, 2019

wesm commented Jul 12, 2019

gszadovszky left a comment

martinradev commented Aug 6, 2019

martinradev commented Aug 6, 2019

martinradev commented Aug 27, 2019

martinradev commented Nov 6, 2019

pitrou commented Mar 13, 2024

PARQUET-1622: Add BYTE_STREAM_SPLIT encoding #144

PARQUET-1622: Add BYTE_STREAM_SPLIT encoding #144

Conversation

martinradev commented Jul 11, 2019

wesm commented Jul 12, 2019

gszadovszky left a comment

Choose a reason for hiding this comment

martinradev commented Aug 6, 2019

martinradev commented Aug 6, 2019

martinradev commented Aug 27, 2019

martinradev commented Nov 6, 2019

pitrou commented Mar 13, 2024