Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1622: Add BYTE_STREAM_SPLIT encoding #144

Merged
merged 1 commit into from Dec 3, 2019

Conversation

martinradev
Copy link
Contributor

As discussed in the Jira issue, it would be a great benefit for better compression ratio and speed to add an encoding for FP data.

The patch extends the format to add the BYTE_STREAM_SPLIT
encoding and adds documentation for it.
@wesm
Copy link
Member

wesm commented Jul 12, 2019

I think we need to vote on this

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vote is still not initialized. I think, it would be much harder to start a vote with both the stream split and the ZFP. I would handle them separately.
While I find the stream split encoding simple and ready for implementation (after a successful vote), ZFP would require additional discussions. Parquet did not have any lossy stuff before. If the community would like to accept it it is still open if we would like to handle it like the other compression codecs.

@martinradev
Copy link
Contributor Author

I have not started a vote yet. I am currently considering a possible change to the encoding to make it adaptive. The reason for this is that it performs worse than plain encoding when the data is very repetitive. The initial results were somewhat of an outlier because the dataset ( https://userweb.cs.txstate.edu/~burtscher/research/datasets/ ) I used is not repetitive and difficult to compress for regular text compressors like gzip, zstd, etc. However, real-world data would likely be a mixture of repetitive and non-repetitive time-series data.

I'll do a bit more investigation and initiate a vote soon.

I apologize for the caused inconvenience and overhead.

@martinradev
Copy link
Contributor Author

As for the ZFP stuff, this was my mistake. I will have to clearly work on it on a separate branch.
I am more familiar with the "Gerrit-style" of contributing to code. The "Github-style" is new to me.
I will clean-up the pull request soon.

@martinradev
Copy link
Contributor Author

We could maybe have a vote on this. I will open a VOTE thread.

@martinradev
Copy link
Contributor Author

@zivanfi @gszadovszky would you be able to review the proposed encoding and achieved results? The vote is initialized on the parquet mailing list and there are two +1s.

Results from the encoding: https://github.com/martinradev/arrow-fp-compression-bench
The results are based on an implementation in Arrow but I also already have unpolished patches for parquet-mr.

@gszadovszky gszadovszky merged commit ee02ef8 into apache:master Dec 3, 2019
@pitrou
Copy link
Member

pitrou commented Mar 13, 2024

For posterity, here is the archive of the VOTE thread for this feature: https://lists.apache.org/thread/xs5qt2odm299pxgqb22mty2csc1so5yr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants