New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1622: Add BYTE_STREAM_SPLIT encoding #144
Conversation
The patch extends the format to add the BYTE_STREAM_SPLIT encoding and adds documentation for it.
I think we need to vote on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vote is still not initialized. I think, it would be much harder to start a vote with both the stream split and the ZFP. I would handle them separately.
While I find the stream split encoding simple and ready for implementation (after a successful vote), ZFP would require additional discussions. Parquet did not have any lossy stuff before. If the community would like to accept it it is still open if we would like to handle it like the other compression codecs.
I have not started a vote yet. I am currently considering a possible change to the encoding to make it adaptive. The reason for this is that it performs worse than plain encoding when the data is very repetitive. The initial results were somewhat of an outlier because the dataset ( https://userweb.cs.txstate.edu/~burtscher/research/datasets/ ) I used is not repetitive and difficult to compress for regular text compressors like gzip, zstd, etc. However, real-world data would likely be a mixture of repetitive and non-repetitive time-series data. I'll do a bit more investigation and initiate a vote soon. I apologize for the caused inconvenience and overhead. |
As for the ZFP stuff, this was my mistake. I will have to clearly work on it on a separate branch. |
We could maybe have a vote on this. I will open a VOTE thread. |
@zivanfi @gszadovszky would you be able to review the proposed encoding and achieved results? The vote is initialized on the parquet mailing list and there are two +1s. Results from the encoding: https://github.com/martinradev/arrow-fp-compression-bench |
For posterity, here is the archive of the VOTE thread for this feature: https://lists.apache.org/thread/xs5qt2odm299pxgqb22mty2csc1so5yr |
As discussed in the Jira issue, it would be a great benefit for better compression ratio and speed to add an encoding for FP data.