-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Benchmark and maybe Override DeltaBitPackEncoder Defaults #34536
Comments
cc @rok |
Benchmarking with overridden defaults makes a lot of sense yes! Do you think different bitwidth ranges (of random data) have different optimal results? If they have strong influence we might want to document that. |
As arrow-rs says, it use DeltaBinaryPacked as default encoding as default encoder for integers. And it found out that, for uniform distributed numbers, wider bitwidth should be much better. I'll benchmark delta binary packed for different size and different input distribution in x86_64 and neon machine, and find out if we should make it larger. To be honest, the best way should be adaptive encoding, but I'm not so familiar with encoding algorithms By the way, tustvoid mentions that, http://arxiv.org/pdf/1209.2137v5.pdf declares why shouldn't we encoding delta in this way... |
This is interesting. We need to make these parameters configurable via By the way, I really doubt we can reach to a clear answer to this question in the end. The best encoding ratio is the entropy of the data which requires a precise knowledge of probability and distribution of all input words. As the result is highly dependent on the data distribution, we need to define and prepare some datasets with distinct distribution and pattern. |
On My M1 MacOS: Current ( BlockSize: 128, BlockCnt: 4)
After (BlockSize: 256, BlockCnt: 4)
|
On my PC (Ryzen 3800X, avx2 enabled): Before:
After
|
…oder (#34632) ### Rationale for this change Change default DeltaBitPackEncoder BlockSize from 32 to 64. ### What changes are included in this PR? Tiny block size change, and an trivial optimization. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: #34536 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…ackEncoder (apache#34632) ### Rationale for this change Change default DeltaBitPackEncoder BlockSize from 32 to 64. ### What changes are included in this PR? Tiny block size change, and an trivial optimization. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: apache#34536 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
Describe the enhancement requested
Parquet C++ DELTA_BINARY_PACKED uses:
As mentioned in apache/arrow-rs#2282 . We may benchmark it and make it suitable for different bitwidth.
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: