Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Add benchmarks for DELTA_BINARY_PACKED #14951

Closed
pitrou opened this issue Dec 14, 2022 · 7 comments · Fixed by #15140
Closed

[C++][Parquet] Add benchmarks for DELTA_BINARY_PACKED #14951

pitrou opened this issue Dec 14, 2022 · 7 comments · Fixed by #15140

Comments

@pitrou
Copy link
Member

pitrou commented Dec 14, 2022

Describe the enhancement requested

Now that we support the DELTA_BINARY_PACKED encoding for both reading and writing, we can add some benchmarks for it.

Component(s)

C++, Parquet

@mapleFU
Copy link
Member

mapleFU commented Dec 28, 2022

Should it just follow src/parquet/encoding_benchmark.cc? I'd like to have a try.

@rok
Copy link
Member

rok commented Dec 28, 2022

That seems like the right place @mapleFU. I think you can go ahead and open a PR! :)

@mapleFU
Copy link
Member

mapleFU commented Dec 31, 2022

By the way, should we generate the input using random? Seems that if we use a incremental number or fixed number, the data of DELTA_BINARY_PACKED should be well compressed and optimized, which is kind of unfair

@pitrou
Copy link
Member Author

pitrou commented Dec 31, 2022

By the way, should we generate the input using random? Seems that if we use a incremental number or fixed number, the data of DELTA_BINARY_PACKED should be well compressed and optimized, which is kind of unfair

Use random but with two different magnitudes (narrow or wide) such as to exercise both the easily compressible and poorly compressible cases.

@mapleFU
Copy link
Member

mapleFU commented Dec 31, 2022

By the way, I want to ask that what's the differences between column_io_benchmark, column_reader_benchmark and encoding_benchmark? And do we have some benchmark for compress + encoding? Because some encoding will not benifit a lot when encoding, but will greatly improve the performance or saving the space when compress and encoding, like byte_stream_split.

@pitrou
Copy link
Member Author

pitrou commented Jan 3, 2023

And do we have some benchmark for compress + encoding? Because some encoding will not benifit a lot when encoding, but will greatly improve the performance or saving the space when compress and encoding, like byte_stream_split.

I'm not convinced it's useful to add compression here. We're not writing the compression routines ourselves.

@mapleFU
Copy link
Member

mapleFU commented Jan 3, 2023

I'm not convinced it's useful to add compression here. We're not writing the compression routines ourselves.

Yes, you're right. But I think if compression is not tested, maybe we will only have benchmark for partial of reading parquet. For example, some benchmark would be "the decoder is how much slower than PlainDecoder" when CPU prefetch works well.

pitrou added a commit that referenced this issue Jan 3, 2023
…ing (#15140)

This patch support benchmark for DELTA_BINARY_PACKED. Different from PLAIN, it should considering the cases that data can or cannot be well compressed
* Closes: #14951

Lead-authored-by: mwish <anmmscs_maple@qq.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 11.0.0 milestone Jan 3, 2023
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this issue Jan 5, 2023
… encoding (apache#15140)

This patch support benchmark for DELTA_BINARY_PACKED. Different from PLAIN, it should considering the cases that data can or cannot be well compressed
* Closes: apache#14951

Lead-authored-by: mwish <anmmscs_maple@qq.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this issue Jan 5, 2023
… encoding (apache#15140)

This patch support benchmark for DELTA_BINARY_PACKED. Different from PLAIN, it should considering the cases that data can or cannot be well compressed
* Closes: apache#14951

Lead-authored-by: mwish <anmmscs_maple@qq.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue Jan 9, 2023
… encoding (apache#15140)

This patch support benchmark for DELTA_BINARY_PACKED. Different from PLAIN, it should considering the cases that data can or cannot be well compressed
* Closes: apache#14951

Lead-authored-by: mwish <anmmscs_maple@qq.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment