Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36297: [C++][Parquet] Benchmark for non-binary dict encoding #36298

Merged

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jun 26, 2023

Rationale for this change

Add benchmark for non-binary dict encoding

What changes are included in this PR?

Add benchmark BM_DictEncodingInt64

Are these changes tested?

no need

Are there any user-facing changes?

no

@mapleFU mapleFU requested a review from wjones127 as a code owner June 26, 2023 09:10
@github-actions
Copy link

⚠️ GitHub issue #36297 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Jun 26, 2023

@pitrou Would you mind take a look? Or here are already some benchmarks here?

@pitrou
Copy link
Member

pitrou commented Jun 26, 2023

Why not, but is this useful? Is dictionary encoding often used with integers?

@mapleFU
Copy link
Member Author

mapleFU commented Jun 26, 2023

Hmm sometimes we use dictionary with integer, and I found that there're not benchmark for it. So add this benchmark.

@pitrou
Copy link
Member

pitrou commented Jun 26, 2023

Are there decoding benchmarks already?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 26, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Jun 26, 2023

@pitrou Seems that we already have some decode test before. But they only test one value, see DecodeDict. I've unify the EncodeDict as same as DecodeDict

@mapleFU mapleFU force-pushed the parquet/optimize-dict-encoding-and-benchmark branch from 19bfd91 to ac35a89 Compare June 26, 2023 15:46
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mapleFU ! A couple other suggestions.

cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved
encoder->FlushValues();
}

state.SetBytesProcessed(state.iterations() * state.range(0) * sizeof(T));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a SetItemsProcessed here and possibly in other benchmark functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And by the way:

Suggested change
state.SetBytesProcessed(state.iterations() * state.range(0) * sizeof(T));
state.SetBytesProcessed(state.iterations() * num_values * sizeof(T));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You addressed only one comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing the message, I've fixed it and paste the result

@mapleFU
Copy link
Member Author

mapleFU commented Jun 27, 2023

Comment fixed

@mapleFU
Copy link
Member Author

mapleFU commented Jun 27, 2023

M1 MacOS, clang++13.0, Release(O2):

BM_DictDecodingInt64_repeats/1024                                     1016 ns         1013 ns       694975 bytes_per_second=7.53056G/s items_per_second=1010.73M/s
BM_DictDecodingInt64_repeats/4096                                     1373 ns         1369 ns       480367 bytes_per_second=22.2907G/s items_per_second=2.99181G/s
BM_DictDecodingInt64_repeats/32768                                    4692 ns         4672 ns       141777 bytes_per_second=52.2591G/s items_per_second=7.0141G/s
BM_DictDecodingInt64_repeats/65536                                    8837 ns         8828 ns        71320 bytes_per_second=55.3124G/s items_per_second=7.42391G/s
BM_DictEncodingInt64_repeats/1024                                     6120 ns         5971 ns       117212 bytes_per_second=1.27764G/s items_per_second=171.482M/s
BM_DictEncodingInt64_repeats/4096                                    23239 ns        23204 ns        30078 bytes_per_second=1.31518G/s items_per_second=176.52M/s
BM_DictEncodingInt64_repeats/32768                                  192336 ns       185895 ns         3687 bytes_per_second=1.31332G/s items_per_second=176.271M/s
BM_DictEncodingInt64_repeats/65536                                  386575 ns       371757 ns         1870 bytes_per_second=1.31344G/s items_per_second=176.287M/s
BM_DictDecodingInt64_literals/1024                                    2027 ns         1862 ns       376684 bytes_per_second=4.09711G/s items_per_second=549.904M/s
BM_DictDecodingInt64_literals/4096                                    5082 ns         4902 ns       143489 bytes_per_second=6.22589G/s items_per_second=835.625M/s
BM_DictDecodingInt64_literals/32768                                  68385 ns        59901 ns        12076 bytes_per_second=4.07575G/s items_per_second=547.038M/s
BM_DictDecodingInt64_literals/65536                                 114003 ns       111464 ns         6023 bytes_per_second=4.38064G/s items_per_second=587.959M/s
BM_DictEncodingInt64_literals/1024                                    7405 ns         7394 ns        94886 bytes_per_second=1056.53M/s items_per_second=138.482M/s
BM_DictEncodingInt64_literals/4096                                   30762 ns        30462 ns        23066 bytes_per_second=1025.87M/s items_per_second=134.463M/s
BM_DictEncodingInt64_literals/32768                                 267431 ns       267176 ns         2494 bytes_per_second=935.713M/s items_per_second=122.646M/s
BM_DictEncodingInt64_literals/65536                                 581665 ns       580486 ns         1139 bytes_per_second=861.347M/s items_per_second=112.898M/s

@mapleFU mapleFU force-pushed the parquet/optimize-dict-encoding-and-benchmark branch from df7c6ab to f233f75 Compare June 27, 2023 16:50
@pitrou pitrou merged commit 4198aac into apache:main Jun 27, 2023
32 of 33 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jun 27, 2023
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 4198aacf.

There were 8 benchmark results indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Benchmark Dict Encoding for non-binary types
2 participants