Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11877: [C++] Add microbenchmark for SimplifyWithGuarantee #9638

Closed
wants to merge 1 commit into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Mar 5, 2021

This adds a microbenchmark for SimplifyWithGuarantee which, especially for a large dataset, can contribute a significant amount of time to reading a dataset, as it's used to evaluate partition expressions against the filter. This was used to help investigate ARROW-11781.

Two different filters are tested: one is fully simplified, and one has had casts inserted (which will happen if you Bind() against a schema with different types).
Two different partition expressions are tested: one is fully simplified, and one compares against dictionary-encoded values (which will happen by default if you infer the schema for a Hive-partitioned, for example).

All 4 combinations are additionally tested both when the filter matches the expression and when it does not match.

@lidavidm
Copy link
Member Author

lidavidm commented Mar 5, 2021

Results

--------------------------------------------------------------------------------------------------------------
Benchmark                                                                    Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------
SimplifyQueryWithGuarantee/negative_lhs_simple_guarantee_simple           7729 ns         7729 ns        88034
SimplifyQueryWithGuarantee/negative_lhs_cast_guarantee_simple            10769 ns        10769 ns        64463
SimplifyQueryWithGuarantee/negative_lhs_simple_guarantee_dictionary      17703 ns        17703 ns        39026
SimplifyQueryWithGuarantee/negative_lhs_cast_guarantee_dictionary        21208 ns        21207 ns        32716
SimplifyQueryWithGuarantee/positive_lhs_simple_guarantee_simple           7689 ns         7689 ns        88028
SimplifyQueryWithGuarantee/positive_lhs_cast_guarantee_simple            10793 ns        10793 ns        63819
SimplifyQueryWithGuarantee/positive_lhs_simple_guarantee_dictionary      18147 ns        18146 ns        39027
SimplifyQueryWithGuarantee/positive_lhs_cast_guarantee_dictionary        21193 ns        21193 ns        33022

@github-actions
Copy link

github-actions bot commented Mar 5, 2021

cpp/src/arrow/dataset/expression_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/expression_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/expression_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/expression_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/expression_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/expression_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/expression_benchmark.cc Outdated Show resolved Hide resolved
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for doing this!

@bkietz bkietz closed this in 2ace1e3 Mar 10, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This adds a microbenchmark for SimplifyWithGuarantee which, especially for a large dataset, can contribute a significant amount of time to reading a dataset, as it's used to evaluate partition expressions against the filter. This was used to help investigate ARROW-11781.

Two different filters are tested: one is fully simplified, and one has had casts inserted (which will happen if you Bind() against a schema with different types).
Two different partition expressions are tested: one is fully simplified, and one compares against dictionary-encoded values (which will happen by default if you infer the schema for a Hive-partitioned, for example).

All 4 combinations are additionally tested both when the filter matches the expression and when it does not match.

Closes apache#9638 from lidavidm/arrow-11877

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This adds a microbenchmark for SimplifyWithGuarantee which, especially for a large dataset, can contribute a significant amount of time to reading a dataset, as it's used to evaluate partition expressions against the filter. This was used to help investigate ARROW-11781.

Two different filters are tested: one is fully simplified, and one has had casts inserted (which will happen if you Bind() against a schema with different types).
Two different partition expressions are tested: one is fully simplified, and one compares against dictionary-encoded values (which will happen by default if you infer the schema for a Hive-partitioned, for example).

All 4 combinations are additionally tested both when the filter matches the expression and when it does not match.

Closes apache#9638 from lidavidm/arrow-11877

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants