Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17305: [C++] Avoid spending time in popcount in BitmapAnd benchmark #13794

Merged
merged 1 commit into from
Aug 5, 2022

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Aug 4, 2022

This was artificially limiting the reported performance of BitmapAnd.

Before:

--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BenchmarkBitmapAnd/32768/0        1708 ns         1708 ns       408579 bytes_per_second=17.8726G/s
BenchmarkBitmapAnd/131072/0       6968 ns         6965 ns       102223 bytes_per_second=17.5262G/s
BenchmarkBitmapAnd/32768/1        3982 ns         3981 ns       175136 bytes_per_second=7.66574G/s
BenchmarkBitmapAnd/131072/1      15574 ns        15569 ns        44988 bytes_per_second=7.8404G/s
BenchmarkBitmapAnd/32768/2        3999 ns         3998 ns       175021 bytes_per_second=7.63248G/s
BenchmarkBitmapAnd/131072/2      15589 ns        15585 ns        44844 bytes_per_second=7.83234G/s

After:

--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BenchmarkBitmapAnd/32768/0         732 ns          732 ns       967465 bytes_per_second=41.6736G/s
BenchmarkBitmapAnd/131072/0       3105 ns         3105 ns       229726 bytes_per_second=39.3198G/s
BenchmarkBitmapAnd/32768/1        2913 ns         2913 ns       240233 bytes_per_second=10.4774G/s
BenchmarkBitmapAnd/131072/1      11528 ns        11526 ns        60865 bytes_per_second=10.5912G/s
BenchmarkBitmapAnd/32768/2        2924 ns         2924 ns       236873 bytes_per_second=10.4378G/s
BenchmarkBitmapAnd/131072/2      11552 ns        11550 ns        60619 bytes_per_second=10.5691G/s

(I didn't check, but the compiler here probably auto-vectorizes the aligned code path)

@pitrou pitrou requested a review from westonpace August 4, 2022 09:26
@github-actions
Copy link

github-actions bot commented Aug 4, 2022

@github-actions
Copy link

github-actions bot commented Aug 4, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

…mark

This was artificially limiting the reported performance of BitmapAnd.

Before:
```
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BenchmarkBitmapAnd/32768/0        1708 ns         1708 ns       408579 bytes_per_second=17.8726G/s
BenchmarkBitmapAnd/131072/0       6968 ns         6965 ns       102223 bytes_per_second=17.5262G/s
BenchmarkBitmapAnd/32768/1        3982 ns         3981 ns       175136 bytes_per_second=7.66574G/s
BenchmarkBitmapAnd/131072/1      15574 ns        15569 ns        44988 bytes_per_second=7.8404G/s
BenchmarkBitmapAnd/32768/2        3999 ns         3998 ns       175021 bytes_per_second=7.63248G/s
BenchmarkBitmapAnd/131072/2      15589 ns        15585 ns        44844 bytes_per_second=7.83234G/s
```

After:
```
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BenchmarkBitmapAnd/32768/0         732 ns          732 ns       967465 bytes_per_second=41.6736G/s
BenchmarkBitmapAnd/131072/0       3105 ns         3105 ns       229726 bytes_per_second=39.3198G/s
BenchmarkBitmapAnd/32768/1        2913 ns         2913 ns       240233 bytes_per_second=10.4774G/s
BenchmarkBitmapAnd/131072/1      11528 ns        11526 ns        60865 bytes_per_second=10.5912G/s
BenchmarkBitmapAnd/32768/2        2924 ns         2924 ns       236873 bytes_per_second=10.4378G/s
BenchmarkBitmapAnd/131072/2      11552 ns        11550 ns        60619 bytes_per_second=10.5691G/s
```

(I didn't check, but the compiler here probably auto-vectorizes the aligned code path)
@pitrou pitrou force-pushed the ARROW-17305-bitmap-and-popcount branch from 8d94169 to f2f03a1 Compare August 4, 2022 15:03
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I don't really know the details of what must be done to prevent the compiler from optimizing away a computation but while the new numbers are fast but I'm pretty sure they are also realistic. I get similar numbers of my system and I think it works out to ~8bytes/cycle.

@westonpace
Copy link
Member

Yep, if I set ARROW_SIMD_LEVEL=NONE then I get slower performance (it's actually pretty similar to what your before numbers were) so there must be some kind of auto-vectorization going on.

@cyb70289 cyb70289 merged commit 56e6caf into apache:master Aug 5, 2022
@cyb70289
Copy link
Contributor

cyb70289 commented Aug 5, 2022

CI error not related

@ursabot
Copy link

ursabot commented Aug 5, 2022

Benchmark runs are scheduled for baseline = 81ded07 and contender = 56e6caf. 56e6caf is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.27% ⬆️0.65%] test-mac-arm
[Finished ⬇️0.54% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.43% ⬆️1.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 56e6caf0 ec2-t3-xlarge-us-east-2
[Failed] 56e6caf0 test-mac-arm
[Finished] 56e6caf0 ursa-i9-9960x
[Finished] 56e6caf0 ursa-thinkcentre-m75q
[Failed] 81ded071 ec2-t3-xlarge-us-east-2
[Finished] 81ded071 test-mac-arm
[Finished] 81ded071 ursa-i9-9960x
[Finished] 81ded071 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Aug 5, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@pitrou pitrou deleted the ARROW-17305-bitmap-and-popcount branch August 5, 2022 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants