Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable popcnt for MSVC, improve perf by leveraging ILP #38

Merged
merged 1 commit into from Apr 26, 2019

Conversation

@Lastique
Copy link
Member

commented Apr 13, 2019

On MSVC and compatible compilers popcnt is now enabled if it is known at
compile time that the target CPU supports popcnt instruction. Unfortunately,
MSVC does not have a predefined macro specifically for popcnt, so we have
to test for AVX instead. __POPCNT__ is still tested in case clang or user
defines it.

For MSVC, the 64-bit popcount is now also implemented on 32-bit targets by
issuing two 32-bit popcnt instructions. For gcc and compatible compilers,
16-bit popcount implementation is provided, which is better than the generic
popcount implementation, but still slower than byte-wise do_count.

The do_count algorithm implementations have been improved by leveraging
instruction level parallelism better, which gives 0 to 27% improvement and
no regressions.

Also made code formatting more consistent and reduced code duplication between
do_count implementations.

This fixes #36

@jeking3

This comment has been minimized.

Copy link
Collaborator

commented Apr 23, 2019

I know about this, it's behind refreshing the CI to clear out the build failure(s).

@jeking3

This comment has been minimized.

Copy link
Collaborator

commented Apr 23, 2019

@Lastique please rebase on develop to clear out the build issues. Thanks.

@Lastique Lastique force-pushed the Lastique:reenable_popcount branch from cf1a64b to c46d454 Apr 23, 2019

@codecov

This comment has been minimized.

Copy link

commented Apr 23, 2019

Codecov Report

Merging #38 into develop will increase coverage by 1.9%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop      #38     +/-   ##
==========================================
+ Coverage    77.15%   79.05%   +1.9%     
==========================================
  Files            5        5             
  Lines          604      616     +12     
  Branches       211      204      -7     
==========================================
+ Hits           466      487     +21     
  Misses          24       24             
+ Partials       114      105      -9
Impacted Files Coverage Δ
...ude/boost/dynamic_bitset/detail/dynamic_bitset.hpp 97.22% <100%> (+3.67%) ⬆️
include/boost/dynamic_bitset/dynamic_bitset.hpp 77.93% <0%> (+1.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c747bec...b29b053. Read the comment docs.

@jeking3

This comment has been minimized.

Copy link
Collaborator

commented Apr 23, 2019

At this point I think the failures are legit and not a product of CI.

@Lastique

This comment has been minimized.

Copy link
Member Author

commented Apr 23, 2019

What failures?

@jeking3

This comment has been minimized.

Copy link
Collaborator

commented Apr 23, 2019

Click through to the Appveyor build.

@Lastique

This comment has been minimized.

Copy link
Member Author

commented Apr 23, 2019

I'll take a look later, thanks.

@Lastique Lastique force-pushed the Lastique:reenable_popcount branch from c46d454 to f3b2be9 Apr 25, 2019

Reenabled popcnt for MSVC, improved perf by leveraging ILP.
On MSVC and compatible compilers popcnt is now enabled if it is known at
compile time that the target CPU supports popcnt instruction. Unfortunately,
MSVC does not have a predefined macro specifically for popcnt, so we have
to test for AVX instead. __POPCNT__ is still tested in case clang or user
defines it.

For MSVC, the 64-bit popcount is now also implemented on 32-bit targets by
issuing two 32-bit popcnt instructions. For gcc and compatible compilers,
16-bit popcount implementation is provided, which is better than the generic
popcount implementation, but still slower than byte-wise do_count.

The do_count algorithm implementations have been improved by leveraging
instruction level parallelism better, which gives 0 to 27% improvement and
no regressions.

Also made code formatting more consistent and reduced code duplication between
do_count implementations.

@Lastique Lastique force-pushed the Lastique:reenable_popcount branch from f3b2be9 to b29b053 Apr 25, 2019

@Lastique

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2019

The tests have passed.

while (length >= 2u) {
num1 += popcount<ValueType>(*first);
++first;
num2 += popcount<ValueType>(*first);

This comment has been minimized.

Copy link
@jeking3

jeking3 Apr 26, 2019

Collaborator

It's interesting this improves performance. Simply by assigning the result to two variables (i.e. registers) that allows them to execute independently even though they are both tied to the advancement of the iterator? I would have expected to need some sort of hint or intrinsic for that. Is it the compiler or the CPU doing this?

This comment has been minimized.

Copy link
@jeking3

jeking3 Apr 26, 2019

Collaborator

I guess on Intel architecture it's CPU based, after some reading.

This comment has been minimized.

Copy link
@Lastique

Lastique Apr 26, 2019

Author Member

The iterator is most likely advanced only once per loop iteration. One of the two iterator dereference operations translates to a memory access with a static offset.

The important optimization here is breaking one dependency chain on the accumulator into two, which allows the CPU to execute popcnt and accumulate instructions in parallel for the two dependency chains. Instruction-level parallelism is a common feature of most modern CPUs that support out-of-order execution. Theoretically, a compiler could generate an equivalent code, but I haven't seen a compiler do this.

This comment has been minimized.

Copy link
@jeking3

jeking3 Apr 26, 2019

Collaborator

Potentially any loop unroll without side-effects would be eligible for this sort of thing?

This comment has been minimized.

Copy link
@Lastique

Lastique Apr 26, 2019

Author Member

Yes, most of the time. You may not get any speedup in some cases, e.g. if your algorithm is memory-bound or the loop body becomes too large for the out-of-order window in the CPU. The latter case is somewhat mitigated by instruction reordering done by compiler or by hand.

This comment has been minimized.

Copy link
@jeking3

jeking3 Apr 26, 2019

Collaborator

Sounds like an interesting compiler option if you ask me. :) Thanks for the explanation.

@jeking3 jeking3 merged commit 83bdf5a into boostorg:develop Apr 26, 2019

4 checks passed

codecov/patch 100% of diff hit (target 77.15%)
Details
codecov/project 79.05% (+1.9%) compared to 8b87ff2
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@Lastique Lastique deleted the Lastique:reenable_popcount branch Apr 26, 2019

std::size_t num = 0u;
while (value) {
num += count_table<>::table[value & ((1u<<table_width) - 1)];
value >>= table_width;

This comment has been minimized.

Copy link
@jeking3

jeking3 Apr 26, 2019

Collaborator

Coverity flagged this issue:

CID 338011 (#1 of 1): Bad bit shift operation (BAD_SHIFT)
2. large_shift: In expression value >>= 8U, right shifting value by more than 7 bits always yields zero. The shift amount is 8.

I believe this is intentional in this case, but I want to double-check with you.

This comment has been minimized.

Copy link
@Lastique

Lastique Apr 26, 2019

Author Member
  1. I believe the original code was the same. The change was in the indentation.
  2. The intention of the code is to eventually shift off all set bits, leaving value zero. If value is less than int, it is promoted to int before the shift operation, so in no case the shift by 8 bits exceeds the left hand operand capacity. I believe, the error is wrong.

This comment has been minimized.

Copy link
@jeking3

jeking3 Apr 26, 2019

Collaborator

I'll mark it as such, thanks. There's another interesting one in the issues which looks like a real bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.