C++ C Python Shell Makefile
Switch branches/tags
Nothing to show
Clone or download
Permalink
Failed to load latest commit information.
original moved legacy code to separate directory Apr 4, 2015
results link to the new results Apr 26, 2017
scripts a helper script for readme Jan 21, 2017
.gitignore +2 new methods - mix SSE/ABX2 with hardware popcnt Mar 26, 2016
.travis.yml Setting up travis CI Mar 27, 2016
LICENSE +LICENSE file, major contributors agreed to BSD Mar 21, 2016
Makefile AVX512 VPOPCNT: use the compiler's intrinsic Nov 22, 2017
README.rst spelling mistake Nov 22, 2017
config.h AVX512 VPOPCNT: use the compiler's intrinsic Nov 22, 2017
function_registry.cpp added AArch64 version Apr 26, 2017
popcnt-aarch64.cpp added AArch64 version Apr 26, 2017
popcnt-all.cpp added AArch64 version Apr 26, 2017
popcnt-avx2-cpu.cpp +2 new methods - mix SSE/ABX2 with hardware popcnt Mar 26, 2016
popcnt-avx2-harley-seal.cpp AVX2 Harley-Seal: fixed #13 Mar 26, 2016
popcnt-avx2-lookup.cpp restored the original AVX2-lookup procedure Mar 26, 2016
popcnt-avx512-harley-seal.cpp +AVX512BW implementation using vpternlogd Mar 26, 2016
popcnt-avx512-vpopcnt.cpp AVX512 VPOPCNT: use the compiler's intrinsic Nov 22, 2017
popcnt-bit-parallel-scalar.cpp +scalar version using fewer instructions Mar 26, 2016
popcnt-bit-parallel-scalar32.cpp 32 bit variants Jan 8, 2017
popcnt-builtin.cpp prepare for ARM Neon build Jan 8, 2017
popcnt-cpu.cpp prepare for ARM Neon build Jan 8, 2017
popcnt-harley-seal.cpp clean the mess with names, refactor verify utlity Mar 20, 2016
popcnt-lookup.cpp clean the mess with names, refactor verify utlity Mar 20, 2016
popcnt-neon.cpp added ARM Neon implementations Jan 8, 2017
popcnt-sse-bit-parallel-better.cpp +better bit-field algorithm (SSE version) Mar 26, 2016
popcnt-sse-bit-parallel.cpp restored the orginal procedures for bit-parallel and LUT with inner l… Mar 23, 2016
popcnt-sse-cpu.cpp +2 new methods - mix SSE/ABX2 with hardware popcnt Mar 26, 2016
popcnt-sse-harley-seal.cpp SSE & AVX2 implementations of Harley-Seal algorithm Mar 24, 2016
popcnt-sse-lookup.cpp restored the orginal procedures for bit-parallel and LUT with inner l… Mar 23, 2016
speed.cpp Remove unnecessary includes Jan 20, 2018
sse_operators.cpp +AVX512BW implementation using vpternlogd Mar 26, 2016
verify.cpp fix initialization Apr 26, 2017

README.rst

SIMD popcount

Sample programs for my article http://0x80.pl/articles/sse-popcount.html

https://travis-ci.org/WojciechMula/sse-popcount.svg?branch=master

Paper

Daniel Lemire, Nathan Kurz and I published an article Faster Population Counts using AVX2 Instructions.

Introduction

Subdirectory original contains code from 2008 --- it is 32-bit and GCC-centric. The root directory contains fresh C++11 code, written with intrinsics and tested on 64-bit machines.

There are two programs:

  • verify --- it tests if all non-lookup implementations counts bits properly;
  • speed --- benchmarks different implementations of popcount procedure; please read help to find all options (run the program without arguments).

There are several targets:

  • default --- builtin functions, SSE and popcnt instructions;
  • AVX2 --- all above plus AVX2 implementations;
  • AVX512BW --- all above plus experimental AVX512BW code (require software emulator);
  • AVX512 VPOPCNT --- all above plus experimental AVX512 VPOPCNT code (should be compilable with very recent GCC, software emulator doesn't support this extension yet);
  • arm --- builtin and ARM Neon implementations.

Type make help to find out details. To run the default target benchmark simply type make.

Available implementations

procedure description
lookup-8 lookup in std::uint8_t[256] LUT
lookup-64 lookup in std::uint64_t[256] LUT
bit-parallel naive bit parallel method
bit-parallel-optimized a bit better bit parallel
bit-parallel-mul bit-parallel with fewer instructions
bit-parallel32 naive bit parallel method (32 bit)
bit-parallel-optimized32 a bit better bit parallel (32 bit)
harley-seal Harley-Seal popcount (4th iteration)
sse-bit-parallel SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original SSE implementation of bit-parallel-optimized
sse-bit-parallel-better SSE implementation of bit-parallel with fewer instructions
sse-harley-seal SSE implementation of Harley-Seal
sse-lookup SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original SSSE3 variant using pshufb instruction
avx2-lookup AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original AVX2 variant using pshufb instruction
avx2-harley-seal AVX2 implementation of Harley-Seal
cpu CPU instruction popcnt (64-bit variant)
sse-cpu load data with SSE, then count bits using popcnt
avx2-cpu load data with AVX2, then count bits using popcnt
avx512-harley-seal AVX512 implementation of Harley-Seal
avx512-vpopcnt AVX512 VPOPCNT
builtin-popcnt builtin for popcnt
builtin-popcnt32 builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled unrolled builtin-popcnt
builtin-popcnt-unrolled32 unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual builtin-popcnt-movdq unrolled (assembly code)
neon-vcnt ARM Neon using VCNT
neon-HS Harley-Seal using Neon VCNT
aarch64-cnt ARMv8 Neon using CNT

Performance results

The subdirectory results contains performance results from various computers. If you can, please contribute.

Acknowledgments

  • Kim Walisch (@kimwalisch) wrote Harley-Seal scalar implementation.
  • Simon Lindholm (@simonlindholm) added unrolled versions of procedures.
  • Dan Luu (@danluu) agreed to include his procedures (builint-*) into this project. More details in Dan's article Hand coded assembly beats intrinsics in speed and simplicity

See also

  • libpopcnt --- library by Kim Walisch utilizing methods from our paper.