SIMD popcount

Sample programs for my article http://0x80.pl/articles/sse-popcount.html

Paper

Daniel Lemire, Nathan Kurz and I published an article Faster Population Counts using AVX2 Instructions.

Introduction

Subdirectory original contains code from 2008 --- it is 32-bit and GCC-centric. The root directory contains fresh C++11 code, written with intrinsics and tested on 64-bit machines.

There are two programs:

verify --- it tests if all non-lookup implementations counts bits properly;
speed --- benchmarks different implementations of popcount procedure; please read help to find all options (run the program without arguments).

There are several targets:

default --- builtin functions, SSE and popcnt instructions;
AVX2 --- all above plus AVX2 implementations;
AVX512BW --- all above plus experimental AVX512BW code;
AVX512VBMI --- all above plus experimental AVX512VBMI code;
AVX512 VPOPCNT --- all above plus experimental AVX512 VPOPCNT code (should be compilable with very recent GCC, software emulator doesn't support this extension yet);
arm --- builtin and ARM Neon implementations.

Type make help to find out details. To run the default target benchmark simply type make.

Available implementations

procedure	description
lookup-8	lookup in std::uint8_t[256] LUT
lookup-64	lookup in std::uint64_t[256] LUT
bit-parallel	naive bit parallel method
bit-parallel-optimized	a bit better bit parallel
bit-parallel-optimized2	better utilization of 2- and 4-bit subwords
bit-parallel-mul	bit-parallel with fewer instructions
bit-parallel32	naive bit parallel method (32 bit)
bit-parallel-optimized32	a bit better bit parallel (32 bit)
harley-seal	Harley-Seal popcount (4th iteration)
sse-bit-parallel	SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original	SSE implementation of bit-parallel-optimized
sse-bit-parallel-better	SSE implementation of bit-parallel with fewer instructions
sse-harley-seal	SSE implementation of Harley-Seal
sse-lookup	SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original	SSSE3 variant using pshufb instruction
avx2-lookup	AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original	AVX2 variant using pshufb instruction
avx2-harley-seal	AVX2 implementation of Harley-Seal
cpu	CPU instruction popcnt (64-bit variant)
sse-cpu	load data with SSE, then count bits using popcnt
avx2-cpu	load data with AVX2, then count bits using popcnt
avx512-harley-seal	AVX512 implementation of Harley-Seal
avx512bw-shuf	AVX512BW implementation uses shuffle instruction
avx512vbmi-shuf	AVX512VBMI implementation uses shuffle instruction
avx512-vpopcnt	AVX512 VPOPCNT
builtin-popcnt	builtin for popcnt
builtin-popcnt32	builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled	unrolled builtin-popcnt
builtin-popcnt-unrolled32	unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata	unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual	unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq	builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled	builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual	builtin-popcnt-movdq unrolled (assembly code)
neon-vcnt	ARM Neon using VCNT
neon-HS	Harley-Seal using Neon VCNT
aarch64-cnt	ARMv8 Neon using CNT

Performance results

The subdirectory results contains performance results from various computers. If you can, please contribute.

Acknowledgments

Kim Walisch (@kimwalisch) wrote Harley-Seal scalar implementation.
Simon Lindholm (@simonlindholm) added unrolled versions of procedures.
Dan Luu (@danluu) agreed to include his procedures (builint-*) into this project. More details in Dan's article Hand coded assembly beats intrinsics in speed and simplicity

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
original		original
results		results
scripts		scripts
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.rst		README.rst
config.h		config.h
function_registry.cpp		function_registry.cpp
popcnt-aarch64.cpp		popcnt-aarch64.cpp
popcnt-all.cpp		popcnt-all.cpp
popcnt-avx2-cpu.cpp		popcnt-avx2-cpu.cpp
popcnt-avx2-harley-seal.cpp		popcnt-avx2-harley-seal.cpp
popcnt-avx2-lookup.cpp		popcnt-avx2-lookup.cpp
popcnt-avx512-harley-seal.cpp		popcnt-avx512-harley-seal.cpp
popcnt-avx512-vpopcnt.cpp		popcnt-avx512-vpopcnt.cpp
popcnt-avx512bw-lookup.cpp		popcnt-avx512bw-lookup.cpp
popcnt-avx512vbmi-lookup.cpp		popcnt-avx512vbmi-lookup.cpp
popcnt-bit-parallel-scalar.cpp		popcnt-bit-parallel-scalar.cpp
popcnt-bit-parallel-scalar32.cpp		popcnt-bit-parallel-scalar32.cpp
popcnt-builtin.cpp		popcnt-builtin.cpp
popcnt-cpu.cpp		popcnt-cpu.cpp
popcnt-harley-seal.cpp		popcnt-harley-seal.cpp
popcnt-lookup.cpp		popcnt-lookup.cpp
popcnt-neon.cpp		popcnt-neon.cpp
popcnt-rvv.cpp		popcnt-rvv.cpp
popcnt-sse-bit-parallel-better.cpp		popcnt-sse-bit-parallel-better.cpp
popcnt-sse-bit-parallel.cpp		popcnt-sse-bit-parallel.cpp
popcnt-sse-cpu.cpp		popcnt-sse-cpu.cpp
popcnt-sse-harley-seal.cpp		popcnt-sse-harley-seal.cpp
popcnt-sse-lookup.cpp		popcnt-sse-lookup.cpp
speed.cpp		speed.cpp
sse_operators.cpp		sse_operators.cpp
verify.cpp		verify.cpp

License

WojciechMula/sse-popcount

Folders and files

Latest commit

History

Repository files navigation

SIMD popcount

Paper

Introduction

Available implementations

Performance results

Acknowledgments

See also

About

Topics

Resources

License

Stars

Watchers

Forks

Languages