Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
720 lines (664 sloc) 76.7 KB

Population count comparison for Haswell Core i7-4770 CPU @ 3.40GHz

Generated on: 2016-03-26

Specification

CPU: Haswell Core i7-4770 CPU @ 3.40GHz

Compiler: GCC 5.3.0 (Ubuntu)

Instruction set: AVX2

Number of runs: 5

All times are given in seconds.

Procedures

procedure description
lookup-8 lookup in std::uint8_t[256] LUT
lookup-64 lookup in std::uint64_t[256] LUT
bit-parallel naive bit parallel method
bit-parallel-optimized a bit better bit parallel
bit-parallel-mul bit-parallel with fewer instructions
harley-seal Harley-Seal popcount (4th iteration)
sse-bit-parallel SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original SSE implementation of bit-parallel-optimized
sse-bit-parallel-better SSE implementation of bit-parallel with fewer instructions
sse-harley-seal SSE implementation of Harley-Seal
sse-lookup SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original SSSE3 variant using pshufb instruction
avx2-lookup AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original AVX2 variant using pshufb instruction
avx2-harley-seal AVX2 implementation of Harley-Seal
cpu CPU instruction popcnt (64-bit variant)
sse-cpu load data with SSE, then count bits using popcnt
avx2-cpu load data with AVX2, then count bits using popcnt
builtin-popcnt builtin for popcnt
builtin-popcnt32 builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled unrolled builtin-popcnt
builtin-popcnt-unrolled32 unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual builtin-popcnt-movdq unrolled (assembly code)

Running time

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.20459 1.10942 1.06966 1.11342 1.69944 1.66395 1.64353 1.63281
lookup-64 1.17685 1.09910 1.06269 1.08992 1.67641 1.63699 1.61699 1.60908
bit-parallel 1.32661 1.12067 1.05220 1.02585 1.62042 1.60970 1.60452 1.60648
bit-parallel-optimized 1.03180 0.82544 0.73700 0.69278 1.07308 1.05540 1.04655 1.05100
bit-parallel-mul 0.85492 0.72226 0.65594 0.62647 1.06513 1.02557 1.00872 0.99841
harley-seal 1.03180 0.81070 0.50116 0.39429 0.54653 0.50303 0.48200 0.47094
sse-bit-parallel 1.79140 1.69979 0.95726 0.63293 0.73317 0.60194 0.51672 0.49033
sse-bit-parallel-original 1.41642 0.82879 0.56012 0.42746 0.58246 0.57800 0.54149 0.53427
sse-bit-parallel-better 2.61391 2.03126 1.09129 0.66172 0.71430 0.56356 0.49055 0.45673
sse-harley-seal 1.12661 0.71832 0.50877 0.22741 0.28450 0.24347 0.22265 0.21202
sse-lookup 0.53064 0.33902 0.21373 0.18056 0.26689 0.25495 0.24996 0.24830
sse-lookup-original 1.04292 0.63382 0.41272 0.30954 0.41567 0.41024 0.37114 0.35682
avx2-lookup 0.50116 0.30954 0.20636 0.13575 0.17204 0.14997 0.13613 0.13426
avx2-lookup-original 1.78312 0.93599 0.50484 0.36600 0.31277 0.24320 0.20630 0.19625
avx2-harley-seal 1.04442 0.61678 0.38303 0.26795 0.19295 0.14620 0.12328 0.11145
cpu 0.29480 0.22110 0.16214 0.14003 0.20636 0.19751 0.20783 0.19857
sse-cpu 2.18153 0.26444 0.20636 0.17741 0.26346 0.25142 0.26614 0.25413
avx2-cpu 2.00157 1.85354 0.25814 0.21528 0.30251 0.28702 0.27929 0.27857
builtin-popcnt 0.29480 0.26532 0.25058 0.24321 0.38956 0.33049 0.30460 0.29243
builtin-popcnt32 0.46953 0.44765 0.62961 0.50848 0.79063 0.75806 0.72629 0.72995
builtin-popcnt-unrolled 0.26532 0.17688 0.14003 0.12163 0.19162 0.19015 0.21131 0.19759
builtin-popcnt-unrolled32 0.44381 0.37015 0.33226 0.31322 0.48938 0.48954 0.46232 0.44888
builtin-popcnt-unrolled-errata 0.29480 0.20636 0.16214 0.14003 0.20858 0.19923 0.20037 0.19425
builtin-popcnt-unrolled-errata-manual 0.32775 0.23250 0.17305 0.14913 0.22391 0.22265 0.21923 0.20444
builtin-popcnt-movdq 0.20636 0.17988 0.19162 0.18425 0.29000 0.31599 0.30364 0.29296
builtin-popcnt-movdq-unrolled 0.32428 0.25638 0.19162 0.17164 0.25556 0.24600 0.24859 0.23759
builtin-popcnt-movdq-unrolled_manual 0.36384 0.26613 0.20974 0.18593 0.27761 0.27070 0.25343 0.24468

Input size 32B

procedure time [s] relative time (less is better)
lookup-8 1.20459 ███████████████████████
lookup-64 1.17685 ██████████████████████▌
bit-parallel 1.32661 █████████████████████████▍
bit-parallel-optimized 1.03180 ███████████████████▋
bit-parallel-mul 0.85492 ████████████████▎
harley-seal 1.03180 ███████████████████▋
sse-bit-parallel 1.79140 ██████████████████████████████████▎
sse-bit-parallel-original 1.41642 ███████████████████████████
sse-bit-parallel-better 2.61391 ██████████████████████████████████████████████████
sse-harley-seal 1.12661 █████████████████████▌
sse-lookup 0.53064 ██████████▏
sse-lookup-original 1.04292 ███████████████████▉
avx2-lookup 0.50116 █████████▌
avx2-lookup-original 1.78312 ██████████████████████████████████
avx2-harley-seal 1.04442 ███████████████████▉
cpu 0.29480 █████▋
sse-cpu 2.18153 █████████████████████████████████████████▋
avx2-cpu 2.00157 ██████████████████████████████████████▎
builtin-popcnt 0.29480 █████▋
builtin-popcnt32 0.46953 ████████▉
builtin-popcnt-unrolled 0.26532 █████
builtin-popcnt-unrolled32 0.44381 ████████▍
builtin-popcnt-unrolled-errata 0.29480 █████▋
builtin-popcnt-unrolled-errata-manual 0.32775 ██████▎
builtin-popcnt-movdq 0.20636 ███▉
builtin-popcnt-movdq-unrolled 0.32428 ██████▏
builtin-popcnt-movdq-unrolled_manual 0.36384 ██████▉

Input size 64B

procedure time [s] relative time (less is better)
lookup-8 1.10942 ███████████████████████████▎
lookup-64 1.09910 ███████████████████████████
bit-parallel 1.12067 ███████████████████████████▌
bit-parallel-optimized 0.82544 ████████████████████▎
bit-parallel-mul 0.72226 █████████████████▊
harley-seal 0.81070 ███████████████████▉
sse-bit-parallel 1.69979 █████████████████████████████████████████▊
sse-bit-parallel-original 0.82879 ████████████████████▍
sse-bit-parallel-better 2.03126 ██████████████████████████████████████████████████
sse-harley-seal 0.71832 █████████████████▋
sse-lookup 0.33902 ████████▎
sse-lookup-original 0.63382 ███████████████▌
avx2-lookup 0.30954 ███████▌
avx2-lookup-original 0.93599 ███████████████████████
avx2-harley-seal 0.61678 ███████████████▏
cpu 0.22110 █████▍
sse-cpu 0.26444 ██████▌
avx2-cpu 1.85354 █████████████████████████████████████████████▋
builtin-popcnt 0.26532 ██████▌
builtin-popcnt32 0.44765 ███████████
builtin-popcnt-unrolled 0.17688 ████▎
builtin-popcnt-unrolled32 0.37015 █████████
builtin-popcnt-unrolled-errata 0.20636 █████
builtin-popcnt-unrolled-errata-manual 0.23250 █████▋
builtin-popcnt-movdq 0.17988 ████▍
builtin-popcnt-movdq-unrolled 0.25638 ██████▎
builtin-popcnt-movdq-unrolled_manual 0.26613 ██████▌

Input size 128B

procedure time [s] relative time (less is better)
lookup-8 1.06966 █████████████████████████████████████████████████
lookup-64 1.06269 ████████████████████████████████████████████████▋
bit-parallel 1.05220 ████████████████████████████████████████████████▏
bit-parallel-optimized 0.73700 █████████████████████████████████▊
bit-parallel-mul 0.65594 ██████████████████████████████
harley-seal 0.50116 ██████████████████████▉
sse-bit-parallel 0.95726 ███████████████████████████████████████████▊
sse-bit-parallel-original 0.56012 █████████████████████████▋
sse-bit-parallel-better 1.09129 ██████████████████████████████████████████████████
sse-harley-seal 0.50877 ███████████████████████▎
sse-lookup 0.21373 █████████▊
sse-lookup-original 0.41272 ██████████████████▉
avx2-lookup 0.20636 █████████▍
avx2-lookup-original 0.50484 ███████████████████████▏
avx2-harley-seal 0.38303 █████████████████▌
cpu 0.16214 ███████▍
sse-cpu 0.20636 █████████▍
avx2-cpu 0.25814 ███████████▊
builtin-popcnt 0.25058 ███████████▍
builtin-popcnt32 0.62961 ████████████████████████████▊
builtin-popcnt-unrolled 0.14003 ██████▍
builtin-popcnt-unrolled32 0.33226 ███████████████▏
builtin-popcnt-unrolled-errata 0.16214 ███████▍
builtin-popcnt-unrolled-errata-manual 0.17305 ███████▉
builtin-popcnt-movdq 0.19162 ████████▊
builtin-popcnt-movdq-unrolled 0.19162 ████████▊
builtin-popcnt-movdq-unrolled_manual 0.20974 █████████▌

Input size 256B

procedure time [s] relative time (less is better)
lookup-8 1.11342 ██████████████████████████████████████████████████
lookup-64 1.08992 ████████████████████████████████████████████████▉
bit-parallel 1.02585 ██████████████████████████████████████████████
bit-parallel-optimized 0.69278 ███████████████████████████████
bit-parallel-mul 0.62647 ████████████████████████████▏
harley-seal 0.39429 █████████████████▋
sse-bit-parallel 0.63293 ████████████████████████████▍
sse-bit-parallel-original 0.42746 ███████████████████▏
sse-bit-parallel-better 0.66172 █████████████████████████████▋
sse-harley-seal 0.22741 ██████████▏
sse-lookup 0.18056 ████████
sse-lookup-original 0.30954 █████████████▉
avx2-lookup 0.13575 ██████
avx2-lookup-original 0.36600 ████████████████▍
avx2-harley-seal 0.26795 ████████████
cpu 0.14003 ██████▎
sse-cpu 0.17741 ███████▉
avx2-cpu 0.21528 █████████▋
builtin-popcnt 0.24321 ██████████▉
builtin-popcnt32 0.50848 ██████████████████████▊
builtin-popcnt-unrolled 0.12163 █████▍
builtin-popcnt-unrolled32 0.31322 ██████████████
builtin-popcnt-unrolled-errata 0.14003 ██████▎
builtin-popcnt-unrolled-errata-manual 0.14913 ██████▋
builtin-popcnt-movdq 0.18425 ████████▎
builtin-popcnt-movdq-unrolled 0.17164 ███████▋
builtin-popcnt-movdq-unrolled_manual 0.18593 ████████▎

Input size 512B

procedure time [s] relative time (less is better)
lookup-8 1.69944 ██████████████████████████████████████████████████
lookup-64 1.67641 █████████████████████████████████████████████████▎
bit-parallel 1.62042 ███████████████████████████████████████████████▋
bit-parallel-optimized 1.07308 ███████████████████████████████▌
bit-parallel-mul 1.06513 ███████████████████████████████▎
harley-seal 0.54653 ████████████████
sse-bit-parallel 0.73317 █████████████████████▌
sse-bit-parallel-original 0.58246 █████████████████▏
sse-bit-parallel-better 0.71430 █████████████████████
sse-harley-seal 0.28450 ████████▎
sse-lookup 0.26689 ███████▊
sse-lookup-original 0.41567 ████████████▏
avx2-lookup 0.17204 █████
avx2-lookup-original 0.31277 █████████▏
avx2-harley-seal 0.19295 █████▋
cpu 0.20636 ██████
sse-cpu 0.26346 ███████▊
avx2-cpu 0.30251 ████████▉
builtin-popcnt 0.38956 ███████████▍
builtin-popcnt32 0.79063 ███████████████████████▎
builtin-popcnt-unrolled 0.19162 █████▋
builtin-popcnt-unrolled32 0.48938 ██████████████▍
builtin-popcnt-unrolled-errata 0.20858 ██████▏
builtin-popcnt-unrolled-errata-manual 0.22391 ██████▌
builtin-popcnt-movdq 0.29000 ████████▌
builtin-popcnt-movdq-unrolled 0.25556 ███████▌
builtin-popcnt-movdq-unrolled_manual 0.27761 ████████▏

Input size 1024B

procedure time [s] relative time (less is better)
lookup-8 1.66395 ██████████████████████████████████████████████████
lookup-64 1.63699 █████████████████████████████████████████████████▏
bit-parallel 1.60970 ████████████████████████████████████████████████▎
bit-parallel-optimized 1.05540 ███████████████████████████████▋
bit-parallel-mul 1.02557 ██████████████████████████████▊
harley-seal 0.50303 ███████████████
sse-bit-parallel 0.60194 ██████████████████
sse-bit-parallel-original 0.57800 █████████████████▎
sse-bit-parallel-better 0.56356 ████████████████▉
sse-harley-seal 0.24347 ███████▎
sse-lookup 0.25495 ███████▋
sse-lookup-original 0.41024 ████████████▎
avx2-lookup 0.14997 ████▌
avx2-lookup-original 0.24320 ███████▎
avx2-harley-seal 0.14620 ████▍
cpu 0.19751 █████▉
sse-cpu 0.25142 ███████▌
avx2-cpu 0.28702 ████████▌
builtin-popcnt 0.33049 █████████▉
builtin-popcnt32 0.75806 ██████████████████████▊
builtin-popcnt-unrolled 0.19015 █████▋
builtin-popcnt-unrolled32 0.48954 ██████████████▋
builtin-popcnt-unrolled-errata 0.19923 █████▉
builtin-popcnt-unrolled-errata-manual 0.22265 ██████▋
builtin-popcnt-movdq 0.31599 █████████▍
builtin-popcnt-movdq-unrolled 0.24600 ███████▍
builtin-popcnt-movdq-unrolled_manual 0.27070 ████████▏

Input size 2048B

procedure time [s] relative time (less is better)
lookup-8 1.64353 ██████████████████████████████████████████████████
lookup-64 1.61699 █████████████████████████████████████████████████▏
bit-parallel 1.60452 ████████████████████████████████████████████████▊
bit-parallel-optimized 1.04655 ███████████████████████████████▊
bit-parallel-mul 1.00872 ██████████████████████████████▋
harley-seal 0.48200 ██████████████▋
sse-bit-parallel 0.51672 ███████████████▋
sse-bit-parallel-original 0.54149 ████████████████▍
sse-bit-parallel-better 0.49055 ██████████████▉
sse-harley-seal 0.22265 ██████▊
sse-lookup 0.24996 ███████▌
sse-lookup-original 0.37114 ███████████▎
avx2-lookup 0.13613 ████▏
avx2-lookup-original 0.20630 ██████▎
avx2-harley-seal 0.12328 ███▊
cpu 0.20783 ██████▎
sse-cpu 0.26614 ████████
avx2-cpu 0.27929 ████████▍
builtin-popcnt 0.30460 █████████▎
builtin-popcnt32 0.72629 ██████████████████████
builtin-popcnt-unrolled 0.21131 ██████▍
builtin-popcnt-unrolled32 0.46232 ██████████████
builtin-popcnt-unrolled-errata 0.20037 ██████
builtin-popcnt-unrolled-errata-manual 0.21923 ██████▋
builtin-popcnt-movdq 0.30364 █████████▏
builtin-popcnt-movdq-unrolled 0.24859 ███████▌
builtin-popcnt-movdq-unrolled_manual 0.25343 ███████▋

Input size 4096B

procedure time [s] relative time (less is better)
lookup-8 1.63281 ██████████████████████████████████████████████████
lookup-64 1.60908 █████████████████████████████████████████████████▎
bit-parallel 1.60648 █████████████████████████████████████████████████▏
bit-parallel-optimized 1.05100 ████████████████████████████████▏
bit-parallel-mul 0.99841 ██████████████████████████████▌
harley-seal 0.47094 ██████████████▍
sse-bit-parallel 0.49033 ███████████████
sse-bit-parallel-original 0.53427 ████████████████▎
sse-bit-parallel-better 0.45673 █████████████▉
sse-harley-seal 0.21202 ██████▍
sse-lookup 0.24830 ███████▌
sse-lookup-original 0.35682 ██████████▉
avx2-lookup 0.13426 ████
avx2-lookup-original 0.19625 ██████
avx2-harley-seal 0.11145 ███▍
cpu 0.19857 ██████
sse-cpu 0.25413 ███████▊
avx2-cpu 0.27857 ████████▌
builtin-popcnt 0.29243 ████████▉
builtin-popcnt32 0.72995 ██████████████████████▎
builtin-popcnt-unrolled 0.19759 ██████
builtin-popcnt-unrolled32 0.44888 █████████████▋
builtin-popcnt-unrolled-errata 0.19425 █████▉
builtin-popcnt-unrolled-errata-manual 0.20444 ██████▎
builtin-popcnt-movdq 0.29296 ████████▉
builtin-popcnt-movdq-unrolled 0.23759 ███████▎
builtin-popcnt-movdq-unrolled_manual 0.24468 ███████▍

Speedup

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
lookup-64 1.02 1.01 1.01 1.02 1.01 1.02 1.02 1.01
bit-parallel 0.91 0.99 1.02 1.09 1.05 1.03 1.02 1.02
bit-parallel-optimized 1.17 1.34 1.45 1.61 1.58 1.58 1.57 1.55
bit-parallel-mul 1.41 1.54 1.63 1.78 1.60 1.62 1.63 1.64
harley-seal 1.17 1.37 2.13 2.82 3.11 3.31 3.41 3.47
sse-bit-parallel 0.67 0.65 1.12 1.76 2.32 2.76 3.18 3.33
sse-bit-parallel-original 0.85 1.34 1.91 2.60 2.92 2.88 3.04 3.06
sse-bit-parallel-better 0.46 0.55 0.98 1.68 2.38 2.95 3.35 3.58
sse-harley-seal 1.07 1.54 2.10 4.90 5.97 6.83 7.38 7.70
sse-lookup 2.27 3.27 5.00 6.17 6.37 6.53 6.58 6.58
sse-lookup-original 1.16 1.75 2.59 3.60 4.09 4.06 4.43 4.58
avx2-lookup 2.40 3.58 5.18 8.20 9.88 11.10 12.07 12.16
avx2-lookup-original 0.68 1.19 2.12 3.04 5.43 6.84 7.97 8.32
avx2-harley-seal 1.15 1.80 2.79 4.16 8.81 11.38 13.33 14.65
cpu 4.09 5.02 6.60 7.95 8.24 8.42 7.91 8.22
sse-cpu 0.55 4.20 5.18 6.28 6.45 6.62 6.18 6.43
avx2-cpu 0.60 0.60 4.14 5.17 5.62 5.80 5.88 5.86
builtin-popcnt 4.09 4.18 4.27 4.58 4.36 5.03 5.40 5.58
builtin-popcnt32 2.57 2.48 1.70 2.19 2.15 2.20 2.26 2.24
builtin-popcnt-unrolled 4.54 6.27 7.64 9.15 8.87 8.75 7.78 8.26
builtin-popcnt-unrolled32 2.71 3.00 3.22 3.55 3.47 3.40 3.55 3.64
builtin-popcnt-unrolled-errata 4.09 5.38 6.60 7.95 8.15 8.35 8.20 8.41
builtin-popcnt-unrolled-errata-manual 3.68 4.77 6.18 7.47 7.59 7.47 7.50 7.99
builtin-popcnt-movdq 5.84 6.17 5.58 6.04 5.86 5.27 5.41 5.57
builtin-popcnt-movdq-unrolled 3.71 4.33 5.58 6.49 6.65 6.76 6.61 6.87
builtin-popcnt-movdq-unrolled_manual 3.31 4.17 5.10 5.99 6.12 6.15 6.49 6.67

CSV file

Download haswell-i7-4770-gcc5.3.0-avx2.csv