Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
632 lines (576 sloc) 66.6 KB

Population count comparison for Core i5 M540 @ 2.53GHz

Generated on: 2016-03-26

Specification

CPU: Core i5 M540 @ 2.53GHz

Compiler: GCC 4.9.2 (Debian 4.9.2-10)

Instruction set: SSE

Number of runs: 5

All times are given in seconds.

Procedures

procedure description
lookup-8 lookup in std::uint8_t[256] LUT
lookup-64 lookup in std::uint64_t[256] LUT
bit-parallel naive bit parallel method
bit-parallel-optimized a bit better bit parallel
bit-parallel-mul bit-parallel with fewer instructions
harley-seal Harley-Seal popcount (4th iteration)
sse-bit-parallel SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original SSE implementation of bit-parallel-optimized
sse-bit-parallel-better SSE implementation of bit-parallel with fewer instructions
sse-harley-seal SSE implementation of Harley-Seal
sse-lookup SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original SSSE3 variant using pshufb instruction
cpu CPU instruction popcnt (64-bit variant)
sse-cpu load data with SSE, then count bits using popcnt
builtin-popcnt builtin for popcnt
builtin-popcnt32 builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled unrolled builtin-popcnt
builtin-popcnt-unrolled32 unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual builtin-popcnt-movdq unrolled (assembly code)

Running time

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 2.30682 2.19949 2.15913 2.12274 3.40312 3.37968 3.36830 3.36327
lookup-64 2.30520 2.19460 2.15665 2.12098 3.39359 3.37388 3.36426 3.35894
bit-parallel 2.13734 1.99894 1.94310 1.90146 3.00652 2.98865 2.98472 2.97496
bit-parallel-optimized 1.37613 1.23683 1.16308 1.13311 1.79759 1.76732 1.75583 1.75727
bit-parallel-mul 1.27090 1.18854 1.14315 1.12855 1.77835 1.78635 1.76195 1.74768
harley-seal 1.48025 1.29651 0.79610 0.63324 0.90016 0.85752 0.83508 0.82458
sse-bit-parallel 2.69392 2.39464 1.41028 0.95430 1.16728 0.99899 0.91217 0.86662
sse-bit-parallel-original 2.36376 1.44147 1.03644 0.84391 1.17614 1.19961 1.14922 1.10145
sse-bit-parallel-better 2.72236 2.40755 1.36711 0.91569 1.10007 0.92206 0.83866 0.78519
sse-harley-seal 1.97896 1.27526 0.90073 0.43162 0.56059 0.47240 0.43234 0.41591
sse-lookup 0.75539 0.54069 0.35012 0.31090 0.46971 0.45597 0.44554 0.44298
sse-lookup-original 1.80431 1.12350 0.77243 0.59811 0.82280 0.89240 0.81698 0.75707
cpu 0.49250 0.37698 0.32133 0.29132 0.44240 0.43117 0.42524 0.34746
sse-cpu 2.46184 0.45881 0.37843 0.33627 0.51150 0.48877 0.48091 0.47667
builtin-popcnt 0.49263 0.44286 0.46112 0.42607 0.65241 0.66710 0.64707 0.63798
builtin-popcnt32 0.85318 0.81902 0.80633 0.79473 1.32992 1.30050 1.27972 1.26759
builtin-popcnt-unrolled 0.49270 0.37709 0.32117 0.29111 0.44242 0.43123 0.42522 0.28776
builtin-popcnt-unrolled32 0.91986 0.78649 0.72367 0.68804 1.07508 1.06164 0.74972 0.71987
builtin-popcnt-unrolled-errata 0.59095 0.44246 0.37058 0.33233 0.50194 0.48662 0.47922 0.28935
builtin-popcnt-unrolled-errata-manual 0.62275 0.45880 0.37864 0.33610 0.50461 0.48852 0.48039 0.29710
builtin-popcnt-movdq 0.39942 0.37706 0.35306 0.34013 0.53455 0.52919 0.45700 0.44145
builtin-popcnt-movdq-unrolled 0.52712 0.39337 0.33021 0.29545 0.44566 0.43309 0.42648 0.38394
builtin-popcnt-movdq-unrolled_manual 0.60784 0.46728 0.39911 0.36264 0.55238 0.53818 0.53127 0.39940

Input size 32B

procedure time [s] relative time (less is better)
lookup-8 2.30682 ██████████████████████████████████████████▎
lookup-64 2.30520 ██████████████████████████████████████████▎
bit-parallel 2.13734 ███████████████████████████████████████▎
bit-parallel-optimized 1.37613 █████████████████████████▎
bit-parallel-mul 1.27090 ███████████████████████▎
harley-seal 1.48025 ███████████████████████████▏
sse-bit-parallel 2.69392 █████████████████████████████████████████████████▍
sse-bit-parallel-original 2.36376 ███████████████████████████████████████████▍
sse-bit-parallel-better 2.72236 ██████████████████████████████████████████████████
sse-harley-seal 1.97896 ████████████████████████████████████▎
sse-lookup 0.75539 █████████████▊
sse-lookup-original 1.80431 █████████████████████████████████▏
cpu 0.49250 █████████
sse-cpu 2.46184 █████████████████████████████████████████████▏
builtin-popcnt 0.49263 █████████
builtin-popcnt32 0.85318 ███████████████▋
builtin-popcnt-unrolled 0.49270 █████████
builtin-popcnt-unrolled32 0.91986 ████████████████▉
builtin-popcnt-unrolled-errata 0.59095 ██████████▊
builtin-popcnt-unrolled-errata-manual 0.62275 ███████████▍
builtin-popcnt-movdq 0.39942 ███████▎
builtin-popcnt-movdq-unrolled 0.52712 █████████▋
builtin-popcnt-movdq-unrolled_manual 0.60784 ███████████▏

Input size 64B

procedure time [s] relative time (less is better)
lookup-8 2.19949 █████████████████████████████████████████████▋
lookup-64 2.19460 █████████████████████████████████████████████▌
bit-parallel 1.99894 █████████████████████████████████████████▌
bit-parallel-optimized 1.23683 █████████████████████████▋
bit-parallel-mul 1.18854 ████████████████████████▋
harley-seal 1.29651 ██████████████████████████▉
sse-bit-parallel 2.39464 █████████████████████████████████████████████████▋
sse-bit-parallel-original 1.44147 █████████████████████████████▉
sse-bit-parallel-better 2.40755 ██████████████████████████████████████████████████
sse-harley-seal 1.27526 ██████████████████████████▍
sse-lookup 0.54069 ███████████▏
sse-lookup-original 1.12350 ███████████████████████▎
cpu 0.37698 ███████▊
sse-cpu 0.45881 █████████▌
builtin-popcnt 0.44286 █████████▏
builtin-popcnt32 0.81902 █████████████████
builtin-popcnt-unrolled 0.37709 ███████▊
builtin-popcnt-unrolled32 0.78649 ████████████████▎
builtin-popcnt-unrolled-errata 0.44246 █████████▏
builtin-popcnt-unrolled-errata-manual 0.45880 █████████▌
builtin-popcnt-movdq 0.37706 ███████▊
builtin-popcnt-movdq-unrolled 0.39337 ████████▏
builtin-popcnt-movdq-unrolled_manual 0.46728 █████████▋

Input size 128B

procedure time [s] relative time (less is better)
lookup-8 2.15913 ██████████████████████████████████████████████████
lookup-64 2.15665 █████████████████████████████████████████████████▉
bit-parallel 1.94310 ████████████████████████████████████████████▉
bit-parallel-optimized 1.16308 ██████████████████████████▉
bit-parallel-mul 1.14315 ██████████████████████████▍
harley-seal 0.79610 ██████████████████▍
sse-bit-parallel 1.41028 ████████████████████████████████▋
sse-bit-parallel-original 1.03644 ████████████████████████
sse-bit-parallel-better 1.36711 ███████████████████████████████▋
sse-harley-seal 0.90073 ████████████████████▊
sse-lookup 0.35012 ████████
sse-lookup-original 0.77243 █████████████████▉
cpu 0.32133 ███████▍
sse-cpu 0.37843 ████████▊
builtin-popcnt 0.46112 ██████████▋
builtin-popcnt32 0.80633 ██████████████████▋
builtin-popcnt-unrolled 0.32117 ███████▍
builtin-popcnt-unrolled32 0.72367 ████████████████▊
builtin-popcnt-unrolled-errata 0.37058 ████████▌
builtin-popcnt-unrolled-errata-manual 0.37864 ████████▊
builtin-popcnt-movdq 0.35306 ████████▏
builtin-popcnt-movdq-unrolled 0.33021 ███████▋
builtin-popcnt-movdq-unrolled_manual 0.39911 █████████▏

Input size 256B

procedure time [s] relative time (less is better)
lookup-8 2.12274 ██████████████████████████████████████████████████
lookup-64 2.12098 █████████████████████████████████████████████████▉
bit-parallel 1.90146 ████████████████████████████████████████████▊
bit-parallel-optimized 1.13311 ██████████████████████████▋
bit-parallel-mul 1.12855 ██████████████████████████▌
harley-seal 0.63324 ██████████████▉
sse-bit-parallel 0.95430 ██████████████████████▍
sse-bit-parallel-original 0.84391 ███████████████████▉
sse-bit-parallel-better 0.91569 █████████████████████▌
sse-harley-seal 0.43162 ██████████▏
sse-lookup 0.31090 ███████▎
sse-lookup-original 0.59811 ██████████████
cpu 0.29132 ██████▊
sse-cpu 0.33627 ███████▉
builtin-popcnt 0.42607 ██████████
builtin-popcnt32 0.79473 ██████████████████▋
builtin-popcnt-unrolled 0.29111 ██████▊
builtin-popcnt-unrolled32 0.68804 ████████████████▏
builtin-popcnt-unrolled-errata 0.33233 ███████▊
builtin-popcnt-unrolled-errata-manual 0.33610 ███████▉
builtin-popcnt-movdq 0.34013 ████████
builtin-popcnt-movdq-unrolled 0.29545 ██████▉
builtin-popcnt-movdq-unrolled_manual 0.36264 ████████▌

Input size 512B

procedure time [s] relative time (less is better)
lookup-8 3.40312 ██████████████████████████████████████████████████
lookup-64 3.39359 █████████████████████████████████████████████████▊
bit-parallel 3.00652 ████████████████████████████████████████████▏
bit-parallel-optimized 1.79759 ██████████████████████████▍
bit-parallel-mul 1.77835 ██████████████████████████▏
harley-seal 0.90016 █████████████▏
sse-bit-parallel 1.16728 █████████████████▏
sse-bit-parallel-original 1.17614 █████████████████▎
sse-bit-parallel-better 1.10007 ████████████████▏
sse-harley-seal 0.56059 ████████▏
sse-lookup 0.46971 ██████▉
sse-lookup-original 0.82280 ████████████
cpu 0.44240 ██████▍
sse-cpu 0.51150 ███████▌
builtin-popcnt 0.65241 █████████▌
builtin-popcnt32 1.32992 ███████████████████▌
builtin-popcnt-unrolled 0.44242 ██████▌
builtin-popcnt-unrolled32 1.07508 ███████████████▊
builtin-popcnt-unrolled-errata 0.50194 ███████▎
builtin-popcnt-unrolled-errata-manual 0.50461 ███████▍
builtin-popcnt-movdq 0.53455 ███████▊
builtin-popcnt-movdq-unrolled 0.44566 ██████▌
builtin-popcnt-movdq-unrolled_manual 0.55238 ████████

Input size 1024B

procedure time [s] relative time (less is better)
lookup-8 3.37968 ██████████████████████████████████████████████████
lookup-64 3.37388 █████████████████████████████████████████████████▉
bit-parallel 2.98865 ████████████████████████████████████████████▏
bit-parallel-optimized 1.76732 ██████████████████████████▏
bit-parallel-mul 1.78635 ██████████████████████████▍
harley-seal 0.85752 ████████████▋
sse-bit-parallel 0.99899 ██████████████▊
sse-bit-parallel-original 1.19961 █████████████████▋
sse-bit-parallel-better 0.92206 █████████████▋
sse-harley-seal 0.47240 ██████▉
sse-lookup 0.45597 ██████▋
sse-lookup-original 0.89240 █████████████▏
cpu 0.43117 ██████▍
sse-cpu 0.48877 ███████▏
builtin-popcnt 0.66710 █████████▊
builtin-popcnt32 1.30050 ███████████████████▏
builtin-popcnt-unrolled 0.43123 ██████▍
builtin-popcnt-unrolled32 1.06164 ███████████████▋
builtin-popcnt-unrolled-errata 0.48662 ███████▏
builtin-popcnt-unrolled-errata-manual 0.48852 ███████▏
builtin-popcnt-movdq 0.52919 ███████▊
builtin-popcnt-movdq-unrolled 0.43309 ██████▍
builtin-popcnt-movdq-unrolled_manual 0.53818 ███████▉

Input size 2048B

procedure time [s] relative time (less is better)
lookup-8 3.36830 ██████████████████████████████████████████████████
lookup-64 3.36426 █████████████████████████████████████████████████▉
bit-parallel 2.98472 ████████████████████████████████████████████▎
bit-parallel-optimized 1.75583 ██████████████████████████
bit-parallel-mul 1.76195 ██████████████████████████▏
harley-seal 0.83508 ████████████▍
sse-bit-parallel 0.91217 █████████████▌
sse-bit-parallel-original 1.14922 █████████████████
sse-bit-parallel-better 0.83866 ████████████▍
sse-harley-seal 0.43234 ██████▍
sse-lookup 0.44554 ██████▌
sse-lookup-original 0.81698 ████████████▏
cpu 0.42524 ██████▎
sse-cpu 0.48091 ███████▏
builtin-popcnt 0.64707 █████████▌
builtin-popcnt32 1.27972 ██████████████████▉
builtin-popcnt-unrolled 0.42522 ██████▎
builtin-popcnt-unrolled32 0.74972 ███████████▏
builtin-popcnt-unrolled-errata 0.47922 ███████
builtin-popcnt-unrolled-errata-manual 0.48039 ███████▏
builtin-popcnt-movdq 0.45700 ██████▊
builtin-popcnt-movdq-unrolled 0.42648 ██████▎
builtin-popcnt-movdq-unrolled_manual 0.53127 ███████▉

Input size 4096B

procedure time [s] relative time (less is better)
lookup-8 3.36327 ██████████████████████████████████████████████████
lookup-64 3.35894 █████████████████████████████████████████████████▉
bit-parallel 2.97496 ████████████████████████████████████████████▏
bit-parallel-optimized 1.75727 ██████████████████████████
bit-parallel-mul 1.74768 █████████████████████████▉
harley-seal 0.82458 ████████████▎
sse-bit-parallel 0.86662 ████████████▉
sse-bit-parallel-original 1.10145 ████████████████▎
sse-bit-parallel-better 0.78519 ███████████▋
sse-harley-seal 0.41591 ██████▏
sse-lookup 0.44298 ██████▌
sse-lookup-original 0.75707 ███████████▎
cpu 0.34746 █████▏
sse-cpu 0.47667 ███████
builtin-popcnt 0.63798 █████████▍
builtin-popcnt32 1.26759 ██████████████████▊
builtin-popcnt-unrolled 0.28776 ████▎
builtin-popcnt-unrolled32 0.71987 ██████████▋
builtin-popcnt-unrolled-errata 0.28935 ████▎
builtin-popcnt-unrolled-errata-manual 0.29710 ████▍
builtin-popcnt-movdq 0.44145 ██████▌
builtin-popcnt-movdq-unrolled 0.38394 █████▋
builtin-popcnt-movdq-unrolled_manual 0.39940 █████▉

Speedup

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
lookup-64 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
bit-parallel 1.08 1.10 1.11 1.12 1.13 1.13 1.13 1.13
bit-parallel-optimized 1.68 1.78 1.86 1.87 1.89 1.91 1.92 1.91
bit-parallel-mul 1.82 1.85 1.89 1.88 1.91 1.89 1.91 1.92
harley-seal 1.56 1.70 2.71 3.35 3.78 3.94 4.03 4.08
sse-bit-parallel 0.86 0.92 1.53 2.22 2.92 3.38 3.69 3.88
sse-bit-parallel-original 0.98 1.53 2.08 2.52 2.89 2.82 2.93 3.05
sse-bit-parallel-better 0.85 0.91 1.58 2.32 3.09 3.67 4.02 4.28
sse-harley-seal 1.17 1.72 2.40 4.92 6.07 7.15 7.79 8.09
sse-lookup 3.05 4.07 6.17 6.83 7.25 7.41 7.56 7.59
sse-lookup-original 1.28 1.96 2.80 3.55 4.14 3.79 4.12 4.44
cpu 4.68 5.83 6.72 7.29 7.69 7.84 7.92 9.68
sse-cpu 0.94 4.79 5.71 6.31 6.65 6.91 7.00 7.06
builtin-popcnt 4.68 4.97 4.68 4.98 5.22 5.07 5.21 5.27
builtin-popcnt32 2.70 2.69 2.68 2.67 2.56 2.60 2.63 2.65
builtin-popcnt-unrolled 4.68 5.83 6.72 7.29 7.69 7.84 7.92 11.69
builtin-popcnt-unrolled32 2.51 2.80 2.98 3.09 3.17 3.18 4.49 4.67
builtin-popcnt-unrolled-errata 3.90 4.97 5.83 6.39 6.78 6.95 7.03 11.62
builtin-popcnt-unrolled-errata-manual 3.70 4.79 5.70 6.32 6.74 6.92 7.01 11.32
builtin-popcnt-movdq 5.78 5.83 6.12 6.24 6.37 6.39 7.37 7.62
builtin-popcnt-movdq-unrolled 4.38 5.59 6.54 7.18 7.64 7.80 7.90 8.76
builtin-popcnt-movdq-unrolled_manual 3.80 4.71 5.41 5.85 6.16 6.28 6.34 8.42

CSV file

Download westmere-m540-gcc4.9.2-sse.csv