Skip to content

Latest commit

 

History

History
719 lines (664 loc) · 77.3 KB

skylake-i7-6700-clang3.8.0-avx2.rst

File metadata and controls

719 lines (664 loc) · 77.3 KB

Population count comparison for Skylake Core i7-6700 CPU @ 3.40GHz

Generated on: 2016-03-26

Contents

Specification

CPU: Skylake Core i7-6700 CPU @ 3.40GHz

Compiler: 3.8.0-svn257311-1~exp1 (Ubuntu)

Instruction set: AVX2

Number of runs: 5

All times are given in seconds.

Procedures

procedure description
lookup-8 lookup in std::uint8_t[256] LUT
lookup-64 lookup in std::uint64_t[256] LUT
bit-parallel naive bit parallel method
bit-parallel-optimized a bit better bit parallel
bit-parallel-mul bit-parallel with fewer instructions
harley-seal Harley-Seal popcount (4th iteration)
sse-bit-parallel SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original SSE implementation of bit-parallel-optimized
sse-bit-parallel-better SSE implementation of bit-parallel with fewer instructions
sse-harley-seal SSE implementation of Harley-Seal
sse-lookup SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original SSSE3 variant using pshufb instruction
avx2-lookup AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original AVX2 variant using pshufb instruction
avx2-harley-seal AVX2 implementation of Harley-Seal
cpu CPU instruction popcnt (64-bit variant)
sse-cpu load data with SSE, then count bits using popcnt
avx2-cpu load data with AVX2, then count bits using popcnt
builtin-popcnt builtin for popcnt
builtin-popcnt32 builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled unrolled builtin-popcnt
builtin-popcnt-unrolled32 unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual builtin-popcnt-movdq unrolled (assembly code)

Running time

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.05663 0.98891 1.15859 1.06665 1.63234 1.59694 1.57876 1.56954
lookup-64 1.03518 0.95493 1.06125 0.97176 1.47745 1.44230 1.42358 1.41498
bit-parallel 1.15605 1.07004 1.03226 1.01322 1.60619 1.63027 1.59463 1.60044
bit-parallel-optimized 0.82633 0.71118 0.65648 0.63280 0.99373 1.02343 0.97960 0.98773
bit-parallel-mul 0.67734 0.60370 0.58192 0.66570 0.98413 0.94520 0.92402 0.91287
harley-seal 1.32884 0.93001 0.45719 0.35052 0.47554 0.43295 0.41168 0.40105
sse-bit-parallel 1.11677 0.99346 0.64180 0.45853 0.58837 0.51578 0.47959 0.46139
sse-bit-parallel-original 0.85411 0.59378 0.44258 0.36574 0.55395 0.53442 0.52128 0.51927
sse-bit-parallel-better 1.11683 1.02102 0.62236 0.43211 0.54241 0.46772 0.43047 0.41185
sse-harley-seal 0.91623 0.59038 0.42968 0.21537 0.26776 0.22766 0.20777 0.19780
sse-lookup 0.37931 0.25738 0.17610 0.15084 0.23039 0.22113 0.21674 0.21457
sse-lookup-original 0.77295 0.51175 0.34690 0.25996 0.39734 0.33790 0.35048 0.36659
avx2-lookup 0.32512 0.20320 0.14225 0.09702 0.13477 0.12895 0.12574 0.12404
avx2-lookup-original 1.12234 0.69069 0.39373 0.23961 0.26304 0.23145 0.20544 0.19114
avx2-harley-seal 0.87193 0.50367 0.32202 0.22904 0.18195 0.13830 0.11638 0.10531
cpu 0.21674 0.17465 0.12403 0.13021 0.23828 0.32969 0.33816 0.34263
sse-cpu 1.14629 0.18965 0.16283 0.15636 0.25773 0.26459 0.27349 0.27067
avx2-cpu 1.11650 0.99273 0.16197 0.16020 0.26421 0.27079 0.27794 0.28788
builtin-popcnt 0.21333 0.18147 0.11514 0.09631 0.14393 0.13328 0.13867 0.13432
builtin-popcnt32 0.65022 0.46059 0.44703 0.44026 0.69899 0.70140 0.69760 0.69559
builtin-popcnt-unrolled 0.21674 0.21674 0.21674 0.21674 0.34678 0.35753 0.35204 0.34947
builtin-popcnt-unrolled32 0.43348 0.43348 0.43348 0.43348 0.71508 0.70412 0.69895 0.69621
builtin-popcnt-unrolled-errata 0.32512 0.32511 0.32511 0.32511 0.52018 0.52051 0.52018 0.52026
builtin-popcnt-unrolled-errata-manual 0.27092 0.17610 0.13706 0.12090 0.17459 0.21746 0.19642 0.18500
builtin-popcnt-movdq 0.20790 0.18105 0.16867 0.16783 0.26803 0.28308 0.28513 0.28493
builtin-popcnt-movdq-unrolled 0.29802 0.20244 0.16924 0.15661 0.24175 0.25688 0.24022 0.23233
builtin-popcnt-movdq-unrolled_manual 0.29802 0.20352 0.16917 0.15727 0.24246 0.25278 0.23827 0.23136

Input size 32B

procedure time [s] relative time (less is better)
lookup-8 1.05663 ███████████████████████████████████████▊
lookup-64 1.03518 ██████████████████████████████████████▉
bit-parallel 1.15605 ███████████████████████████████████████████▍
bit-parallel-optimized 0.82633 ███████████████████████████████
bit-parallel-mul 0.67734 █████████████████████████▍
harley-seal 1.32884 ██████████████████████████████████████████████████
sse-bit-parallel 1.11677 ██████████████████████████████████████████
sse-bit-parallel-original 0.85411 ████████████████████████████████▏
sse-bit-parallel-better 1.11683 ██████████████████████████████████████████
sse-harley-seal 0.91623 ██████████████████████████████████▍
sse-lookup 0.37931 ██████████████▎
sse-lookup-original 0.77295 █████████████████████████████
avx2-lookup 0.32512 ████████████▏
avx2-lookup-original 1.12234 ██████████████████████████████████████████▏
avx2-harley-seal 0.87193 ████████████████████████████████▊
cpu 0.21674 ████████▏
sse-cpu 1.14629 ███████████████████████████████████████████▏
avx2-cpu 1.11650 ██████████████████████████████████████████
builtin-popcnt 0.21333 ████████
builtin-popcnt32 0.65022 ████████████████████████▍
builtin-popcnt-unrolled 0.21674 ████████▏
builtin-popcnt-unrolled32 0.43348 ████████████████▎
builtin-popcnt-unrolled-errata 0.32512 ████████████▏
builtin-popcnt-unrolled-errata-manual 0.27092 ██████████▏
builtin-popcnt-movdq 0.20790 ███████▊
builtin-popcnt-movdq-unrolled 0.29802 ███████████▏
builtin-popcnt-movdq-unrolled_manual 0.29802 ███████████▏

Input size 64B

procedure time [s] relative time (less is better)
lookup-8 0.98891 ██████████████████████████████████████████████▏
lookup-64 0.95493 ████████████████████████████████████████████▌
bit-parallel 1.07004 ██████████████████████████████████████████████████
bit-parallel-optimized 0.71118 █████████████████████████████████▏
bit-parallel-mul 0.60370 ████████████████████████████▏
harley-seal 0.93001 ███████████████████████████████████████████▍
sse-bit-parallel 0.99346 ██████████████████████████████████████████████▍
sse-bit-parallel-original 0.59378 ███████████████████████████▋
sse-bit-parallel-better 1.02102 ███████████████████████████████████████████████▋
sse-harley-seal 0.59038 ███████████████████████████▌
sse-lookup 0.25738 ████████████
sse-lookup-original 0.51175 ███████████████████████▉
avx2-lookup 0.20320 █████████▍
avx2-lookup-original 0.69069 ████████████████████████████████▎
avx2-harley-seal 0.50367 ███████████████████████▌
cpu 0.17465 ████████▏
sse-cpu 0.18965 ████████▊
avx2-cpu 0.99273 ██████████████████████████████████████████████▍
builtin-popcnt 0.18147 ████████▍
builtin-popcnt32 0.46059 █████████████████████▌
builtin-popcnt-unrolled 0.21674 ██████████▏
builtin-popcnt-unrolled32 0.43348 ████████████████████▎
builtin-popcnt-unrolled-errata 0.32511 ███████████████▏
builtin-popcnt-unrolled-errata-manual 0.17610 ████████▏
builtin-popcnt-movdq 0.18105 ████████▍
builtin-popcnt-movdq-unrolled 0.20244 █████████▍
builtin-popcnt-movdq-unrolled_manual 0.20352 █████████▌

Input size 128B

procedure time [s] relative time (less is better)
lookup-8 1.15859 ██████████████████████████████████████████████████
lookup-64 1.06125 █████████████████████████████████████████████▊
bit-parallel 1.03226 ████████████████████████████████████████████▌
bit-parallel-optimized 0.65648 ████████████████████████████▎
bit-parallel-mul 0.58192 █████████████████████████
harley-seal 0.45719 ███████████████████▋
sse-bit-parallel 0.64180 ███████████████████████████▋
sse-bit-parallel-original 0.44258 ███████████████████
sse-bit-parallel-better 0.62236 ██████████████████████████▊
sse-harley-seal 0.42968 ██████████████████▌
sse-lookup 0.17610 ███████▌
sse-lookup-original 0.34690 ██████████████▉
avx2-lookup 0.14225 ██████▏
avx2-lookup-original 0.39373 ████████████████▉
avx2-harley-seal 0.32202 █████████████▉
cpu 0.12403 █████▎
sse-cpu 0.16283 ███████
avx2-cpu 0.16197 ██████▉
builtin-popcnt 0.11514 ████▉
builtin-popcnt32 0.44703 ███████████████████▎
builtin-popcnt-unrolled 0.21674 █████████▎
builtin-popcnt-unrolled32 0.43348 ██████████████████▋
builtin-popcnt-unrolled-errata 0.32511 ██████████████
builtin-popcnt-unrolled-errata-manual 0.13706 █████▉
builtin-popcnt-movdq 0.16867 ███████▎
builtin-popcnt-movdq-unrolled 0.16924 ███████▎
builtin-popcnt-movdq-unrolled_manual 0.16917 ███████▎

Input size 256B

procedure time [s] relative time (less is better)
lookup-8 1.06665 ██████████████████████████████████████████████████
lookup-64 0.97176 █████████████████████████████████████████████▌
bit-parallel 1.01322 ███████████████████████████████████████████████▍
bit-parallel-optimized 0.63280 █████████████████████████████▋
bit-parallel-mul 0.66570 ███████████████████████████████▏
harley-seal 0.35052 ████████████████▍
sse-bit-parallel 0.45853 █████████████████████▍
sse-bit-parallel-original 0.36574 █████████████████▏
sse-bit-parallel-better 0.43211 ████████████████████▎
sse-harley-seal 0.21537 ██████████
sse-lookup 0.15084 ███████
sse-lookup-original 0.25996 ████████████▏
avx2-lookup 0.09702 ████▌
avx2-lookup-original 0.23961 ███████████▏
avx2-harley-seal 0.22904 ██████████▋
cpu 0.13021 ██████
sse-cpu 0.15636 ███████▎
avx2-cpu 0.16020 ███████▌
builtin-popcnt 0.09631 ████▌
builtin-popcnt32 0.44026 ████████████████████▋
builtin-popcnt-unrolled 0.21674 ██████████▏
builtin-popcnt-unrolled32 0.43348 ████████████████████▎
builtin-popcnt-unrolled-errata 0.32511 ███████████████▏
builtin-popcnt-unrolled-errata-manual 0.12090 █████▋
builtin-popcnt-movdq 0.16783 ███████▊
builtin-popcnt-movdq-unrolled 0.15661 ███████▎
builtin-popcnt-movdq-unrolled_manual 0.15727 ███████▎

Input size 512B

procedure time [s] relative time (less is better)
lookup-8 1.63234 ██████████████████████████████████████████████████
lookup-64 1.47745 █████████████████████████████████████████████▎
bit-parallel 1.60619 █████████████████████████████████████████████████▏
bit-parallel-optimized 0.99373 ██████████████████████████████▍
bit-parallel-mul 0.98413 ██████████████████████████████▏
harley-seal 0.47554 ██████████████▌
sse-bit-parallel 0.58837 ██████████████████
sse-bit-parallel-original 0.55395 ████████████████▉
sse-bit-parallel-better 0.54241 ████████████████▌
sse-harley-seal 0.26776 ████████▏
sse-lookup 0.23039 ███████
sse-lookup-original 0.39734 ████████████▏
avx2-lookup 0.13477 ████▏
avx2-lookup-original 0.26304 ████████
avx2-harley-seal 0.18195 █████▌
cpu 0.23828 ███████▎
sse-cpu 0.25773 ███████▉
avx2-cpu 0.26421 ████████
builtin-popcnt 0.14393 ████▍
builtin-popcnt32 0.69899 █████████████████████▍
builtin-popcnt-unrolled 0.34678 ██████████▌
builtin-popcnt-unrolled32 0.71508 █████████████████████▉
builtin-popcnt-unrolled-errata 0.52018 ███████████████▉
builtin-popcnt-unrolled-errata-manual 0.17459 █████▎
builtin-popcnt-movdq 0.26803 ████████▏
builtin-popcnt-movdq-unrolled 0.24175 ███████▍
builtin-popcnt-movdq-unrolled_manual 0.24246 ███████▍

Input size 1024B

procedure time [s] relative time (less is better)
lookup-8 1.59694 ████████████████████████████████████████████████▉
lookup-64 1.44230 ████████████████████████████████████████████▏
bit-parallel 1.63027 ██████████████████████████████████████████████████
bit-parallel-optimized 1.02343 ███████████████████████████████▍
bit-parallel-mul 0.94520 ████████████████████████████▉
harley-seal 0.43295 █████████████▎
sse-bit-parallel 0.51578 ███████████████▊
sse-bit-parallel-original 0.53442 ████████████████▍
sse-bit-parallel-better 0.46772 ██████████████▎
sse-harley-seal 0.22766 ██████▉
sse-lookup 0.22113 ██████▊
sse-lookup-original 0.33790 ██████████▎
avx2-lookup 0.12895 ███▉
avx2-lookup-original 0.23145 ███████
avx2-harley-seal 0.13830 ████▏
cpu 0.32969 ██████████
sse-cpu 0.26459 ████████
avx2-cpu 0.27079 ████████▎
builtin-popcnt 0.13328 ████
builtin-popcnt32 0.70140 █████████████████████▌
builtin-popcnt-unrolled 0.35753 ██████████▉
builtin-popcnt-unrolled32 0.70412 █████████████████████▌
builtin-popcnt-unrolled-errata 0.52051 ███████████████▉
builtin-popcnt-unrolled-errata-manual 0.21746 ██████▋
builtin-popcnt-movdq 0.28308 ████████▋
builtin-popcnt-movdq-unrolled 0.25688 ███████▉
builtin-popcnt-movdq-unrolled_manual 0.25278 ███████▊

Input size 2048B

procedure time [s] relative time (less is better)
lookup-8 1.57876 █████████████████████████████████████████████████▌
lookup-64 1.42358 ████████████████████████████████████████████▋
bit-parallel 1.59463 ██████████████████████████████████████████████████
bit-parallel-optimized 0.97960 ██████████████████████████████▋
bit-parallel-mul 0.92402 ████████████████████████████▉
harley-seal 0.41168 ████████████▉
sse-bit-parallel 0.47959 ███████████████
sse-bit-parallel-original 0.52128 ████████████████▎
sse-bit-parallel-better 0.43047 █████████████▍
sse-harley-seal 0.20777 ██████▌
sse-lookup 0.21674 ██████▊
sse-lookup-original 0.35048 ██████████▉
avx2-lookup 0.12574 ███▉
avx2-lookup-original 0.20544 ██████▍
avx2-harley-seal 0.11638 ███▋
cpu 0.33816 ██████████▌
sse-cpu 0.27349 ████████▌
avx2-cpu 0.27794 ████████▋
builtin-popcnt 0.13867 ████▎
builtin-popcnt32 0.69760 █████████████████████▊
builtin-popcnt-unrolled 0.35204 ███████████
builtin-popcnt-unrolled32 0.69895 █████████████████████▉
builtin-popcnt-unrolled-errata 0.52018 ████████████████▎
builtin-popcnt-unrolled-errata-manual 0.19642 ██████▏
builtin-popcnt-movdq 0.28513 ████████▉
builtin-popcnt-movdq-unrolled 0.24022 ███████▌
builtin-popcnt-movdq-unrolled_manual 0.23827 ███████▍

Input size 4096B

procedure time [s] relative time (less is better)
lookup-8 1.56954 █████████████████████████████████████████████████
lookup-64 1.41498 ████████████████████████████████████████████▏
bit-parallel 1.60044 ██████████████████████████████████████████████████
bit-parallel-optimized 0.98773 ██████████████████████████████▊
bit-parallel-mul 0.91287 ████████████████████████████▌
harley-seal 0.40105 ████████████▌
sse-bit-parallel 0.46139 ██████████████▍
sse-bit-parallel-original 0.51927 ████████████████▏
sse-bit-parallel-better 0.41185 ████████████▊
sse-harley-seal 0.19780 ██████▏
sse-lookup 0.21457 ██████▋
sse-lookup-original 0.36659 ███████████▍
avx2-lookup 0.12404 ███▉
avx2-lookup-original 0.19114 █████▉
avx2-harley-seal 0.10531 ███▎
cpu 0.34263 ██████████▋
sse-cpu 0.27067 ████████▍
avx2-cpu 0.28788 ████████▉
builtin-popcnt 0.13432 ████▏
builtin-popcnt32 0.69559 █████████████████████▋
builtin-popcnt-unrolled 0.34947 ██████████▉
builtin-popcnt-unrolled32 0.69621 █████████████████████▊
builtin-popcnt-unrolled-errata 0.52026 ████████████████▎
builtin-popcnt-unrolled-errata-manual 0.18500 █████▊
builtin-popcnt-movdq 0.28493 ████████▉
builtin-popcnt-movdq-unrolled 0.23233 ███████▎
builtin-popcnt-movdq-unrolled_manual 0.23136 ███████▏

Speedup

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
lookup-64 1.02 1.04 1.09 1.10 1.10 1.11 1.11 1.11
bit-parallel 0.91 0.92 1.12 1.05 1.02 0.98 0.99 0.98
bit-parallel-optimized 1.28 1.39 1.76 1.69 1.64 1.56 1.61 1.59
bit-parallel-mul 1.56 1.64 1.99 1.60 1.66 1.69 1.71 1.72
harley-seal 0.80 1.06 2.53 3.04 3.43 3.69 3.83 3.91
sse-bit-parallel 0.95 1.00 1.81 2.33 2.77 3.10 3.29 3.40
sse-bit-parallel-original 1.24 1.67 2.62 2.92 2.95 2.99 3.03 3.02
sse-bit-parallel-better 0.95 0.97 1.86 2.47 3.01 3.41 3.67 3.81
sse-harley-seal 1.15 1.68 2.70 4.95 6.10 7.01 7.60 7.93
sse-lookup 2.79 3.84 6.58 7.07 7.09 7.22 7.28 7.31
sse-lookup-original 1.37 1.93 3.34 4.10 4.11 4.73 4.50 4.28
avx2-lookup 3.25 4.87 8.14 10.99 12.11 12.38 12.56 12.65
avx2-lookup-original 0.94 1.43 2.94 4.45 6.21 6.90 7.68 8.21
avx2-harley-seal 1.21 1.96 3.60 4.66 8.97 11.55 13.57 14.90
cpu 4.88 5.66 9.34 8.19 6.85 4.84 4.67 4.58
sse-cpu 0.92 5.21 7.12 6.82 6.33 6.04 5.77 5.80
avx2-cpu 0.95 1.00 7.15 6.66 6.18 5.90 5.68 5.45
builtin-popcnt 4.95 5.45 10.06 11.08 11.34 11.98 11.38 11.68
builtin-popcnt32 1.63 2.15 2.59 2.42 2.34 2.28 2.26 2.26
builtin-popcnt-unrolled 4.88 4.56 5.35 4.92 4.71 4.47 4.48 4.49
builtin-popcnt-unrolled32 2.44 2.28 2.67 2.46 2.28 2.27 2.26 2.25
builtin-popcnt-unrolled-errata 3.25 3.04 3.56 3.28 3.14 3.07 3.04 3.02
builtin-popcnt-unrolled-errata-manual 3.90 5.62 8.45 8.82 9.35 7.34 8.04 8.48
builtin-popcnt-movdq 5.08 5.46 6.87 6.36 6.09 5.64 5.54 5.51
builtin-popcnt-movdq-unrolled 3.55 4.89 6.85 6.81 6.75 6.22 6.57 6.76
builtin-popcnt-movdq-unrolled_manual 3.55 4.86 6.85 6.78 6.73 6.32 6.63 6.78

CSV file

Download skylake-i7-6700-clang3.8.0-avx2.csv