Skip to content

Latest commit

 

History

History
719 lines (664 loc) · 76.8 KB

skylake-i7-6700-gcc5.3.0-avx2.rst

File metadata and controls

719 lines (664 loc) · 76.8 KB

Population count comparison for Skylake Core i7-6700 CPU @ 3.40GHz

Generated on: 2016-03-26

Contents

Specification

CPU: Skylake Core i7-6700 CPU @ 3.40GHz

Compiler: GCC 5.3.0 (Ubuntu)

Instruction set: AVX2

Number of runs: 5

All times are given in seconds.

Procedures

procedure description
lookup-8 lookup in std::uint8_t[256] LUT
lookup-64 lookup in std::uint64_t[256] LUT
bit-parallel naive bit parallel method
bit-parallel-optimized a bit better bit parallel
bit-parallel-mul bit-parallel with fewer instructions
harley-seal Harley-Seal popcount (4th iteration)
sse-bit-parallel SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original SSE implementation of bit-parallel-optimized
sse-bit-parallel-better SSE implementation of bit-parallel with fewer instructions
sse-harley-seal SSE implementation of Harley-Seal
sse-lookup SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original SSSE3 variant using pshufb instruction
avx2-lookup AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original AVX2 variant using pshufb instruction
avx2-harley-seal AVX2 implementation of Harley-Seal
cpu CPU instruction popcnt (64-bit variant)
sse-cpu load data with SSE, then count bits using popcnt
avx2-cpu load data with AVX2, then count bits using popcnt
builtin-popcnt builtin for popcnt
builtin-popcnt32 builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled unrolled builtin-popcnt
builtin-popcnt-unrolled32 unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual builtin-popcnt-movdq unrolled (assembly code)

Running time

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.02956 0.94836 1.04600 0.95508 1.46021 1.42425 1.40627 1.39751
lookup-64 1.00704 0.94362 1.04179 0.96031 1.46638 1.43074 1.41285 1.40219
bit-parallel 1.05678 0.95295 0.90993 0.88909 1.40587 1.39754 1.39337 1.40798
bit-parallel-optimized 0.81278 0.69739 0.63874 0.61126 0.95469 0.94345 0.93783 0.94590
bit-parallel-mul 0.65022 0.58277 0.55384 0.63993 0.94622 0.90929 0.89067 0.88138
harley-seal 0.81279 0.66382 0.43349 0.33865 0.46870 0.43078 0.41181 0.40233
sse-bit-parallel 1.96422 1.57389 0.92645 0.59693 0.69071 0.51317 0.46554 0.44174
sse-bit-parallel-original 0.98698 0.64160 0.46789 0.37692 0.53546 0.50186 0.54272 0.53910
sse-bit-parallel-better 2.37451 1.77951 0.91255 0.55164 0.62721 0.52551 0.45176 0.41155
sse-harley-seal 0.94107 0.59645 0.43005 0.20380 0.25607 0.22175 0.20421 0.19545
sse-lookup 0.40702 0.29802 0.18291 0.15747 0.23972 0.23061 0.22648 0.22486
sse-lookup-original 0.86734 0.52831 0.35898 0.27432 0.37118 0.38857 0.39088 0.38906
avx2-lookup 0.43350 0.26391 0.17816 0.11536 0.14846 0.13170 0.12303 0.11898
avx2-lookup-original 1.64500 0.99497 0.55478 0.34966 0.40783 0.27806 0.23484 0.21153
avx2-harley-seal 0.89417 0.51328 0.32495 0.22880 0.17255 0.13258 0.11234 0.10230
cpu 0.24383 0.18810 0.13546 0.12192 0.30881 0.25016 0.21192 0.19269
sse-cpu 1.74296 0.22689 0.18052 0.16467 0.24506 0.23503 0.25034 0.23566
avx2-cpu 2.25318 1.70364 0.22751 0.19394 0.28513 0.26503 0.25543 0.25294
builtin-popcnt 0.18965 0.16911 0.15518 0.16960 0.31822 0.28013 0.26618 0.26028
builtin-popcnt32 0.39193 0.39391 0.56698 0.47326 0.70798 0.68983 0.67679 0.67369
builtin-popcnt-unrolled 0.21674 0.14901 0.12192 0.11176 0.21719 0.18749 0.17844 0.17597
builtin-popcnt-unrolled32 0.36604 0.30837 0.29083 0.27588 0.46756 0.42812 0.41192 0.40445
builtin-popcnt-unrolled-errata 0.24383 0.17611 0.13547 0.11515 0.17610 0.18231 0.17940 0.17644
builtin-popcnt-unrolled-errata-manual 0.28447 0.19284 0.14273 0.11902 0.27409 0.23065 0.19764 0.18555
builtin-popcnt-movdq 0.18978 0.17245 0.17213 0.17727 0.32302 0.30900 0.31118 0.30717
builtin-popcnt-movdq-unrolled 0.27933 0.22039 0.17509 0.16159 0.26084 0.25552 0.24025 0.23251
builtin-popcnt-movdq-unrolled_manual 0.31021 0.22818 0.18489 0.16560 0.24995 0.24937 0.23789 0.23281

Input size 32B

procedure time [s] relative time (less is better)
lookup-8 1.02956 █████████████████████▋
lookup-64 1.00704 █████████████████████▏
bit-parallel 1.05678 ██████████████████████▎
bit-parallel-optimized 0.81278 █████████████████
bit-parallel-mul 0.65022 █████████████▋
harley-seal 0.81279 █████████████████
sse-bit-parallel 1.96422 █████████████████████████████████████████▎
sse-bit-parallel-original 0.98698 ████████████████████▊
sse-bit-parallel-better 2.37451 ██████████████████████████████████████████████████
sse-harley-seal 0.94107 ███████████████████▊
sse-lookup 0.40702 ████████▌
sse-lookup-original 0.86734 ██████████████████▎
avx2-lookup 0.43350 █████████▏
avx2-lookup-original 1.64500 ██████████████████████████████████▋
avx2-harley-seal 0.89417 ██████████████████▊
cpu 0.24383 █████▏
sse-cpu 1.74296 ████████████████████████████████████▋
avx2-cpu 2.25318 ███████████████████████████████████████████████▍
builtin-popcnt 0.18965 ███▉
builtin-popcnt32 0.39193 ████████▎
builtin-popcnt-unrolled 0.21674 ████▌
builtin-popcnt-unrolled32 0.36604 ███████▋
builtin-popcnt-unrolled-errata 0.24383 █████▏
builtin-popcnt-unrolled-errata-manual 0.28447 █████▉
builtin-popcnt-movdq 0.18978 ███▉
builtin-popcnt-movdq-unrolled 0.27933 █████▉
builtin-popcnt-movdq-unrolled_manual 0.31021 ██████▌

Input size 64B

procedure time [s] relative time (less is better)
lookup-8 0.94836 ██████████████████████████▋
lookup-64 0.94362 ██████████████████████████▌
bit-parallel 0.95295 ██████████████████████████▊
bit-parallel-optimized 0.69739 ███████████████████▌
bit-parallel-mul 0.58277 ████████████████▎
harley-seal 0.66382 ██████████████████▋
sse-bit-parallel 1.57389 ████████████████████████████████████████████▏
sse-bit-parallel-original 0.64160 ██████████████████
sse-bit-parallel-better 1.77951 ██████████████████████████████████████████████████
sse-harley-seal 0.59645 ████████████████▊
sse-lookup 0.29802 ████████▎
sse-lookup-original 0.52831 ██████████████▊
avx2-lookup 0.26391 ███████▍
avx2-lookup-original 0.99497 ███████████████████████████▉
avx2-harley-seal 0.51328 ██████████████▍
cpu 0.18810 █████▎
sse-cpu 0.22689 ██████▍
avx2-cpu 1.70364 ███████████████████████████████████████████████▊
builtin-popcnt 0.16911 ████▊
builtin-popcnt32 0.39391 ███████████
builtin-popcnt-unrolled 0.14901 ████▏
builtin-popcnt-unrolled32 0.30837 ████████▋
builtin-popcnt-unrolled-errata 0.17611 ████▉
builtin-popcnt-unrolled-errata-manual 0.19284 █████▍
builtin-popcnt-movdq 0.17245 ████▊
builtin-popcnt-movdq-unrolled 0.22039 ██████▏
builtin-popcnt-movdq-unrolled_manual 0.22818 ██████▍

Input size 128B

procedure time [s] relative time (less is better)
lookup-8 1.04600 ██████████████████████████████████████████████████
lookup-64 1.04179 █████████████████████████████████████████████████▊
bit-parallel 0.90993 ███████████████████████████████████████████▍
bit-parallel-optimized 0.63874 ██████████████████████████████▌
bit-parallel-mul 0.55384 ██████████████████████████▍
harley-seal 0.43349 ████████████████████▋
sse-bit-parallel 0.92645 ████████████████████████████████████████████▎
sse-bit-parallel-original 0.46789 ██████████████████████▎
sse-bit-parallel-better 0.91255 ███████████████████████████████████████████▌
sse-harley-seal 0.43005 ████████████████████▌
sse-lookup 0.18291 ████████▋
sse-lookup-original 0.35898 █████████████████▏
avx2-lookup 0.17816 ████████▌
avx2-lookup-original 0.55478 ██████████████████████████▌
avx2-harley-seal 0.32495 ███████████████▌
cpu 0.13546 ██████▍
sse-cpu 0.18052 ████████▋
avx2-cpu 0.22751 ██████████▉
builtin-popcnt 0.15518 ███████▍
builtin-popcnt32 0.56698 ███████████████████████████
builtin-popcnt-unrolled 0.12192 █████▊
builtin-popcnt-unrolled32 0.29083 █████████████▉
builtin-popcnt-unrolled-errata 0.13547 ██████▍
builtin-popcnt-unrolled-errata-manual 0.14273 ██████▊
builtin-popcnt-movdq 0.17213 ████████▏
builtin-popcnt-movdq-unrolled 0.17509 ████████▎
builtin-popcnt-movdq-unrolled_manual 0.18489 ████████▊

Input size 256B

procedure time [s] relative time (less is better)
lookup-8 0.95508 █████████████████████████████████████████████████▋
lookup-64 0.96031 ██████████████████████████████████████████████████
bit-parallel 0.88909 ██████████████████████████████████████████████▎
bit-parallel-optimized 0.61126 ███████████████████████████████▊
bit-parallel-mul 0.63993 █████████████████████████████████▎
harley-seal 0.33865 █████████████████▋
sse-bit-parallel 0.59693 ███████████████████████████████
sse-bit-parallel-original 0.37692 ███████████████████▋
sse-bit-parallel-better 0.55164 ████████████████████████████▋
sse-harley-seal 0.20380 ██████████▌
sse-lookup 0.15747 ████████▏
sse-lookup-original 0.27432 ██████████████▎
avx2-lookup 0.11536 ██████
avx2-lookup-original 0.34966 ██████████████████▏
avx2-harley-seal 0.22880 ███████████▉
cpu 0.12192 ██████▎
sse-cpu 0.16467 ████████▌
avx2-cpu 0.19394 ██████████
builtin-popcnt 0.16960 ████████▊
builtin-popcnt32 0.47326 ████████████████████████▋
builtin-popcnt-unrolled 0.11176 █████▊
builtin-popcnt-unrolled32 0.27588 ██████████████▎
builtin-popcnt-unrolled-errata 0.11515 █████▉
builtin-popcnt-unrolled-errata-manual 0.11902 ██████▏
builtin-popcnt-movdq 0.17727 █████████▏
builtin-popcnt-movdq-unrolled 0.16159 ████████▍
builtin-popcnt-movdq-unrolled_manual 0.16560 ████████▌

Input size 512B

procedure time [s] relative time (less is better)
lookup-8 1.46021 █████████████████████████████████████████████████▊
lookup-64 1.46638 ██████████████████████████████████████████████████
bit-parallel 1.40587 ███████████████████████████████████████████████▉
bit-parallel-optimized 0.95469 ████████████████████████████████▌
bit-parallel-mul 0.94622 ████████████████████████████████▎
harley-seal 0.46870 ███████████████▉
sse-bit-parallel 0.69071 ███████████████████████▌
sse-bit-parallel-original 0.53546 ██████████████████▎
sse-bit-parallel-better 0.62721 █████████████████████▍
sse-harley-seal 0.25607 ████████▋
sse-lookup 0.23972 ████████▏
sse-lookup-original 0.37118 ████████████▋
avx2-lookup 0.14846 █████
avx2-lookup-original 0.40783 █████████████▉
avx2-harley-seal 0.17255 █████▉
cpu 0.30881 ██████████▌
sse-cpu 0.24506 ████████▎
avx2-cpu 0.28513 █████████▋
builtin-popcnt 0.31822 ██████████▊
builtin-popcnt32 0.70798 ████████████████████████▏
builtin-popcnt-unrolled 0.21719 ███████▍
builtin-popcnt-unrolled32 0.46756 ███████████████▉
builtin-popcnt-unrolled-errata 0.17610 ██████
builtin-popcnt-unrolled-errata-manual 0.27409 █████████▎
builtin-popcnt-movdq 0.32302 ███████████
builtin-popcnt-movdq-unrolled 0.26084 ████████▉
builtin-popcnt-movdq-unrolled_manual 0.24995 ████████▌

Input size 1024B

procedure time [s] relative time (less is better)
lookup-8 1.42425 █████████████████████████████████████████████████▊
lookup-64 1.43074 ██████████████████████████████████████████████████
bit-parallel 1.39754 ████████████████████████████████████████████████▊
bit-parallel-optimized 0.94345 ████████████████████████████████▉
bit-parallel-mul 0.90929 ███████████████████████████████▊
harley-seal 0.43078 ███████████████
sse-bit-parallel 0.51317 █████████████████▉
sse-bit-parallel-original 0.50186 █████████████████▌
sse-bit-parallel-better 0.52551 ██████████████████▎
sse-harley-seal 0.22175 ███████▋
sse-lookup 0.23061 ████████
sse-lookup-original 0.38857 █████████████▌
avx2-lookup 0.13170 ████▌
avx2-lookup-original 0.27806 █████████▋
avx2-harley-seal 0.13258 ████▋
cpu 0.25016 ████████▋
sse-cpu 0.23503 ████████▏
avx2-cpu 0.26503 █████████▎
builtin-popcnt 0.28013 █████████▊
builtin-popcnt32 0.68983 ████████████████████████
builtin-popcnt-unrolled 0.18749 ██████▌
builtin-popcnt-unrolled32 0.42812 ██████████████▉
builtin-popcnt-unrolled-errata 0.18231 ██████▎
builtin-popcnt-unrolled-errata-manual 0.23065 ████████
builtin-popcnt-movdq 0.30900 ██████████▊
builtin-popcnt-movdq-unrolled 0.25552 ████████▉
builtin-popcnt-movdq-unrolled_manual 0.24937 ████████▋

Input size 2048B

procedure time [s] relative time (less is better)
lookup-8 1.40627 █████████████████████████████████████████████████▊
lookup-64 1.41285 ██████████████████████████████████████████████████
bit-parallel 1.39337 █████████████████████████████████████████████████▎
bit-parallel-optimized 0.93783 █████████████████████████████████▏
bit-parallel-mul 0.89067 ███████████████████████████████▌
harley-seal 0.41181 ██████████████▌
sse-bit-parallel 0.46554 ████████████████▍
sse-bit-parallel-original 0.54272 ███████████████████▏
sse-bit-parallel-better 0.45176 ███████████████▉
sse-harley-seal 0.20421 ███████▏
sse-lookup 0.22648 ████████
sse-lookup-original 0.39088 █████████████▊
avx2-lookup 0.12303 ████▎
avx2-lookup-original 0.23484 ████████▎
avx2-harley-seal 0.11234 ███▉
cpu 0.21192 ███████▍
sse-cpu 0.25034 ████████▊
avx2-cpu 0.25543 █████████
builtin-popcnt 0.26618 █████████▍
builtin-popcnt32 0.67679 ███████████████████████▉
builtin-popcnt-unrolled 0.17844 ██████▎
builtin-popcnt-unrolled32 0.41192 ██████████████▌
builtin-popcnt-unrolled-errata 0.17940 ██████▎
builtin-popcnt-unrolled-errata-manual 0.19764 ██████▉
builtin-popcnt-movdq 0.31118 ███████████
builtin-popcnt-movdq-unrolled 0.24025 ████████▌
builtin-popcnt-movdq-unrolled_manual 0.23789 ████████▍

Input size 4096B

procedure time [s] relative time (less is better)
lookup-8 1.39751 █████████████████████████████████████████████████▋
lookup-64 1.40219 █████████████████████████████████████████████████▊
bit-parallel 1.40798 ██████████████████████████████████████████████████
bit-parallel-optimized 0.94590 █████████████████████████████████▌
bit-parallel-mul 0.88138 ███████████████████████████████▎
harley-seal 0.40233 ██████████████▎
sse-bit-parallel 0.44174 ███████████████▋
sse-bit-parallel-original 0.53910 ███████████████████▏
sse-bit-parallel-better 0.41155 ██████████████▌
sse-harley-seal 0.19545 ██████▉
sse-lookup 0.22486 ███████▉
sse-lookup-original 0.38906 █████████████▊
avx2-lookup 0.11898 ████▏
avx2-lookup-original 0.21153 ███████▌
avx2-harley-seal 0.10230 ███▋
cpu 0.19269 ██████▊
sse-cpu 0.23566 ████████▎
avx2-cpu 0.25294 ████████▉
builtin-popcnt 0.26028 █████████▏
builtin-popcnt32 0.67369 ███████████████████████▉
builtin-popcnt-unrolled 0.17597 ██████▏
builtin-popcnt-unrolled32 0.40445 ██████████████▎
builtin-popcnt-unrolled-errata 0.17644 ██████▎
builtin-popcnt-unrolled-errata-manual 0.18555 ██████▌
builtin-popcnt-movdq 0.30717 ██████████▉
builtin-popcnt-movdq-unrolled 0.23251 ████████▎
builtin-popcnt-movdq-unrolled_manual 0.23281 ████████▎

Speedup

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
lookup-64 1.02 1.01 1.00 0.99 1.00 1.00 1.00 1.00
bit-parallel 0.97 1.00 1.15 1.07 1.04 1.02 1.01 0.99
bit-parallel-optimized 1.27 1.36 1.64 1.56 1.53 1.51 1.50 1.48
bit-parallel-mul 1.58 1.63 1.89 1.49 1.54 1.57 1.58 1.59
harley-seal 1.27 1.43 2.41 2.82 3.12 3.31 3.41 3.47
sse-bit-parallel 0.52 0.60 1.13 1.60 2.11 2.78 3.02 3.16
sse-bit-parallel-original 1.04 1.48 2.24 2.53 2.73 2.84 2.59 2.59
sse-bit-parallel-better 0.43 0.53 1.15 1.73 2.33 2.71 3.11 3.40
sse-harley-seal 1.09 1.59 2.43 4.69 5.70 6.42 6.89 7.15
sse-lookup 2.53 3.18 5.72 6.07 6.09 6.18 6.21 6.22
sse-lookup-original 1.19 1.80 2.91 3.48 3.93 3.67 3.60 3.59
avx2-lookup 2.38 3.59 5.87 8.28 9.84 10.81 11.43 11.75
avx2-lookup-original 0.63 0.95 1.89 2.73 3.58 5.12 5.99 6.61
avx2-harley-seal 1.15 1.85 3.22 4.17 8.46 10.74 12.52 13.66
cpu 4.22 5.04 7.72 7.83 4.73 5.69 6.64 7.25
sse-cpu 0.59 4.18 5.79 5.80 5.96 6.06 5.62 5.93
avx2-cpu 0.46 0.56 4.60 4.92 5.12 5.37 5.51 5.53
builtin-popcnt 5.43 5.61 6.74 5.63 4.59 5.08 5.28 5.37
builtin-popcnt32 2.63 2.41 1.84 2.02 2.06 2.06 2.08 2.07
builtin-popcnt-unrolled 4.75 6.36 8.58 8.55 6.72 7.60 7.88 7.94
builtin-popcnt-unrolled32 2.81 3.08 3.60 3.46 3.12 3.33 3.41 3.46
builtin-popcnt-unrolled-errata 4.22 5.38 7.72 8.29 8.29 7.81 7.84 7.92
builtin-popcnt-unrolled-errata-manual 3.62 4.92 7.33 8.02 5.33 6.17 7.12 7.53
builtin-popcnt-movdq 5.43 5.50 6.08 5.39 4.52 4.61 4.52 4.55
builtin-popcnt-movdq-unrolled 3.69 4.30 5.97 5.91 5.60 5.57 5.85 6.01
builtin-popcnt-movdq-unrolled_manual 3.32 4.16 5.66 5.77 5.84 5.71 5.91 6.00

CSV file

Download skylake-i7-6700-gcc5.3.0-avx2.csv