Population count comparison for Skylake Core i7-6700 CPU @ 3.40GHz
Generated on: 2016-03-26
CPU: Skylake Core i7-6700 CPU @ 3.40GHz
Compiler: GCC 5.3.0 (Ubuntu)
Instruction set: AVX2
Number of runs: 5
All times are given in seconds .
procedure
description
lookup-8
lookup in std::uint8_t[256] LUT
lookup-64
lookup in std::uint64_t[256] LUT
bit-parallel
naive bit parallel method
bit-parallel-optimized
a bit better bit parallel
bit-parallel-mul
bit-parallel with fewer instructions
harley-seal
Harley-Seal popcount (4th iteration)
sse-bit-parallel
SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original
SSE implementation of bit-parallel-optimized
sse-bit-parallel-better
SSE implementation of bit-parallel with fewer instructions
sse-harley-seal
SSE implementation of Harley-Seal
sse-lookup
SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original
SSSE3 variant using pshufb instruction
avx2-lookup
AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original
AVX2 variant using pshufb instruction
avx2-harley-seal
AVX2 implementation of Harley-Seal
cpu
CPU instruction popcnt (64-bit variant)
sse-cpu
load data with SSE, then count bits using popcnt
avx2-cpu
load data with AVX2, then count bits using popcnt
builtin-popcnt
builtin for popcnt
builtin-popcnt32
builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled
unrolled builtin-popcnt
builtin-popcnt-unrolled32
unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata
unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual
unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq
builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled
builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual
builtin-popcnt-movdq unrolled (assembly code)
procedure
32 B
64 B
128 B
256 B
512 B
1024 B
2048 B
4096 B
lookup-8
1.02956
0.94836
1.04600
0.95508
1.46021
1.42425
1.40627
1.39751
lookup-64
1.00704
0.94362
1.04179
0.96031
1.46638
1.43074
1.41285
1.40219
bit-parallel
1.05678
0.95295
0.90993
0.88909
1.40587
1.39754
1.39337
1.40798
bit-parallel-optimized
0.81278
0.69739
0.63874
0.61126
0.95469
0.94345
0.93783
0.94590
bit-parallel-mul
0.65022
0.58277
0.55384
0.63993
0.94622
0.90929
0.89067
0.88138
harley-seal
0.81279
0.66382
0.43349
0.33865
0.46870
0.43078
0.41181
0.40233
sse-bit-parallel
1.96422
1.57389
0.92645
0.59693
0.69071
0.51317
0.46554
0.44174
sse-bit-parallel-original
0.98698
0.64160
0.46789
0.37692
0.53546
0.50186
0.54272
0.53910
sse-bit-parallel-better
2.37451
1.77951
0.91255
0.55164
0.62721
0.52551
0.45176
0.41155
sse-harley-seal
0.94107
0.59645
0.43005
0.20380
0.25607
0.22175
0.20421
0.19545
sse-lookup
0.40702
0.29802
0.18291
0.15747
0.23972
0.23061
0.22648
0.22486
sse-lookup-original
0.86734
0.52831
0.35898
0.27432
0.37118
0.38857
0.39088
0.38906
avx2-lookup
0.43350
0.26391
0.17816
0.11536
0.14846
0.13170
0.12303
0.11898
avx2-lookup-original
1.64500
0.99497
0.55478
0.34966
0.40783
0.27806
0.23484
0.21153
avx2-harley-seal
0.89417
0.51328
0.32495
0.22880
0.17255
0.13258
0.11234
0.10230
cpu
0.24383
0.18810
0.13546
0.12192
0.30881
0.25016
0.21192
0.19269
sse-cpu
1.74296
0.22689
0.18052
0.16467
0.24506
0.23503
0.25034
0.23566
avx2-cpu
2.25318
1.70364
0.22751
0.19394
0.28513
0.26503
0.25543
0.25294
builtin-popcnt
0.18965
0.16911
0.15518
0.16960
0.31822
0.28013
0.26618
0.26028
builtin-popcnt32
0.39193
0.39391
0.56698
0.47326
0.70798
0.68983
0.67679
0.67369
builtin-popcnt-unrolled
0.21674
0.14901
0.12192
0.11176
0.21719
0.18749
0.17844
0.17597
builtin-popcnt-unrolled32
0.36604
0.30837
0.29083
0.27588
0.46756
0.42812
0.41192
0.40445
builtin-popcnt-unrolled-errata
0.24383
0.17611
0.13547
0.11515
0.17610
0.18231
0.17940
0.17644
builtin-popcnt-unrolled-errata-manual
0.28447
0.19284
0.14273
0.11902
0.27409
0.23065
0.19764
0.18555
builtin-popcnt-movdq
0.18978
0.17245
0.17213
0.17727
0.32302
0.30900
0.31118
0.30717
builtin-popcnt-movdq-unrolled
0.27933
0.22039
0.17509
0.16159
0.26084
0.25552
0.24025
0.23251
builtin-popcnt-movdq-unrolled_manual
0.31021
0.22818
0.18489
0.16560
0.24995
0.24937
0.23789
0.23281
procedure
time [s]
relative time (less is better)
lookup-8
1.02956
█████████████████████▋
lookup-64
1.00704
█████████████████████▏
bit-parallel
1.05678
██████████████████████▎
bit-parallel-optimized
0.81278
█████████████████
bit-parallel-mul
0.65022
█████████████▋
harley-seal
0.81279
█████████████████
sse-bit-parallel
1.96422
█████████████████████████████████████████▎
sse-bit-parallel-original
0.98698
████████████████████▊
sse-bit-parallel-better
2.37451
██████████████████████████████████████████████████
sse-harley-seal
0.94107
███████████████████▊
sse-lookup
0.40702
████████▌
sse-lookup-original
0.86734
██████████████████▎
avx2-lookup
0.43350
█████████▏
avx2-lookup-original
1.64500
██████████████████████████████████▋
avx2-harley-seal
0.89417
██████████████████▊
cpu
0.24383
█████▏
sse-cpu
1.74296
████████████████████████████████████▋
avx2-cpu
2.25318
███████████████████████████████████████████████▍
builtin-popcnt
0.18965
███▉
builtin-popcnt32
0.39193
████████▎
builtin-popcnt-unrolled
0.21674
████▌
builtin-popcnt-unrolled32
0.36604
███████▋
builtin-popcnt-unrolled-errata
0.24383
█████▏
builtin-popcnt-unrolled-errata-manual
0.28447
█████▉
builtin-popcnt-movdq
0.18978
███▉
builtin-popcnt-movdq-unrolled
0.27933
█████▉
builtin-popcnt-movdq-unrolled_manual
0.31021
██████▌
procedure
time [s]
relative time (less is better)
lookup-8
0.94836
██████████████████████████▋
lookup-64
0.94362
██████████████████████████▌
bit-parallel
0.95295
██████████████████████████▊
bit-parallel-optimized
0.69739
███████████████████▌
bit-parallel-mul
0.58277
████████████████▎
harley-seal
0.66382
██████████████████▋
sse-bit-parallel
1.57389
████████████████████████████████████████████▏
sse-bit-parallel-original
0.64160
██████████████████
sse-bit-parallel-better
1.77951
██████████████████████████████████████████████████
sse-harley-seal
0.59645
████████████████▊
sse-lookup
0.29802
████████▎
sse-lookup-original
0.52831
██████████████▊
avx2-lookup
0.26391
███████▍
avx2-lookup-original
0.99497
███████████████████████████▉
avx2-harley-seal
0.51328
██████████████▍
cpu
0.18810
█████▎
sse-cpu
0.22689
██████▍
avx2-cpu
1.70364
███████████████████████████████████████████████▊
builtin-popcnt
0.16911
████▊
builtin-popcnt32
0.39391
███████████
builtin-popcnt-unrolled
0.14901
████▏
builtin-popcnt-unrolled32
0.30837
████████▋
builtin-popcnt-unrolled-errata
0.17611
████▉
builtin-popcnt-unrolled-errata-manual
0.19284
█████▍
builtin-popcnt-movdq
0.17245
████▊
builtin-popcnt-movdq-unrolled
0.22039
██████▏
builtin-popcnt-movdq-unrolled_manual
0.22818
██████▍
procedure
time [s]
relative time (less is better)
lookup-8
1.04600
██████████████████████████████████████████████████
lookup-64
1.04179
█████████████████████████████████████████████████▊
bit-parallel
0.90993
███████████████████████████████████████████▍
bit-parallel-optimized
0.63874
██████████████████████████████▌
bit-parallel-mul
0.55384
██████████████████████████▍
harley-seal
0.43349
████████████████████▋
sse-bit-parallel
0.92645
████████████████████████████████████████████▎
sse-bit-parallel-original
0.46789
██████████████████████▎
sse-bit-parallel-better
0.91255
███████████████████████████████████████████▌
sse-harley-seal
0.43005
████████████████████▌
sse-lookup
0.18291
████████▋
sse-lookup-original
0.35898
█████████████████▏
avx2-lookup
0.17816
████████▌
avx2-lookup-original
0.55478
██████████████████████████▌
avx2-harley-seal
0.32495
███████████████▌
cpu
0.13546
██████▍
sse-cpu
0.18052
████████▋
avx2-cpu
0.22751
██████████▉
builtin-popcnt
0.15518
███████▍
builtin-popcnt32
0.56698
███████████████████████████
builtin-popcnt-unrolled
0.12192
█████▊
builtin-popcnt-unrolled32
0.29083
█████████████▉
builtin-popcnt-unrolled-errata
0.13547
██████▍
builtin-popcnt-unrolled-errata-manual
0.14273
██████▊
builtin-popcnt-movdq
0.17213
████████▏
builtin-popcnt-movdq-unrolled
0.17509
████████▎
builtin-popcnt-movdq-unrolled_manual
0.18489
████████▊
procedure
time [s]
relative time (less is better)
lookup-8
0.95508
█████████████████████████████████████████████████▋
lookup-64
0.96031
██████████████████████████████████████████████████
bit-parallel
0.88909
██████████████████████████████████████████████▎
bit-parallel-optimized
0.61126
███████████████████████████████▊
bit-parallel-mul
0.63993
█████████████████████████████████▎
harley-seal
0.33865
█████████████████▋
sse-bit-parallel
0.59693
███████████████████████████████
sse-bit-parallel-original
0.37692
███████████████████▋
sse-bit-parallel-better
0.55164
████████████████████████████▋
sse-harley-seal
0.20380
██████████▌
sse-lookup
0.15747
████████▏
sse-lookup-original
0.27432
██████████████▎
avx2-lookup
0.11536
██████
avx2-lookup-original
0.34966
██████████████████▏
avx2-harley-seal
0.22880
███████████▉
cpu
0.12192
██████▎
sse-cpu
0.16467
████████▌
avx2-cpu
0.19394
██████████
builtin-popcnt
0.16960
████████▊
builtin-popcnt32
0.47326
████████████████████████▋
builtin-popcnt-unrolled
0.11176
█████▊
builtin-popcnt-unrolled32
0.27588
██████████████▎
builtin-popcnt-unrolled-errata
0.11515
█████▉
builtin-popcnt-unrolled-errata-manual
0.11902
██████▏
builtin-popcnt-movdq
0.17727
█████████▏
builtin-popcnt-movdq-unrolled
0.16159
████████▍
builtin-popcnt-movdq-unrolled_manual
0.16560
████████▌
procedure
time [s]
relative time (less is better)
lookup-8
1.46021
█████████████████████████████████████████████████▊
lookup-64
1.46638
██████████████████████████████████████████████████
bit-parallel
1.40587
███████████████████████████████████████████████▉
bit-parallel-optimized
0.95469
████████████████████████████████▌
bit-parallel-mul
0.94622
████████████████████████████████▎
harley-seal
0.46870
███████████████▉
sse-bit-parallel
0.69071
███████████████████████▌
sse-bit-parallel-original
0.53546
██████████████████▎
sse-bit-parallel-better
0.62721
█████████████████████▍
sse-harley-seal
0.25607
████████▋
sse-lookup
0.23972
████████▏
sse-lookup-original
0.37118
████████████▋
avx2-lookup
0.14846
█████
avx2-lookup-original
0.40783
█████████████▉
avx2-harley-seal
0.17255
█████▉
cpu
0.30881
██████████▌
sse-cpu
0.24506
████████▎
avx2-cpu
0.28513
█████████▋
builtin-popcnt
0.31822
██████████▊
builtin-popcnt32
0.70798
████████████████████████▏
builtin-popcnt-unrolled
0.21719
███████▍
builtin-popcnt-unrolled32
0.46756
███████████████▉
builtin-popcnt-unrolled-errata
0.17610
██████
builtin-popcnt-unrolled-errata-manual
0.27409
█████████▎
builtin-popcnt-movdq
0.32302
███████████
builtin-popcnt-movdq-unrolled
0.26084
████████▉
builtin-popcnt-movdq-unrolled_manual
0.24995
████████▌
procedure
time [s]
relative time (less is better)
lookup-8
1.42425
█████████████████████████████████████████████████▊
lookup-64
1.43074
██████████████████████████████████████████████████
bit-parallel
1.39754
████████████████████████████████████████████████▊
bit-parallel-optimized
0.94345
████████████████████████████████▉
bit-parallel-mul
0.90929
███████████████████████████████▊
harley-seal
0.43078
███████████████
sse-bit-parallel
0.51317
█████████████████▉
sse-bit-parallel-original
0.50186
█████████████████▌
sse-bit-parallel-better
0.52551
██████████████████▎
sse-harley-seal
0.22175
███████▋
sse-lookup
0.23061
████████
sse-lookup-original
0.38857
█████████████▌
avx2-lookup
0.13170
████▌
avx2-lookup-original
0.27806
█████████▋
avx2-harley-seal
0.13258
████▋
cpu
0.25016
████████▋
sse-cpu
0.23503
████████▏
avx2-cpu
0.26503
█████████▎
builtin-popcnt
0.28013
█████████▊
builtin-popcnt32
0.68983
████████████████████████
builtin-popcnt-unrolled
0.18749
██████▌
builtin-popcnt-unrolled32
0.42812
██████████████▉
builtin-popcnt-unrolled-errata
0.18231
██████▎
builtin-popcnt-unrolled-errata-manual
0.23065
████████
builtin-popcnt-movdq
0.30900
██████████▊
builtin-popcnt-movdq-unrolled
0.25552
████████▉
builtin-popcnt-movdq-unrolled_manual
0.24937
████████▋
procedure
time [s]
relative time (less is better)
lookup-8
1.40627
█████████████████████████████████████████████████▊
lookup-64
1.41285
██████████████████████████████████████████████████
bit-parallel
1.39337
█████████████████████████████████████████████████▎
bit-parallel-optimized
0.93783
█████████████████████████████████▏
bit-parallel-mul
0.89067
███████████████████████████████▌
harley-seal
0.41181
██████████████▌
sse-bit-parallel
0.46554
████████████████▍
sse-bit-parallel-original
0.54272
███████████████████▏
sse-bit-parallel-better
0.45176
███████████████▉
sse-harley-seal
0.20421
███████▏
sse-lookup
0.22648
████████
sse-lookup-original
0.39088
█████████████▊
avx2-lookup
0.12303
████▎
avx2-lookup-original
0.23484
████████▎
avx2-harley-seal
0.11234
███▉
cpu
0.21192
███████▍
sse-cpu
0.25034
████████▊
avx2-cpu
0.25543
█████████
builtin-popcnt
0.26618
█████████▍
builtin-popcnt32
0.67679
███████████████████████▉
builtin-popcnt-unrolled
0.17844
██████▎
builtin-popcnt-unrolled32
0.41192
██████████████▌
builtin-popcnt-unrolled-errata
0.17940
██████▎
builtin-popcnt-unrolled-errata-manual
0.19764
██████▉
builtin-popcnt-movdq
0.31118
███████████
builtin-popcnt-movdq-unrolled
0.24025
████████▌
builtin-popcnt-movdq-unrolled_manual
0.23789
████████▍
procedure
time [s]
relative time (less is better)
lookup-8
1.39751
█████████████████████████████████████████████████▋
lookup-64
1.40219
█████████████████████████████████████████████████▊
bit-parallel
1.40798
██████████████████████████████████████████████████
bit-parallel-optimized
0.94590
█████████████████████████████████▌
bit-parallel-mul
0.88138
███████████████████████████████▎
harley-seal
0.40233
██████████████▎
sse-bit-parallel
0.44174
███████████████▋
sse-bit-parallel-original
0.53910
███████████████████▏
sse-bit-parallel-better
0.41155
██████████████▌
sse-harley-seal
0.19545
██████▉
sse-lookup
0.22486
███████▉
sse-lookup-original
0.38906
█████████████▊
avx2-lookup
0.11898
████▏
avx2-lookup-original
0.21153
███████▌
avx2-harley-seal
0.10230
███▋
cpu
0.19269
██████▊
sse-cpu
0.23566
████████▎
avx2-cpu
0.25294
████████▉
builtin-popcnt
0.26028
█████████▏
builtin-popcnt32
0.67369
███████████████████████▉
builtin-popcnt-unrolled
0.17597
██████▏
builtin-popcnt-unrolled32
0.40445
██████████████▎
builtin-popcnt-unrolled-errata
0.17644
██████▎
builtin-popcnt-unrolled-errata-manual
0.18555
██████▌
builtin-popcnt-movdq
0.30717
██████████▉
builtin-popcnt-movdq-unrolled
0.23251
████████▎
builtin-popcnt-movdq-unrolled_manual
0.23281
████████▎
procedure
32 B
64 B
128 B
256 B
512 B
1024 B
2048 B
4096 B
lookup-8
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
lookup-64
1.02
1.01
1.00
0.99
1.00
1.00
1.00
1.00
bit-parallel
0.97
1.00
1.15
1.07
1.04
1.02
1.01
0.99
bit-parallel-optimized
1.27
1.36
1.64
1.56
1.53
1.51
1.50
1.48
bit-parallel-mul
1.58
1.63
1.89
1.49
1.54
1.57
1.58
1.59
harley-seal
1.27
1.43
2.41
2.82
3.12
3.31
3.41
3.47
sse-bit-parallel
0.52
0.60
1.13
1.60
2.11
2.78
3.02
3.16
sse-bit-parallel-original
1.04
1.48
2.24
2.53
2.73
2.84
2.59
2.59
sse-bit-parallel-better
0.43
0.53
1.15
1.73
2.33
2.71
3.11
3.40
sse-harley-seal
1.09
1.59
2.43
4.69
5.70
6.42
6.89
7.15
sse-lookup
2.53
3.18
5.72
6.07
6.09
6.18
6.21
6.22
sse-lookup-original
1.19
1.80
2.91
3.48
3.93
3.67
3.60
3.59
avx2-lookup
2.38
3.59
5.87
8.28
9.84
10.81
11.43
11.75
avx2-lookup-original
0.63
0.95
1.89
2.73
3.58
5.12
5.99
6.61
avx2-harley-seal
1.15
1.85
3.22
4.17
8.46
10.74
12.52
13.66
cpu
4.22
5.04
7.72
7.83
4.73
5.69
6.64
7.25
sse-cpu
0.59
4.18
5.79
5.80
5.96
6.06
5.62
5.93
avx2-cpu
0.46
0.56
4.60
4.92
5.12
5.37
5.51
5.53
builtin-popcnt
5.43
5.61
6.74
5.63
4.59
5.08
5.28
5.37
builtin-popcnt32
2.63
2.41
1.84
2.02
2.06
2.06
2.08
2.07
builtin-popcnt-unrolled
4.75
6.36
8.58
8.55
6.72
7.60
7.88
7.94
builtin-popcnt-unrolled32
2.81
3.08
3.60
3.46
3.12
3.33
3.41
3.46
builtin-popcnt-unrolled-errata
4.22
5.38
7.72
8.29
8.29
7.81
7.84
7.92
builtin-popcnt-unrolled-errata-manual
3.62
4.92
7.33
8.02
5.33
6.17
7.12
7.53
builtin-popcnt-movdq
5.43
5.50
6.08
5.39
4.52
4.61
4.52
4.55
builtin-popcnt-movdq-unrolled
3.69
4.30
5.97
5.91
5.60
5.57
5.85
6.01
builtin-popcnt-movdq-unrolled_manual
3.32
4.16
5.66
5.77
5.84
5.71
5.91
6.00
Download skylake-i7-6700-gcc5.3.0-avx2.csv