Population count comparison for Skylake Core i7-6700 CPU @ 3.40GHz
Generated on: 2016-03-26
CPU: Skylake Core i7-6700 CPU @ 3.40GHz
Compiler: 3.8.0-svn257311-1~exp1 (Ubuntu)
Instruction set: AVX2
Number of runs: 5
All times are given in seconds .
procedure
description
lookup-8
lookup in std::uint8_t[256] LUT
lookup-64
lookup in std::uint64_t[256] LUT
bit-parallel
naive bit parallel method
bit-parallel-optimized
a bit better bit parallel
bit-parallel-mul
bit-parallel with fewer instructions
harley-seal
Harley-Seal popcount (4th iteration)
sse-bit-parallel
SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original
SSE implementation of bit-parallel-optimized
sse-bit-parallel-better
SSE implementation of bit-parallel with fewer instructions
sse-harley-seal
SSE implementation of Harley-Seal
sse-lookup
SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original
SSSE3 variant using pshufb instruction
avx2-lookup
AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original
AVX2 variant using pshufb instruction
avx2-harley-seal
AVX2 implementation of Harley-Seal
cpu
CPU instruction popcnt (64-bit variant)
sse-cpu
load data with SSE, then count bits using popcnt
avx2-cpu
load data with AVX2, then count bits using popcnt
builtin-popcnt
builtin for popcnt
builtin-popcnt32
builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled
unrolled builtin-popcnt
builtin-popcnt-unrolled32
unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata
unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual
unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq
builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled
builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual
builtin-popcnt-movdq unrolled (assembly code)
procedure
32 B
64 B
128 B
256 B
512 B
1024 B
2048 B
4096 B
lookup-8
1.05663
0.98891
1.15859
1.06665
1.63234
1.59694
1.57876
1.56954
lookup-64
1.03518
0.95493
1.06125
0.97176
1.47745
1.44230
1.42358
1.41498
bit-parallel
1.15605
1.07004
1.03226
1.01322
1.60619
1.63027
1.59463
1.60044
bit-parallel-optimized
0.82633
0.71118
0.65648
0.63280
0.99373
1.02343
0.97960
0.98773
bit-parallel-mul
0.67734
0.60370
0.58192
0.66570
0.98413
0.94520
0.92402
0.91287
harley-seal
1.32884
0.93001
0.45719
0.35052
0.47554
0.43295
0.41168
0.40105
sse-bit-parallel
1.11677
0.99346
0.64180
0.45853
0.58837
0.51578
0.47959
0.46139
sse-bit-parallel-original
0.85411
0.59378
0.44258
0.36574
0.55395
0.53442
0.52128
0.51927
sse-bit-parallel-better
1.11683
1.02102
0.62236
0.43211
0.54241
0.46772
0.43047
0.41185
sse-harley-seal
0.91623
0.59038
0.42968
0.21537
0.26776
0.22766
0.20777
0.19780
sse-lookup
0.37931
0.25738
0.17610
0.15084
0.23039
0.22113
0.21674
0.21457
sse-lookup-original
0.77295
0.51175
0.34690
0.25996
0.39734
0.33790
0.35048
0.36659
avx2-lookup
0.32512
0.20320
0.14225
0.09702
0.13477
0.12895
0.12574
0.12404
avx2-lookup-original
1.12234
0.69069
0.39373
0.23961
0.26304
0.23145
0.20544
0.19114
avx2-harley-seal
0.87193
0.50367
0.32202
0.22904
0.18195
0.13830
0.11638
0.10531
cpu
0.21674
0.17465
0.12403
0.13021
0.23828
0.32969
0.33816
0.34263
sse-cpu
1.14629
0.18965
0.16283
0.15636
0.25773
0.26459
0.27349
0.27067
avx2-cpu
1.11650
0.99273
0.16197
0.16020
0.26421
0.27079
0.27794
0.28788
builtin-popcnt
0.21333
0.18147
0.11514
0.09631
0.14393
0.13328
0.13867
0.13432
builtin-popcnt32
0.65022
0.46059
0.44703
0.44026
0.69899
0.70140
0.69760
0.69559
builtin-popcnt-unrolled
0.21674
0.21674
0.21674
0.21674
0.34678
0.35753
0.35204
0.34947
builtin-popcnt-unrolled32
0.43348
0.43348
0.43348
0.43348
0.71508
0.70412
0.69895
0.69621
builtin-popcnt-unrolled-errata
0.32512
0.32511
0.32511
0.32511
0.52018
0.52051
0.52018
0.52026
builtin-popcnt-unrolled-errata-manual
0.27092
0.17610
0.13706
0.12090
0.17459
0.21746
0.19642
0.18500
builtin-popcnt-movdq
0.20790
0.18105
0.16867
0.16783
0.26803
0.28308
0.28513
0.28493
builtin-popcnt-movdq-unrolled
0.29802
0.20244
0.16924
0.15661
0.24175
0.25688
0.24022
0.23233
builtin-popcnt-movdq-unrolled_manual
0.29802
0.20352
0.16917
0.15727
0.24246
0.25278
0.23827
0.23136
procedure
time [s]
relative time (less is better)
lookup-8
1.05663
███████████████████████████████████████▊
lookup-64
1.03518
██████████████████████████████████████▉
bit-parallel
1.15605
███████████████████████████████████████████▍
bit-parallel-optimized
0.82633
███████████████████████████████
bit-parallel-mul
0.67734
█████████████████████████▍
harley-seal
1.32884
██████████████████████████████████████████████████
sse-bit-parallel
1.11677
██████████████████████████████████████████
sse-bit-parallel-original
0.85411
████████████████████████████████▏
sse-bit-parallel-better
1.11683
██████████████████████████████████████████
sse-harley-seal
0.91623
██████████████████████████████████▍
sse-lookup
0.37931
██████████████▎
sse-lookup-original
0.77295
█████████████████████████████
avx2-lookup
0.32512
████████████▏
avx2-lookup-original
1.12234
██████████████████████████████████████████▏
avx2-harley-seal
0.87193
████████████████████████████████▊
cpu
0.21674
████████▏
sse-cpu
1.14629
███████████████████████████████████████████▏
avx2-cpu
1.11650
██████████████████████████████████████████
builtin-popcnt
0.21333
████████
builtin-popcnt32
0.65022
████████████████████████▍
builtin-popcnt-unrolled
0.21674
████████▏
builtin-popcnt-unrolled32
0.43348
████████████████▎
builtin-popcnt-unrolled-errata
0.32512
████████████▏
builtin-popcnt-unrolled-errata-manual
0.27092
██████████▏
builtin-popcnt-movdq
0.20790
███████▊
builtin-popcnt-movdq-unrolled
0.29802
███████████▏
builtin-popcnt-movdq-unrolled_manual
0.29802
███████████▏
procedure
time [s]
relative time (less is better)
lookup-8
0.98891
██████████████████████████████████████████████▏
lookup-64
0.95493
████████████████████████████████████████████▌
bit-parallel
1.07004
██████████████████████████████████████████████████
bit-parallel-optimized
0.71118
█████████████████████████████████▏
bit-parallel-mul
0.60370
████████████████████████████▏
harley-seal
0.93001
███████████████████████████████████████████▍
sse-bit-parallel
0.99346
██████████████████████████████████████████████▍
sse-bit-parallel-original
0.59378
███████████████████████████▋
sse-bit-parallel-better
1.02102
███████████████████████████████████████████████▋
sse-harley-seal
0.59038
███████████████████████████▌
sse-lookup
0.25738
████████████
sse-lookup-original
0.51175
███████████████████████▉
avx2-lookup
0.20320
█████████▍
avx2-lookup-original
0.69069
████████████████████████████████▎
avx2-harley-seal
0.50367
███████████████████████▌
cpu
0.17465
████████▏
sse-cpu
0.18965
████████▊
avx2-cpu
0.99273
██████████████████████████████████████████████▍
builtin-popcnt
0.18147
████████▍
builtin-popcnt32
0.46059
█████████████████████▌
builtin-popcnt-unrolled
0.21674
██████████▏
builtin-popcnt-unrolled32
0.43348
████████████████████▎
builtin-popcnt-unrolled-errata
0.32511
███████████████▏
builtin-popcnt-unrolled-errata-manual
0.17610
████████▏
builtin-popcnt-movdq
0.18105
████████▍
builtin-popcnt-movdq-unrolled
0.20244
█████████▍
builtin-popcnt-movdq-unrolled_manual
0.20352
█████████▌
procedure
time [s]
relative time (less is better)
lookup-8
1.15859
██████████████████████████████████████████████████
lookup-64
1.06125
█████████████████████████████████████████████▊
bit-parallel
1.03226
████████████████████████████████████████████▌
bit-parallel-optimized
0.65648
████████████████████████████▎
bit-parallel-mul
0.58192
█████████████████████████
harley-seal
0.45719
███████████████████▋
sse-bit-parallel
0.64180
███████████████████████████▋
sse-bit-parallel-original
0.44258
███████████████████
sse-bit-parallel-better
0.62236
██████████████████████████▊
sse-harley-seal
0.42968
██████████████████▌
sse-lookup
0.17610
███████▌
sse-lookup-original
0.34690
██████████████▉
avx2-lookup
0.14225
██████▏
avx2-lookup-original
0.39373
████████████████▉
avx2-harley-seal
0.32202
█████████████▉
cpu
0.12403
█████▎
sse-cpu
0.16283
███████
avx2-cpu
0.16197
██████▉
builtin-popcnt
0.11514
████▉
builtin-popcnt32
0.44703
███████████████████▎
builtin-popcnt-unrolled
0.21674
█████████▎
builtin-popcnt-unrolled32
0.43348
██████████████████▋
builtin-popcnt-unrolled-errata
0.32511
██████████████
builtin-popcnt-unrolled-errata-manual
0.13706
█████▉
builtin-popcnt-movdq
0.16867
███████▎
builtin-popcnt-movdq-unrolled
0.16924
███████▎
builtin-popcnt-movdq-unrolled_manual
0.16917
███████▎
procedure
time [s]
relative time (less is better)
lookup-8
1.06665
██████████████████████████████████████████████████
lookup-64
0.97176
█████████████████████████████████████████████▌
bit-parallel
1.01322
███████████████████████████████████████████████▍
bit-parallel-optimized
0.63280
█████████████████████████████▋
bit-parallel-mul
0.66570
███████████████████████████████▏
harley-seal
0.35052
████████████████▍
sse-bit-parallel
0.45853
█████████████████████▍
sse-bit-parallel-original
0.36574
█████████████████▏
sse-bit-parallel-better
0.43211
████████████████████▎
sse-harley-seal
0.21537
██████████
sse-lookup
0.15084
███████
sse-lookup-original
0.25996
████████████▏
avx2-lookup
0.09702
████▌
avx2-lookup-original
0.23961
███████████▏
avx2-harley-seal
0.22904
██████████▋
cpu
0.13021
██████
sse-cpu
0.15636
███████▎
avx2-cpu
0.16020
███████▌
builtin-popcnt
0.09631
████▌
builtin-popcnt32
0.44026
████████████████████▋
builtin-popcnt-unrolled
0.21674
██████████▏
builtin-popcnt-unrolled32
0.43348
████████████████████▎
builtin-popcnt-unrolled-errata
0.32511
███████████████▏
builtin-popcnt-unrolled-errata-manual
0.12090
█████▋
builtin-popcnt-movdq
0.16783
███████▊
builtin-popcnt-movdq-unrolled
0.15661
███████▎
builtin-popcnt-movdq-unrolled_manual
0.15727
███████▎
procedure
time [s]
relative time (less is better)
lookup-8
1.63234
██████████████████████████████████████████████████
lookup-64
1.47745
█████████████████████████████████████████████▎
bit-parallel
1.60619
█████████████████████████████████████████████████▏
bit-parallel-optimized
0.99373
██████████████████████████████▍
bit-parallel-mul
0.98413
██████████████████████████████▏
harley-seal
0.47554
██████████████▌
sse-bit-parallel
0.58837
██████████████████
sse-bit-parallel-original
0.55395
████████████████▉
sse-bit-parallel-better
0.54241
████████████████▌
sse-harley-seal
0.26776
████████▏
sse-lookup
0.23039
███████
sse-lookup-original
0.39734
████████████▏
avx2-lookup
0.13477
████▏
avx2-lookup-original
0.26304
████████
avx2-harley-seal
0.18195
█████▌
cpu
0.23828
███████▎
sse-cpu
0.25773
███████▉
avx2-cpu
0.26421
████████
builtin-popcnt
0.14393
████▍
builtin-popcnt32
0.69899
█████████████████████▍
builtin-popcnt-unrolled
0.34678
██████████▌
builtin-popcnt-unrolled32
0.71508
█████████████████████▉
builtin-popcnt-unrolled-errata
0.52018
███████████████▉
builtin-popcnt-unrolled-errata-manual
0.17459
█████▎
builtin-popcnt-movdq
0.26803
████████▏
builtin-popcnt-movdq-unrolled
0.24175
███████▍
builtin-popcnt-movdq-unrolled_manual
0.24246
███████▍
procedure
time [s]
relative time (less is better)
lookup-8
1.59694
████████████████████████████████████████████████▉
lookup-64
1.44230
████████████████████████████████████████████▏
bit-parallel
1.63027
██████████████████████████████████████████████████
bit-parallel-optimized
1.02343
███████████████████████████████▍
bit-parallel-mul
0.94520
████████████████████████████▉
harley-seal
0.43295
█████████████▎
sse-bit-parallel
0.51578
███████████████▊
sse-bit-parallel-original
0.53442
████████████████▍
sse-bit-parallel-better
0.46772
██████████████▎
sse-harley-seal
0.22766
██████▉
sse-lookup
0.22113
██████▊
sse-lookup-original
0.33790
██████████▎
avx2-lookup
0.12895
███▉
avx2-lookup-original
0.23145
███████
avx2-harley-seal
0.13830
████▏
cpu
0.32969
██████████
sse-cpu
0.26459
████████
avx2-cpu
0.27079
████████▎
builtin-popcnt
0.13328
████
builtin-popcnt32
0.70140
█████████████████████▌
builtin-popcnt-unrolled
0.35753
██████████▉
builtin-popcnt-unrolled32
0.70412
█████████████████████▌
builtin-popcnt-unrolled-errata
0.52051
███████████████▉
builtin-popcnt-unrolled-errata-manual
0.21746
██████▋
builtin-popcnt-movdq
0.28308
████████▋
builtin-popcnt-movdq-unrolled
0.25688
███████▉
builtin-popcnt-movdq-unrolled_manual
0.25278
███████▊
procedure
time [s]
relative time (less is better)
lookup-8
1.57876
█████████████████████████████████████████████████▌
lookup-64
1.42358
████████████████████████████████████████████▋
bit-parallel
1.59463
██████████████████████████████████████████████████
bit-parallel-optimized
0.97960
██████████████████████████████▋
bit-parallel-mul
0.92402
████████████████████████████▉
harley-seal
0.41168
████████████▉
sse-bit-parallel
0.47959
███████████████
sse-bit-parallel-original
0.52128
████████████████▎
sse-bit-parallel-better
0.43047
█████████████▍
sse-harley-seal
0.20777
██████▌
sse-lookup
0.21674
██████▊
sse-lookup-original
0.35048
██████████▉
avx2-lookup
0.12574
███▉
avx2-lookup-original
0.20544
██████▍
avx2-harley-seal
0.11638
███▋
cpu
0.33816
██████████▌
sse-cpu
0.27349
████████▌
avx2-cpu
0.27794
████████▋
builtin-popcnt
0.13867
████▎
builtin-popcnt32
0.69760
█████████████████████▊
builtin-popcnt-unrolled
0.35204
███████████
builtin-popcnt-unrolled32
0.69895
█████████████████████▉
builtin-popcnt-unrolled-errata
0.52018
████████████████▎
builtin-popcnt-unrolled-errata-manual
0.19642
██████▏
builtin-popcnt-movdq
0.28513
████████▉
builtin-popcnt-movdq-unrolled
0.24022
███████▌
builtin-popcnt-movdq-unrolled_manual
0.23827
███████▍
procedure
time [s]
relative time (less is better)
lookup-8
1.56954
█████████████████████████████████████████████████
lookup-64
1.41498
████████████████████████████████████████████▏
bit-parallel
1.60044
██████████████████████████████████████████████████
bit-parallel-optimized
0.98773
██████████████████████████████▊
bit-parallel-mul
0.91287
████████████████████████████▌
harley-seal
0.40105
████████████▌
sse-bit-parallel
0.46139
██████████████▍
sse-bit-parallel-original
0.51927
████████████████▏
sse-bit-parallel-better
0.41185
████████████▊
sse-harley-seal
0.19780
██████▏
sse-lookup
0.21457
██████▋
sse-lookup-original
0.36659
███████████▍
avx2-lookup
0.12404
███▉
avx2-lookup-original
0.19114
█████▉
avx2-harley-seal
0.10531
███▎
cpu
0.34263
██████████▋
sse-cpu
0.27067
████████▍
avx2-cpu
0.28788
████████▉
builtin-popcnt
0.13432
████▏
builtin-popcnt32
0.69559
█████████████████████▋
builtin-popcnt-unrolled
0.34947
██████████▉
builtin-popcnt-unrolled32
0.69621
█████████████████████▊
builtin-popcnt-unrolled-errata
0.52026
████████████████▎
builtin-popcnt-unrolled-errata-manual
0.18500
█████▊
builtin-popcnt-movdq
0.28493
████████▉
builtin-popcnt-movdq-unrolled
0.23233
███████▎
builtin-popcnt-movdq-unrolled_manual
0.23136
███████▏
procedure
32 B
64 B
128 B
256 B
512 B
1024 B
2048 B
4096 B
lookup-8
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
lookup-64
1.02
1.04
1.09
1.10
1.10
1.11
1.11
1.11
bit-parallel
0.91
0.92
1.12
1.05
1.02
0.98
0.99
0.98
bit-parallel-optimized
1.28
1.39
1.76
1.69
1.64
1.56
1.61
1.59
bit-parallel-mul
1.56
1.64
1.99
1.60
1.66
1.69
1.71
1.72
harley-seal
0.80
1.06
2.53
3.04
3.43
3.69
3.83
3.91
sse-bit-parallel
0.95
1.00
1.81
2.33
2.77
3.10
3.29
3.40
sse-bit-parallel-original
1.24
1.67
2.62
2.92
2.95
2.99
3.03
3.02
sse-bit-parallel-better
0.95
0.97
1.86
2.47
3.01
3.41
3.67
3.81
sse-harley-seal
1.15
1.68
2.70
4.95
6.10
7.01
7.60
7.93
sse-lookup
2.79
3.84
6.58
7.07
7.09
7.22
7.28
7.31
sse-lookup-original
1.37
1.93
3.34
4.10
4.11
4.73
4.50
4.28
avx2-lookup
3.25
4.87
8.14
10.99
12.11
12.38
12.56
12.65
avx2-lookup-original
0.94
1.43
2.94
4.45
6.21
6.90
7.68
8.21
avx2-harley-seal
1.21
1.96
3.60
4.66
8.97
11.55
13.57
14.90
cpu
4.88
5.66
9.34
8.19
6.85
4.84
4.67
4.58
sse-cpu
0.92
5.21
7.12
6.82
6.33
6.04
5.77
5.80
avx2-cpu
0.95
1.00
7.15
6.66
6.18
5.90
5.68
5.45
builtin-popcnt
4.95
5.45
10.06
11.08
11.34
11.98
11.38
11.68
builtin-popcnt32
1.63
2.15
2.59
2.42
2.34
2.28
2.26
2.26
builtin-popcnt-unrolled
4.88
4.56
5.35
4.92
4.71
4.47
4.48
4.49
builtin-popcnt-unrolled32
2.44
2.28
2.67
2.46
2.28
2.27
2.26
2.25
builtin-popcnt-unrolled-errata
3.25
3.04
3.56
3.28
3.14
3.07
3.04
3.02
builtin-popcnt-unrolled-errata-manual
3.90
5.62
8.45
8.82
9.35
7.34
8.04
8.48
builtin-popcnt-movdq
5.08
5.46
6.87
6.36
6.09
5.64
5.54
5.51
builtin-popcnt-movdq-unrolled
3.55
4.89
6.85
6.81
6.75
6.22
6.57
6.76
builtin-popcnt-movdq-unrolled_manual
3.55
4.86
6.85
6.78
6.73
6.32
6.63
6.78
Download skylake-i7-6700-clang3.8.0-avx2.csv