New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX2 int8 summation (maddubs variant) #8

Merged
merged 2 commits into from Feb 2, 2019

Conversation

Projects
None yet
2 participants
@mayeut
Copy link
Contributor

mayeut commented Feb 1, 2019

This variant seems much better on my Haswell CPU.
I expect it to do better as well on Skylake (given sadbw/maddubs changes in latency/throughput between the 2 architectures) but that remains to be confirmed.

Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz (Haswell)
Apple LLVM version 10.0.0 (clang-1000.11.45.5)

element count 4096
rdtsc_overhead set to 24
scalar                        	:     0.138 cycle/op (best)    0.162 cycle/op (avg)
scalar (C++)                  	:     0.122 cycle/op (best)    0.140 cycle/op (avg)
SSE                           	:     0.305 cycle/op (best)    0.350 cycle/op (avg)
SSE (v2)                      	:     0.305 cycle/op (best)    0.326 cycle/op (avg)
SSE (sadbw)                   	:     0.138 cycle/op (best)    0.154 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.153 cycle/op (best)    0.156 cycle/op (avg)
AVX2                          	:     0.166 cycle/op (best)    0.183 cycle/op (avg)
AVX2 (v2)                     	:     0.153 cycle/op (best)    0.159 cycle/op (avg)
AVX2 (sadbw)                  	:     0.073 cycle/op (best)    0.077 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.066 cycle/op (best)    0.083 cycle/op (avg)
AVX2 (maddubs)                	:     0.024 cycle/op (best)    0.032 cycle/op (avg)
element count 16384
scalar                        	:     0.124 cycle/op (best)    0.140 cycle/op (avg)
scalar (C++)                  	:     0.124 cycle/op (best)    0.142 cycle/op (avg)
SSE                           	:     0.304 cycle/op (best)    0.338 cycle/op (avg)
SSE (v2)                      	:     0.304 cycle/op (best)    0.334 cycle/op (avg)
SSE (sadbw)                   	:     0.136 cycle/op (best)    0.154 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.136 cycle/op (best)    0.144 cycle/op (avg)
AVX2                          	:     0.160 cycle/op (best)    0.175 cycle/op (avg)
AVX2 (v2)                     	:     0.152 cycle/op (best)    0.169 cycle/op (avg)
AVX2 (sadbw)                  	:     0.070 cycle/op (best)    0.075 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.065 cycle/op (best)    0.077 cycle/op (avg)
AVX2 (maddubs)                	:     0.023 cycle/op (best)    0.026 cycle/op (avg)
element count 32768
scalar                        	:     0.122 cycle/op (best)    0.141 cycle/op (avg)
scalar (C++)                  	:     0.122 cycle/op (best)    0.139 cycle/op (avg)
SSE                           	:     0.304 cycle/op (best)    0.340 cycle/op (avg)
SSE (v2)                      	:     0.304 cycle/op (best)    0.334 cycle/op (avg)
SSE (sadbw)                   	:     0.135 cycle/op (best)    0.178 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.136 cycle/op (best)    0.171 cycle/op (avg)
AVX2                          	:     0.160 cycle/op (best)    0.179 cycle/op (avg)
AVX2 (v2)                     	:     0.152 cycle/op (best)    0.165 cycle/op (avg)
AVX2 (sadbw)                  	:     0.070 cycle/op (best)    0.081 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.066 cycle/op (best)    0.072 cycle/op (avg)
AVX2 (maddubs)                	:     0.023 cycle/op (best)    0.025 cycle/op (avg)

mayeut added some commits Feb 1, 2019

Add AVX2 int8 summation
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz (Haswell)
Apple LLVM version 10.0.0 (clang-1000.11.45.5)

element count 4096
rdtsc_overhead set to 24
scalar                        	:     0.138 cycle/op (best)    0.162 cycle/op (avg)
scalar (C++)                  	:     0.122 cycle/op (best)    0.140 cycle/op (avg)
SSE                           	:     0.305 cycle/op (best)    0.350 cycle/op (avg)
SSE (v2)                      	:     0.305 cycle/op (best)    0.326 cycle/op (avg)
SSE (sadbw)                   	:     0.138 cycle/op (best)    0.154 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.153 cycle/op (best)    0.156 cycle/op (avg)
AVX2                          	:     0.166 cycle/op (best)    0.183 cycle/op (avg)
AVX2 (v2)                     	:     0.153 cycle/op (best)    0.159 cycle/op (avg)
AVX2 (sadbw)                  	:     0.073 cycle/op (best)    0.077 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.066 cycle/op (best)    0.083 cycle/op (avg)
AVX2 (maddubs)                	:     0.024 cycle/op (best)    0.032 cycle/op (avg)
element count 16384
scalar                        	:     0.124 cycle/op (best)    0.140 cycle/op (avg)
scalar (C++)                  	:     0.124 cycle/op (best)    0.142 cycle/op (avg)
SSE                           	:     0.304 cycle/op (best)    0.338 cycle/op (avg)
SSE (v2)                      	:     0.304 cycle/op (best)    0.334 cycle/op (avg)
SSE (sadbw)                   	:     0.136 cycle/op (best)    0.154 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.136 cycle/op (best)    0.144 cycle/op (avg)
AVX2                          	:     0.160 cycle/op (best)    0.175 cycle/op (avg)
AVX2 (v2)                     	:     0.152 cycle/op (best)    0.169 cycle/op (avg)
AVX2 (sadbw)                  	:     0.070 cycle/op (best)    0.075 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.065 cycle/op (best)    0.077 cycle/op (avg)
AVX2 (maddubs)                	:     0.023 cycle/op (best)    0.026 cycle/op (avg)
element count 32768
scalar                        	:     0.122 cycle/op (best)    0.141 cycle/op (avg)
scalar (C++)                  	:     0.122 cycle/op (best)    0.139 cycle/op (avg)
SSE                           	:     0.304 cycle/op (best)    0.340 cycle/op (avg)
SSE (v2)                      	:     0.304 cycle/op (best)    0.334 cycle/op (avg)
SSE (sadbw)                   	:     0.135 cycle/op (best)    0.178 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.136 cycle/op (best)    0.171 cycle/op (avg)
AVX2                          	:     0.160 cycle/op (best)    0.179 cycle/op (avg)
AVX2 (v2)                     	:     0.152 cycle/op (best)    0.165 cycle/op (avg)
AVX2 (sadbw)                  	:     0.070 cycle/op (best)    0.081 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.066 cycle/op (best)    0.072 cycle/op (avg)
AVX2 (maddubs)                	:     0.023 cycle/op (best)    0.025 cycle/op (avg)
Add AVX2 int8 summation sadbw variant
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz (Haswell)
Apple LLVM version 10.0.0 (clang-1000.11.45.5)

element count 4096
rdtsc_overhead set to 24
scalar                        	:     0.137 cycle/op (best)    0.146 cycle/op (avg)
scalar (C++)                  	:     0.126 cycle/op (best)    0.142 cycle/op (avg)
SSE                           	:     0.305 cycle/op (best)    0.314 cycle/op (avg)
SSE (v2)                      	:     0.303 cycle/op (best)    0.327 cycle/op (avg)
SSE (sadbw)                   	:     0.137 cycle/op (best)    0.155 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.143 cycle/op (best)    0.154 cycle/op (avg)
AVX2                          	:     0.176 cycle/op (best)    0.195 cycle/op (avg)
AVX2 (v2)                     	:     0.153 cycle/op (best)    0.163 cycle/op (avg)
AVX2 (sadbw)                  	:     0.071 cycle/op (best)    0.075 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.068 cycle/op (best)    0.072 cycle/op (avg)
AVX2 (sadbw, variant)         	:     0.043 cycle/op (best)    0.046 cycle/op (avg)
AVX2 (maddubs)                	:     0.023 cycle/op (best)    0.028 cycle/op (avg)
element count 16384
scalar                        	:     0.118 cycle/op (best)    0.136 cycle/op (avg)
scalar (C++)                  	:     0.121 cycle/op (best)    0.129 cycle/op (avg)
SSE                           	:     0.295 cycle/op (best)    0.336 cycle/op (avg)
SSE (v2)                      	:     0.304 cycle/op (best)    0.322 cycle/op (avg)
SSE (sadbw)                   	:     0.130 cycle/op (best)    0.154 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.131 cycle/op (best)    0.164 cycle/op (avg)
AVX2                          	:     0.160 cycle/op (best)    0.169 cycle/op (avg)
AVX2 (v2)                     	:     0.152 cycle/op (best)    0.163 cycle/op (avg)
AVX2 (sadbw)                  	:     0.068 cycle/op (best)    0.077 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.062 cycle/op (best)    0.067 cycle/op (avg)
AVX2 (sadbw, variant)         	:     0.041 cycle/op (best)    0.046 cycle/op (avg)
AVX2 (maddubs)                	:     0.023 cycle/op (best)    0.025 cycle/op (avg)
element count 32768
scalar                        	:     0.124 cycle/op (best)    0.138 cycle/op (avg)
scalar (C++)                  	:     0.123 cycle/op (best)    0.142 cycle/op (avg)
SSE                           	:     0.304 cycle/op (best)    0.337 cycle/op (avg)
SSE (v2)                      	:     0.304 cycle/op (best)    0.325 cycle/op (avg)
SSE (sadbw)                   	:     0.136 cycle/op (best)    0.148 cycle/op (avg)
SSE (sadbw, unrolled)         	:     0.136 cycle/op (best)    0.152 cycle/op (avg)
AVX2                          	:     0.160 cycle/op (best)    0.170 cycle/op (avg)
AVX2 (v2)                     	:     0.152 cycle/op (best)    0.162 cycle/op (avg)
AVX2 (sadbw)                  	:     0.071 cycle/op (best)    0.073 cycle/op (avg)
AVX2 (sadbw, unrolled)        	:     0.064 cycle/op (best)    0.067 cycle/op (avg)
AVX2 (sadbw, variant)         	:     0.040 cycle/op (best)    0.041 cycle/op (avg)
AVX2 (maddubs)                	:     0.023 cycle/op (best)    0.025 cycle/op (avg)
@mayeut

This comment has been minimized.

Copy link
Contributor Author

mayeut commented Feb 2, 2019

I pushed a 2nd commit with a sadbw variant which might perform better on skylake (not the case on haswell)

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Feb 2, 2019

Hi, thanks a lot. I didn't even consider maddubs instruction :) Love it.

Will upgrade the article soon.

@WojciechMula WojciechMula reopened this Feb 2, 2019

@WojciechMula WojciechMula merged commit 0f17148 into WojciechMula:master Feb 2, 2019

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Feb 2, 2019

The button "Comment and close" is bigger than "Merge" :)

@mayeut mayeut deleted the mayeut:sse-sumbytes-int8-maddubs branch Feb 3, 2019

WojciechMula added a commit that referenced this pull request Feb 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment