AVX2 core for OGR-NG #12

craig-johnston · 2017-04-28T08:18:01Z

I added an AVX2 implementation based on my SSE2 implementation for 64 bit clients. There is a minor, but reliable, improvement when running on a Kaby Lake processor. I expect that the AVX2 core will scale better than the SSE2 implementation as Intel releases new processors and improves the performance of AVX2.

I have only tested compilation using VS2015, but have updated the Linux Makefile (hopefully there will be no problem).

I have assumed that if AVX2 is available, it is preferable to use this core. It would be worth checking that assumption on some of the older architectures that support AVX2 (e.g. Haswell).

This includes the VS2015 and Kaby Lake branches I've sent pull requests for.

Let me know if there are changes you'd like me to make.

Craig.

Benchmark results on a Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz:

[Apr 28 08:04:07 UTC] Automatic processor type detection found
an Intel Core iX-7xxx (Kaby Lake) processor.
[Apr 28 08:04:07 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Apr 28 08:04:27 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)
0.00:00:17.06 [57,077,774 nodes/sec]
[Apr 28 08:04:27 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Apr 28 08:04:46 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)
0.00:00:16.92 [63,158,500 nodes/sec]
[Apr 28 08:04:46 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Apr 28 08:05:06 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)
0.00:00:17.07 [82,854,657 nodes/sec]
[Apr 28 08:05:06 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Apr 28 08:05:25 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)
0.00:00:16.92 [81,372,634 nodes/sec]
[Apr 28 08:05:25 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Apr 28 08:05:44 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)
0.00:00:16.92 [83,726,597 nodes/sec]
[Apr 28 08:05:44 UTC] OGR-NG benchmark summary :
Default core : #4 (cj-asm-avx2) 83,726,597 nodes/sec
Fastest core : #4 (cj-asm-avx2) 83,726,597 nodes/sec

…th string postfixes.

OGRNG Alignment changed to 32 Core added for 64 bit builds and selected as default for CPUs supporting AVX2

…client-base into ogrng-avx2

ertyu · 2017-04-28T23:07:38Z

Tested on Haswell:

[Apr 28 23:04:31 UTC] Automatic processor type detection found
                      an Intel Core iX-4xxx (Haswell) processor.
[Apr 28 23:04:31 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Apr 28 23:04:51 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)             
                      0.00:00:17.06 [41,607,741 nodes/sec]
[Apr 28 23:04:51 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Apr 28 23:05:09 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)           
                      0.00:00:16.12 [49,491,628 nodes/sec]
[Apr 28 23:05:09 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Apr 28 23:05:28 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)              
                      0.00:00:16.15 [73,969,422 nodes/sec]
[Apr 28 23:05:28 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Apr 28 23:05:47 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)        
                      0.00:00:16.15 [71,995,392 nodes/sec]
[Apr 28 23:05:47 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Apr 28 23:06:05 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)              
                      0.00:00:16.17 [70,541,764 nodes/sec]

craig-johnston · 2017-04-29T02:07:36Z

I just went to change the code to force selection of SSE2 for Haswell, but it appears that's already in place.

I also noticed that the code puts Skylake is out into the same architecture group as Haswell so it will also use SSE2. I expect Skylake will perform similarly to Kaby Lake as they are supposed to be of the same microarchitecture and so should have similar instruction costs. If someone shows the core selection for Skylake should be AVX2, we can change its grouping.

ChrisLundquist · 2017-06-01T20:00:05Z

Running on:
i7-6770HQ CPU
(http://www.intel.com/content/www/us/en/nuc/nuc-kit-nuc6i7kyk-features-configurations.html)

Compiled with:

gcc version 6.3.0 20170406 (Ubuntu 6.3.0-12ubuntu2)

They are neck-and-neck on my chip. SSE2 wins some, AVX2 wins others.
The AVX2 core uses slightly more power.

$ ./dnetc --bench ogr-ng 

dnetc v2.9112-521-CFR-16021318 for Linux (Linux 4.10.0-21-generic).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://bugs.distributed.net/

[Jun 01 19:47:04 UTC] Automatic processor type detection found
                      an Intel Core iX-6xxx (Skylake) processor.
[Jun 01 19:47:04 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Jun 01 19:47:23 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)                                                                                                                                         
                      0.00:00:16.30 [31,132,697 nodes/sec]
[Jun 01 19:47:23 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Jun 01 19:47:42 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)                                                                                                                                       
                      0.00:00:17.07 [56,059,433 nodes/sec]
[Jun 01 19:47:42 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Jun 01 19:48:03 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)                                                                                                                                          
                      0.00:00:17.01 [73,908,297 nodes/sec]
[Jun 01 19:48:03 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Jun 01 19:48:22 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)                                                                                                                                    
                      0.00:00:17.05 [72,040,032 nodes/sec]
[Jun 01 19:48:22 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Jun 01 19:48:41 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)                                                                                                                                          
                      0.00:00:17.07 [73,903,698 nodes/sec]
[Jun 01 19:48:41 UTC] OGR-NG benchmark summary :
                      Default core : #2 (cj-asm-sse2) 73,908,297 nodes/sec
                      Fastest core : #2 (cj-asm-sse2) 73,908,297 nodes/sec
[Jun 01 19:48:41 UTC] Compare and share your rates in the speeds database at
                      http://www.distributed.net/speed/
                      (benchmark rates are for a single processor core)

A sample where avx2 wins:

$ ./dnetc --bench ogr-ng 
dnetc v2.9112-521-CFR-16021318 for Linux (Linux 4.10.0-21-generic).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://bugs.distributed.net/

[Jun 01 19:49:46 UTC] Automatic processor type detection found
                      an Intel Core iX-6xxx (Skylake) processor.
[Jun 01 19:49:46 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Jun 01 19:50:04 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)                                                                                                                                         
                      0.00:00:16.30 [30,174,905 nodes/sec]
[Jun 01 19:50:04 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Jun 01 19:50:24 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)                                                                                                                                       
                      0.00:00:17.03 [56,804,160 nodes/sec]
[Jun 01 19:50:24 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Jun 01 19:50:43 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)                                                                                                                                          
                      0.00:00:17.08 [74,686,723 nodes/sec]
[Jun 01 19:50:43 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Jun 01 19:51:03 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)                                                                                                                                    
                      0.00:00:16.14 [73,399,657 nodes/sec]
[Jun 01 19:51:03 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Jun 01 19:51:22 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)                                                                                                                                          
                      0.00:00:17.12 [75,213,877 nodes/sec]
[Jun 01 19:51:22 UTC] OGR-NG benchmark summary :
                      Default core : #2 (cj-asm-sse2) 74,686,723 nodes/sec
                      Fastest core : #4 (cj-asm-avx2) 75,213,877 nodes/sec
[Jun 01 19:51:22 UTC] Core #4 is marginally faster than the default core.
                      Testing variability might lead to pick one or the other.
[Jun 01 19:51:22 UTC] Compare and share your rates in the speeds database at
                      http://www.distributed.net/speed/
                      (benchmark rates are for a single processor core)

Power sample for sse2:

$ ./rapl-read 
	Package 0
		package-0	: 14.659936J
		core	: 12.615751J
		uncore	: 0.000000J
		dram	: 1.002134J

Power sample for sse2-lzcnt

	Package 0
		package-0	: 14.329919J
		core	: 12.295684J
		uncore	: 0.000000J
		dram	: 1.002561J

Power sample for avx2:

$ ./rapl-read 
	Package 0
		package-0	: 15.075339J
		core	: 13.029385J
		uncore	: 0.000000J
		dram	: 1.002439J

ChrisLundquist · 2017-06-01T20:20:24Z

FWIW Broadwell behaves like Haswell (no surprise there)
Using the same binary as above.

Running on a nuc5i5 (i7-5557U CPU @ 3.10GHz)

[Jun 01 20:15:20 UTC] Automatic processor type detection did not
                      recognize the processor (tag: "100063D4")
[Jun 01 20:15:20 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Jun 01 20:15:39 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)                                                                                                                                         
                      0.00:00:16.96 [29,903,650 nodes/sec]
[Jun 01 20:15:39 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Jun 01 20:15:58 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)                                                                                                                                       
                      0.00:00:17.05 [53,626,066 nodes/sec]
[Jun 01 20:15:58 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Jun 01 20:16:18 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)                                                                                                                                          
                      0.00:00:17.01 [78,280,448 nodes/sec]
[Jun 01 20:16:18 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Jun 01 20:16:37 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)                                                                                                                                    
                      0.00:00:16.98 [77,036,275 nodes/sec]
[Jun 01 20:16:37 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Jun 01 20:16:57 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)                                                                                                                                          
                      0.00:00:17.00 [75,841,827 nodes/sec]
[Jun 01 20:16:57 UTC] OGR-NG benchmark summary :
                      Default core : #4 (cj-asm-avx2) 75,841,827 nodes/sec
                      Fastest core : #2 (cj-asm-sse2) 78,280,448 nodes/sec

ChrisLundquist · 2017-06-01T20:22:16Z

I almost forgot, I had to make a minor change along the lines of yours to get this to compile on the above platform:

$ git diff
diff --git a/common/client.h b/common/client.h
index d68a3020..5e3a90e4 100644
--- a/common/client.h
+++ b/common/client.h
@@ -24,7 +24,7 @@
 #define BUFFER_DEFAULT_OUT_BASENAME "buff-out"
 #define MINCLIENTOPTSTRLEN   64 /* no asciiz var is smaller than this */
 #define NO_OUTBUFFER_THRESHOLDS /* no longer have outthresholds */
-#define DEFAULT_EXITFLAGFILENAME "exitdnet"EXTN_SEP"now"
+#define DEFAULT_EXITFLAGFILENAME "exitdnet" EXTN_SEP "now"
 
 // ------------------

craig-johnston · 2017-06-05T01:52:48Z

Thanks for the benchmark reports. It seems that the AVX2 core is only worthwhile on KabyLake and (probably) later.

ChrisLundquist · 2017-06-17T00:18:50Z

@craig-johnston Did you investigate xsave and xrstor as alternatives to the sections like:

        ; is simpler to manage. So save maximum amount required to work with all OS'es.
        push    rsi     ; Windows
        push    rdi     ; Windows
        push    rbx
        push    rbp
        push    r12
        push    r13
        push    r14
        push    r15
        ; According to x64 ABI, stack must be aligned by 16 before call =>
        ; it'll be xxxxxxx8 after call. We've pushed EVEN number of registers above =>
        ; stack is still at xxxxxxx8. Subtracting ***8 will make it aligned to 16,
        ; so we can save XMM registers (required for Windows only, but see above).
        sub     rsp, 0xA8
        movdqa  [rsp+0x00], xmm6
        movdqa  [rsp+0x10], xmm7
        movdqa  [rsp+0x20], xmm8
        movdqa  [rsp+0x30], xmm9
        movdqa  [rsp+0x40], xmm10
        movdqa  [rsp+0x50], xmm11
        movdqa  [rsp+0x60], xmm12
        movdqa  [rsp+0x70], xmm13
        movdqa  [rsp+0x80], xmm14
        movdqa  [rsp+0x90], xmm15

http://www.agner.org/optimize/instruction_tables.pdf
Says xsave has a 107 cycle latency which doesn't seem particularly fast, but some other sources says it stores it in a more compact format which pushes fewer bytes?

craig-johnston · 2017-06-17T02:13:18Z

I didn't. The code is the same generic function prologue used for most of the OGR-NG asm cores. It's not in the hot path, so the performance has very little impact when compared to the body, which will loop many thousands of times for each call to the function.

craig-johnston added 10 commits March 19, 2017 15:09

Makefile now defines snprintf=_snprintf for VS2013 and earlier

1e8acc7

Fix concatenating of strings and defines, they were being confused wi…

31fda34

…th string postfixes.

Added Kaby Lake detection

bc58cdc

White space

b6ece0b

Best core selection for OGR-NG on Kaby Lake is cj-asm-sse2

812124d

Initial AVX2 implementation

8daec6c

OGRNG Alignment changed to 32 Core added for 64 bit builds and selected as default for CPUs supporting AVX2

Add AVX2 core to configure

92bd53d

Add AVX2 core to configure

9e40dfd

Merge branch 'ogrng-avx2' of https://github.com/craig-johnston/dnetc-…

6f0305b

…client-base into ogrng-avx2

Improved documentation on AVX2 core

71b9147

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 core for OGR-NG #12

AVX2 core for OGR-NG #12

craig-johnston commented Apr 28, 2017

ertyu commented Apr 28, 2017

craig-johnston commented Apr 29, 2017

ChrisLundquist commented Jun 1, 2017

ChrisLundquist commented Jun 1, 2017

ChrisLundquist commented Jun 1, 2017

craig-johnston commented Jun 5, 2017

ChrisLundquist commented Jun 17, 2017

craig-johnston commented Jun 17, 2017

AVX2 core for OGR-NG #12

Are you sure you want to change the base?

AVX2 core for OGR-NG #12

Conversation

craig-johnston commented Apr 28, 2017

ertyu commented Apr 28, 2017

craig-johnston commented Apr 29, 2017

ChrisLundquist commented Jun 1, 2017

ChrisLundquist commented Jun 1, 2017

ChrisLundquist commented Jun 1, 2017

craig-johnston commented Jun 5, 2017

ChrisLundquist commented Jun 17, 2017

craig-johnston commented Jun 17, 2017