Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2 core for OGR-NG #12

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

craig-johnston
Copy link
Contributor

I added an AVX2 implementation based on my SSE2 implementation for 64 bit clients. There is a minor, but reliable, improvement when running on a Kaby Lake processor. I expect that the AVX2 core will scale better than the SSE2 implementation as Intel releases new processors and improves the performance of AVX2.

I have only tested compilation using VS2015, but have updated the Linux Makefile (hopefully there will be no problem).

I have assumed that if AVX2 is available, it is preferable to use this core. It would be worth checking that assumption on some of the older architectures that support AVX2 (e.g. Haswell).

This includes the VS2015 and Kaby Lake branches I've sent pull requests for.

Let me know if there are changes you'd like me to make.

Craig.

Benchmark results on a Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz:

[Apr 28 08:04:07 UTC] Automatic processor type detection found
an Intel Core iX-7xxx (Kaby Lake) processor.
[Apr 28 08:04:07 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Apr 28 08:04:27 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)
0.00:00:17.06 [57,077,774 nodes/sec]
[Apr 28 08:04:27 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Apr 28 08:04:46 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)
0.00:00:16.92 [63,158,500 nodes/sec]
[Apr 28 08:04:46 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Apr 28 08:05:06 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)
0.00:00:17.07 [82,854,657 nodes/sec]
[Apr 28 08:05:06 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Apr 28 08:05:25 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)
0.00:00:16.92 [81,372,634 nodes/sec]
[Apr 28 08:05:25 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Apr 28 08:05:44 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)
0.00:00:16.92 [83,726,597 nodes/sec]
[Apr 28 08:05:44 UTC] OGR-NG benchmark summary :
Default core : #4 (cj-asm-avx2) 83,726,597 nodes/sec
Fastest core : #4 (cj-asm-avx2) 83,726,597 nodes/sec

@ertyu
Copy link
Member

ertyu commented Apr 28, 2017

Tested on Haswell:

[Apr 28 23:04:31 UTC] Automatic processor type detection found
                      an Intel Core iX-4xxx (Haswell) processor.
[Apr 28 23:04:31 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Apr 28 23:04:51 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)             
                      0.00:00:17.06 [41,607,741 nodes/sec]
[Apr 28 23:04:51 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Apr 28 23:05:09 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)           
                      0.00:00:16.12 [49,491,628 nodes/sec]
[Apr 28 23:05:09 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Apr 28 23:05:28 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)              
                      0.00:00:16.15 [73,969,422 nodes/sec]
[Apr 28 23:05:28 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Apr 28 23:05:47 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)        
                      0.00:00:16.15 [71,995,392 nodes/sec]
[Apr 28 23:05:47 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Apr 28 23:06:05 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)              
                      0.00:00:16.17 [70,541,764 nodes/sec]

@craig-johnston
Copy link
Contributor Author

I just went to change the code to force selection of SSE2 for Haswell, but it appears that's already in place.

I also noticed that the code puts Skylake is out into the same architecture group as Haswell so it will also use SSE2. I expect Skylake will perform similarly to Kaby Lake as they are supposed to be of the same microarchitecture and so should have similar instruction costs. If someone shows the core selection for Skylake should be AVX2, we can change its grouping.

@ChrisLundquist
Copy link

Running on:
i7-6770HQ CPU
(http://www.intel.com/content/www/us/en/nuc/nuc-kit-nuc6i7kyk-features-configurations.html)

Compiled with:

gcc version 6.3.0 20170406 (Ubuntu 6.3.0-12ubuntu2) 

They are neck-and-neck on my chip. SSE2 wins some, AVX2 wins others.
The AVX2 core uses slightly more power.

$ ./dnetc --bench ogr-ng 

dnetc v2.9112-521-CFR-16021318 for Linux (Linux 4.10.0-21-generic).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://bugs.distributed.net/

[Jun 01 19:47:04 UTC] Automatic processor type detection found
                      an Intel Core iX-6xxx (Skylake) processor.
[Jun 01 19:47:04 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Jun 01 19:47:23 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)                                                                                                                                         
                      0.00:00:16.30 [31,132,697 nodes/sec]
[Jun 01 19:47:23 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Jun 01 19:47:42 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)                                                                                                                                       
                      0.00:00:17.07 [56,059,433 nodes/sec]
[Jun 01 19:47:42 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Jun 01 19:48:03 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)                                                                                                                                          
                      0.00:00:17.01 [73,908,297 nodes/sec]
[Jun 01 19:48:03 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Jun 01 19:48:22 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)                                                                                                                                    
                      0.00:00:17.05 [72,040,032 nodes/sec]
[Jun 01 19:48:22 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Jun 01 19:48:41 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)                                                                                                                                          
                      0.00:00:17.07 [73,903,698 nodes/sec]
[Jun 01 19:48:41 UTC] OGR-NG benchmark summary :
                      Default core : #2 (cj-asm-sse2) 73,908,297 nodes/sec
                      Fastest core : #2 (cj-asm-sse2) 73,908,297 nodes/sec
[Jun 01 19:48:41 UTC] Compare and share your rates in the speeds database at
                      http://www.distributed.net/speed/
                      (benchmark rates are for a single processor core)

A sample where avx2 wins:

$ ./dnetc --bench ogr-ng 
dnetc v2.9112-521-CFR-16021318 for Linux (Linux 4.10.0-21-generic).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://bugs.distributed.net/

[Jun 01 19:49:46 UTC] Automatic processor type detection found
                      an Intel Core iX-6xxx (Skylake) processor.
[Jun 01 19:49:46 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Jun 01 19:50:04 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)                                                                                                                                         
                      0.00:00:16.30 [30,174,905 nodes/sec]
[Jun 01 19:50:04 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Jun 01 19:50:24 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)                                                                                                                                       
                      0.00:00:17.03 [56,804,160 nodes/sec]
[Jun 01 19:50:24 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Jun 01 19:50:43 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)                                                                                                                                          
                      0.00:00:17.08 [74,686,723 nodes/sec]
[Jun 01 19:50:43 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Jun 01 19:51:03 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)                                                                                                                                    
                      0.00:00:16.14 [73,399,657 nodes/sec]
[Jun 01 19:51:03 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Jun 01 19:51:22 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)                                                                                                                                          
                      0.00:00:17.12 [75,213,877 nodes/sec]
[Jun 01 19:51:22 UTC] OGR-NG benchmark summary :
                      Default core : #2 (cj-asm-sse2) 74,686,723 nodes/sec
                      Fastest core : #4 (cj-asm-avx2) 75,213,877 nodes/sec
[Jun 01 19:51:22 UTC] Core #4 is marginally faster than the default core.
                      Testing variability might lead to pick one or the other.
[Jun 01 19:51:22 UTC] Compare and share your rates in the speeds database at
                      http://www.distributed.net/speed/
                      (benchmark rates are for a single processor core)

Power sample for sse2:

$ ./rapl-read 
	Package 0
		package-0	: 14.659936J
		core	: 12.615751J
		uncore	: 0.000000J
		dram	: 1.002134J

Power sample for sse2-lzcnt

	Package 0
		package-0	: 14.329919J
		core	: 12.295684J
		uncore	: 0.000000J
		dram	: 1.002561J

Power sample for avx2:

$ ./rapl-read 
	Package 0
		package-0	: 15.075339J
		core	: 13.029385J
		uncore	: 0.000000J
		dram	: 1.002439J

@ChrisLundquist
Copy link

FWIW Broadwell behaves like Haswell (no surprise there)
Using the same binary as above.

Running on a nuc5i5 (i7-5557U CPU @ 3.10GHz)

[Jun 01 20:15:20 UTC] Automatic processor type detection did not
                      recognize the processor (tag: "100063D4")
[Jun 01 20:15:20 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Jun 01 20:15:39 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)                                                                                                                                         
                      0.00:00:16.96 [29,903,650 nodes/sec]
[Jun 01 20:15:39 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Jun 01 20:15:58 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)                                                                                                                                       
                      0.00:00:17.05 [53,626,066 nodes/sec]
[Jun 01 20:15:58 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Jun 01 20:16:18 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)                                                                                                                                          
                      0.00:00:17.01 [78,280,448 nodes/sec]
[Jun 01 20:16:18 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Jun 01 20:16:37 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)                                                                                                                                    
                      0.00:00:16.98 [77,036,275 nodes/sec]
[Jun 01 20:16:37 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Jun 01 20:16:57 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)                                                                                                                                          
                      0.00:00:17.00 [75,841,827 nodes/sec]
[Jun 01 20:16:57 UTC] OGR-NG benchmark summary :
                      Default core : #4 (cj-asm-avx2) 75,841,827 nodes/sec
                      Fastest core : #2 (cj-asm-sse2) 78,280,448 nodes/sec

@ChrisLundquist
Copy link

I almost forgot, I had to make a minor change along the lines of yours to get this to compile on the above platform:

$ git diff
diff --git a/common/client.h b/common/client.h
index d68a3020..5e3a90e4 100644
--- a/common/client.h
+++ b/common/client.h
@@ -24,7 +24,7 @@
 #define BUFFER_DEFAULT_OUT_BASENAME "buff-out"
 #define MINCLIENTOPTSTRLEN   64 /* no asciiz var is smaller than this */
 #define NO_OUTBUFFER_THRESHOLDS /* no longer have outthresholds */
-#define DEFAULT_EXITFLAGFILENAME "exitdnet"EXTN_SEP"now"
+#define DEFAULT_EXITFLAGFILENAME "exitdnet" EXTN_SEP "now"
 
 // ------------------

@craig-johnston
Copy link
Contributor Author

Thanks for the benchmark reports. It seems that the AVX2 core is only worthwhile on KabyLake and (probably) later.

@ChrisLundquist
Copy link

@craig-johnston Did you investigate xsave and xrstor as alternatives to the sections like:

        ; is simpler to manage. So save maximum amount required to work with all OS'es.
        push    rsi     ; Windows
        push    rdi     ; Windows
        push    rbx
        push    rbp
        push    r12
        push    r13
        push    r14
        push    r15
        ; According to x64 ABI, stack must be aligned by 16 before call =>
        ; it'll be xxxxxxx8 after call. We've pushed EVEN number of registers above =>
        ; stack is still at xxxxxxx8. Subtracting ***8 will make it aligned to 16,
        ; so we can save XMM registers (required for Windows only, but see above).
        sub     rsp, 0xA8
        movdqa  [rsp+0x00], xmm6
        movdqa  [rsp+0x10], xmm7
        movdqa  [rsp+0x20], xmm8
        movdqa  [rsp+0x30], xmm9
        movdqa  [rsp+0x40], xmm10
        movdqa  [rsp+0x50], xmm11
        movdqa  [rsp+0x60], xmm12
        movdqa  [rsp+0x70], xmm13
        movdqa  [rsp+0x80], xmm14
        movdqa  [rsp+0x90], xmm15

http://www.agner.org/optimize/instruction_tables.pdf
Says xsave has a 107 cycle latency which doesn't seem particularly fast, but some other sources says it stores it in a more compact format which pushes fewer bytes?

@craig-johnston
Copy link
Contributor Author

I didn't. The code is the same generic function prologue used for most of the OGR-NG asm cores. It's not in the hot path, so the performance has very little impact when compared to the body, which will loop many thousands of times for each call to the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants