Add SSE4 optimized SHA256 #10821

Merged
merged 5 commits into from Jul 20, 2017

Conversation

Projects
None yet
9 participants
Owner

sipa commented Jul 14, 2017 edited

This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with --enable-experimental-asm.

In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

This gives around a 50% speedup on the SHA256 benchmark for me.

It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

src/crypto/sha256.cpp
+
+#if defined(__x86_64__) || defined(__amd64__)
+ uint32_t eax, ebx, ecx, edx;
+ if (__get_cpuid(1, &eax, &ebx, &ecx, &edx) && (ecx >> 20) & 1) {
@laanwj

laanwj Jul 14, 2017 edited

Owner

I'd prefer to do this setup explicitly during initialization; this also avoids having to use an atomic pointer, which seems overkill (why would it ever change during runtime?) and may be inefficient on some platforms.
(also the detection might be more involved on some platforms, so it's better for clarity to drive it from an init function instead of magically at first call).

@theuni

theuni Jul 14, 2017

Member

We also have the option of using the ifunc attribute, supported on recent binutils with at least gcc and clang.

Though it's non-standard and afaik elf-specific, it's worth considering where possible.

@gmaxwell

gmaxwell Jul 16, 2017

Member

do we have constructors with hashing in them?

Member

luke-jr commented Jul 14, 2017

Even with inline assembly, there are build complications unfortunately. The compile will fail if the target doesn't support it..

Owner

sipa commented Jul 14, 2017

@luke-jr There are system macros to test whether you're compiling for x86_64 or not.

Member

luke-jr commented Jul 14, 2017

You said almost every x86_64 CPU. Are we going to drop support for the outliers then?

One of the travis builds obviously has an issue with it too:
crypto/sha256_sse42.cpp:42:9: error: inline assembly requires more registers than available

Member

theuni commented Jul 14, 2017

The clang/osx build succeeds when -fomit-frame-pointer is used. I don't speak enough asm to know if a register can be freed up.

Member

gmaxwell commented Jul 14, 2017

Even with inline assembly, there are build complications unfortunately. The compile will fail if the target doesn't support it..

No it won't-- these files are compiled without -msse4.2 already. The only thing required is that its x86_64, which the build tests for.

Owner

sipa commented Jul 14, 2017

@luke-jr There is runtime detection to see if the CPU supports the extension. The only requirement is that the target is x86_64.

Member

jonasschnelli commented Jul 14, 2017 edited

Gitian OSX build is broken (https://bitcoin.jonasschnelli.ch/build/216):

Generated test/data/base58_keys_invalid.json.h
crypto/sha256_sse42.cpp:42:9: error: inline assembly requires more registers than available
        "shl    $0x6,%2;"
        ^
1 error generated.

No problem on Win/ OSX Linux

Owner

sipa commented Jul 14, 2017 edited

@jonasschnelli @theuni figured it out - clang isn't compiling with -fomit-frame-pointer, and thus there is one fewer register available. Unfortunately, omitting the frame pointer still makes this code not work...

Owner

sipa commented Jul 14, 2017 edited

Updated the code to use one fewer register. The original YASM code used the dx register for two purposes, which I had separated out into two separate registers. They're merged now.

src/crypto/sha256_sse42.cpp
+; documentation and/or other materials provided with the
+; distribution.
+;
+; * Neither the name of the Intel Corporation nor the names of its
@TheBlueMatt

TheBlueMatt Jul 14, 2017

Contributor

We're gonna have to do something to meet this condition, though it doesnt appear we'd have to do much.

@gmaxwell

gmaxwell Jul 14, 2017 edited

Member

This is the standard three clause BSD license, it is GPL and whatnot compatible. The source code to Bitcoin, which contains this notice, is part of the "documentation and/or other materials" we provide.

@TheBlueMatt

TheBlueMatt Jul 14, 2017

Contributor

We ship sans-source all the time? I figured we'd just put a "contains softare copyright Intel" in the --help output or a README somewhere.

sipa changed the title from Add SSE 4.2 optimized SHA256 to [WIP] Add SSE 4.2 optimized SHA256 Jul 15, 2017

Owner

sipa commented Jul 15, 2017

Marking as WIP, as this does not seem to produce correct hashes on OSX (cc @theuni).

Member

theuni commented Jul 15, 2017

I poked at this for hours and came up empty-handed. I'll wait for someone else to confirm my osx breakage isn't just local.

Member

theuni commented Jul 15, 2017

two more data points:

  1. @fanquake verified that this crashes on osx for him as well.

  2. I managed to reproduce a crash on Linux with an old clang (3.2), and it's even uglier, crashing gdb as well:

Starting program: /home/cory/dev/bitcoin2/src/bitcoind
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
/build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) n /build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertiondwarf2_per_objfile->ranges.readin' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
0x000000000074c910 in sha256_sse42::Transform (
/build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) n /build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertiondwarf2_per_objfile->ranges.readin' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
Segmentation fault (core dumped)

Member

theuni commented Jul 16, 2017 edited

Tested ACK 08b7438. Good on OSX now!

Edit: Though I'd prefer to have the cpu check done separately.

sipa changed the title from [WIP] Add SSE 4.2 optimized SHA256 to Add SSE 4.2 optimized SHA256 Jul 16, 2017

Owner

sipa commented Jul 16, 2017

Removing WIP tag, I believe we solved the OSX problem.

Contributor

fanquake commented Jul 16, 2017

Confirmed that this now runs on OSX.

Running src/bench/bench_bitcoin
master (5cfdda2)

SHA256,30,0.034190416336060,0.035426974296570,0.034737364451090,115983933,120179929,117843926
SHA256,30,0.033560991287231,0.037778496742249,0.035649696985881,113846584,128155476,120938933
SHA256,30,0.033833026885986,0.035175085067749,0.034680000940959,114771438,119322675,117649641
SHA256_32b,2,2.333264589309692,2.333264589309692,2.333264589309692,7915485729,7915485729,7915485729
SHA256_32b,2,2.289189100265503,2.289189100265503,2.289189100265503,7765884738,7765884738,7765884738
SHA256_32b,2,2.370669960975647,2.370669960975647,2.370669960975647,8042288399,8042288399,8042288399

master (5cfdda2) + this PR

SHA256,320,0.003191620111465,0.003264248371124,0.003223562240601,10826857,11073394,10935724
SHA256,352,0.003048248589039,0.003163591027260,0.003104761242867,10340442,10731709,10532673
SHA256,352,0.003055907785892,0.003142252564430,0.003093159334226,10366424,10659420,10493303
SHA256_32b,4,0.324660062789917,0.329437971115112,0.327049016952515,1101416645,1117628996,1109522820
SHA256_32b,4,0.327362537384033,0.329176425933838,0.328269481658936,1110585003,1116655624,1113620313
SHA256_32b,4,0.325733423233032,0.331611037254333,0.328672230243683,1105059350,1124999710,1115029530
Member

gmaxwell commented Jul 16, 2017

This should do something to print what implementation its using to help spot runtime auto-detection bugs.

Owner

sipa commented Jul 16, 2017

@gmaxwell Already done

Owner

sipa commented Jul 16, 2017

Added an extra commit that performs a self-test before selecting an optimized transform function.

Owner

sipa commented Jul 16, 2017

@fanquake Are you compiling with -O0 or something similar? This shouldn't give a 10x speedup for the SHA256 benchmark. More like a factor 1.5x.

Member

jonasschnelli commented Jul 16, 2017

Tested ACK on my OSX box as well as on a Debian with Skylake

CPU OSX: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
CPU Debian: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz

Perf.-improvements: factor ~1.6.

---- DETAILS:

Exts OSX:

SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT
FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

Exts Debian:

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt

OSX Master

SHA1,576,0.001819185912609,0.002340316772461,0.001918800589111,5282892,6796157,5572193
SHA256,208,0.004609435796738,0.005578860640526,0.004965331691962,13385764,16201094,14419137
SHA256_32b,4,0.332423448562622,0.333897590637207,0.333160519599915,965338619,969639015,967488817
SHA512,352,0.002628467977047,0.003746151924133,0.002973860637708,7633100,10878782,8636088

OSX This PR

SHA1,576,0.001871295273304,0.002110570669174,0.001951401018434,5434265,6129157,5666867
SHA256,352,0.002895936369896,0.003193676471710,0.002999408678575,8409812,9274468,8710180
SHA256_32b,6,0.216736078262329,0.222404479980469,0.219772179921468,629402126,645863039,638218484
SHA512,352,0.002783536911011,0.003117501735687,0.002889553931626,8083365,9053381,8391166

Debian Master

SHA1,704,0.001483812928200,0.001535888761282,0.001513123512268,5341692,5529187,5447247
SHA256,256,0.003961175680161,0.004363536834717,0.004079484380782,14260373,15708338,14686141
SHA256_32b,4,0.281208992004395,0.283093929290771,0.282151460647583,1012352610,1019139204,1015745907
SHA512,416,0.002545125782490,0.002609595656395,0.002585454629018,9162493,9394502,9307633

Debian This PR

SHA1,704,0.001500129699707,0.001563936471939,0.001528463241729,5400623,5630186,5502462
SHA256,384,0.002633377909660,0.002746812999249,0.002677822485566,9480191,9888481,9640152
SHA256_32b,6,0.190533041954041,0.193126082420349,0.191900690396627,685917852,695253649,690841909
SHA512,384,0.002558782696724,0.002740010619164,0.002604349205891,9211619,9863917,9375649

(non 256 SHA's are for comp. reference).

Owner

sipa commented Jul 16, 2017

Rebased, and moved the autodetection to an explicit SHA256AutoDetect() function that is called during initialization.

Owner

sipa commented Jul 16, 2017

Improved the self test (it now tests 0, 1, and 2-block transforms), and made it assert when the selftest fails rather than failing over to the standard implementation. This way, it won't hide problems.

@gmaxwell

ACK

sipa changed the title from Add SSE 4.2 optimized SHA256 to Add SSE4 optimized SHA256 Jul 17, 2017

@gmaxwell

re-ACK

+ "movdqa %19,%%xmm10;"
+ "movdqa %20,%%xmm11;"
+
+ "Lloop0_%=:"
@theuni

theuni Jul 17, 2017

Member

It'd be helpful to add a little note about the 'L' prefix and what problem it solves. If nothing else, it may turn up as a another useful google hit for someone in the future.

+ return true;
+}
+
+TransformType Transform = sha256::Transform;
@theuni

theuni Jul 17, 2017

Member

Like with the rand init, I think we'd save ourselves from future oopses by setting this to nullptr initially, and letting SHA256AutoDetect() set the fallback to sha256::Transform if necessary.

@laanwj

laanwj Jul 17, 2017 edited

Owner

I don't think that will work - there is some SHA256 work before main (IIRC to set up the chain parameteters). Better if that uses the 'canonical' SHA256.

@theuni

theuni Jul 17, 2017

Member

Right, nevermind.

@sipa

sipa Jul 17, 2017

Owner

Indeed, that is the reason.

Owner

laanwj commented Jul 18, 2017

utACK, looks good to me now, but I still think it's too late for 0.15.
At least to enable it by default, I'm ok with an --enable-experimental-asm option, then enabling it by default after the 0.15 branch-off.

Owner

sipa commented Jul 18, 2017

@laanwj Added a --enable-experimental-asm configure option, disabled by default.

laanwj added this to the 0.15.0 milestone Jul 18, 2017

configure.ac
+AC_ARG_ENABLE([experimental-asm],
+ [AS_HELP_STRING([--enable-experimental-asm],
+ [Enable experimental assembly routines (default is no)])],
+ [experimental_asm=$withval],
@laanwj

laanwj Jul 18, 2017

Owner

This should be $enableval, not $withval

@sipa

sipa Jul 18, 2017

Owner

Strange, I tested this.

@laanwj

laanwj Jul 18, 2017

Owner

Me too, and it didn't work for me unless I changed it. Using $withval here most llikelys pick up the last --with check result (for qrencode, which wasn't installed in my case, so it always had no)

@sipa

sipa Jul 18, 2017

Owner

Fixed.

@@ -1162,6 +1172,7 @@ AM_CONDITIONAL([USE_LCOV],[test x$use_lcov = xyes])
AM_CONDITIONAL([GLIBC_BACK_COMPAT],[test x$use_glibc_compat = xyes])
AM_CONDITIONAL([HARDEN],[test x$use_hardening = xyes])
AM_CONDITIONAL([ENABLE_HWCRC32],[test x$enable_hwcrc32 = xyes])
+AM_CONDITIONAL([EXPERIMENTAL_ASM],[test x$experimental_asm = xyes])
@theuni

theuni Jul 18, 2017

Member

This is only needed if you intended to avoid compiling the _sse4.o variant altogether. AM_CONDITIONAL sets Makefile variables.

@theuni

theuni Jul 18, 2017

Member

On second thought, I'd actually prefer doing it that way in order to keep sha256_sse4.cpp completely generic. It was very helpful for me while testing to just throw together a quick test app using the .cpp directly.

The makefile change would become:

if EXPERIMENTAL_ASM
crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp
endif

Then obviously the guard isn't needed in the .cpp.

@sipa

sipa Jul 18, 2017

Owner

Fixed.

@sipa

Addressed comments, and tested.

configure.ac
+AC_ARG_ENABLE([experimental-asm],
+ [AS_HELP_STRING([--enable-experimental-asm],
+ [Enable experimental assembly routines (default is no)])],
+ [experimental_asm=$withval],
@sipa

sipa Jul 18, 2017

Owner

Fixed.

@@ -1162,6 +1172,7 @@ AM_CONDITIONAL([USE_LCOV],[test x$use_lcov = xyes])
AM_CONDITIONAL([GLIBC_BACK_COMPAT],[test x$use_glibc_compat = xyes])
AM_CONDITIONAL([HARDEN],[test x$use_hardening = xyes])
AM_CONDITIONAL([ENABLE_HWCRC32],[test x$enable_hwcrc32 = xyes])
+AM_CONDITIONAL([EXPERIMENTAL_ASM],[test x$experimental_asm = xyes])
@sipa

sipa Jul 18, 2017

Owner

Fixed.

src/crypto/sha256.cpp
@@ -11,11 +11,13 @@
#if defined(__x86_64__) || defined(__amd64__)
#include <cpuid.h>
@theuni

theuni Jul 18, 2017

Member

Nit: no need to risk including the not-guaranteed-to-exist header. Move the #if up a bit?

@sipa

sipa Jul 18, 2017

Owner

Fixed.

src/crypto/sha256_sse4.cpp
@@ -5,6 +5,8 @@
// This is a translation to GCC extended asm syntax from YASM code by Intel
// (available at the bottom of this file).
+#include "config/bitcoin-config.h"
+
@theuni

theuni Jul 18, 2017

Member

Not needed anymore :)

@sipa

sipa Jul 18, 2017

Owner

Fixed.

Member

theuni commented Jul 18, 2017

utACK modulo the small nits.

@gmaxwell

reACK

src/init.cpp
@@ -1161,6 +1161,7 @@ bool AppInitSanityChecks()
// ********************************************************* Step 4: sanity checks
// Initialize elliptic curve code
+ LogPrintf("Using the '%s' SHA256 implementation\n", SHA256AutoDetect());
@laanwj

laanwj Jul 20, 2017

Owner

Nit: Seems this is a log message with the side-effect of detecting the SHA256 implementation.
I'd prefer to assign the result explicitly, so that if someone happens to comment this out, or moves it to debug category, it won't just be skipped.

@sipa

sipa Jul 20, 2017

Owner

Fixed.

Member

theuni commented Jul 20, 2017

utACK 6b8d872, though I extensively tested earlier revisions.

@laanwj laanwj merged commit 6b8d872 into bitcoin:master Jul 20, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@laanwj laanwj added a commit that referenced this pull request Jul 20, 2017

@laanwj laanwj Merge #10821: Add SSE4 optimized SHA256
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
16240f4

TheBlueMatt referenced this pull request in bitcoinfibre/bitcoinfibre Jul 21, 2017

Closed

Build system do not check for yasm but requires it #4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment