Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Add SSE4 optimized SHA256 #10821
Conversation
| + | ||
| +#if defined(__x86_64__) || defined(__amd64__) | ||
| + uint32_t eax, ebx, ecx, edx; | ||
| + if (__get_cpuid(1, &eax, &ebx, &ecx, &edx) && (ecx >> 20) & 1) { |
laanwj
Jul 14, 2017
•
Owner
I'd prefer to do this setup explicitly during initialization; this also avoids having to use an atomic pointer, which seems overkill (why would it ever change during runtime?) and may be inefficient on some platforms.
(also the detection might be more involved on some platforms, so it's better for clarity to drive it from an init function instead of magically at first call).
theuni
Jul 14, 2017
Member
We also have the option of using the ifunc attribute, supported on recent binutils with at least gcc and clang.
Though it's non-standard and afaik elf-specific, it's worth considering where possible.
|
Even with inline assembly, there are build complications unfortunately. The compile will fail if the target doesn't support it.. |
laanwj
added Validation Utils and libraries
labels
Jul 14, 2017
|
@luke-jr There are system macros to test whether you're compiling for x86_64 or not. |
|
You said almost every x86_64 CPU. Are we going to drop support for the outliers then? |
MeshCollider
commented
Jul 14, 2017
|
One of the travis builds obviously has an issue with it too: |
|
The clang/osx build succeeds when -fomit-frame-pointer is used. I don't speak enough asm to know if a register can be freed up. |
No it won't-- these files are compiled without -msse4.2 already. The only thing required is that its x86_64, which the build tests for. |
|
@luke-jr There is runtime detection to see if the CPU supports the extension. The only requirement is that the target is x86_64. |
|
Gitian OSX build is broken (https://bitcoin.jonasschnelli.ch/build/216):
No problem on Win/ |
|
@jonasschnelli @theuni figured it out - clang isn't compiling with |
|
Updated the code to use one fewer register. The original YASM code used the |
| +; documentation and/or other materials provided with the | ||
| +; distribution. | ||
| +; | ||
| +; * Neither the name of the Intel Corporation nor the names of its |
TheBlueMatt
Jul 14, 2017
Contributor
We're gonna have to do something to meet this condition, though it doesnt appear we'd have to do much.
gmaxwell
Jul 14, 2017
•
Member
This is the standard three clause BSD license, it is GPL and whatnot compatible. The source code to Bitcoin, which contains this notice, is part of the "documentation and/or other materials" we provide.
TheBlueMatt
Jul 14, 2017
Contributor
We ship sans-source all the time? I figured we'd just put a "contains softare copyright Intel" in the --help output or a README somewhere.
sipa
changed the title from
Add SSE 4.2 optimized SHA256 to [WIP] Add SSE 4.2 optimized SHA256
Jul 15, 2017
|
Marking as WIP, as this does not seem to produce correct hashes on OSX (cc @theuni). |
|
I poked at this for hours and came up empty-handed. I'll wait for someone else to confirm my osx breakage isn't just local. |
|
two more data points:
|
|
Tested ACK 08b7438. Good on OSX now! Edit: Though I'd prefer to have the cpu check done separately. |
sipa
changed the title from
[WIP] Add SSE 4.2 optimized SHA256 to Add SSE 4.2 optimized SHA256
Jul 16, 2017
|
Removing WIP tag, I believe we solved the OSX problem. |
|
Confirmed that this now runs on OSX. Running
master (5cfdda2) + this PR
|
|
This should do something to print what implementation its using to help spot runtime auto-detection bugs. |
|
@gmaxwell Already done |
|
Added an extra commit that performs a self-test before selecting an optimized transform function. |
|
@fanquake Are you compiling with -O0 or something similar? This shouldn't give a 10x speedup for the SHA256 benchmark. More like a factor 1.5x. |
|
Tested ACK on my OSX box as well as on a Debian with Skylake CPU OSX: Perf.-improvements: factor ~1.6. ---- DETAILS: Exts OSX:
Exts Debian:
OSX Master
OSX This PR
Debian Master
Debian This PR
(non 256 SHA's are for comp. reference). |
|
Rebased, and moved the autodetection to an explicit |
|
Improved the self test (it now tests 0, 1, and 2-block transforms), and made it assert when the selftest fails rather than failing over to the standard implementation. This way, it won't hide problems. |
sipa
changed the title from
Add SSE 4.2 optimized SHA256 to Add SSE4 optimized SHA256
Jul 17, 2017
| + "movdqa %19,%%xmm10;" | ||
| + "movdqa %20,%%xmm11;" | ||
| + | ||
| + "Lloop0_%=:" |
theuni
Jul 17, 2017
Member
It'd be helpful to add a little note about the 'L' prefix and what problem it solves. If nothing else, it may turn up as a another useful google hit for someone in the future.
| + return true; | ||
| +} | ||
| + | ||
| +TransformType Transform = sha256::Transform; |
theuni
Jul 17, 2017
Member
Like with the rand init, I think we'd save ourselves from future oopses by setting this to nullptr initially, and letting SHA256AutoDetect() set the fallback to sha256::Transform if necessary.
laanwj
Jul 17, 2017
•
Owner
I don't think that will work - there is some SHA256 work before main (IIRC to set up the chain parameteters). Better if that uses the 'canonical' SHA256.
|
utACK, looks good to me now, but I still think it's too late for 0.15. |
|
@laanwj Added a |
laanwj
added this to the
0.15.0
milestone
Jul 18, 2017
| +AC_ARG_ENABLE([experimental-asm], | ||
| + [AS_HELP_STRING([--enable-experimental-asm], | ||
| + [Enable experimental assembly routines (default is no)])], | ||
| + [experimental_asm=$withval], |
laanwj
Jul 18, 2017
Owner
Me too, and it didn't work for me unless I changed it. Using $withval here most llikelys pick up the last --with check result (for qrencode, which wasn't installed in my case, so it always had no)
| @@ -1162,6 +1172,7 @@ AM_CONDITIONAL([USE_LCOV],[test x$use_lcov = xyes]) | ||
| AM_CONDITIONAL([GLIBC_BACK_COMPAT],[test x$use_glibc_compat = xyes]) | ||
| AM_CONDITIONAL([HARDEN],[test x$use_hardening = xyes]) | ||
| AM_CONDITIONAL([ENABLE_HWCRC32],[test x$enable_hwcrc32 = xyes]) | ||
| +AM_CONDITIONAL([EXPERIMENTAL_ASM],[test x$experimental_asm = xyes]) |
theuni
Jul 18, 2017
Member
This is only needed if you intended to avoid compiling the _sse4.o variant altogether. AM_CONDITIONAL sets Makefile variables.
theuni
Jul 18, 2017
Member
On second thought, I'd actually prefer doing it that way in order to keep sha256_sse4.cpp completely generic. It was very helpful for me while testing to just throw together a quick test app using the .cpp directly.
The makefile change would become:
if EXPERIMENTAL_ASM
crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp
endif
Then obviously the guard isn't needed in the .cpp.
| +AC_ARG_ENABLE([experimental-asm], | ||
| + [AS_HELP_STRING([--enable-experimental-asm], | ||
| + [Enable experimental assembly routines (default is no)])], | ||
| + [experimental_asm=$withval], |
| @@ -1162,6 +1172,7 @@ AM_CONDITIONAL([USE_LCOV],[test x$use_lcov = xyes]) | ||
| AM_CONDITIONAL([GLIBC_BACK_COMPAT],[test x$use_glibc_compat = xyes]) | ||
| AM_CONDITIONAL([HARDEN],[test x$use_hardening = xyes]) | ||
| AM_CONDITIONAL([ENABLE_HWCRC32],[test x$enable_hwcrc32 = xyes]) | ||
| +AM_CONDITIONAL([EXPERIMENTAL_ASM],[test x$experimental_asm = xyes]) |
| @@ -11,11 +11,13 @@ | ||
| #if defined(__x86_64__) || defined(__amd64__) | ||
| #include <cpuid.h> |
theuni
Jul 18, 2017
Member
Nit: no need to risk including the not-guaranteed-to-exist header. Move the #if up a bit?
| @@ -5,6 +5,8 @@ | ||
| // This is a translation to GCC extended asm syntax from YASM code by Intel | ||
| // (available at the bottom of this file). | ||
| +#include "config/bitcoin-config.h" | ||
| + |
|
utACK modulo the small nits. |
| @@ -1161,6 +1161,7 @@ bool AppInitSanityChecks() | ||
| // ********************************************************* Step 4: sanity checks | ||
| // Initialize elliptic curve code | ||
| + LogPrintf("Using the '%s' SHA256 implementation\n", SHA256AutoDetect()); |
laanwj
Jul 20, 2017
Owner
Nit: Seems this is a log message with the side-effect of detecting the SHA256 implementation.
I'd prefer to assign the result explicitly, so that if someone happens to comment this out, or moves it to debug category, it won't just be skipped.
sipa
added some commits
Jul 14, 2017
|
utACK 6b8d872, though I extensively tested earlier revisions. |
laanwj
merged commit 6b8d872
into
bitcoin:master
Jul 20, 2017
1 check passed
laanwj
added a commit
that referenced
this pull request
Jul 20, 2017
|
|
laanwj |
16240f4
|
sipa commentedJul 14, 2017
•
edited
This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with
--enable-experimental-asm.In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.
This gives around a 50% speedup on the SHA256 benchmark for me.
It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.