Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SSE4 optimized SHA256 #10821

Merged
merged 5 commits into from Jul 20, 2017

Conversation

Projects
None yet
10 participants
@sipa
Copy link
Member

sipa commented Jul 14, 2017

This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with --enable-experimental-asm.

In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

This gives around a 50% speedup on the SHA256 benchmark for me.

It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

src/crypto/sha256.cpp Outdated

#if defined(__x86_64__) || defined(__amd64__)
uint32_t eax, ebx, ecx, edx;
if (__get_cpuid(1, &eax, &ebx, &ecx, &edx) && (ecx >> 20) & 1) {

This comment has been minimized.

@laanwj

laanwj Jul 14, 2017

Member

I'd prefer to do this setup explicitly during initialization; this also avoids having to use an atomic pointer, which seems overkill (why would it ever change during runtime?) and may be inefficient on some platforms.
(also the detection might be more involved on some platforms, so it's better for clarity to drive it from an init function instead of magically at first call).

This comment has been minimized.

@theuni

theuni Jul 14, 2017

Member

We also have the option of using the ifunc attribute, supported on recent binutils with at least gcc and clang.

Though it's non-standard and afaik elf-specific, it's worth considering where possible.

This comment has been minimized.

@gmaxwell

gmaxwell Jul 16, 2017

Member

do we have constructors with hashing in them?

This comment has been minimized.

@sipa

sipa Jul 16, 2017

Author Member

@laanwj Fixed.

@luke-jr

This comment has been minimized.

Copy link
Member

luke-jr commented Jul 14, 2017

Even with inline assembly, there are build complications unfortunately. The compile will fail if the target doesn't support it..

@sipa sipa force-pushed the sipa:20170713_shasse branch Jul 14, 2017

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jul 14, 2017

@luke-jr There are system macros to test whether you're compiling for x86_64 or not.

@luke-jr

This comment has been minimized.

Copy link
Member

luke-jr commented Jul 14, 2017

You said almost every x86_64 CPU. Are we going to drop support for the outliers then?

@MeshCollider

This comment has been minimized.

Copy link
Member

MeshCollider commented Jul 14, 2017

One of the travis builds obviously has an issue with it too:
crypto/sha256_sse42.cpp:42:9: error: inline assembly requires more registers than available

@theuni

This comment has been minimized.

Copy link
Member

theuni commented Jul 14, 2017

The clang/osx build succeeds when -fomit-frame-pointer is used. I don't speak enough asm to know if a register can be freed up.

@gmaxwell

This comment has been minimized.

Copy link
Member

gmaxwell commented Jul 14, 2017

Even with inline assembly, there are build complications unfortunately. The compile will fail if the target doesn't support it..

No it won't-- these files are compiled without -msse4.2 already. The only thing required is that its x86_64, which the build tests for.

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jul 14, 2017

@luke-jr There is runtime detection to see if the CPU supports the extension. The only requirement is that the target is x86_64.

@jonasschnelli

This comment has been minimized.

Copy link
Member

jonasschnelli commented Jul 14, 2017

Gitian OSX build is broken (https://bitcoin.jonasschnelli.ch/build/216):

Generated test/data/base58_keys_invalid.json.h
crypto/sha256_sse42.cpp:42:9: error: inline assembly requires more registers than available
        "shl    $0x6,%2;"
        ^
1 error generated.

No problem on Win/ OSX Linux

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jul 14, 2017

@jonasschnelli @theuni figured it out - clang isn't compiling with -fomit-frame-pointer, and thus there is one fewer register available. Unfortunately, omitting the frame pointer still makes this code not work...

@sipa sipa force-pushed the sipa:20170713_shasse branch Jul 14, 2017

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jul 14, 2017

Updated the code to use one fewer register. The original YASM code used the dx register for two purposes, which I had separated out into two separate registers. They're merged now.

@sipa sipa force-pushed the sipa:20170713_shasse branch 2 times, most recently Jul 14, 2017

src/crypto/sha256_sse42.cpp Outdated
; documentation and/or other materials provided with the
; distribution.
;
; * Neither the name of the Intel Corporation nor the names of its

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Jul 14, 2017

Contributor

We're gonna have to do something to meet this condition, though it doesnt appear we'd have to do much.

This comment has been minimized.

@gmaxwell

gmaxwell Jul 14, 2017

Member

This is the standard three clause BSD license, it is GPL and whatnot compatible. The source code to Bitcoin, which contains this notice, is part of the "documentation and/or other materials" we provide.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Jul 14, 2017

Contributor

We ship sans-source all the time? I figured we'd just put a "contains softare copyright Intel" in the --help output or a README somewhere.

@sipa sipa force-pushed the sipa:20170713_shasse branch Jul 15, 2017

@sipa sipa changed the title Add SSE 4.2 optimized SHA256 [WIP] Add SSE 4.2 optimized SHA256 Jul 15, 2017

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jul 15, 2017

Marking as WIP, as this does not seem to produce correct hashes on OSX (cc @theuni).

@sipa sipa force-pushed the sipa:20170713_shasse branch Jul 15, 2017

@theuni

This comment has been minimized.

Copy link
Member

theuni commented Jul 15, 2017

I poked at this for hours and came up empty-handed. I'll wait for someone else to confirm my osx breakage isn't just local.

@theuni

This comment has been minimized.

Copy link
Member

theuni commented Jul 15, 2017

two more data points:

  1. @fanquake verified that this crashes on osx for him as well.

  2. I managed to reproduce a crash on Linux with an old clang (3.2), and it's even uglier, crashing gdb as well:

Starting program: /home/cory/dev/bitcoin2/src/bitcoind
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
/build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) n /build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
0x000000000074c910 in sha256_sse42::Transform (
/build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) n /build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
Segmentation fault (core dumped)

@sipa sipa force-pushed the sipa:20170713_shasse branch 3 times, most recently Jul 15, 2017

@theuni

This comment has been minimized.

Copy link
Member

theuni commented Jul 16, 2017

Tested ACK 08b7438f73236fc738fb655f766e77a81e6b7311. Good on OSX now!

Edit: Though I'd prefer to have the cpu check done separately.

@sipa sipa changed the title [WIP] Add SSE 4.2 optimized SHA256 Add SSE 4.2 optimized SHA256 Jul 16, 2017

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jul 16, 2017

Removing WIP tag, I believe we solved the OSX problem.

@sipa sipa force-pushed the sipa:20170713_shasse branch Jul 18, 2017

src/crypto/sha256.cpp Outdated
@@ -11,11 +11,13 @@

#if defined(__x86_64__) || defined(__amd64__)
#include <cpuid.h>

This comment has been minimized.

@theuni

theuni Jul 18, 2017

Member

Nit: no need to risk including the not-guaranteed-to-exist header. Move the #if up a bit?

This comment has been minimized.

@sipa

sipa Jul 18, 2017

Author Member

Fixed.

src/crypto/sha256_sse4.cpp Outdated
@@ -5,6 +5,8 @@
// This is a translation to GCC extended asm syntax from YASM code by Intel
// (available at the bottom of this file).

#include "config/bitcoin-config.h"

This comment has been minimized.

@theuni

theuni Jul 18, 2017

Member

Not needed anymore :)

This comment has been minimized.

@sipa

sipa Jul 18, 2017

Author Member

Fixed.

@theuni

This comment has been minimized.

Copy link
Member

theuni commented Jul 18, 2017

utACK modulo the small nits.

@sipa sipa force-pushed the sipa:20170713_shasse branch Jul 18, 2017

@gmaxwell
Copy link
Member

gmaxwell left a comment

reACK

src/init.cpp Outdated
@@ -1161,6 +1161,7 @@ bool AppInitSanityChecks()
// ********************************************************* Step 4: sanity checks

// Initialize elliptic curve code
LogPrintf("Using the '%s' SHA256 implementation\n", SHA256AutoDetect());

This comment has been minimized.

@laanwj

laanwj Jul 20, 2017

Member

Nit: Seems this is a log message with the side-effect of detecting the SHA256 implementation.
I'd prefer to assign the result explicitly, so that if someone happens to comment this out, or moves it to debug category, it won't just be skipped.

This comment has been minimized.

@sipa

sipa Jul 20, 2017

Author Member

Fixed.

@sipa sipa force-pushed the sipa:20170713_shasse branch to 6b8d872 Jul 20, 2017

@theuni

This comment has been minimized.

Copy link
Member

theuni commented Jul 20, 2017

utACK 6b8d872, though I extensively tested earlier revisions.

@laanwj laanwj merged commit 6b8d872 into bitcoin:master Jul 20, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

laanwj added a commit that referenced this pull request Jul 20, 2017

Merge #10821: Add SSE4 optimized SHA256
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c

@jnewbery jnewbery referenced this pull request Jul 31, 2017

Closed

TODO for release notes 0.15.0 #9889

12 of 12 tasks complete

sickpig referenced this pull request in sickpig/BitcoinUnlimited Sep 13, 2017

Merge #10821: Add SSE4 optimized SHA256
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c

gandrewstone referenced this pull request in gandrewstone/BitcoinUnlimited Sep 13, 2017

Merge pull request #4 from sickpig/port-sha256-sse4
Port of Core PR  #10821: Add SSE4 optimized SHA256

sickpig referenced this pull request in sickpig/BitcoinUnlimited Oct 13, 2017

Port Core PR #10821: Add SSE4 optimized SHA256
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c

gandrewstone referenced this pull request in BitcoinUnlimited/BitcoinUnlimited Oct 19, 2017

Merge pull request #781 from sickpig/sha256-port
Backport of Core #10821 and #11176
@Sjors

This comment has been minimized.

Copy link
Member

Sjors commented Oct 20, 2017

For future reference, as of #11176 this is now enabled by default.

laanwj added a commit that referenced this pull request Jul 9, 2018

Merge #13386: SHA256 implementations based on Intel SHA Extensions
66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on #13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to #13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-#10821): 6.1ms
  * Using SSE4 (master, #10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, #10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, #10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.