Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert the 1-way SSE4 SHA256 code from asm to intrinsics #13442

Open
wants to merge 3 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@sipa
Copy link
Member

sipa commented Jun 11, 2018

Currently, master contains 2 implementations of SHA256 for SSE4:

  • A generic one written using GCC inline assembly (converted from Intel NASM code), added in #10821.
  • A specialized double-SHA256 for 64-byte inputs written using intrinsics, added in #13191.

The advantage of the inline assembly is that its performance is not affected by compiler optimizations (and doesn't even need compiler support for SSE4). The downside is that it is an opaque, unreadable, non-reusable blob of code.

This patch converts the former also to intrinsics - making its operation more clear, while hopefully lending itself to being adaptable for other specialized implementations.

The resulting implementation is slightly faster on my system (i7-7820HQ) when compiled with GCC 7.3. Small variations in the code can affect the optimizer though, and have as much as a few % impact on speed.

@fanquake fanquake added the Validation label Jun 12, 2018

@theuni

This comment has been minimized.

Copy link
Member

theuni commented Jun 12, 2018

Nice!

@sipa See theuni@d79fb1d for clang compile fixes, and theuni@4ee6fbb for a change that may or may not be needed to avoid a performance hit on AMD.

@sipa sipa force-pushed the sipa:201806_sse4intrin branch from 86e04f0 Jun 12, 2018

src/crypto/sha256_sse41.cpp Outdated
Round(a, b, c, d, e, f, g, h, Ws[0]);
XTMP0 = _mm_alignr_epi8(X3, X2, 4);
XTMP0 = _mm_add_epi32(XTMP0, X0);
XTMP3 = XTMP2 = XTMP1 = _mm_alignr_epi8(X1, X0, 4);

This comment has been minimized.

Copy link
@theuni

theuni Jun 12, 2018

Member

Is there some voodoo here, why not just use XTMP3 below? Does this avoid a pipeline stall or something?

This comment has been minimized.

Copy link
@sipa

sipa Jun 12, 2018

Author Member

No idea. It's just a translation of the existing assembly code.

@sipa sipa force-pushed the sipa:201806_sse4intrin branch 3 times, most recently Jun 12, 2018

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jun 12, 2018

@theuni Included the clang compile fixes. I'm going to benchmark to see whether to include the other changes.

@sipa sipa force-pushed the sipa:201806_sse4intrin branch to 9fe51b4 Jun 12, 2018

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jun 14, 2018

It would be worthwhile to benchmark this on reasonably recent clang versions as well - the performance impact may be very different depending on how good the compiler is at ordering parallel instruction paths.

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jun 18, 2018

Some more benchmarks, comparing GCC 7.3 and clang 6.0, for the SHA256 benchmark (i7-7820HQ, fixed to 2.2 GHz).

  • GCC, master: 4.4 ms
  • GCC, this PR: 4.3 ms
  • clang, master: 4.4 ms
  • clang, this PR: 4.8 ms

Unfortunately, it seems that clang isn't as good in producing as performant code from intrinsics.

@theuni

This comment has been minimized.

Copy link
Member

theuni commented Jul 19, 2018

@sipa Mind rebasing? I'd like to add the lib-per-cpu changes on top of this.

@sipa sipa force-pushed the sipa:201806_sse4intrin branch from 9fe51b4 to 8655e78 Jul 19, 2018

@sipa

This comment has been minimized.

Copy link
Member Author

sipa commented Jul 19, 2018

Rebased, though I don't think this PR is acceptable until we have a way to avoid the performance loss in clang.

@DrahtBot

This comment has been minimized.

Copy link
Contributor

DrahtBot commented Jul 28, 2018

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #13789 (crypto/sha256: Use pragmas to enforce necessary intrinsics for GCC and Clang by luke-jr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

w = _mm_add_epi32(w, x0);
_mm_store_si128((__m128i*)w32, w);

Round(a, b, c, d, e, f, g, h, w32[0]);

This comment has been minimized.

Copy link
@practicalswift

practicalswift Oct 4, 2018

Member

Is the value read from w32[0] is guaranteed to be initialised here?

t2 = _mm_srli_epi32(t2, 7);
t1 = _mm_or_si128(_mm_slli_epi32(t1, 32 - 7), t2);

Round(h, a, b, c, d, e, f, g, w32[1]);

This comment has been minimized.

Copy link
@practicalswift

practicalswift Oct 4, 2018

Member

Same here and throughout this function :-)

@@ -615,12 +615,9 @@ std::string SHA256AutoDetect()
#endif

if (have_sse4) {

This comment has been minimized.

Copy link
@practicalswift

practicalswift Oct 4, 2018

Member

Move the if statement inside of the #if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL) to remove the possibility of an empty if statement.

@bitcoin bitcoin deleted a comment from STALININST Oct 4, 2018

@sipa sipa force-pushed the sipa:201806_sse4intrin branch from 8655e78 to 4a221ce Oct 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.