Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add POWER8 ASM for 4-way SHA256 #13203

Closed
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@TheBlueMatt
Copy link
Contributor

TheBlueMatt commented May 9, 2018

Based on #13191, This speeds up 4-way SHA256 by about 3.75x over the C impl.

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2018-05-asm branch May 9, 2018

@laanwj

This comment has been minimized.

Copy link
Member

laanwj commented May 9, 2018

Cool!
Ping @luke-jr

@luke-jr
Copy link
Member

luke-jr left a comment

I'd be surprised if this works as-is; either way, some cleanup needed to avoid breaking non-POWER builds.

configure.ac Outdated
@@ -762,6 +762,29 @@ AC_LINK_IFELSE([AC_LANG_SOURCE([
)
LDFLAGS="$TEMP_LDFLAGS"

TEMP_LDFLAGS="$LDFLAGS"
LDFLAGS="$TEMP_LDFLAGS -mpower8-vector"

This comment has been minimized.

@luke-jr

luke-jr May 9, 2018

Member

Shouldn't this be CPPFLAGS or CXXFLAGS?

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

Agree, but this should be assigned to POWER_VSX_CXXFLAGS or so above, in an AX_CHECK_LINK_FLAG test.

src/Makefile.am Outdated
@@ -309,6 +310,16 @@ crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2
endif
crypto_libbitcoin_crypto_avx2_a_SOURCES = crypto/sha256_avx2.cpp

if HAVE_POWER8
crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector

This comment has been minimized.

@luke-jr

luke-jr May 9, 2018

Member

This is probably needed on CXXFLAGS, since the man page says it changes the code generated.

src/crypto/sha256.cpp Outdated
@@ -29,6 +29,15 @@ namespace sha256d64_avx2
void Transform_8way(unsigned char* out, const unsigned char* in);
}

#if defined(__linux__) && defined(HAVE_POWER8)
#include <sys/auxv.h>

This comment has been minimized.

@luke-jr

luke-jr May 9, 2018

Member

I don't think we need this here, and POWER vectoring stuff seems to screw with standard C++ stuff, so maybe not a good idea to have it.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt May 9, 2018

Author Contributor

auxv isnt POWER- or vector-specific, its to get getauxval which is a libc function.

src/crypto/sha256.cpp Outdated
@@ -511,6 +520,12 @@ std::string SHA256AutoDetect()
ret = "sse4";
#endif
}
#elif (defined(__linux__)) && defined(HAVE_POWER8)
if (getauxval(AT_HWCAP2) & 0x02000000) {

This comment has been minimized.

@luke-jr

luke-jr May 9, 2018

Member

Can we move this check to a function in the power8-specific file?

This comment has been minimized.

@TheBlueMatt

TheBlueMatt May 9, 2018

Author Contributor

Its right below the x86-specific support check, and its a short file, it seems like overkill.

src/crypto/sha256_power8.cpp Outdated
{

typedef __vector uint32_t uint32x4_p8;
typedef __vector uint8_t uint8x16_p8;

This comment has been minimized.

@luke-jr

luke-jr May 9, 2018

Member

__vector is a non-standard GCC extension; any reason we can't use vector here directly?

src/crypto/sha256_power8.cpp Outdated
#include <config/bitcoin-config.h>
#endif

#ifdef HAVE_POWER8

This comment has been minimized.

@luke-jr

luke-jr May 9, 2018

Member

Since this file is only compiled with HAVE_POWER8, I think we don't need this here.

@laanwj laanwj added the Validation label May 9, 2018

@theuni
Copy link
Member

theuni left a comment

Concept ACK!

First set of build-system comments, there are few things to sort out. Will continue reviewing after the first round of cleanups.

src/Makefile.am Outdated
crypto_libbitcoin_crypto_avx2_a_CPPFLAGS = $(CPPFLAGS)
if ENABLE_AVX2
crypto_libbitcoin_crypto_avx2_a_CXXFLAGS += $(AVX2_CXXFLAGS)
crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

Can't compiler defines be used instead so that non-autoconf builds can compile with optims too?

Specifically for this one, rather than ifdef(ENABLE_AVX2), use ifdef(__AVX2__)

configure.ac Outdated
@@ -762,6 +762,29 @@ AC_LINK_IFELSE([AC_LANG_SOURCE([
)
LDFLAGS="$TEMP_LDFLAGS"

TEMP_LDFLAGS="$LDFLAGS"
LDFLAGS="$TEMP_LDFLAGS -mpower8-vector"

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

Agree, but this should be assigned to POWER_VSX_CXXFLAGS or so above, in an AX_CHECK_LINK_FLAG test.

configure.ac Outdated
@@ -327,6 +330,45 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
)
CXXFLAGS="$TEMP_CXXFLAGS"

AC_DEFINE(ENABLE_AVX2, 1, [Define this symbol to build code that uses AVX2 intrinsics])

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

Only define these in their respective check's success cases.

configure.ac Outdated
}
]])],
[
AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.])

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

Please move the defines/conditionals down with the other asm flags for consistency.

src/Makefile.am Outdated
@@ -289,6 +294,32 @@ if USE_ASM
crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp
endif

crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS)

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

These should all be $(AM_CXXFLAGS) $(PIE_FLAGS).

src/Makefile.am Outdated
@@ -289,6 +294,32 @@ if USE_ASM
crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp
endif

crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS)
crypto_libbitcoin_crypto_sse41_a_CPPFLAGS = $(CPPFLAGS)

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

Ditto for the CPPFLAGS.

src/Makefile.am Outdated
crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp

LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8)
EXTRA_LIBRARIES += $(LIBBITCOIN_CRYPTO_POWER8)

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

append to EXTRA_LIBRARIES outside of the if HAVE_POWER8. These libraries are only built if something depends on them.

src/Makefile.am Outdated
crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS)
crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp

LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8)

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

I'm not sure whether this works as intended or not, but we don't want to be doing it either way. Add LIBBITCOIN_CRYPTO_POWER8 to bitcoind_LDADD like the others, we can potentially aggregate them as a follow-up.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt May 9, 2018

Author Contributor

Why not? Its shorthand for "crypto depends on this" instead of listing it as a dep in every target that may use it. Its also much less britle, which is important given we dont have a POWER travis target.

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

I agree with aggregating, but as a clean follow-up, or as part of #13191. I'd just prefer not to have one-off's. It's not clear, for example, if EXTRA_LIBRARIES is appended twice, as it depends on when LIBBITCOIN_CRYPTO is calculated.

@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented May 9, 2018

@luke-jr No, it definitely works, but its apparently /only/ tested on POWER, not "not yet tested" :p.

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2018-05-asm branch 2 times, most recently May 9, 2018

configure.ac Outdated
]
)
AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8])
LDFLAGS="$TEMP_LDFLAGS"

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

This should reset CXXFLAGS, not LDFLAGS.

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2018-05-asm branch 2 times, most recently May 9, 2018

configure.ac Outdated
]])],
[
AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.])
AC_MSG_RESULT(yes)

This comment has been minimized.

@theuni

theuni May 9, 2018

Member
POWER8_ASM_CXXFLAGS="-mpower8-vector"
AC_MSG_RESULT(yes)

Then at the bottom, with the others:

AC_SUBST(POWER8_ASM_CXXFLAGS)

Then in the Makefile.am:

crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS) $(POWER8_ASM_CXXFLAGS)
configure.ac Outdated
AC_MSG_RESULT(no)
]
)
AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8])

This comment has been minimized.

@theuni

theuni May 9, 2018

Member

HAVE_POWER8 is a define that will be handled by autoheader, it can't be treated like a variable here. Need to set one separately.

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2018-05-asm branch May 10, 2018

@gmaxwell

This comment has been minimized.

Copy link
Member

gmaxwell commented May 15, 2018

spiffy! will test soon

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2018-05-asm branch Jun 4, 2018

@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented Jun 4, 2018

Rebased.

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2018-05-asm branch 3 times, most recently Jun 11, 2018

TheBlueMatt added some commits Jun 11, 2018

Add POWER8 vector impl for 4-way SHA256
This speeds up 4-way SHA256 by about 3.75x over the C impl.
Make checkqueue bench a realistic result
Previously checkqueue bench would essentially just test cross-core
latency with more threads the more threads a host had. Not only did
this make the result worse on hosts with more CPU cores, but it
doesn't mimic how we use the checkqueue in real life, and made
bench_bitcoin almost unusably slow on high-core-count systems,
especially dual-CPU systems.

Instead, we use the same thread count that we use for the real
validation queue, with a 75-microsecond busy-wait on each thread,
which should at least vaguely mimic how we use the checkqueue.

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2018-05-asm branch to 3b402e0 Jun 11, 2018

@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented Jun 11, 2018

Rebased. Should be ready for merge +/- some more review, though luckily the tests should mostly cover it.

@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented Jun 11, 2018

Note that I also had to fix the checkqueue bench as otherwise bench_bitcoin is unsuably slow on multi-cpu (or many-core) systems.

@laanwj

This comment has been minimized.

Copy link
Member

laanwj commented Jun 14, 2018

I'm unable to test this at the moment. Can someone please post the benchmark results before/after this?

@DrahtBot

This comment has been minimized.

Copy link
Contributor

DrahtBot commented Jul 9, 2018

Needs rebase
// file COPYING or http://www.opensource.org/licenses/mit-license.php.
//
// This is a translation to GCC extended asm syntax from YASM code by Intel
// (available at the bottom of this file).

This comment has been minimized.

@luke-jr

luke-jr Jul 25, 2018

Member

I don't see any YASM code at the bottom; also, really from Intel?

This comment has been minimized.

@sipa

sipa Jul 25, 2018

Member

Looks like that was copied from the sse4 code.

#include <stdint.h>
]], [[
unsigned char src[16];
__builtin_crypto_vshasigmaw((__vector uint32_t)vec_vsx_ld(0, src), 1, 0xf);

This comment has been minimized.

@luke-jr

luke-jr Jul 25, 2018

Member

Better to use vector here. We don't use [the non-standard] __vector in the code itself.

@@ -314,6 +314,7 @@ fi
AX_CHECK_COMPILE_FLAG([-msse4.2],[[SSE42_CXXFLAGS="-msse4.2"]],,[[$CXXFLAG_WERROR]])
AX_CHECK_COMPILE_FLAG([-msse4.1],[[SSE41_CXXFLAGS="-msse4.1"]],,[[$CXXFLAG_WERROR]])
AX_CHECK_COMPILE_FLAG([-mavx -mavx2],[[AVX2_CXXFLAGS="-mavx -mavx2"]],,[[$CXXFLAG_WERROR]])
AX_CHECK_COMPILE_FLAG([-mpower8-vector],[[POWER8_CXXFLAGS="-mpower8-vector"]],,[[$CXXFLAG_WERROR]])

This comment has been minimized.

@luke-jr

luke-jr Jul 25, 2018

Member

I think -faltivec is needed too?

@gmaxwell

This comment has been minimized.

Copy link
Member

gmaxwell commented Aug 21, 2018

This needs a trivial rebase but with that works fine against master. ACK. Will post benchmarks after IBD finishes.

@gmaxwell

This comment has been minimized.

Copy link
Member

gmaxwell commented Aug 23, 2018

Before:
MerkleRoot, 5, 800, 21.6763, 0.00541869, 0.00541984, 0.00541886
After:
MerkleRoot, 5, 800, 6.32571, 0.00158138, 0.00158148, 0.00158142

On power9.

Interesting reindex speed was about the same for both.

@gmaxwell

This comment has been minimized.

Copy link
Member

gmaxwell commented Aug 23, 2018

@TheBlueMatt Rebase.

@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented Aug 24, 2018

Hmm, well if it doesnt speed up rebase then there is something funky we should look into before merging new code, no?

@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented Aug 24, 2018

Gonna go ahead and close it, if someone wants to adopt it and figure out why IBD didn't improve where the equivalent x86 change improved it significantly, they should adopt it.

@gmaxwell

This comment has been minimized.

Copy link
Member

gmaxwell commented Aug 24, 2018

The comparable change made a big impact on x86 many months ago-- who knows now... Synchronization primitives are more expensive on power, so it also wouldn't be surprising if our glib use of atomics all over the place weren't moving around the limiting factors in performance on power compared to x86.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.