Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace BLAKE2b with BLAKE3 #519

Merged
merged 6 commits into from
Jun 18, 2020
Merged

Replace BLAKE2b with BLAKE3 #519

merged 6 commits into from
Jun 18, 2020

Conversation

erijo
Copy link
Contributor

@erijo erijo commented Jan 22, 2020

Description

Replace the BLAKE2b hash function with BLAKE3 (see #503). The BLAKE3 source files are added as third_party sources and built when building ccache.

I've compared the performance on my machine by running make CXX=/usr/bin/g++ performance.

On master (42939f0):

Without ccache:                             2.6040 s (100.0000 %) (  1.0000 x)
With ccache, preprocessor mode, cache miss: 2.7191 s (104.4190 %) (  0.9577 x)
With ccache, preprocessor mode, cache hit:  0.0892 s (  3.4245 %) ( 29.2010 x)
With ccache, direct mode, cache miss:       2.7271 s (104.7250 %) (  0.9549 x)
With ccache, direct mode, cache hit:        0.0156 s (  0.5986 %) (167.0574 x)
With ccache, depend mode, cache miss:       2.6532 s (101.8877 %) (  0.9815 x)
With ccache, depend mode, cache hit:        0.0157 s (  0.6035 %) (165.6877 x)

This PR:

Without ccache:                             2.5987 s (100.0000 %) (  1.0000 x)
With ccache, preprocessor mode, cache miss: 2.6948 s (103.7001 %) (  0.9643 x)
With ccache, preprocessor mode, cache hit:  0.0879 s (  3.3843 %) ( 29.5482 x)
With ccache, direct mode, cache miss:       2.6909 s (103.5476 %) (  0.9657 x)
With ccache, direct mode, cache hit:        0.0129 s (  0.4956 %) (201.7592 x)
With ccache, depend mode, cache miss:       2.6251 s (101.0160 %) (  0.9899 x)
With ccache, depend mode, cache hit:        0.0131 s (  0.5035 %) (198.6122 x)

This PR but all BLAKE3 sources built with -O3 instead of -O2:

Without ccache:                             2.6031 s (100.0000 %) (  1.0000 x)
With ccache, preprocessor mode, cache miss: 2.7248 s (104.6749 %) (  0.9553 x)
With ccache, preprocessor mode, cache hit:  0.0886 s (  3.4035 %) ( 29.3813 x)
With ccache, direct mode, cache miss:       2.7173 s (104.3844 %) (  0.9580 x)
With ccache, direct mode, cache hit:        0.0130 s (  0.4987 %) (200.5049 x)
With ccache, depend mode, cache miss:       2.6381 s (101.3422 %) (  0.9868 x)
With ccache, depend mode, cache hit:        0.0130 s (  0.4975 %) (200.9903 x)

@erijo
Copy link
Contributor Author

erijo commented Jan 23, 2020

There seems to be some problem building the AVX512 version on Travis (too old GCC perhaps?). I'll let it be for now, but I can have a look at it later if you're interested in merging this PR.

@erijo
Copy link
Contributor Author

erijo commented Jan 23, 2020

I did another test where I built KDAB/hotspot@a7c2ecb1 with make -j8. I measured the time it took to build using three different ccache versions. For each ccache version I ran the build three times: first one with an empty ccache dir and then two times with populated ccache dir.

ccache 3.7.7 from Debian

235,79s user 19,71s system 386% cpu 1:06,03 total (empty cache)
15,16s user 3,60s system 329% cpu 5,688 total
15,26s user 3,68s system 330% cpu 5,730 total

ccache master (42939f0)

228,53s user 19,61s system 385% cpu 1:04,44 total (empty cache)
15,79s user 3,82s system 335% cpu 5,846 total
15,51s user 3,78s system 333% cpu 5,782 total

This PR

235,94s user 19,62s system 385% cpu 1:06,30 total (empty cache)
14,93s user 3,73s system 335% cpu 5,567 total
15,11s user 3,54s system 336% cpu 5,543 total

Based on this limited testing it seems like this version is faster than 3.7.7 which is faster than current master.

@erijo
Copy link
Contributor Author

erijo commented Feb 10, 2020

I took a look at the CI failure and the problem seems to be that GCC 5.4 generates vmovdqu %ymm17, (%r11) which, as far as I can tell, is an invalid instruction. This instruction is VEX encoded and can only access registers ymm0 - ymm15. To access ymm16-ymm31 the instruction should be EVEX encoded.

Adding -mavx512bw to the build flags makes GCC generate vmovdqu8 %ymm17, (%r11) instead, which compiles (the 8 suffix makes it an EVEX instruction), but I can't really judge the consequences of that change.

@erijo
Copy link
Contributor Author

erijo commented Feb 13, 2020

I've now rebased this PR on the latest master and update BLAKE3 to the first release of the C implementation. Following the discussion in #503 I've also reverted my digest size change and the size is now back to 160 bits as it was before (in a separate commit to make it easy to go one way or the other).

With the latest BLAKE3 release all build errors are fixed and performance is a tiny bit better than before. On my computer I now get:

Without ccache:                             2.6827 s (100.0000 %) (  1.0000 x)
With ccache, preprocessor mode, cache miss: 2.7722 s (103.3368 %) (  0.9677 x)
With ccache, preprocessor mode, cache hit:  0.0887 s (  3.3047 %) ( 30.2603 x)
With ccache, direct mode, cache miss:       2.7757 s (103.4676 %) (  0.9665 x)
With ccache, direct mode, cache hit:        0.0123 s (  0.4595 %) (217.6293 x)
With ccache, depend mode, cache miss:       2.7122 s (101.1020 %) (  0.9891 x)
With ccache, depend mode, cache hit:        0.0124 s (  0.4609 %) (216.9495 x)

@erijo
Copy link
Contributor Author

erijo commented Feb 25, 2020

I've updated BLAKE3 to the latest release BLAKE3-team/BLAKE3@c-0.2.2 and rebased on latest ccache master.

Updated performance numbers now look like this:

Without ccache:                             3.1337 s (100.0000 %) (  1.0000 x)
With ccache, preprocessor mode, cache miss: 3.2211 s (102.7911 %) (  0.9728 x)
With ccache, preprocessor mode, cache hit:  0.0899 s (  2.8695 %) ( 34.8487 x)
With ccache, direct mode, cache miss:       3.3066 s (105.5195 %) (  0.9477 x)
With ccache, direct mode, cache hit:        0.0130 s (  0.4162 %) (240.2744 x)
With ccache, depend mode, cache miss:       3.3053 s (105.4788 %) (  0.9481 x)
With ccache, depend mode, cache hit:        0.0130 s (  0.4141 %) (241.4907 x)

Compared to master (e0f5063):

Without ccache:                             3.0477 s (100.0000 %) (  1.0000 x)
With ccache, preprocessor mode, cache miss: 3.1426 s (103.1121 %) (  0.9698 x)
With ccache, preprocessor mode, cache hit:  0.0898 s (  2.9453 %) ( 33.9520 x)
With ccache, direct mode, cache miss:       3.1478 s (103.2845 %) (  0.9682 x)
With ccache, direct mode, cache hit:        0.0165 s (  0.5406 %) (184.9961 x)
With ccache, depend mode, cache miss:       3.1039 s (101.8438 %) (  0.9819 x)
With ccache, depend mode, cache hit:        0.0165 s (  0.5405 %) (185.0014 x)

@jrosdahl jrosdahl added the improvement Improvement that is not a bug fix or new feature label Jun 8, 2020
@erijo erijo force-pushed the blake3 branch 2 times, most recently from 35f1ecb to 3367633 Compare June 17, 2020 20:17
@erijo
Copy link
Contributor Author

erijo commented Jun 17, 2020

I've rebased on top of latest master, ported the build to cmake and updated to latest BLAKE3.

Running ../misc/performance /usr/bin/g++ -std=c++11 -I../src -I. ../src/ccache.cpp in my build directory results in the following when using master (53f8eff):

Without ccache:                             1.2119 s (100.0000 %) (  1.0000 x)
With ccache, preprocessor mode, cache miss: 1.3758 s (113.5180 %) (  0.8809 x)
With ccache, preprocessor mode, cache hit:  0.1021 s (  8.4253 %) ( 11.8691 x)
With ccache, direct mode, cache miss:       1.4012 s (115.6167 %) (  0.8649 x)
With ccache, direct mode, cache hit:        0.0159 s (  1.3130 %) ( 76.1634 x)
With ccache, depend mode, cache miss:       1.4119 s (116.4988 %) (  0.8584 x)
With ccache, depend mode, cache hit:        0.0160 s (  1.3173 %) ( 75.9121 x)

Compared with this when using this PR:

Without ccache:                             1.2554 s (100.0000 %) (  1.0000 x)
With ccache, preprocessor mode, cache miss: 1.4155 s (112.7486 %) (  0.8869 x)
With ccache, preprocessor mode, cache hit:  0.0916 s (  7.2974 %) ( 13.7035 x)
With ccache, direct mode, cache miss:       1.3175 s (104.9410 %) (  0.9529 x)
With ccache, direct mode, cache hit:        0.0124 s (  0.9869 %) (101.3293 x)
With ccache, depend mode, cache miss:       1.3245 s (105.4975 %) (  0.9479 x)
With ccache, depend mode, cache hit:        0.0125 s (  0.9960 %) (100.3981 x)

@jrosdahl jrosdahl added this to the 4.0 milestone Jun 17, 2020
@jrosdahl jrosdahl merged commit 2a0dd8e into ccache:master Jun 18, 2020
@jrosdahl
Copy link
Member

Thanks!

@erijo erijo deleted the blake3 branch June 22, 2020 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement that is not a bug fix or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants