Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unroll the ChaCha20 inner loop for performance #24946

Merged
merged 1 commit into from
May 9, 2022

Conversation

sipa
Copy link
Member

@sipa sipa commented Apr 22, 2022

Unrolling the inner ChaCha20 loop gives a ~15% speedup for me in the CHACHA20_* benchmarks. It's a simple change, this performance helps with RNG generation, and will matter more for BIP324.

@kristapsk
Copy link
Contributor

Concept ACK, I see ./src/bench/bench_bitcoin improvements with this change.

@instagibbs
Copy link
Member

instagibbs commented Apr 22, 2022

Can you give commands to run just those benches for those willing to replicate?

@sipa
Copy link
Member Author

sipa commented Apr 22, 2022

@instagibbs

./src/bench/bench_bitcoin -filter=.*CHACHA20_[1-9].*

@laanwj
Copy link
Member

laanwj commented Apr 22, 2022

I'm somewhat surprised unrolling a loop that is 20 10 times the same thing gives that much performance win on modern CPUs. But it's just a few ROL instructions I guess so the loop overhead easily dominates? Anyhow, concept ACK.

@instagibbs
Copy link
Member

Getting a rough average of 15% speedup as well

@sipa
Copy link
Member Author

sipa commented Apr 22, 2022

@laanwj It may also have to do with better register scheduling when unrolling (the same variable doesn't need to stay in the same register every iteration), though I haven't investigated what the difference in emitted asm is.

This change may be very compiler and platform dependent, so it may be good to know what its impact is with modern clang versions and/or on arm64 systems.

@jonatack
Copy link
Member

Debian testing clang 15, normal (non-debug) build, fixed CPU speed, I'm not sure I'm seeing a difference. Trying again after optimizing and tuning further.

@laanwj
Copy link
Member

laanwj commented Apr 22, 2022

Gcc 11.2.0, x86_64:

  • The function ChaCha20::Keystream grows in size from 992 bytes to 3840 (doesn't seem too bad, still fits in a page).
  • One iteration of the loop looks like:
 370:	41 01 ed             	add    %ebp,%r13d
 373:	41 01 db             	add    %ebx,%r11d
 376:	41 01 f2             	add    %esi,%r10d
 379:	44 31 e9             	xor    %r13d,%ecx
 37c:	44 31 da             	xor    %r11d,%edx
 37f:	44 31 d0             	xor    %r10d,%eax
 382:	c1 c1 10             	rol    $0x10,%ecx
 385:	c1 c2 10             	rol    $0x10,%edx
 388:	41 01 c9             	add    %ecx,%r9d
 38b:	01 d7                	add    %edx,%edi
 38d:	c1 c0 10             	rol    $0x10,%eax
 390:	44 31 cd             	xor    %r9d,%ebp
 393:	31 fb                	xor    %edi,%ebx
 395:	41 01 c4             	add    %eax,%r12d
 398:	c1 c5 0c             	rol    $0xc,%ebp
 39b:	c1 c3 0c             	rol    $0xc,%ebx
 39e:	44 31 e6             	xor    %r12d,%esi
 3a1:	41 01 ed             	add    %ebp,%r13d
 3a4:	41 01 db             	add    %ebx,%r11d
 3a7:	c1 c6 0c             	rol    $0xc,%esi
 3aa:	44 31 e9             	xor    %r13d,%ecx
 3ad:	44 31 da             	xor    %r11d,%edx
 3b0:	41 01 f2             	add    %esi,%r10d
 3b3:	c1 c1 08             	rol    $0x8,%ecx
 3b6:	c1 c2 08             	rol    $0x8,%edx
 3b9:	44 31 d0             	xor    %r10d,%eax
 3bc:	41 01 c9             	add    %ecx,%r9d
 3bf:	01 d7                	add    %edx,%edi
 3c1:	44 31 cd             	xor    %r9d,%ebp
 3c4:	31 fb                	xor    %edi,%ebx
 3c6:	89 7c 24 08          	mov    %edi,0x8(%rsp)
 3ca:	c1 c5 07             	rol    $0x7,%ebp
 3cd:	c1 c3 07             	rol    $0x7,%ebx
 3d0:	44 89 4c 24 04       	mov    %r9d,0x4(%rsp)
 3d5:	c1 c0 08             	rol    $0x8,%eax
 3d8:	45 01 f8             	add    %r15d,%r8d
 3db:	41 01 dd             	add    %ebx,%r13d
 3de:	45 31 c6             	xor    %r8d,%r14d
 3e1:	41 01 c4             	add    %eax,%r12d
 3e4:	44 89 f7             	mov    %r14d,%edi
 3e7:	44 8b 74 24 0c       	mov    0xc(%rsp),%r14d
 3ec:	44 31 e6             	xor    %r12d,%esi
 3ef:	c1 c7 10             	rol    $0x10,%edi
 3f2:	c1 c6 07             	rol    $0x7,%esi
 3f5:	41 01 fe             	add    %edi,%r14d
 3f8:	41 01 f3             	add    %esi,%r11d
 3fb:	45 31 f7             	xor    %r14d,%r15d
 3fe:	45 89 f1             	mov    %r14d,%r9d
 401:	44 31 d9             	xor    %r11d,%ecx
 404:	41 c1 c7 0c          	rol    $0xc,%r15d
 408:	c1 c1 10             	rol    $0x10,%ecx
 40b:	45 01 f8             	add    %r15d,%r8d
 40e:	44 31 c7             	xor    %r8d,%edi
 411:	c1 c7 08             	rol    $0x8,%edi
 414:	41 01 f9             	add    %edi,%r9d
 417:	44 31 ef             	xor    %r13d,%edi
 41a:	c1 c7 10             	rol    $0x10,%edi
 41d:	45 31 cf             	xor    %r9d,%r15d
 420:	41 01 c9             	add    %ecx,%r9d
 423:	41 01 fc             	add    %edi,%r12d
 426:	41 c1 c7 07          	rol    $0x7,%r15d
 42a:	44 31 e3             	xor    %r12d,%ebx
 42d:	c1 c3 0c             	rol    $0xc,%ebx
 430:	41 01 dd             	add    %ebx,%r13d
 433:	44 31 ef             	xor    %r13d,%edi
 436:	41 89 fe             	mov    %edi,%r14d
 439:	41 c1 c6 08          	rol    $0x8,%r14d
 43d:	45 01 f4             	add    %r14d,%r12d
 440:	44 31 e3             	xor    %r12d,%ebx
 443:	c1 c3 07             	rol    $0x7,%ebx
 446:	44 31 ce             	xor    %r9d,%esi
 449:	45 01 fa             	add    %r15d,%r10d
 44c:	41 01 e8             	add    %ebp,%r8d
 44f:	c1 c6 0c             	rol    $0xc,%esi
 452:	44 31 d2             	xor    %r10d,%edx
 455:	44 31 c0             	xor    %r8d,%eax
 458:	c1 c2 10             	rol    $0x10,%edx
 45b:	41 01 f3             	add    %esi,%r11d
 45e:	c1 c0 10             	rol    $0x10,%eax
 461:	44 31 d9             	xor    %r11d,%ecx
 464:	c1 c1 08             	rol    $0x8,%ecx
 467:	41 8d 3c 09          	lea    (%r9,%rcx,1),%edi
 46b:	44 8b 4c 24 04       	mov    0x4(%rsp),%r9d
 470:	31 fe                	xor    %edi,%esi
 472:	89 7c 24 0c          	mov    %edi,0xc(%rsp)
 476:	8b 7c 24 08          	mov    0x8(%rsp),%edi
 47a:	41 01 d1             	add    %edx,%r9d
 47d:	c1 c6 07             	rol    $0x7,%esi
 480:	01 c7                	add    %eax,%edi
 482:	45 31 cf             	xor    %r9d,%r15d
 485:	31 fd                	xor    %edi,%ebp
 487:	41 c1 c7 0c          	rol    $0xc,%r15d
 48b:	c1 c5 0c             	rol    $0xc,%ebp
 48e:	45 01 fa             	add    %r15d,%r10d
 491:	41 01 e8             	add    %ebp,%r8d
 494:	44 31 d2             	xor    %r10d,%edx
 497:	44 31 c0             	xor    %r8d,%eax
 49a:	c1 c2 08             	rol    $0x8,%edx
 49d:	c1 c0 08             	rol    $0x8,%eax
 4a0:	41 01 d1             	add    %edx,%r9d
 4a3:	01 c7                	add    %eax,%edi
 4a5:	45 31 cf             	xor    %r9d,%r15d
 4a8:	31 fd                	xor    %edi,%ebp
 4aa:	41 c1 c7 07          	rol    $0x7,%r15d
 4ae:	c1 c5 07             	rol    $0x7,%ebp
 4b1:	83 6c 24 10 01       	subl   $0x1,0x10(%rsp)
 4b6:	0f 85 b4 fe ff ff    	jne    370 <ChaCha20::Keystream(unsigned char*, unsigned long)+0x140>
  • The unrolling indeed causes different register allocation, as well as instructions from multiple iterations to be interspersed (maybe better for scheduling, maybe it's possible to combine?).
  • Benchmarks before on old AMD Phenom(tm) II X6 1075T:
|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                2.18 |      459,187,395.75 |    0.3% |      0.03 | `CHACHA20_1MB`
|                2.21 |      452,155,530.63 |    0.2% |      0.01 | `CHACHA20_256BYTES`
|                2.34 |      427,257,435.31 |    0.0% |      0.01 | `CHACHA20_64BYTES`
  • Benchmarks after on same (~12% speedup):
|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                1.91 |      523,324,820.67 |    0.4% |      0.02 | `CHACHA20_1MB`
|                1.94 |      516,638,576.63 |    0.0% |      0.01 | `CHACHA20_256BYTES`
|                2.22 |      451,258,216.13 |    4.6% |      0.01 | `CHACHA20_64BYTES`

@jonatack
Copy link
Member

jonatack commented Apr 22, 2022

Restarted and tuned (i7 6500U CPU @ 2.5 GHz) with pyperf system tune, non-debug build, seeing roughly a 3 to 4% improvement.

Linux 5.16.0-6-amd64 #1 SMP PREEMPT Debian 5.16.18-1 (2022-03-29) x86_64 GNU/Linux.

Debian clang version 15.0.0-++20220422111431+ba46ae7bd853-1~exp1~20220422111525.449
Target: x86_64-pc-linux-gnu                                    
Thread model: posix                                            
InstalledDir: /usr/bin      
master

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                2.43 |      410,814,309.33 |    0.3% |           18.61 |            6.29 |  2.957 |           0.20 |    0.0% |      0.03 | `CHACHA20_1MB`
|                2.46 |      406,907,108.96 |    0.0% |           18.89 |            6.37 |  2.965 |           0.22 |    0.0% |      0.01 | `CHACHA20_256BYTES`
|                2.59 |      385,499,110.76 |    1.0% |           19.72 |            6.68 |  2.952 |           0.28 |    0.0% |      0.01 | `CHACHA20_64BYTES`
branch

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                2.35 |      425,969,024.53 |    0.7% |           16.70 |            6.07 |  2.752 |           0.05 |    0.0% |      0.03 | `CHACHA20_1MB`
|                2.37 |      422,279,272.14 |    0.0% |           17.14 |            6.14 |  2.792 |           0.07 |    0.0% |      0.01 | `CHACHA20_256BYTES`
|                2.52 |      396,803,365.77 |    0.1% |           18.45 |            6.53 |  2.825 |           0.13 |    0.0% |      0.01 | `CHACHA20_64BYTES`

Edit: re-ran the bench a dozen times each to verify that these results are representative.

@ajtowns
Copy link
Contributor

ajtowns commented Apr 23, 2022

I'm seeing much smaller improvements (0%-2.5% with gcc 11; 1.3%-7% with clang 13) on an old i7. (And very slightly worse performance compared to master with debug enabled)

Did you consider just changing the for() { ... } loop to REPEAT10( ... ) with #define REPEAT10(a) a a a a a a a a a a ?

@laanwj
Copy link
Member

laanwj commented Apr 23, 2022

  • gcc 11.2.0, RISC-V 64-bit (SiFive Unmatched, 1.2Ghz): speedup is there, but much less pronounced (~5%):
|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|---------------:|--------:|----------:|:----------
Before:
|               22.29 |       44,862,631.89 |    0.8% |            0.00 |            0.00 |           0.00 |    0.0% |      0.26 | `CHACHA20_1MB`
After:
|               21.23 |       47,101,646.21 |    0.9% |            0.00 |            0.00 |           0.00 |    0.0% |      0.25 | `CHACHA20_1MB`
  • gcc 10.2.1, aarch64 (custom i.MX8MQ board, 1Ghz), ~8% speedup:
|             ns/byte |              byte/s |    err% |        ins/byte |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:----------
Before:
|                6.04 |      165,526,246.91 |    0.1% |           16.84 |           0.16 |   11.8% |      0.07 | `CHACHA20_1MB`
After:
|                5.58 |      179,185,196.22 |    0.1% |           15.86 |           0.02 |    0.0% |      0.06 | `CHACHA20_1MB`

It's a nice speedup, and a simple change, tested ACK 4f3a189

Did you consider just changing the for() { ... } loop to REPEAT10( ... ) with #define REPEAT10(a) a a a a a a a a a a ?

I like this idea, more elegantly than copy/pasting it makes it immediately clear it's the same. I would guess the generated code is exactly the same.

@maflcko
Copy link
Member

maflcko commented Apr 23, 2022

Not seeing a large difference on an i7. (Maybe a 1%-3% speedup?)

gcc-12 Before:

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                2.23 |      447,617,214.06 |    0.2% |      0.03 | `CHACHA20_1MB`
|                2.26 |      441,653,947.12 |    0.1% |      0.01 | `CHACHA20_256BYTES`
|                2.50 |      399,993,391.82 |    6.1% |      0.01 | :wavy_dash: `CHACHA20_64BYTES` (Unstable with ~6,241.4 iters. Increase `minEpochIterations` to e.g. 62414)
|                7.03 |      142,173,319.29 |   10.1% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
|                3.26 |      307,218,931.17 |    1.7% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
|                8.83 |      113,259,198.67 |    1.3% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT`
|                4.28 |      233,685,573.34 |    0.4% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
|               15.78 |       63,391,055.77 |    0.6% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
|                7.71 |      129,684,901.52 |    0.4% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`

gcc-12 After:

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                2.20 |      454,707,913.08 |    0.8% |      0.03 | `CHACHA20_1MB`
|                2.36 |      424,359,263.25 |    4.9% |      0.01 | `CHACHA20_256BYTES`
|                2.41 |      414,622,602.59 |    0.4% |      0.01 | `CHACHA20_64BYTES`
|                6.99 |      143,089,808.99 |    7.2% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
|                3.26 |      306,926,493.73 |    4.2% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
|                9.59 |      104,251,645.58 |    8.6% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~402.1 iters. Increase `minEpochIterations` to e.g. 4021)
|                4.33 |      230,986,007.33 |    0.6% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
|               16.23 |       61,602,235.65 |    1.7% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
|                9.63 |      103,830,365.13 |    9.9% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,639.9 iters. Increase `minEpochIterations` to e.g. 16399)

gcc-10 Before:

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                2.26 |      442,527,877.02 |    0.5% |      0.03 | `CHACHA20_1MB`
|                2.30 |      435,535,172.72 |    1.9% |      0.01 | `CHACHA20_256BYTES`
|                2.39 |      418,262,709.74 |    0.4% |      0.01 | `CHACHA20_64BYTES`
|                6.93 |      144,210,951.65 |    5.9% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
|                3.16 |      316,109,217.24 |    4.8% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
|                8.43 |      118,625,079.49 |    0.3% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT`
|                4.18 |      239,143,934.28 |    0.2% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
|               16.05 |       62,308,156.96 |    5.2% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT` (Unstable with ~961.0 iters. Increase `minEpochIterations` to e.g. 9610)
|                7.63 |      131,070,821.81 |    0.1% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`

gcc-10 after:

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                2.20 |      454,351,689.08 |    0.2% |      0.03 | `CHACHA20_1MB`
|                2.40 |      416,825,911.73 |    4.4% |      0.01 | `CHACHA20_256BYTES`
|                2.40 |      416,369,054.39 |    0.2% |      0.01 | `CHACHA20_64BYTES`
|                6.58 |      151,882,394.04 |   10.5% |      0.08 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
|                3.03 |      329,600,644.76 |    0.9% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
|                9.40 |      106,431,172.41 |   10.1% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~434.9 iters. Increase `minEpochIterations` to e.g. 4349)
|                4.30 |      232,776,146.25 |    0.2% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
|               16.17 |       61,831,918.45 |    1.3% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
|                8.83 |      113,301,205.50 |    5.8% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,771.5 iters. Increase `minEpochIterations` to e.g. 17715)

@martinus
Copy link
Contributor

martinus commented Apr 23, 2022

I get the same 1-3% speedup on my i7. In my test adding #pragma GCC unroll 10 in front of the loop seems to produce exactly the same unrolled loop as the hand coded, this works for GCC and clang

Side note 1: use e.g. ./src/bench/bench_bitcoin -filter="CHACHA20.*" -min_time=2000 to run each test for 2 seconds to get more stable results

Side note 2: No need to quote the result, it's markdown 🙂

My results on i7-8700, with clang 13.0.1:

master

ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
1.91 523,770,793.06 0.1% 18.52 6.08 3.043 0.20 0.0% 1.09 CHACHA20_1MB
1.94 515,227,758.97 0.3% 18.79 6.16 3.048 0.22 0.0% 1.10 CHACHA20_256BYTES
2.02 494,527,885.82 0.2% 19.61 6.44 3.046 0.28 0.0% 1.10 CHACHA20_64BYTES

branch

ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
1.83 547,223,233.51 0.0% 17.08 5.83 2.931 0.05 0.0% 1.07 CHACHA20_1MB
1.87 535,851,391.81 0.1% 17.51 5.95 2.942 0.07 0.0% 1.10 CHACHA20_256BYTES
1.98 504,774,917.46 0.0% 18.81 6.32 2.977 0.13 0.0% 1.10 CHACHA20_64BYTES

@Empact
Copy link
Contributor

Empact commented Apr 23, 2022

+1 for #pragma unroll or similar

@laanwj
Copy link
Member

laanwj commented Apr 27, 2022

So I think the conclusion here is that on i7 there's no (or not much) difference but on other platforms it varies. But it never becomes worse. I think a performance optimization like this is mostly interesting for slower CPUs with less effective branch prediction so that's OK with me.

@laanwj
Copy link
Member

laanwj commented May 4, 2022

@sipa What are your thoughts on using #pragma unroll or a macro? Or do you prefer keeping it this way?

@sipa
Copy link
Member Author

sipa commented May 4, 2022

@laanwj That won't work on every compiler.

I'd be ok with switching to a macro to do the 10x expansion.

@maflcko
Copy link
Member

maflcko commented May 4, 2022

TIL that it is possible to pass multiple lines as an argument to a macro

@sipa
Copy link
Member Author

sipa commented May 4, 2022

TIL that it is possible to pass multiple lines as an argument to a macro

You clearly never saw the original serialization code this codebase had ;)

@sipa sipa force-pushed the 202204_unrollchacha branch from 4f3a189 to 266bf15 Compare May 4, 2022 18:44
@sipa
Copy link
Member Author

sipa commented May 4, 2022

I'd be ok with switching to a macro to do the 10x expansion.

Done, used @ajtowns's approach suggested above.

@sipa sipa force-pushed the 202204_unrollchacha branch from 266bf15 to 81c09ee Compare May 4, 2022 18:54
@martinus
Copy link
Contributor

martinus commented May 5, 2022

tested ACK 81c09ee with clang++ 13.0.1, test CHACHA20_1MB:

  • 4.3% faster on i9-9960X
  • 4.5% faster on i9-9980HK
  • 4.4% faster on i7-8700

@DrahtBot
Copy link
Contributor

DrahtBot commented May 8, 2022

Guix builds

File commit 4604508
(master)
commit 4fd4a5f
(master and this pull)
SHA256SUMS.part 2cc4936b229484ae... 72b43e559d44ffb7...
*-aarch64-linux-gnu-debug.tar.gz 9bddf61e8a0520d9... deeb9f228bac481f...
*-aarch64-linux-gnu.tar.gz e9d04df826660c90... 522251f6bae6c76e...
*-arm-linux-gnueabihf-debug.tar.gz 4beeedbf37980025... 1670184e36cf8954...
*-arm-linux-gnueabihf.tar.gz 3d050332fd87e4a2... d96deb05bc2a4c32...
*-arm64-apple-darwin-unsigned.dmg 4da5ad0e9772e4e6... 76cc400c040540b0...
*-arm64-apple-darwin-unsigned.tar.gz dcfdc8f43192660a... b2635ec3d4fae50f...
*-arm64-apple-darwin.tar.gz 4a5d70a227973d83... d1fe669b0e4947c4...
*-powerpc64-linux-gnu-debug.tar.gz 53d2e35633a6a31c... 49c537f1d856cbc4...
*-powerpc64-linux-gnu.tar.gz c375a065b6a1548d... a86b04e408ec9570...
*-powerpc64le-linux-gnu-debug.tar.gz 4969102569ae2d44... 8e8ba290e0752fbd...
*-powerpc64le-linux-gnu.tar.gz e4127eb1ced1baa7... d79b1b65762f872b...
*-riscv64-linux-gnu-debug.tar.gz 2e0700a552bee2aa... a1ecf6009d2ebb4e...
*-riscv64-linux-gnu.tar.gz 2e1a90e7096072c7... df8f4d16400de367...
*-win64-debug.zip a42b196a2551ff55... 4cecf16acf68c1af...
*-win64-setup-unsigned.exe ecea0e8c84dfa767... cf8ed2f4015e5a1d...
*-win64-unsigned.tar.gz 76ac00a1fded7fab... f00c3589f6bc5cff...
*-win64.zip c78256b254a7b0da... 6818154f30241f4c...
*-x86_64-apple-darwin-unsigned.dmg 086d56f16fe9ec2d... f6dbb479fce016e5...
*-x86_64-apple-darwin-unsigned.tar.gz a32dbe948312f693... ce21801a74eb1971...
*-x86_64-apple-darwin.tar.gz 31010eb16dfe31bc... c4ed0c95ff3d6e2d...
*-x86_64-linux-gnu-debug.tar.gz f9d542596820f03f... 770b969a8724b3ea...
*-x86_64-linux-gnu.tar.gz 34f5361f2f5dc8ac... cc157942e3794baf...
*.tar.gz 1eaba963a7fb2753... be1c1b02a7e1bbd8...
guix_build.log 3690d59eb499477b... e1f3ab033c1676fe...
guix_build.log.diff e933d67b5ec7a7fc...

@maflcko
Copy link
Member

maflcko commented May 9, 2022

  • A few percent faster on AMD EPYC as well with gcc-9/gcc-11.2/gcc-12.1/clang-14
  • Same on AMD EPYC with guix built bench
  • Same on Cortex-A72 with guix built bench

@maflcko
Copy link
Member

maflcko commented May 9, 2022

ACK 81c09ee 🍟

Show signature

Signature:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

ACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟
-----BEGIN PGP SIGNATURE-----

iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUgxzQv9FMC3MiK58jmwXRv26Mf41HrwpXJawhRSU/j+VM0Vq9JI6RlIkZ3E5Biy
EKOxtL9cMKv6cMOyE5bihZF3uIqnwJCMAx+8cb+/6RYm33UseEMHxX/S8T+Q8/vy
4r5BU/kisbX77yAjooN7Lr0/nKSv2E8APFjvcp7NIkWkx89W2zrk9z4eoFS5Dri/
yAbMpc95eTtu4gmsbjNNE73/Q1MsdfXiBgzwP8ToV/grzoZPpBTt7dsb1QRRjn1N
NAY/xG1p1kFo7ORbJ0ZHiKE4waat0Erqi8MX35f5mkMVa47X5VdDuP1FGn191f9K
oS6cfgSZr4d+SE3SFer56/3QOVToa06VmxjmKoRv0j12S7NVOxnjRNjwN6XkhgoK
wlpkNa3HxNxdMNmaUDqxXk5Z1zH5RCjZwiPQuMG5sExjemAAJXOFQ8WYnJFGp04R
dFlXeMTy2ZQWMWoEMhdJ2jCDjvggjMW8t51VA3+GQvr8ZZmN10dzXPA+Qi1c25es
QNkpUvPg
=2W4Z
-----END PGP SIGNATURE-----

@maflcko maflcko merged commit dab18f0 into bitcoin:master May 9, 2022
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request May 9, 2022
81c09ee Unroll the ChaCha20 inner loop for performance (Pieter Wuille)

Pull request description:

  Unrolling the inner ChaCha20 loop gives a ~15% speedup for me in the CHACHA20_* benchmarks. It's a simple change, this performance helps with RNG generation, and will matter more for BIP324.

ACKs for top commit:
  martinus:
    tested ACK  81c09ee with clang++ 13.0.1, test `CHACHA20_1MB`:
  MarcoFalke:
    ACK 81c09ee 🍟

Tree-SHA512: 108bd0ba573bb08de92d611e7be7c09a2c2700f9655f44129b87f9b71f7e101dfc6bd345783e7b4b9b40f0b003913cf59187f422da8cdb5b20887f7855b2611a
@bitcoin bitcoin locked and limited conversation to collaborators May 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants