wire: only borrow/return binaryFreeList buffers at the message level #2073

Roasbeef · 2023-12-16T00:42:16Z

NOTE: This is a rebased version of #1426

The original PR body follows:

This PR optimizes the wire packages serialization of small buffers by minimizing the number of borrow/return round trips during message serialization. Currently the wire package uses a binaryFreeList from which 8-byte buffers are borrowed and returned for the purpose of serializing small integers and varints.

Problem

To understand the problem, consider calling WriteVarInt on a number greater than 0xfc (which requires writing the discriminant and a 2, 4, or 8 byte value following).

For instance, writing 20,000 will invoke PutUint8 and then PutUint16. Expanding this out to examine the message passing, we see:

buffer <- freelist

PutUInt8(buffer, discriminant)
w.Write(buffer)

freelist <- buffer

buffer <- freelist

PutUint16(buffer, value)
w.Write(buffer)

freelist <- buffer

Each <- requires a channel select, which more-or-less bears the performance implication of a mutex. This cost, in addition to need to wake up other goroutines and switch executions, imparts a significant performance penalty. In the context of block serialization, several hundred thousand of these operations may be performed.

Solution

In our example above, we can improve this by only using two <-, one to borrow and one to return, as so:

buffer <- freelist

PutUInt8(buffer, discriminant)
w.Write(buffer)

PutUint16(buffer, value)
w.Write(buffer)

freelist <- buffer

As expected, cutting the number channels sends in half cuts also cuts the latency in half, which can be seen in the benchmarks below for larger VarInts,

The remainder of this PR is to propagate this pattern all the way up to the top level of messages in the wire package, such that deserializing a message only incurs one borrow and one return. Any subroutines are made to conditionally borrow from the binarySerializer if the invoker has not provided them with a buffer, and conditionally return if they indeed were required to borrow.

A good example of how these channel sends/receives can add up is in MsgTx serialization, which is now upwards of 80% faster as a result of these optimizations:

benchmark                         old ns/op     new ns/op     delta
BenchmarkSerializeTx-8            683           142           -79.21%
BenchmarkSerializeTxSmall-8       724           143           -80.25%
BenchmarkSerializeTxLarge-8       1476002       182111        -87.66%

Preliminary Benchmarks

benchmark                         old ns/op     new ns/op     delta
BenchmarkWriteVarInt1-8           71.9          74.6          +3.76%
BenchmarkWriteVarInt3-8           148           70.9          -52.09%
BenchmarkWriteVarInt5-8           149           71.0          -52.35%
BenchmarkWriteVarInt9-8           147           72.8          -50.48%
BenchmarkReadVarInt1-8            78.8          77.9          -1.14%
BenchmarkReadVarInt3-8            159           87.7          -44.84%
BenchmarkReadVarInt5-8            155           88.3          -43.03%
BenchmarkReadVarInt9-8            158           86.9          -45.00%
BenchmarkReadVarStr4-8            120           119           -0.83%
BenchmarkReadVarStr10-8           138           130           -5.80%
BenchmarkWriteVarStr4-8           101           105           +3.96%
BenchmarkWriteVarStr10-8          103           105           +1.94%
BenchmarkReadOutPoint-8           91.3          28.7          -68.57%
BenchmarkWriteOutPoint-8          78.9          10.2          -87.07%
BenchmarkReadTxOut-8              245           118           -51.84%
BenchmarkWriteTxOut-8             151           93.5          -38.08%
BenchmarkReadTxIn-8               338           139           -58.88%
BenchmarkWriteTxIn-8              238           31.0          -86.97%
BenchmarkDeserializeTxSmall-8     1119          586           -47.63%
BenchmarkDeserializeTxLarge-8     2476063       1275815       -48.47%
BenchmarkSerializeTx-8            683           142           -79.21%
BenchmarkSerializeTxSmall-8       724           143           -80.25%
BenchmarkSerializeTxLarge-8       1476002       182111        -87.66%
BenchmarkReadBlockHeader-8        406           69.1          -82.98%
BenchmarkWriteBlockHeader-8       431           21.1          -95.10%
BenchmarkDecodeGetHeaders-8       13537         12238         -9.60%
BenchmarkDecodeHeaders-8          1025275       236709        -76.91%
BenchmarkDecodeGetBlocks-8        13206         11684         -11.53%
BenchmarkDecodeAddr-8             337977        157519        -53.39%
BenchmarkDecodeInv-8              5990935       1898169       -68.32%
BenchmarkDecodeNotFound-8         6285831       1864701       -70.33%
BenchmarkDecodeMerkleBlock-8      4357          2606          -40.19%
BenchmarkTxHash-8                 1928          1222          -36.62%
BenchmarkDoubleHashB-8            993           969           -2.42%
BenchmarkDoubleHashH-8            932           1029          +10.41%

benchmark                         old allocs     new allocs     delta
BenchmarkWriteBlockHeader-8       4              0              -100.00%
// all others remain 0 alloc

benchmark                         old bytes     new bytes     delta
BenchmarkWriteBlockHeader-8       16            0             -100.00%
// all others remain unchanged

Notes

I'm still in the process of going through and adding benchmarks to top-level messages in order to guage the overall performance benefit, expect more to be added at a later point.

There are a few remaining messages which have not yet been optimized, e.g. MsgAlert, MsgVesion, etc. I plan to add those as well but decided to start with the ones that were more performance critical.

coveralls · 2023-12-16T00:44:07Z

Pull Request Test Coverage Report for Build 7353202677

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.7%) to 56.754%

Totals
Change from base Build 7227596643:	0.7%
Covered Lines:	28907
Relevant Lines:	50934

💛 - Coveralls

Roasbeef · 2023-12-18T18:38:39Z

cc @ellemouton

Roasbeef · 2023-12-19T00:47:20Z

With this PR, combined with the rest on the v0.24 milestone, I was able to achieve a 12 hour sync over LAN!

guggero

Very nice changes and nicely split up into multiple commits.
I did as thorough a review as concentration allowed and didn't find anything wrong with the code.

The main thing I would change is to use a consistent pattern of:

buf := binarySerializer.Borrow()
defer binarySerializer.Return(buf)

everywhere to avoid a single mistake somewhere causing the buffers to all be occupied and becoming useless (which probably wouldn't show up in benchmark tests).

wire/bench_test.go

wire/common.go

wire/bench_test.go

wire/msgtx.go

wire/msggetblocks.go

wire/msgblock.go

cfromknecht added 30 commits December 15, 2023 16:35

wire/bench_test: report allocs in benchmarks

b434080

wire/bench: add witness block

a9edc32

wire/common: optimize Read/WriteVarInt

a371aeb

wire: introduce Read/WriteVarIntBuf to reuse buffers between invocations

6275db9

wire/msgtx: use Read/WriteVarIntBuf in tx serialization

e58aadc

wire/msgtx: reuse tx-level buffer for version and locktime

e12d32d

wire/common: add optimized Read/WriteVarBytesBuf

7951aa5

wire/msgtx: introduce optimized read/writeOutPointBuf

b171012

wire/msgtx: introduce optimized writeTxInBuf

d43d9d5

wire/msgtx: use writeTxInBuf in txn encoding

4829ff7

wire/msgtx: introduce optimized readScriptBuf

99f6488

wire/msgtx: introduce optimized readTxInBuf

6f4a7a1

wire/msgtx: use readTxInBuf in txn serialization

607eea1

wire/msgtx: introduce optimized WriteTxOutBuf

7c8844f

wire/msgtx: use WriteTxOutBuf in txn serialization

48d31e5

wire/msgtx: introduce optimized readTxOutBuf

aebc743

wire/msgtx: use readTxOutBuf in txn serialization

24d4217

wire/msgtx: introduce optimized writeTxWitnessBuf

3bfd0c6

wire/msgtx: use writeTxWitnessBuf in txn serialization

3a91303

wire/msgtx: use readScriptBuf in txn serialization

0cf8c19

wire/bench_test: introduce optimized readBlockHeaderBuf

aa769e3

wire/blockheader: introduce optimized writeBlockHeaderBuf

3cee06e

wire/invvect: add optimized readInvVectBuf and writeInvVectBuf

674c220

wire/msggetblocks: optimize by reusing small buffer

4ebc651

wire/msgblock: use only one small buffer per block encode/decode

ee1f807

wire/msgblock: optimize DeserializeTxLoc by reusing small buffers

d8e0817

wire/msggetheaders: optimize serialization by reusing small buffers

c0d35e6

wire/msgheaders: optimize serialization by reusing small buffers

83675cb

wire/msggetcfheaders: use single small buffer for encode/decode

d042fe0

wire/msgcfheaders: optimize encode/decode by using one small buffer

1c525db

cfromknecht added 21 commits December 15, 2023 16:37

wire/msggetcfcheckpt: optimize by removing read/writeElement

1990555

wire/msgcfcheckpt: optimize serialization by reusing small buffers

f37f475

wire/msginv: optimize by reusing small buffers

2383a04

wire/msggetdata: optimize serialization by reusing small buffers

d6594da

wiree/msggetcfilters: optimize serialization by reusing small buffers

ddeba60

wire/msgcfilter: optimize serialization by reusing small buffers

834febb

wire/msgnotfound: optimize serialization by reusing small buffers

efcf964

wire/invvect: remove unused readInvVect and writeInvVect

1cd5e02

wire/common: add optimized writeVarStrBuf an readVarStrBuf

57daac3

wire/msgreject: optimize serialization by reusing small buffers

dc4fbb0

wire/netaddress: add optimiezed read/writeNetAddressBuf

8bf07cc

wire/msgmerkleblock: optimize serialization by reusing small buffers

7207967

wire/msgping: remove usage for read/writeElement

3698f2d

wire/msgpong: remove usage of read/writeElement

80ae5d3

wire/msgtx: remove unused writeTxWitness

da89ed6

wire/msgtx: remove unused writeTxIn

f0184e5

wire/msgtx: remove unused readTxIn

e0fa866

wire/msgtx: remove unused readScript

4cc4f76

wire/msgtx: remove unused read/writeOutPoint

2e6eefc

wire/msgtx: use tx-level script slab

d7396dc

wire/msgblock+msgtx: user block-level script slab

8c4da83

Roasbeef mentioned this pull request Dec 16, 2023

wire: only borrow/return binaryFreeList buffers at the message level #1426

Closed

Roasbeef requested a review from guggero December 18, 2023 18:38

guggero reviewed Dec 20, 2023

View reviewed changes

wire: consistently use defer for returning scratch buffers

b0e9636

Roasbeef merged commit 16684f6 into btcsuite:master Dec 29, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wire: only borrow/return binaryFreeList buffers at the message level #2073

wire: only borrow/return binaryFreeList buffers at the message level #2073

Roasbeef commented Dec 16, 2023

coveralls commented Dec 16, 2023 •

edited

Roasbeef commented Dec 18, 2023

Roasbeef commented Dec 19, 2023

guggero left a comment

wire: only borrow/return binaryFreeList buffers at the message level #2073

wire: only borrow/return binaryFreeList buffers at the message level #2073

Conversation

Roasbeef commented Dec 16, 2023

The original PR body follows:

Problem

Solution

Preliminary Benchmarks

Notes

coveralls commented Dec 16, 2023 • edited

Pull Request Test Coverage Report for Build 7353202677

Warning: This coverage report may be inaccurate.

💛 - Coveralls

Roasbeef commented Dec 18, 2023

Roasbeef commented Dec 19, 2023

guggero left a comment

Choose a reason for hiding this comment

coveralls commented Dec 16, 2023 •

edited