-
Notifications
You must be signed in to change notification settings - Fork 38.4k
streams: replace std::find with memchr (5x improvement)
#34044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34044. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste |
|
Concept ACK 'memchr' seems like a better alternative here |
l0rinc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
FindByte’s only production use seems to be scanning block files during -reindex to find the first byte of the network magic in a circular buffer.
Because of documented historical bugs the data may be jumbled a bit, it's why we need to be able to find the beginning.
It's not something we expect users to do often, but this should speed it up a tiny bit. Left a few suggestions to make it more readable as well to make it easier to sell :)
I would be curious whether a -reindex -stopafterblockimport speedup is measurable - since my understanding is that the magic is usually at the beginning anyway, so it's also possible this ends up slowing down the average case.
I haven't checked, but LoadExternalBlockFile bench should also exercise this change - and it puts the change into perspective.
4115c99 to
246086c
Compare
|
it doesn't help that you rebase on a Knots base - the first few commits are already merged on Core, please fix that. |
246086c to
b6f93de
Compare
before: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 53.20 | 18,796,833.40 | 0.0% | 11.00 | `FindByte` | 22,499,431.11 | 44.45 | 0.2% | 10.90 | `LoadExternalBlockFile` after: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 10.38 | 96,365,031.03 | 0.0% | 10.99 | `FindByte` | 22,128,903.67 | 45.19 | 0.3% | 10.96 | `LoadExternalBlockFile`
b6f93de to
830024e
Compare
I've ran a reindex benchmark, see updated PR description. I'm not sure if ~1.2% is an irrelevant gain, but I argue a 5-6x improvement on the |
|
I ran a reindex (without chainstate) 3 times for the whole mainchain - there is no measurable speedup here (it's even 1-2% slower than before, though that's most likely just noise): reindex | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)COMMITS="938d7aacabd0bb3784bb3e529b1ed06bb2891864 e24701fe5522ac9b0eaeacc67bd16e11555a6020"; \
DATA_DIR="$HOME/Library/Application Support/Bitcoin"; LOG_DIR="$HOME/bitcoin-reindex-logs"; \
mkdir -p "$LOG_DIR"; \
COMMA_COMMITS=${COMMITS// /,}; \
(echo ""; for c in $(echo $COMMITS); do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex | $(hostname) | $(uname -m) | $(sysctl -n machdep.cpu.brand_string) | $(nproc) cores | $(printf '%.1fGiB' "$(( $(sysctl -n hw.memsize)/1024/1024/1024 ))") RAM | SSD | $(sw_vers -productName) $(sw_vers -productVersion) $(sw_vers -buildVersion) | $(xcrun clang --version | head -1)"; echo "") && \
hyperfine \
--sort command \
--runs 3 \
--export-json "$LOG_DIR/reindex-$(echo "$COMMITS" | sed -E 's/([a-f0-9]{8})[a-f0-9]* ?/\1-/g;s/-$//')-appleclang.json" \
--parameter-list COMMIT "$COMMA_COMMITS" \
--prepare "killall -9 bitcoind 2>/dev/null || true; rm -f \"$DATA_DIR\"/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo && ninja -C build bitcoind -j2" \
--conclude "killall bitcoind 2>/dev/null || true; sleep 5; grep -q 'Stopping after block import' \"$DATA_DIR\"/debug.log || { echo 'debug.log assertions failed'; exit 1; }; \
cp \"$DATA_DIR\"/debug.log \"$LOG_DIR\"/reindex-{COMMIT}-\$(date +%s).log 2>/dev/null || true" \
"./build/bin/bitcoind -datadir=\"$DATA_DIR\" -reindex -stopafterblockimport -printtoconsole=0"
938d7aacab Merge bitcoin/bitcoin#33657: rest: allow reading partial block data from storage
e24701fe55 streams: replace std::find with memchrBenchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -reindex -stopafterblockimport -printtoconsole=0 (COMMIT = 938d7aacabd0bb3784bb3e529b1ed06bb2891864)
Time (mean ± σ): 17116.255 s ± 287.737 s [User: 18550.915 s, System: 3721.215 s]
Range (min … max): 16793.539 s … 17346.051 s 3 runs
Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -reindex -stopafterblockimport -printtoconsole=0 (COMMIT = e24701fe5522ac9b0eaeacc67bd16e11555a6020)
Time (mean ± σ): 17344.195 s ± 113.196 s [User: 18589.419 s, System: 3727.864 s]
Range (min … max): 17247.068 s … 17468.509 s 3 runsIt can still be a useful change, but I don't think it's fair to say it speeds up anything. |
|
I wouldn't expect any change, because unless you're starting from a pre-0.8 node block files, or badly corrupted ones, it'll always be the first byte that matches. |
|
It's also what I expected, as mentioned above, the current solution is worse for that case |
|
understood. |
|
I think it could be better to remove the recovery logic. An upgrade from 0.8 does not seem like a use-case that any user will ever need in the future. Also, I don't see what kind of corruption this could possibly be able to recover, given that it can't progress past the first corrupt block anyway? |
I will push a PR for that, let's see what others think |
Summary
This PR optimizes the
FindBytemethod by usingmemchrinstead ofstd::find. This takes advantage of the underlying optimizations that come withmemchr, primarily vectorized chunked reads. Whilestd::findis more standard and modern, it is suboptimal for iterating single bytes as they're iterated 1 by 1 instead of exploiting SIMD.One could argue that this is not a concern of Bitcoin Core but rather of libc++ mantainers, but since it shows 5x improvement in existing benchmarks, I think it's worth including.
Benchmarks
Details
taskset -c 1 ./bin/bench_bitcoin -filter="(FindByte|LoadExternalBlockFile)" --min-time=10000FindByteLoadExternalBlockFileFindByteLoadExternalBlockFileI've also ran a reindex benchmark up to block 300'000 and it shows a slight improvement of ~1.2%
Details