Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace non-standard CLZ builtins with c++20's bit_width #29057

Closed
wants to merge 3 commits into from

Conversation

theuni
Copy link
Member

@theuni theuni commented Dec 11, 2023

Split out of #28674

Note that we can't yet drop our configure checks because we pass the results on to minisketch. I've opened a PR for that upstream here: sipa/minisketch#80

fanquake suggested that we simply replace our CountBits call-sites with uses of std::bit_width directly and just drop the tests and fuzzers. I agree with that, but I wanted to allow our tests/fuzzers to run with this change first to help convince reviewers that std::bit_width is a drop-in replacement forCountBits.

I can either add a commit on top of this PR to do the switch after the c-i has run or do it as a follow-up PR, I have no preference.

I was curious to see what would happen under the hood with this change. Fortunately libc++ (as an example) does exactly what one would expect, using the built-ins: https://github.com/llvm/llvm-project/blob/main/libcxx/include/__bit/countl.h#L56

@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 11, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK hebasto, fanquake
Stale ACK theStack

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #28674 ([POC] C++20 std::endian by fanquake)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Contributor

@theStack theStack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-review ACK 273e4dc

Unit tests succeeded locally ✔️

Seems reasonable to keep CountBits in a first step with its tests and fuzzers for reassurance and then replace all of its call-sites. Happy to re-review this PR or a follow-up, depending on where you decide to put the next commit.

@hebasto
Copy link
Member

hebasto commented Dec 12, 2023

Concept ACK.

@hebasto
Copy link
Member

hebasto commented Dec 12, 2023

... but I wanted to allow our tests/fuzzers to run with this change first to help convince reviewers that std::bit_width is a drop-in replacement forCountBits.

I'm convinced :)

@fanquake
Copy link
Member

Concept ACK. We can also check the benchmarks in https://corecheck.dev/bitcoin/bitcoin/pulls/29057 when they become available.

I can either add a commit on top of this PR to do the switch after the c-i has run or do it as a follow-up PR, I have no preference.

I think the changes here should be straightforward enough, that you could push another commit doing the cleanup.

@theuni
Copy link
Member Author

theuni commented Dec 12, 2023

Thanks @theStack and @hebasto for the quick review.

@fanquake sounds good, will push a commit to do the switch once we see some benchmarks.

@maflcko
Copy link
Member

maflcko commented Dec 13, 2023

I don't think the benchmarks will come. I guess the machine has been preempted by AWS? @aureleoules

@aureleoules
Copy link
Member

I don't think the benchmarks will come. I guess the machine has been preempted by AWS? @aureleoules

I've re-ran the job. I actually found a way to re-run preempted jobs automatically!
A lot of benchmark jobs failed yesterday because of #29061 including this pull because I rebase all pulls on master before running jobs and I don't properly handle failures 😬

@aureleoules
Copy link
Member

So the benchmark results are available now but they are pessimistic on a few benchs (WalletCreateTxUseOnlyPresetInputs, WalletAvailableCoins and PrevectorClearNontrivial).

I ran the benchmarks locally without cachegrind and there does not seem to be any variation between master and the pull.
But when I run them with cachegrind I notice the same variation as seen on corecheck.

This may be due to a limitation of cachegrind because it cannot emulate L2 cache and so results may be pessimist? I'm not an expert in this field though so I'll let you interpret the results😅

@theuni
Copy link
Member Author

theuni commented Dec 13, 2023

I can reproduce a slight slowdown here. I added a bench that demonstrates the difference, then the original commit, then the removal.

Both libc++ and libstdc++ implement this in terms of the same built-ins we were using before, so I find this surprising. I'd hate to find that the c++ism's have a cost.

Going to convert this to a draft while I look into it.

bit_width is a drop-in replacement with an exact meaning in c++, so there is
no need to continue testing/fuzzing/benchmarking.
@maflcko
Copy link
Member

maflcko commented Jan 5, 2024

I can reproduce a slight slowdown here.
Both libc++ and libstdc++ implement this in terms of the same built-ins we were using before, so I find this surprising. I'd hate to find that the c++ism's have a cost.

I ran the first commit via ./src/bench/bench_bitcoin --filter=BitWidth.* and I can see a slowdown of the current implementation (std is faster). Which is also confusing, given that it is the same code. Doubly confusing, because it is the opposite result of yours?

@theuni
Copy link
Member Author

theuni commented Jan 17, 2024

@maflcko I'm thinking we should just take the c++20 code and assume it will be more performant going forward. That makes sense to me anyway.

I'm going to go ahead and close this and include it in another PR which gathers the serialization c++20 changes. We can do another round of benches there.

@theuni theuni closed this Jan 17, 2024
fanquake added a commit that referenced this pull request Mar 1, 2024
86b7f28 serialization: use internal endian conversion functions (Cory Fields)
432b18c serialization: detect byteswap builtins without autoconf tests (Cory Fields)
297367b crypto: replace CountBits with std::bit_width (Cory Fields)
52f9bba crypto: replace non-standard CLZ builtins with c++20's bit_width (Cory Fields)

Pull request description:

  This replaces #28674, #29036, and #29057. Now ready for testing and review.

  Replaces platform-specific endian and byteswap functions. This is especially useful for kernel, as it means that our deep serialization code no longer requires bitcoin-config.h.

  I apologize for the size of the last commit, but it's hard to avoid making those changes at once.

  All platforms now use our internal functions rather than libc or platform-specific ones, with the exception of MSVC.

  Sadly, benchmarking showed that not all compilers are capable of detecting and optimizing byteswap functions, so compiler builtins are instead used where possible. However, they're now detected via macros rather than autoconf checks.

  This[ matches how libc++ implements std::byteswap for c++23](https://github.com/llvm/llvm-project/blob/main/libcxx/include/__bit/byteswap.h#L26).

  I suggest we move/rename `compat/endian.h`, but I left that out of this PR to avoid bikeshedding.

  #29057 pointed out some irregularities in benchmarks. After messing with various compilers and configs for a few weeks with these changes, I'm of the opinion that we can't win on every platform every time, so we should take the code that makes sense going forward. That said, if any real-world slowdowns are caused here, we should obviously investigate.

ACKs for top commit:
  maflcko:
    ACK 86b7f28 📘
  fanquake:
    ACK 86b7f28 - we can finish pruning out the __builtin_clz* checks/usage once the minisketch code has been updated. This is more good cleanup pre-CMake & for the kernal.

Tree-SHA512: 715a32ec190c70505ffbce70bfe81fc7b6aa33e376b60292e801f60cf17025aabfcab4e8c53ebb2e28ffc5cf4c20b74fe3dd8548371ad772085c13aec8b7970e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants