-
Notifications
You must be signed in to change notification settings - Fork 36.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MuHash3072 implementation #19055
Conversation
Reviewers: This PR is being discussed in the review club at https://bitcoincore.reviews/19055 (A |
Some history for the context: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK - this is so cool!
Regarding approaches (i.e. which hash function) I looked at your comparison in this doc, and have a question about the "rolling-ness" consideration and why it's important. It also seems that ECMH (which is noted as non-rolling) performed better and has a maintainability plus since it's part of Secp256k1.
From what I’ve gathered:
- incremental hashing is efficient with arbitrary additions/deletions from a large set, at any point in the data
- rolling hash functions have this same efficiency idea but are particularly optimized for adding/popping at the ends (i.e. like a "rolling" window)
(Please correct me if I'm mistaken here, since you've definitely spent more time with this. I also understand that they aren't mutually exclusive).
Since the UTXO set is not spent in any particular order and you can’t sort it ahead of time, I definitely see (theoretically as well as from the tangible performance wins) why an incremental hash is better, but I fail to understand why rolling is particularly important?
@gzhao408 I think there is a bit of a miscommunication. The term "rolling" here just means efficient addition as well as deletion from the set being hashed. There is no actual window, as sets are unordered. Perhaps the term was chosen badly (which would be my fault). All of LtHash, MuHash, and ECMH support incremental addition and deletion. The "non-rolling" mention in @fjahr's document refers to the fact that he hasn't implemented continuous UTXO set hashing with these functions, only a from scratch computation. It has nothing to do with what these functions can or can't do. The primary difference between MuHash and ECMH is caching:
|
AaaAHhhh, thank you @sipa for the clarification!
This makes a lot more sense; I had thought he meant the hash function itself is rolling/non-rolling. I didn't have much info but was super interested and went for a textbook definition 😅 now I feel embarrassed. This information is super helpful, thanks! |
Please also note that there is an implementation of ECMH that is an open pull request to Secp256k1 but it is not merged and has been stale for some time. So the work to get it merged is more or less the same and I don't think there is a big difference in maintainability between having it in core or secp. |
@fjahr I've written a Python implementation of MuHash3072, so it can be more easily tested: https://github.com/sipa/bitcoin/commits/202005_muhash_python (no tests are included, but I've verified it matches the C++ code in a few simple examples). |
Awesome, I have just pulled it in here and will build further tests on top of it. |
Rebased and added a6cf0df which allows the user to use the legacy hash with the use of a flag. This will definitely be squashed but I decided to keep it separate for the moment to make discussions easier in the review club tomorrow. I figured I would need to do something like this anyway in order go through a deprecation cycle with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK / partial ACK 4438aed. Code review. Did not yet compare the C++ MuHash implementation to the Python one or look for other implementations to verify with. Built, ran tests, ran benchmarks. Verified that minor mutations to muhash.cpp
duly broke the new unit test. Tested the new RPC gettxoutsetinfo
help and boolean. Some comments:
-
This PR essentially makes
gettxoutsetinfo
too slow to be useable in testnet and mainnet (it times out and raises after 15 minutes); for that reason, until the -coinstatsindex in [WIP] Index for UTXO Set Statistics #18000 is merged, the MuHash algorithm should be opt-in for rpcgettxoutsetinfo
and not the default -
What are your plans regarding ECMH?
-
it looks like
TruncatedSHA512Writer
could use unit tests (perhaps just sanity checks if not viable to use test vectors,CHashWriter
is also not tested directly but used by other unit tests: addrman, hash, sighash) -
a fuzz harness would be ideal
-
do you plan to add tests using the just-added Python MuHash3072 implementation?
-
06ae4ab commit message s/256/512/?
-
04d088d commit message could mention the renaming of
hash_serialized_2
toutxo_set_hash
and why it was changed (I didn't see any explanation anywhere) -
04d088d and a6cf0df ought to be squashed to one commit; keeping them separate unnecessarily complicates reviewing -- I squashed them locally in order to see the relevant diff, as ApplyStats() is weird enough to review already
-
at some point you'll want to add a release note
-
some nits below; feel free to ignore
tested the new benchmarks
$ src/bench/bench_bitcoin -filter=MuHash*.*
# Benchmark, evals, iterations, total, min, max, median
MuHash, 5, 5000, 0.287426, 7.45603e-06, 1.72302e-05, 1.08339e-05
MuHashAdd, 5, 5000, 0.243778, 7.93741e-06, 1.17234e-05, 9.61388e-06
MuHashDiv, 5, 100, 8.62069, 0.0146344, 0.0216452, 0.0169485
MuHashPrecompute, 5, 5000, 0.0557224, 1.55965e-06, 4.40565e-06, 1.65373e-06
$ src/bench/bench_bitcoin -filter=MuHash*.* -evals=10
# Benchmark, evals, iterations, total, min, max, median
MuHash, 10, 5000, 0.286697, 5.3029e-06, 6.24744e-06, 5.78305e-06
MuHashAdd, 10, 5000, 0.227103, 4.1665e-06, 5.39662e-06, 4.52542e-06
MuHashDiv, 10, 100, 12.7459, 0.0106584, 0.0182222, 0.0120859
MuHashPrecompute, 10, 5000, 0.096089, 1.67035e-06, 2.37696e-06, 1.89207e-06
$ src/bench/bench_bitcoin -filter=MuHash*.* -evals=50
# Benchmark, evals, iterations, total, min, max, median
MuHash, 50, 5000, 1.92384, 6.04054e-06, 1.34205e-05, 7.3487e-06
MuHashAdd, 50, 5000, 1.26626, 3.9807e-06, 5.9453e-06, 5.61147e-06
MuHashDiv, 50, 100, 59.2743, 0.00950589, 0.0151357, 0.011696
MuHashPrecompute, 50, 5000, 0.386586, 1.4394e-06, 1.66618e-06, 1.55641e-06
Note to reviewers: you'll need to build #18000 to build the index and test performance, as it contains the -coinstatsindex that this PR does not yet implement.
Also verified that, like your added test in $ bitcoin-cli -regtest gettxoutsetinfo
{
"height": 15599,
"bestblock": "6e535bc570f9be1b86d21711b23391e4e8f001682b5c6243883744879cdc4f84",
"transactions": 3216,
"txouts": 3219,
"bogosize": 234987,
"utxo_set_hash": "c4926481130f6689950e07e0e1f2060277d349b106a94b83145d577d3db4d225",
"disk_size": 226234,
"total_amount": 14949.99998350
}
$ bitcoin-cli -regtest gettxoutsetinfo true
{
"height": 15599,
"bestblock": "6e535bc570f9be1b86d21711b23391e4e8f001682b5c6243883744879cdc4f84",
"transactions": 3216,
"txouts": 3219,
"bogosize": 234987,
"utxo_set_hash": "1883e99e56927fd5e3b82ccb9e08acd1be7ed0256d4cafe142626ae2fb66ffb4",
"disk_size": 226234,
"total_amount": 14949.99998350
} |
nit: "separate" is misspelled in the commit title of cda20f3 It might be nice to split this in two PRs: One which adds MuHash, and one which updates Edited to add thought: The next PR could add rolling as well, so this way we don't have an interim state where |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Addressed the latest comments.
Done |
So good |
SERIALIZE_METHODS(MuHash3072, obj) | ||
{ | ||
READWRITE(obj.m_numerator); | ||
READWRITE(obj.m_denominator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really what we want?
or do we want it to normalize it (like Finalize does) and then just serialize the numerator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be a good idea. If we want to use MuHash in combination with assumeutxo then the shorter the (marginally) better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that would be nicer and I had implemented it like that a few weeks ago but decided to kick it out again because code-wise it felt awkward to force normalization before every serialization. Maybe there is a better way to do it than what I tried code-wise. I don't think I found another example of where something comparable is done IIRC. Overall, normalizing is an expensive operation that would need to run quite for every serialization to result in a 50% space-saving on disk. However, for CoinstatsIndex I only save a single MuHash3072 to disk at any time, so at least for that, it would only save 384 bytes total. That's why I took the easy way and kept it as it is now.
I can try to revive my old branch and push it for the sake of discussion. But I am not sure why it makes a difference for assumeutxo, wouldn't we use the finalized 32 byte hash for that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends, if we ever add a commitment to this in the coinbase or something like that then we might want to also save the MuHash in the undoBlocks.
But I think there's no point in discussing this now, so IMO you can leave it as-is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 9815332
9815332 test: Change MuHash Python implementation to match cpp version again (Fabian Jahr) 01297fb fuzz: Add MuHash consistency fuzz test (Fabian Jahr) b111410 test: Add MuHash3072 fuzz test (Fabian Jahr) c122527 bench: Add Muhash benchmarks (Fabian Jahr) 7b12422 test: Add MuHash3072 unit tests (Fabian Jahr) adc708c crypto: Add MuHash3072 implementation (Fabian Jahr) 0b4d290 crypto: Add Num3072 implementation (Fabian Jahr) 589f958 build: Check for 128 bit integer support (Fabian Jahr) Pull request description: This is the first split of bitcoin#18000 which implements the Muhash algorithm and uses it to calculate the UTXO set hash in `gettxoutsetinfo`. ACKs for top commit: laanwj: Code review ACK 9815332 Tree-SHA512: 4bc090738f0e3d80b74bdd8122e24a8ce80121120fd37c7e4335a73e7ba4fcd7643f2a2d559e2eebf54b8e3a3bd5f12cfb27ba61ded135fda210a07a233eae45
|
||
} // namespace | ||
|
||
/** Indicates wether d is larger than the modulus. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: wether -> whether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Included a fix in #19145
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop based on epochIterations. That does not do what it is supposed to do, because epochIterations() is determined automatically from nanobench. Also, multiplication is not needed for the algorithm (as pointed out by a comment in bitcoin#19055), so it's better to remove this loop.
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop based on epochIterations. That does not do what it is supposed to do, because epochIterations() is determined automatically from nanobench. Also, multiplication is not needed for the algorithm (as pointed out by a comment in bitcoin#19055), so it's better to remove this loop.
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop based on epochIterations. That does not do what it is supposed to do, because epochIterations() is determined automatically from nanobench. Also, multiplication is not needed for the algorithm (as pointed out by a comment in bitcoin#19055), so it's better to remove this loop.
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop based on epochIterations. That does not do what it is supposed to do, because epochIterations() is determined automatically from nanobench. Also, multiplication is not needed for the algorithm (as pointed out by a comment in bitcoin#19055), so it's better to remove this loop.
Summary: PR description: > This is the second in a series of pull requests to implement an Index for UTXO set statistics. > > This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version. > test: Move modinv to util and add unit test bitcoin/bitcoin@ab30cec > test: Add Python MuHash3072 implementation to test framework bitcoin/bitcoin@b85543c > test: Add basic Python/C++ Muhash implementation parity unit test bitcoin/bitcoin@0e2b400 > test: Add chacha20 test vectors in muhash bitcoin/bitcoin@36ec980 This is a backport of [[bitcoin/bitcoin#19105 | core#19105]] Backport note: the C++ implementation will be merged much later ([[bitcoin/bitcoin#19055 | core#19055]]) Test Plan: `ninja check-functional` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D10166
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop based on epochIterations. That does not do what it is supposed to do, because epochIterations() is determined automatically from nanobench. Also, multiplication is not needed for the algorithm (as pointed out by a comment in bitcoin#19055), so it's better to remove this loop.
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop based on epochIterations. That does not do what it is supposed to do, because epochIterations() is determined automatically from nanobench. Also, multiplication is not needed for the algorithm (as pointed out by a comment in bitcoin#19055), so it's better to remove this loop.
Summary: General context (see core#18000): This is a step in the way of implementing an index of coin statistics with the goal of making the response time of the `gettxoutsetinfo` RPC call dramatically faster. Currently, this RPC is scanning the full UTXO set every time it is called which makes it hard to use for users that want to continually check the coin supply or compare UTXO set hashes between different nodes. It is especially challenging in periods of multiple quickly mined blocks, even relatively fast machines. > build: Check for 128 bit integer support bitcoin/bitcoin@2474645 > crypto: Add Num3072 implementation > > Num3072 is a specialized bignum implementation used in MuHash3072. bitcoin/bitcoin@0b4d290 > crypto: Add MuHash3072 implementation bitcoin/bitcoin@adc708c > test: Add MuHash3072 unit tests bitcoin/bitcoin@7b12422 > test: Change MuHash Python implementation to match cpp version again bitcoin/bitcoin@9815332 Co-authored-by: Pieter Wuille <pieter.wuille@gmail.com> Co-authored-by: Anthony Towns <aj@erisian.com.au> This is a partial backport of [[bitcoin/bitcoin#19055 | core#19055]] Test Plan: `ninja check check-functional` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Subscribers: Fabien Differential Revision: https://reviews.bitcoinabc.org/D11150
This is the first split of #18000 which implements the Muhash algorithm and uses it to calculate the UTXO set hash in
gettxoutsetinfo
.