Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MuHash3072 implementation #19055

Merged
merged 8 commits into from
Jan 7, 2021
Merged

Add MuHash3072 implementation #19055

merged 8 commits into from
Jan 7, 2021

Conversation

fjahr
Copy link
Contributor

@fjahr fjahr commented May 22, 2020

This is the first split of #18000 which implements the Muhash algorithm and uses it to calculate the UTXO set hash in gettxoutsetinfo.

@jonatack
Copy link
Member

Reviewers: This PR is being discussed in the review club at https://bitcoincore.reviews/19055

(A Review club tag for this PR would be helpful to indicate that review notes and discussion are available there.)

@jonasschnelli
Copy link
Contributor

Some history for the context:

Copy link
Member

@glozow glozow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK - this is so cool!

Regarding approaches (i.e. which hash function) I looked at your comparison in this doc, and have a question about the "rolling-ness" consideration and why it's important. It also seems that ECMH (which is noted as non-rolling) performed better and has a maintainability plus since it's part of Secp256k1.

From what I’ve gathered:

  • incremental hashing is efficient with arbitrary additions/deletions from a large set, at any point in the data
  • rolling hash functions have this same efficiency idea but are particularly optimized for adding/popping at the ends (i.e. like a "rolling" window)
    (Please correct me if I'm mistaken here, since you've definitely spent more time with this. I also understand that they aren't mutually exclusive).

Since the UTXO set is not spent in any particular order and you can’t sort it ahead of time, I definitely see (theoretically as well as from the tangible performance wins) why an incremental hash is better, but I fail to understand why rolling is particularly important?

@sipa
Copy link
Member

sipa commented May 26, 2020

@gzhao408 I think there is a bit of a miscommunication. The term "rolling" here just means efficient addition as well as deletion from the set being hashed. There is no actual window, as sets are unordered. Perhaps the term was chosen badly (which would be my fault).

All of LtHash, MuHash, and ECMH support incremental addition and deletion. The "non-rolling" mention in @fjahr's document refers to the fact that he hasn't implemented continuous UTXO set hashing with these functions, only a from scratch computation. It has nothing to do with what these functions can or can't do.

The primary difference between MuHash and ECMH is caching:

  • ECMH is more CPU time overall, but the time is mostly in computing the "effect" of a set of additions/deletions; applying that effect to the overall hash is extremely cheap. Furthermore, a 64-byte precomputed "patch" per set of additions/deletions can be created, which means a patch could be cached per-transaction, and then very cheaply applied when the transaction confirms. So it allows doing most of the work ahead of time (before the block arrives).

  • MuHash is cheaper overall, but the time is mostly in applying the changes to the overall hash. This means caching isn't very useful (and also the caches would be 768 bytes, which is pretty large). If the intent is computing things in the background, or at block connection time, then this is a better (and simpler) approach.

@glozow
Copy link
Member

glozow commented May 26, 2020

AaaAHhhh, thank you @sipa for the clarification!

The term "rolling" here just means efficient addition as well as deletion from the set being hashed.
The "non-rolling" mention in @fjahr's document refers to the fact that he hasn't implemented continuous UTXO set hashing with these functions

This makes a lot more sense; I had thought he meant the hash function itself is rolling/non-rolling. I didn't have much info but was super interested and went for a textbook definition 😅 now I feel embarrassed. This information is super helpful, thanks!

@fjahr
Copy link
Contributor Author

fjahr commented May 26, 2020

It also seems that ECMH (which is noted as non-rolling) performed better and has a maintainability plus since it's part of Secp256k1.

Please also note that there is an implementation of ECMH that is an open pull request to Secp256k1 but it is not merged and has been stale for some time. So the work to get it merged is more or less the same and I don't think there is a big difference in maintainability between having it in core or secp.

@sipa
Copy link
Member

sipa commented May 26, 2020

@fjahr I've written a Python implementation of MuHash3072, so it can be more easily tested: https://github.com/sipa/bitcoin/commits/202005_muhash_python (no tests are included, but I've verified it matches the C++ code in a few simple examples).

@fjahr
Copy link
Contributor Author

fjahr commented May 26, 2020

@fjahr I've written a Python implementation of MuHash3072, so it can be more easily tested: sipa/bitcoin@202005_muhash_python (commits) (no tests are included, but I've verified it matches the C++ code in a few simple examples).

Awesome, I have just pulled it in here and will build further tests on top of it.

@fjahr
Copy link
Contributor Author

fjahr commented May 26, 2020

Rebased and added a6cf0df which allows the user to use the legacy hash with the use of a flag. This will definitely be squashed but I decided to keep it separate for the moment to make discussions easier in the review club tomorrow. I figured I would need to do something like this anyway in order go through a deprecation cycle with gettxoutsetinfo, the question is if that is still needed with the flag. Also added 4438aed which is @sipa 's python implementation of Muhash, currently unused.

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK / partial ACK 4438aed. Code review. Did not yet compare the C++ MuHash implementation to the Python one or look for other implementations to verify with. Built, ran tests, ran benchmarks. Verified that minor mutations to muhash.cpp duly broke the new unit test. Tested the new RPC gettxoutsetinfo help and boolean. Some comments:

  • This PR essentially makes gettxoutsetinfo too slow to be useable in testnet and mainnet (it times out and raises after 15 minutes); for that reason, until the -coinstatsindex in [WIP] Index for UTXO Set Statistics #18000 is merged, the MuHash algorithm should be opt-in for rpc gettxoutsetinfo and not the default

  • What are your plans regarding ECMH?

  • it looks like TruncatedSHA512Writer could use unit tests (perhaps just sanity checks if not viable to use test vectors, CHashWriter is also not tested directly but used by other unit tests: addrman, hash, sighash)

  • a fuzz harness would be ideal

  • do you plan to add tests using the just-added Python MuHash3072 implementation?

  • 06ae4ab commit message s/256/512/?

  • 04d088d commit message could mention the renaming of hash_serialized_2 to utxo_set_hash and why it was changed (I didn't see any explanation anywhere)

  • 04d088d and a6cf0df ought to be squashed to one commit; keeping them separate unnecessarily complicates reviewing -- I squashed them locally in order to see the relevant diff, as ApplyStats() is weird enough to review already

  • at some point you'll want to add a release note

  • some nits below; feel free to ignore

tested the new benchmarks

$ src/bench/bench_bitcoin -filter=MuHash*.*
# Benchmark, evals, iterations, total, min, max, median
MuHash, 5, 5000, 0.287426, 7.45603e-06, 1.72302e-05, 1.08339e-05
MuHashAdd, 5, 5000, 0.243778, 7.93741e-06, 1.17234e-05, 9.61388e-06
MuHashDiv, 5, 100, 8.62069, 0.0146344, 0.0216452, 0.0169485
MuHashPrecompute, 5, 5000, 0.0557224, 1.55965e-06, 4.40565e-06, 1.65373e-06

$ src/bench/bench_bitcoin -filter=MuHash*.* -evals=10
# Benchmark, evals, iterations, total, min, max, median
MuHash, 10, 5000, 0.286697, 5.3029e-06, 6.24744e-06, 5.78305e-06
MuHashAdd, 10, 5000, 0.227103, 4.1665e-06, 5.39662e-06, 4.52542e-06
MuHashDiv, 10, 100, 12.7459, 0.0106584, 0.0182222, 0.0120859
MuHashPrecompute, 10, 5000, 0.096089, 1.67035e-06, 2.37696e-06, 1.89207e-06

$ src/bench/bench_bitcoin -filter=MuHash*.* -evals=50
# Benchmark, evals, iterations, total, min, max, median
MuHash, 50, 5000, 1.92384, 6.04054e-06, 1.34205e-05, 7.3487e-06
MuHashAdd, 50, 5000, 1.26626, 3.9807e-06, 5.9453e-06, 5.61147e-06
MuHashDiv, 50, 100, 59.2743, 0.00950589, 0.0151357, 0.011696
MuHashPrecompute, 50, 5000, 0.386586, 1.4394e-06, 1.66618e-06, 1.55641e-06

Note to reviewers: you'll need to build #18000 to build the index and test performance, as it contains the -coinstatsindex that this PR does not yet implement.

src/Makefile.am Show resolved Hide resolved
src/bench/crypto_hash.cpp Outdated Show resolved Hide resolved
src/bench/crypto_hash.cpp Outdated Show resolved Hide resolved
src/node/coinstats.cpp Outdated Show resolved Hide resolved
@jonatack
Copy link
Member

jonatack commented May 27, 2020

Also verified that, like your added test in rpc_blockchain.py, only utxo_set_hash varies between gettxoutsetinfo in -regtest with each algo.

$ bitcoin-cli -regtest gettxoutsetinfo
{
  "height": 15599,
  "bestblock": "6e535bc570f9be1b86d21711b23391e4e8f001682b5c6243883744879cdc4f84",
  "transactions": 3216,
  "txouts": 3219,
  "bogosize": 234987,
  "utxo_set_hash": "c4926481130f6689950e07e0e1f2060277d349b106a94b83145d577d3db4d225",
  "disk_size": 226234,
  "total_amount": 14949.99998350
}
$ bitcoin-cli -regtest gettxoutsetinfo true
{
  "height": 15599,
  "bestblock": "6e535bc570f9be1b86d21711b23391e4e8f001682b5c6243883744879cdc4f84",
  "transactions": 3216,
  "txouts": 3219,
  "bogosize": 234987,
  "utxo_set_hash": "1883e99e56927fd5e3b82ccb9e08acd1be7ed0256d4cafe142626ae2fb66ffb4",
  "disk_size": 226234,
  "total_amount": 14949.99998350
}

@narula
Copy link
Contributor

narula commented May 27, 2020

nit: "separate" is misspelled in the commit title of cda20f3

It might be nice to split this in two PRs: One which adds MuHash, and one which updates gettxoutsetinfo to support it. I would definitely at least rebase/squash to put all the MuHash additions first and then layer the functionality changes on top.

Edited to add thought: The next PR could add rolling as well, so this way we don't have an interim state where gettxoutsetinfo is even slower than it is now.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 27, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

src/rpc/blockchain.cpp Outdated Show resolved Hide resolved
@fjahr
Copy link
Contributor Author

fjahr commented Dec 22, 2020

Addressed the latest comments.

Perhaps Num3072 can have SERIALIZE_METHODS (and a test), so that MuHash3072 can serialise its numerator and denominator without poking into Num3072 internals.

Done

@Limpisey168
Copy link

So good

Comment on lines +123 to +126
SERIALIZE_METHODS(MuHash3072, obj)
{
READWRITE(obj.m_numerator);
READWRITE(obj.m_denominator);
Copy link
Contributor

@elichai elichai Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really what we want?
or do we want it to normalize it (like Finalize does) and then just serialize the numerator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a good idea. If we want to use MuHash in combination with assumeutxo then the shorter the (marginally) better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that would be nicer and I had implemented it like that a few weeks ago but decided to kick it out again because code-wise it felt awkward to force normalization before every serialization. Maybe there is a better way to do it than what I tried code-wise. I don't think I found another example of where something comparable is done IIRC. Overall, normalizing is an expensive operation that would need to run quite for every serialization to result in a 50% space-saving on disk. However, for CoinstatsIndex I only save a single MuHash3072 to disk at any time, so at least for that, it would only save 384 bytes total. That's why I took the easy way and kept it as it is now.

I can try to revive my old branch and push it for the sake of discussion. But I am not sure why it makes a difference for assumeutxo, wouldn't we use the finalized 32 byte hash for that as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends, if we ever add a commitment to this in the coinbase or something like that then we might want to also save the MuHash in the undoBlocks.
But I think there's no point in discussing this now, so IMO you can leave it as-is

Copy link
Member

@laanwj laanwj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 9815332

@laanwj laanwj merged commit b6a71b8 into bitcoin:master Jan 7, 2021
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jan 7, 2021
9815332 test: Change MuHash Python implementation to match cpp version again (Fabian Jahr)
01297fb fuzz: Add MuHash consistency fuzz test (Fabian Jahr)
b111410 test: Add MuHash3072 fuzz test (Fabian Jahr)
c122527 bench: Add Muhash benchmarks (Fabian Jahr)
7b12422 test: Add MuHash3072 unit tests (Fabian Jahr)
adc708c crypto: Add MuHash3072 implementation (Fabian Jahr)
0b4d290 crypto: Add Num3072 implementation (Fabian Jahr)
589f958 build: Check for 128 bit integer support (Fabian Jahr)

Pull request description:

  This is the first split of bitcoin#18000 which implements the Muhash algorithm and uses it to calculate the UTXO set hash in `gettxoutsetinfo`.

ACKs for top commit:
  laanwj:
    Code review ACK 9815332

Tree-SHA512: 4bc090738f0e3d80b74bdd8122e24a8ce80121120fd37c7e4335a73e7ba4fcd7643f2a2d559e2eebf54b8e3a3bd5f12cfb27ba61ded135fda210a07a233eae45

} // namespace

/** Indicates wether d is larger than the modulus. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: wether -> whether

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Included a fix in #19145

@fjahr fjahr mentioned this pull request Jan 25, 2021
martinus added a commit to martinus/bitcoin that referenced this pull request Sep 18, 2021
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop based on epochIterations. That does not do what it is supposed to do, because epochIterations() is determined automatically from nanobench.

Also, multiplication is not needed for the algorithm (as pointed out by a comment in bitcoin#19055), so it's better to remove this loop.
martinus added a commit to martinus/bitcoin that referenced this pull request Sep 18, 2021
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop
based on epochIterations. That does not do what it is supposed to do,
because epochIterations() is determined automatically from nanobench.

Also, multiplication is not needed for the algorithm (as pointed out by
a comment in bitcoin#19055), so it's better to remove this loop.
martinus added a commit to martinus/bitcoin that referenced this pull request Sep 18, 2021
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop
based on epochIterations. That does not do what it is supposed to do,
because epochIterations() is determined automatically from nanobench.

Also, multiplication is not needed for the algorithm (as pointed out by
a comment in bitcoin#19055), so it's better to remove this loop.
martinus added a commit to martinus/bitcoin that referenced this pull request Sep 20, 2021
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop
based on epochIterations. That does not do what it is supposed to do,
because epochIterations() is determined automatically from nanobench.

Also, multiplication is not needed for the algorithm (as pointed out by
a comment in bitcoin#19055), so it's better to remove this loop.
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Sep 21, 2021
Summary:
PR description:
> This is the second in a series of pull requests to implement an Index for UTXO set statistics.
>
> This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version.

> test: Move modinv to util and add unit test
bitcoin/bitcoin@ab30cec

> test: Add Python MuHash3072 implementation to test framework
bitcoin/bitcoin@b85543c

> test: Add basic Python/C++ Muhash implementation parity unit test
bitcoin/bitcoin@0e2b400

> test: Add chacha20 test vectors in muhash
bitcoin/bitcoin@36ec980

This is a backport of [[bitcoin/bitcoin#19105 | core#19105]]

Backport note: the C++ implementation will be merged much later ([[bitcoin/bitcoin#19055 | core#19055]])

Test Plan: `ninja check-functional`

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Differential Revision: https://reviews.bitcoinabc.org/D10166
martinus added a commit to martinus/bitcoin that referenced this pull request Sep 21, 2021
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop
based on epochIterations. That does not do what it is supposed to do,
because epochIterations() is determined automatically from nanobench.

Also, multiplication is not needed for the algorithm (as pointed out by
a comment in bitcoin#19055), so it's better to remove this loop.
rebroad pushed a commit to rebroad/bitcoin that referenced this pull request Oct 13, 2021
Introduced in bitcoin#19055, MuHashDiv benchmark used to multiply with a loop
based on epochIterations. That does not do what it is supposed to do,
because epochIterations() is determined automatically from nanobench.

Also, multiplication is not needed for the algorithm (as pointed out by
a comment in bitcoin#19055), so it's better to remove this loop.
kwvg added a commit to kwvg/dash that referenced this pull request Feb 26, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Feb 26, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Feb 26, 2022
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Mar 14, 2022
Summary:
General context (see core#18000):

This is a step in the way of implementing an index of coin statistics with the goal of making the response time of the `gettxoutsetinfo` RPC call dramatically faster. Currently, this RPC is scanning the full UTXO set every time it is called which makes it hard to use for users that want to continually check the coin supply or compare UTXO set hashes between different nodes. It is especially challenging in periods of multiple quickly mined blocks, even relatively fast machines.

> build: Check for 128 bit integer support
bitcoin/bitcoin@2474645

> crypto: Add Num3072 implementation
>
> Num3072 is a specialized bignum implementation used in MuHash3072.
bitcoin/bitcoin@0b4d290

> crypto: Add MuHash3072 implementation
bitcoin/bitcoin@adc708c

> test: Add MuHash3072 unit tests
bitcoin/bitcoin@7b12422

> test: Change MuHash Python implementation to match cpp version again
bitcoin/bitcoin@9815332

Co-authored-by: Pieter Wuille <pieter.wuille@gmail.com>
Co-authored-by: Anthony Towns <aj@erisian.com.au>

This is a partial backport of [[bitcoin/bitcoin#19055 | core#19055]]

Test Plan: `ninja check check-functional`

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Subscribers: Fabien

Differential Revision: https://reviews.bitcoinabc.org/D11150
kwvg added a commit to kwvg/dash that referenced this pull request Apr 3, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Apr 20, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Apr 24, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Apr 27, 2022
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.