Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Input Deduplication Algorithm #14387

Closed
wants to merge 8 commits into from

Conversation

Projects
None yet
@JeremyRubin
Copy link
Contributor

commented Oct 4, 2018

This PR introduces a faster input deduplication algorithm.

In the first commit, I introduce a worst case duplicate input benchmark.

The existing code achieves the following performance.

DuplicateInputs, 5, 10, 0.710982, 0.0140756, 0.0143986, 0.0141852

In the next commit, I introduce a probabilistic checker which is much faster. It's analysis and use is documented extensively in the code comments. It achieves the following performance on the same benchmark.

DuplicateInputs, 5, 10, 0.166576, 0.00329314, 0.0034048, 0.0033221

This is about 4.3X faster.

Future work can further improve this by back-grounding the checks and hashing to another thread.

This does introduce one behavior change in which DoS error gets reported for transactions which have both duplicates and null inputs; the first error found will be the one reported.

@fanquake fanquake added the Consensus label Oct 4, 2018

@JeremyRubin JeremyRubin closed this Oct 4, 2018

@JeremyRubin JeremyRubin reopened this Oct 4, 2018

src/bench/duplicate_inputs.cpp Outdated
block.vtx.push_back(MakeTransactionRef(std::move(naughtyTx)));

block.hashMerkleRoot = BlockMerkleRoot(block);

This comment has been minimized.

Copy link
@MarcoFalke

MarcoFalke Oct 4, 2018

Member

nit: some trailing white space here?

Also, I removed all comments about the linter failure, because they are just distracting from the actual pull request.

@bitcoin bitcoin deleted a comment from JeremyRubin Oct 4, 2018

@bitcoin bitcoin deleted a comment from promag Oct 4, 2018

@bitcoin bitcoin deleted a comment from jamesob Oct 4, 2018

@instagibbs

This comment has been minimized.

Copy link
Member

commented Oct 4, 2018

Can you post "typical" case benchmark comparisons?

@DrahtBot

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #14400 (Add Benchmark to test input de-duplication worst case by JeremyRubin)
  • #14397 (Faster duplicate input check in CheckTransaction (alternative to #14387) by sipa)
  • #14074 (Use std::unordered_set instead of set in blockfilter interface by jimpo)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@practicalswift

This comment has been minimized.

Copy link
Member

commented Oct 4, 2018

@JeremyRubin Impressive speedup! What is the risk-reward trade-off we're facing with this change? More specifically: what risks do you see associated with this change to consensus critical code?

Does the change in which DoS error gets reported for transactions which have both duplicates and null inputs have any consequences or impose any risks?

src/consensus/tx_verify.cpp Outdated
// can be expensive, and the table will operate acceptably for all of the
// transactions in a given block.
//
// Then, we iterate through the elements one by one, generated 8 salted

This comment has been minimized.

Copy link
@leishman

leishman Oct 4, 2018

Contributor

Why 8?

This comment has been minimized.

Copy link
@JeremyRubin

JeremyRubin Oct 4, 2018

Author Contributor

It's my favorite number!

It's to get the expected work below one for the worst case as shown in the analysis.

We could add a 9th hash (we have more bits in the hash computed) but every hash adds more memory accesses in our table.

We could also remove a hash or two and the EV would be less than 10, which is probably acceptable 2. At about 4 or 5 hashes is when it blows up a bit more to an "unacceptable point" (EV 50 to 500 comparisons).

Solve for x such that:

Sum i = 0 to 24390 [ i*( i*x / 2**21)**x ] < 1

It's also possible to modify the algorithm such that if a false positive is hit, you do the current set based algorithm up to the conflict. I'm not sure how to analyze that though, and the current code is sufficiently simple with low enough probability of expensive scan that we don't care that much.

This comment has been minimized.

Copy link
@JeremyRubin

JeremyRubin Oct 5, 2018

Author Contributor

More precisely, it's one of two choices (9 hashes or 8) with 21 bits with expected work less than 1.

Extracting 9 hashes from 3 64 bit integers is a bit more complex code wise, but doable.

sorted((sum(i*(float(i)x/ 2.0**y)**x for i in xrange(24930)) if (xy <= 192) else ('Inf', 0, 0),y,x) for y in xrange(1,22) for x in xrange(1,20))[:10]

[(0.10374566662377155, 21, 9), (0.4157347268068221, 21, 8), (1.9074647424172138, 21, 7), (10.226961125517702, 21, 6), (53.11778131137103, 20, 9), (65.8563129753341, 21, 5), (106.42809006254646, 20, 8), (244.15548702940336, 20, 7), (529.481130078109, 21, 4), (654.5255120331329, 20, 6)]

Another option would be to increase the number of hashes to 16 and then use a 20 bit table, requiring a 320-bit hash . This makes the expected work about 7 comparisons in the worst case, but makes the table half as large which reduces the constant bloat.

This comment has been minimized.

Copy link
@leishman

leishman Oct 5, 2018

Contributor

Ok cool. Thanks for the analysis. 8 seems like a pretty reasonable choice. I left a comment above, but was wondering how this bloom filter compares to native unordered set implementations in the stdlib.

@promag

This comment has been minimized.

Copy link
Member

commented Oct 4, 2018

Anyone measured -reindex with and without this change?

src/consensus/tx_verify.cpp Outdated
// to check for collisions efficiently.
//
// This is faster than the naive construction, using a set, which
// requires more allocation and comparison of uint256s.

This comment has been minimized.

Copy link
@leishman

leishman Oct 5, 2018

Contributor

How does your implementation compare to using std::unordered_set?

This comment has been minimized.

Copy link
@JeremyRubin

JeremyRubin Oct 5, 2018

Author Contributor

See #14397 for a more obvious version that is just the obvious thing.

For std::unordered_set, I'm clocking much worse performance for DeserializeAndCheckBlockTest and 2x worse performance for DuplicateInputs.

This comment has been minimized.

Copy link
@sipa

sipa Oct 5, 2018

Member

@leishman I tried a few alternatives:

  • Master: 13.6 ms
  • #14397: 6.3 ms
  • Using a sorted vector with SipHash'ed prevouts: 3.7 ms
  • This PR: 2.7 ms

This comment has been minimized.

Copy link
@JeremyRubin

JeremyRubin Oct 5, 2018

Author Contributor

I like sorting the siphash'd prevouts -- I'm guessing you then do the expensive check if that collides? Or are you tracking the pointers when you sort too?

This comment has been minimized.

Copy link
@sipa

sipa Oct 5, 2018

Member

@JeremyRubin Yeah, just delay the expensive check until after the cheap check fails. I haven't PR'ed that because I only have PoC, and I don't want to overload reviewers with a series of PRs without even knowing if we want to increase complexity here at all.

This comment has been minimized.

Copy link
@JeremyRubin

JeremyRubin Oct 5, 2018

Author Contributor

For the expensive check, you can still do it in O(n) per colliding entry FYI, which is less expensive than doing the full O(n log n) expensive check given that we don't expect colliding entries without a duplicate.

@jnewbery

This comment has been minimized.

Copy link
Member

commented Oct 5, 2018

My immediate reaction is that this seems very complex compared to a naive std::set comparison! This also pulls our SipHash implementation into consensus-critical code, which seems like a big price to pay for a performance win. I lean pretty strongly towards concept NACK.

@JeremyRubin - your PR description talks about what this PR does, but not why. This makes block propagation faster, but do we have an understanding of how much these milliseconds matter? Is there a way we can determine whether the increased complexity introduced is a reasonable price to pay for the performance win?

Also, +1 to @MarcoFalke's comment here: #14387 (comment) . Can reviewers please not start nitting code before there's been a concept discussion.

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2018

Goal was to minimize the performance regression caused by the CVE fix. Understand this is sensitive code for that reason. This code is also generally theoretically useful for several other contexts because it is O(N). An adapted version (different parameters) could be used to check for duplicate inputs across a large number of txns (e.g., mempool syncing context).

It's actually not thaaat complicated; it's basically just a bloom filter. The complexity is also mostly in the performance, the correctness is somewhat easy to check.

I don't know if the performance win is worth it. I'll leave that for others to determine. Just putting it out there.

@practicalswift

This comment has been minimized.

Copy link
Member

commented Oct 5, 2018

@JeremyRubin It is sufficiently complicated to introduce undefined behaviour in consensus critical code without any of the reviewers noticing .-)

I'm afraid the code as it is currently formulated will trigger undefined behaviour due to shift exponents being too large.

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2018

@practicalswift i think I fixed that -- can you confirm? (and also a copy-paste error on which bit was being set :( )

@practicalswift

This comment has been minimized.

Copy link
Member

commented Oct 5, 2018

@gmaxwell

This comment has been minimized.

Copy link
Member

commented Oct 5, 2018

I feel like this is too much review work vs the gain.

@JeremyRubin JeremyRubin force-pushed the JeremyRubin:faster-dedup branch to 1f24773 Oct 6, 2018

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2018

@gmaxwell @sipa please re-review the updated version (should you have time). @practicalswift I think I have eliminated the UB, if not please let me know.

In this version I have kept complexity limited in scope to validation.cpp.

Performance wise this version is actually a bit better in the worst case compared to using the filter per-transaction (DuplicateInputs) and better in an average case (DeserializeAndCheckBlockTest) compared to master.

The simpler put all in vector then sort then find duplicate algorithm could be used here too.

The major benefit of this approach (as amended) is that we not only detect duplicate inputs per transaction, but across the entire block at the same time. This guarantees we won't see an in-block double spend in ConnectBlock and CheckTxInputs. This might enable us to parallelize checking that inputs exist (against a cache that tracks at what index an output created in that block was created).

@gmaxwell

This comment has been minimized.

Copy link
Member

commented Oct 7, 2018

The major benefit of this approach (as amended) is that we not only detect duplicate inputs per transaction, but across the entire block at the sam

One can't do that without losing the ability to cache the check as part of the wtxid validity caching, as the simpler check could be.

I am failing to see the argument for a gain here that even justifies reviewing the change at all.

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

commented Oct 9, 2018

@gmaxwell That's not accurate -- by fixing the salt for the hash (which should be safe -- I will consider this more closely), you could store three uint64_t's per input in the wtxid validity cache and then re-insert those on the forward pass of the algorithm. To be clear, this just saves sip-hashing at the expense of memory, but you can just keep a table of precomputed sip-hashes for the inputs in the top of the mempool if it's actually an issue. Once you have the hashes, the check itself is very inexpensive... At the very least, such a cache could be configurable.

By checking (without inserting) that all of the outputs (not txids) created in a transaction are not already in the table as we scan we ensure that none of the inputs spent are created later in the block. This can also be done with an additional backwards pass with a new table only tracking TXIDs for no hashing (for tx in txs.reverse(): insert_table(txid); for input in tx.inputs(): check_table(input.coutpoint.hash)). The overall expected more work on either of these approaches is around 2x, and with current parameters this is reasonable. With this check completed, it would be possible to apply all the transactions in a block out of order. Without removing any checks or adding parallelization, this should make less fragile much of the code after CheckBlock (e.g., ConnectBlock) because we never reach it for a block which has out-of-longchain-order transactions (and cause us to have to abort partially applied transactions).

I wanted to get just the duplicates checking reviewed and accepted first, then, in the future work on these other projects.

@instagibbs with this current version, it seems to minorly worse (1.5% median to median) on DeserializeAndCheckBlockTest. I'm unaware if this is a representative sample of blocks or if this tell you anything about the larger performance of a block being validated with respect to things like number of transactions, caches, etc so I'm reticent to give too much weight to this one in any case. If you have ideas for how we can write a better benchmark to test this for future work, let's chat about it.

DeserializeAndCheckBlockTest, 5, 160, 12.2812, 0.0153002, 0.0154073, 0.0153456
DeserializeAndCheckBlockTest, 5, 160, 12.5319, 0.0153902, 0.0159464, 0.0155924

Much thanks to all the reviewers who have spent their time reviewing this PR so-far, I appreciate that you took the time to review this contribution to Bitcoin.

@DrahtBot

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

Needs rebase

MarcoFalke added a commit to MarcoFalke/bitcoin that referenced this pull request Nov 25, 2018

Merge bitcoin#14400: Add Benchmark to test input de-duplication worst…
… case

e4eee7d Add Benchmark to test input de-duplication worst case (Jeremy Rubin)

Pull request description:

  Because there are now 2PRs referencing this benchmark commit, we may as well add it independently as it is worth landing the benchmark even if neither patch is accepted.

  bitcoin#14397
  bitcoin#14387

Tree-SHA512: 4d947323c02297b0d8f5871f9e7cc42488c0e1792a8b10dc174a25f4dd53da8146fd276949a5dbacf4083f0c6a7235cb6f21a8bc35caa499bc2508f8a048b987
@gmaxwell

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

@jnewbery nothing in this PR is in the block propagation typical case anymore (not since we finally implemented the full BIP152 HB mode forwarding), so the belief that this speeds up block propagation is largely mistaken.

"Goal was to minimize the performance regression caused by the CVE fix." -- there wasn't one, or at least not an interesting one. The speedup to duplicate checking was later superseded by changes to allow blocks to be propagated without validating anything more than hash consistency.

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

commented Nov 29, 2018

@gmaxwell see #14837 which supersedes this PR.

In my opinion, you are incorrect that CheckBlock is not latency sensitive. Certainly there are a large class of users for whom CheckBlock performance is critical (e.g., miners performing full validation before mining a new block, and miners calling testblockvalidity to get a new template).

This also has a non negligible impact on benchmarks like DeserializeAndCheckBlockTest, which suggests to me that speeding up these checks is important for reindexing, bootstrap, and other activities.

@fanquake

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

Closing in favour of #14837.

@fanquake fanquake closed this Nov 29, 2018

@gmaxwell

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

In my opinion, you are incorrect that CheckBlock is not latency sensitive. Certainly there are a large class of users for whom CheckBlock performance is critical (e.g., miners performing full validation before mining a new block, and miners calling testblockvalidity to get a new template).

When it was on the critical path of propagation the small delays involved were cumulative across the whole network. As a result even though one was not particularly interesting, the sum total could be. Without that effect, you only get the single one shot delay. The effect of a one shot half millisecond delay on mining is negligible, and efforts spend considering that optimization could be better spent on things like getting testblockvalidity out of the critical path-- which would be an order of magnitude larger speedup, likely simpler to review and verify, and would also further moot the new proposed benefit.

This also has a non negligible impact on benchmarks like DeserializeAndCheckBlockTest, which suggests to me that speeding up these checks is important for reindexing, bootstrap, and other activities.

That logic is spurious. Microbenchmarks are microbenchmarks. If it had an impact on reindexing/bootstrap it could be measured. If it did make a big impact there that would be an argument in favor of it, but unless prior profiling was erroneous, that isn't possible.

@practicalswift

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

@JeremyRand Regarding my comment regarding UB above. The problem was the following:

uint64_t bit1 = 1<<((std::get<0>(h)) & 63);

... is problematic due to ...

$ cling
[cling]$ #include <cstdint>
[cling]$ 1 << 63
warning: shift count >= width of type [-Wshift-count-overflow]
 1 << 63
   ^  ~~
(int) 0

... which can be contrasted to ...

$ cling
[cling]$ #include <cstdint>
[cling]$ static_cast<uint64_t>(1) << 63
(unsigned long) 9223372036854775808
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.