Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoch Mempool #17268

Open
wants to merge 33 commits into
base: master
from
Open

Epoch Mempool #17268

wants to merge 33 commits into from

Conversation

@JeremyRubin
Copy link
Contributor

JeremyRubin commented Oct 26, 2019

Hansel and Gretel
left bread crumb trails as they went
but birds ate them all.

This Pull Request improves the asymptotic complexity of many of the mempool's algorithms as well as makes them more performant in practice.

In the mempool we have a lot of algorithms which depend on rather computationally expensive state tracking. The state tracking algorithms we use make a lot of algorithms inefficient because we're inserting into various sets that grow with the number of ancestors and descendants, or we may have to iterate multiple times of data we've already seen.

To address these inefficiencies, we can closely limit the maximum possible data we iterate and reject transactions above the limits.

However, a rational miner/user will purchase faster hardware and raise these limits to be able to collect more fee revenue or process more transactions. Further, changes coming from around the ecosystem (lightning, OP_SECURETHEBAG) have critical use cases which benefit when the mempool has fewer limitations.

Rather than use expensive state tracking, we can do something simpler. Like Hansel and Gretel, we can leave breadcrumbs in the mempool as we traverse it, so we can know if we are going somewhere we've already been. Luckily, in bitcoind there are no birds!

These breadcrumbs are a uint64_t epoch per CTxMemPoolEntry. (64 bits is enough that it will never overflow). The mempool also has a counter.

Every time we begin traversing the mempool, and we need some help tracking state, we increment the mempool's counter. Any CTxMemPoolEntry with an epoch less than the MemPools has not yet been touched, and should be traversed and have it's epoch set higher.

Given the one-to-one mapping between CTxMemPool entries and transactions, this is a safe way to cache data.

Using these bread-crumbs allows us to get rid of many of the std::set accumulators in favor of a std::vector as we no longer rely on the de-duplicative properties of std::set.

We can also improve many of the related algorithms to no longer make heavy use of heapstack based processing.

The overall changes are not that big. It's 302 additions and 208 deletions. You could review it all at once, especially for a Concept Ack. But I've carefully done them in small steps to permit careful analysis of the changes. So I'd recommend giving the final state of the PR a read over first to get familiar with the end-goal (it won't take long), and then step through commit-by-commit so you can be convinced of the correctness for more thorough code reviewers.

But is it good in practice? Yes! Basic benchmarks show it being 5% faster on the AssembleBlock benchmark, and 14.5% faster on MempoolEviction, but they aren't that tough. Tougher benchmarks, as described below, show a >2x speedup.

PR #17292, adds a motivating benchmark for the Epoch Mempool.

./src/bench/bench_bitcoin --filter=ComplexMemPool --scaling=5

Before:

# Benchmark, evals, iterations, total, min, max, median
ComplexMemPool, 5, 5, 6.62715, 0.264409, 0.265441, 0.2654

After:

# Benchmark, evals, iterations, total, min, max, median
ComplexMemPool, 5, 5, 3.07116, 0.122246, 0.12351, 0.122783

This shows a > 2x speedup on this difficult benchmark (~2.15x across total, min, max, median).

Note: For scientific purity, I created it "single shot blind". That is, I did not modify the test at all after running it on the new branch (I had to run it on master to fix bugs / adjust the expected iters for a second parameter, and then adjust the internal params to get it to be about a second).

What's the catch? The specific downsides of this approach are twofold:

  1. We need to store an extra 8 bytes per CTxMemPoolEntry
  2. Some of the algorithms are a little bit less obvious & are "less safe" to call concurrently (e.g., from a recursive call, not necessarily in parallel)

But for these two specific downsides:

  1. The 8 bytes could be reduced to 4 (with code to reset all CTxMemPoolEntries near overflow, but then we would need to more carefully check that we won't overflow the epoch in a recursive call.
  2. The 8 bytes could be a 4 byte index into a table of 8 (or 4, or even fewer) Byte epochs + back pointer, at expense of indirection like vTxHashes. Then we would only ever need as many epochs as the things we walk per-epoch.
  3. We can replace the epoch with an atomic, which would allow for new parallel graph traversal algorithms to build descendants more quickly
  4. Some algorithms benefit from recursive epoch use

So overall, I felt the potential benefits outweigh the risks.

What's next?
I've submitted this a bit early for review to gain concept acks on the approach, independent testing, and feedback on methodologies for testing, as well as to collect more worst-case edge conditions to put into benchmarks. It's not quite a WIP -- the code passes tests and is "clean", but it should receive intense scrutiny before accepting.

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

JeremyRubin commented Oct 26, 2019

Rebased 👍

@DrahtBot

This comment has been minimized.

Copy link
Contributor

DrahtBot commented Oct 26, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #17566 (Switch to weight units for all feerates computation by darosior)
  • #16910 (wallet: reduce loading time by using unordered maps by achow101)
  • #16490 (rpc: Report reason for replaceable txpool transactions by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@JeremyRubin JeremyRubin force-pushed the JeremyRubin:mempool-experiments-2 branch from 99f03ec to 7f7f7a2 Oct 27, 2019
Copy link
Contributor Author

JeremyRubin left a comment

2 cleanups to be addressed momentarily: One better use of STL and one performance regression fix.

src/miner.cpp Outdated Show resolved Hide resolved
src/miner.cpp Outdated Show resolved Hide resolved
@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

JeremyRubin commented Oct 28, 2019

Added some changes as per suggestion from @TheBlueMatt, which should make reviewing the final state easier.

The epochs are no longer stored on the stack (instead we reference them via the mempool itself) and we use a scoped stack guard in order to ensure that only one part of the code is using epochs at a time. This prevents a called function from creating an epoch guard (via assertion). While there is occasionally use for a multiple outstanding epochs, none are implemented currently. If it comes up in the future it can be special cased.

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

JeremyRubin commented Oct 28, 2019

I just opened up #17292, which adds a motivating benchmark for the Epoch Mempool.

./src/bench/bench_bitcoin --filter=ComplexMemPool --scaling=5

Before:

# Benchmark, evals, iterations, total, min, max, median
ComplexMemPool, 5, 5, 6.62715, 0.264409, 0.265441, 0.2654

After:

# Benchmark, evals, iterations, total, min, max, median
ComplexMemPool, 5, 5, 3.07116, 0.122246, 0.12351, 0.122783

This shows a > 2x speedup on this difficult benchmark (~2.15x across total, min, max, median).

Note: For scientific purity, I created it "single shot blind". That is, I did not modify the test at all after running it on the new branch (I had to run it on master to fix bugs / adjust the expected iters for a second parameter, and then adjust the internal params to get it to be about a second).

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

JeremyRubin commented Oct 29, 2019

I ran the test against master, changing the parameter on number transactions. For this test, the 2x improvement seems to hold. They both seem to be quadratic fundamentally, but that may be a reflection of the test setup and not the actual upper bound.

image

image

@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Oct 29, 2019

This seems like a moderately important area for optimization. One thing that might be worth keeping in mind: It would probably be useful to support a "fast but imprecise" accounting method for the first block template after a new block. This is really easy to do with the current code (just skip updating the costs while inserting transactions), not sure if your work would change that.

[Aside, usually when talking about improving asymptotic complexity people ignore constant factors. You might want to use different language when describing this work to avoid confusing people.]

In theory, I think the common case where confirmation includes an entire disjoint subgraph at once it should be possible to make that linear time (by basically noticing that the entire subgraph is removed, so any cost updates can be skipped). Interestingly: for attack resistance the worst case (permitted by policy) is what matters... but for revenue reasons the average case is more important.

I'm not sure if there are greater wins though from speedups or from latency hiding hacks. e.g. if one keeps a small redundant highest fee only mempool perhaps without CPFP dependency tracking at all, and uses it while the real stuff is updating in the background... then it probably doesn't matter that much if the main mempool is taking tens of seconds.

@JeremyRubin

This comment has been minimized.

Copy link
Contributor Author

JeremyRubin commented Oct 30, 2019

Thanks for the review @gmaxwell.

I think the "fast but imprecise" accounting method is not affected at all by this change, but someone who has implemented that should verify.

I've also used the terminology correctly RE: asymptotes. Imagine a set of O(N) transactions which each have 1 output, and a single transaction which spends all outputs. Assembling the ancestors (e.g., in CalculateMemPoolAncestors) will iterate over O(N) inputs and make O(N log N) comparisons in set entries. In the new code I've written, you do O(N) work as it's O(1) (rather than O(log N)) to query if we've already included it.

Now, you're correct to note that in the benchmark I gave, it's only a constant factor. This is the purpose of the benchmark; a "believable" transaction workload. As noted above, the O(N^2) work manifests as a result of the structure of that test specifically -- you have N transactions you're inserting into the mempool, and you need to percolate the updated package info up the tree every insert. Thus, you end up making N*N traversals. Thus, short of some sort of magic metadata data structure, any implementation's improvements will manifest lower-bounded by W(N^2).

I can generate some micro benchmarks which show that Epoch Mempool beats out the status quo in asymptotes. I figured it's less interesting than showing it's better in the domain of believable workloads, because while it's easy to see that these algorithms win in the asymptotes, it's less clear that they win on realistic workloads.

Also, some of the "composed" algorithms have two parts -- one half is improved by this PR, and the other half isn't -- yet. For example in addUnchecked, we have two components: 1 builds setParentTransactions and the other calls UpdateParent on everything in setParentTransactions. Normally, building setParentTransactions is O(N log N), but with this update it becomes O(N). However, the internals of UpdateParent are an ordered set, so we gain back the log N factor. However future work can update those data structures to make UpdateParent O(1).

To your point about making subtree detaching O(N), this work actually already does that moslty. removeRecursive was previously O(N log N), with this PR it is now basically O(N) (we skip the metadata updating in removeRecursive's call to RemoveStaged with updateDescendants = false, but removeUnchecked removes n items from a std::set so it regains a log(n) factor, but it's a pretty conceptually easy follow up PR to make txlinksMap a std::unordered_set).

I haven't pushed the changes I was working on w.r.t. changing the std::sets to std::unordered_sets because while conceptually simple, they have a lot more edge case in if it will be better or worse (esp when it comes to memory). So the epoch's seemed like a better one as there's really no downside.

@JeremyRubin JeremyRubin force-pushed the JeremyRubin:mempool-experiments-2 branch from 2fb5de2 to 118d1e7 Oct 31, 2019
@DrahtBot DrahtBot removed the Needs rebase label Oct 31, 2019
@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Oct 31, 2019

Ah! Point on asympotics accepted. You might also want to make a benchmark that shows that case! I agree that a 'toy' case isn't a useful benchmark, but if you believe your algo is nlogn in some case it's a useful testing point to actually demonstrate that (e.g. and confirm that there isn't a hidden O(N) in an underlying datastructure that breaks your performance).

@JeremyRubin JeremyRubin force-pushed the JeremyRubin:mempool-experiments-2 branch from aa62939 to 83a1316 Oct 31, 2019
MarcoFalke added a commit that referenced this pull request Nov 1, 2019
b0c774b Add new mempool benchmarks for a complex pool (Jeremy Rubin)

Pull request description:

  This PR is related to #17268.

  It adds a mempool stress test which makes a really big complicated tx graph, and then, similar to mempool_eviction test, trims the size.

  The test setup is to make 100 original transactions with Rand(10)+2 outputs each.

  Then, 800 times:

  we create a new transaction with Rand(10) + 1 parents that are randomly sampled from all existing transactions (with unspent outputs). From each such parent, we then select Rand(remaining outputs) +1 50% of the time, or 1 outputs 50% of the time.

  Then, we trim the size to 3/4. Then we trim it to just a single transaction.

  This creates, hopefully, a big bundle of transactions with lots of complex structure, that should really put a strain on the mempool graph algorithms.

  This ends up testing both the descendant and ancestor tracking.

  I don't love that the test is "unstable". That is, in order to compare this test to another, you really can't modify any of the internal state because it will have a different order of invocations of the deterministic randomness. However, it certainly suffices for comparing branches.

Top commit has no ACKs.

Tree-SHA512: cabe96b849b9885878e20eec558915e921d49e6ed1e4b011b22ca191b4c99aa28930a8b963784c9adf78cc8b034a655513f7a0da865e280a1214ae15ebb1d574
sidhujag added a commit to syscoin/syscoin that referenced this pull request Nov 2, 2019
b0c774b Add new mempool benchmarks for a complex pool (Jeremy Rubin)

Pull request description:

  This PR is related to bitcoin#17268.

  It adds a mempool stress test which makes a really big complicated tx graph, and then, similar to mempool_eviction test, trims the size.

  The test setup is to make 100 original transactions with Rand(10)+2 outputs each.

  Then, 800 times:

  we create a new transaction with Rand(10) + 1 parents that are randomly sampled from all existing transactions (with unspent outputs). From each such parent, we then select Rand(remaining outputs) +1 50% of the time, or 1 outputs 50% of the time.

  Then, we trim the size to 3/4. Then we trim it to just a single transaction.

  This creates, hopefully, a big bundle of transactions with lots of complex structure, that should really put a strain on the mempool graph algorithms.

  This ends up testing both the descendant and ancestor tracking.

  I don't love that the test is "unstable". That is, in order to compare this test to another, you really can't modify any of the internal state because it will have a different order of invocations of the deterministic randomness. However, it certainly suffices for comparing branches.

Top commit has no ACKs.

Tree-SHA512: cabe96b849b9885878e20eec558915e921d49e6ed1e4b011b22ca191b4c99aa28930a8b963784c9adf78cc8b034a655513f7a0da865e280a1214ae15ebb1d574
@JeremyRubin JeremyRubin force-pushed the JeremyRubin:mempool-experiments-2 branch from 83a1316 to 87bde63 Nov 4, 2019
JeremyRubin added 29 commits Nov 4, 2019
…inary_search on a std::set instead of std::count. std::binary_search has to use std::advance on the passed iterators which leads to search taking O(N log N) instead of O(log N).
updated dependents which required to use it generically.
@JeremyRubin JeremyRubin force-pushed the JeremyRubin:mempool-experiments-2 branch from d688bb9 to dcd6c66 Nov 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.