Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Epoch Mempool #17268
Hansel and Gretel
This Pull Request improves the asymptotic complexity of many of the mempool's algorithms as well as makes them more performant in practice.
In the mempool we have a lot of algorithms which depend on rather computationally expensive state tracking. The state tracking algorithms we use make a lot of algorithms inefficient because we're inserting into various sets that grow with the number of ancestors and descendants, or we may have to iterate multiple times of data we've already seen.
To address these inefficiencies, we can closely limit the maximum possible data we iterate and reject transactions above the limits.
However, a rational miner/user will purchase faster hardware and raise these limits to be able to collect more fee revenue or process more transactions. Further, changes coming from around the ecosystem (lightning, OP_SECURETHEBAG) have critical use cases which benefit when the mempool has fewer limitations.
Rather than use expensive state tracking, we can do something simpler. Like Hansel and Gretel, we can leave breadcrumbs in the mempool as we traverse it, so we can know if we are going somewhere we've already been. Luckily, in bitcoind there are no birds!
These breadcrumbs are a uint64_t epoch per CTxMemPoolEntry. (64 bits is enough that it will never overflow). The mempool also has a counter.
Every time we begin traversing the mempool, and we need some help tracking state, we increment the mempool's counter. Any CTxMemPoolEntry with an epoch less than the MemPools has not yet been touched, and should be traversed and have it's epoch set higher.
Given the one-to-one mapping between CTxMemPool entries and transactions, this is a safe way to cache data.
Using these bread-crumbs allows us to get rid of many of the std::set accumulators in favor of a std::vector as we no longer rely on the de-duplicative properties of std::set.
We can also improve many of the related algorithms to no longer make heavy use of heapstack based processing.
The overall changes are not that big. It's 302 additions and 208 deletions. You could review it all at once, especially for a Concept Ack. But I've carefully done them in small steps to permit careful analysis of the changes. So I'd recommend giving the final state of the PR a read over first to get familiar with the end-goal (it won't take long), and then step through commit-by-commit so you can be convinced of the correctness for more thorough code reviewers.
But is it good in practice? Yes! Basic benchmarks show it being 5% faster on the AssembleBlock benchmark, and 14.5% faster on MempoolEviction, but they aren't that tough. Tougher benchmarks, as described below, show a >2x speedup.
What's the catch? The specific downsides of this approach are twofold:
But for these two specific downsides:
So overall, I felt the potential benefits outweigh the risks.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
Added some changes as per suggestion from @TheBlueMatt, which should make reviewing the final state easier.
The epochs are no longer stored on the stack (instead we reference them via the mempool itself) and we use a scoped stack guard in order to ensure that only one part of the code is using epochs at a time. This prevents a called function from creating an epoch guard (via assertion). While there is occasionally use for a multiple outstanding epochs, none are implemented currently. If it comes up in the future it can be special cased.
I just opened up #17292, which adds a motivating benchmark for the Epoch Mempool.
This shows a > 2x speedup on this difficult benchmark (~2.15x across total, min, max, median).
Note: For scientific purity, I created it "single shot blind". That is, I did not modify the test at all after running it on the new branch (I had to run it on master to fix bugs / adjust the expected iters for a second parameter, and then adjust the internal params to get it to be about a second).
This seems like a moderately important area for optimization. One thing that might be worth keeping in mind: It would probably be useful to support a "fast but imprecise" accounting method for the first block template after a new block. This is really easy to do with the current code (just skip updating the costs while inserting transactions), not sure if your work would change that.
[Aside, usually when talking about improving asymptotic complexity people ignore constant factors. You might want to use different language when describing this work to avoid confusing people.]
In theory, I think the common case where confirmation includes an entire disjoint subgraph at once it should be possible to make that linear time (by basically noticing that the entire subgraph is removed, so any cost updates can be skipped). Interestingly: for attack resistance the worst case (permitted by policy) is what matters... but for revenue reasons the average case is more important.
I'm not sure if there are greater wins though from speedups or from latency hiding hacks. e.g. if one keeps a small redundant highest fee only mempool perhaps without CPFP dependency tracking at all, and uses it while the real stuff is updating in the background... then it probably doesn't matter that much if the main mempool is taking tens of seconds.
Thanks for the review @gmaxwell.
I think the "fast but imprecise" accounting method is not affected at all by this change, but someone who has implemented that should verify.
I've also used the terminology correctly RE: asymptotes. Imagine a set of O(N) transactions which each have 1 output, and a single transaction which spends all outputs. Assembling the ancestors (e.g., in CalculateMemPoolAncestors) will iterate over O(N) inputs and make O(N log N) comparisons in set entries. In the new code I've written, you do O(N) work as it's O(1) (rather than O(log N)) to query if we've already included it.
Now, you're correct to note that in the benchmark I gave, it's only a constant factor. This is the purpose of the benchmark; a "believable" transaction workload. As noted above, the O(N^2) work manifests as a result of the structure of that test specifically -- you have N transactions you're inserting into the mempool, and you need to percolate the updated package info up the tree every insert. Thus, you end up making N*N traversals. Thus, short of some sort of magic metadata data structure, any implementation's improvements will manifest lower-bounded by W(N^2).
I can generate some micro benchmarks which show that Epoch Mempool beats out the status quo in asymptotes. I figured it's less interesting than showing it's better in the domain of believable workloads, because while it's easy to see that these algorithms win in the asymptotes, it's less clear that they win on realistic workloads.
Also, some of the "composed" algorithms have two parts -- one half is improved by this PR, and the other half isn't -- yet. For example in addUnchecked, we have two components: 1 builds setParentTransactions and the other calls UpdateParent on everything in setParentTransactions. Normally, building setParentTransactions is O(N log N), but with this update it becomes O(N). However, the internals of UpdateParent are an ordered set, so we gain back the log N factor. However future work can update those data structures to make UpdateParent O(1).
To your point about making subtree detaching O(N), this work actually already does that moslty. removeRecursive was previously O(N log N), with this PR it is now basically O(N) (we skip the metadata updating in removeRecursive's call to RemoveStaged with updateDescendants = false, but removeUnchecked removes n items from a std::set so it regains a log(n) factor, but it's a pretty conceptually easy follow up PR to make txlinksMap a std::unordered_set).
I haven't pushed the changes I was working on w.r.t. changing the std::sets to std::unordered_sets because while conceptually simple, they have a lot more edge case in if it will be better or worse (esp when it comes to memory). So the epoch's seemed like a better one as there's really no downside.
Ah! Point on asympotics accepted. You might also want to make a benchmark that shows that case! I agree that a 'toy' case isn't a useful benchmark, but if you believe your algo is nlogn in some case it's a useful testing point to actually demonstrate that (e.g. and confirm that there isn't a hidden O(N) in an underlying datastructure that breaks your performance).
b0c774b Add new mempool benchmarks for a complex pool (Jeremy Rubin) Pull request description: This PR is related to #17268. It adds a mempool stress test which makes a really big complicated tx graph, and then, similar to mempool_eviction test, trims the size. The test setup is to make 100 original transactions with Rand(10)+2 outputs each. Then, 800 times: we create a new transaction with Rand(10) + 1 parents that are randomly sampled from all existing transactions (with unspent outputs). From each such parent, we then select Rand(remaining outputs) +1 50% of the time, or 1 outputs 50% of the time. Then, we trim the size to 3/4. Then we trim it to just a single transaction. This creates, hopefully, a big bundle of transactions with lots of complex structure, that should really put a strain on the mempool graph algorithms. This ends up testing both the descendant and ancestor tracking. I don't love that the test is "unstable". That is, in order to compare this test to another, you really can't modify any of the internal state because it will have a different order of invocations of the deterministic randomness. However, it certainly suffices for comparing branches. Top commit has no ACKs. Tree-SHA512: cabe96b849b9885878e20eec558915e921d49e6ed1e4b011b22ca191b4c99aa28930a8b963784c9adf78cc8b034a655513f7a0da865e280a1214ae15ebb1d574
b0c774b Add new mempool benchmarks for a complex pool (Jeremy Rubin) Pull request description: This PR is related to bitcoin#17268. It adds a mempool stress test which makes a really big complicated tx graph, and then, similar to mempool_eviction test, trims the size. The test setup is to make 100 original transactions with Rand(10)+2 outputs each. Then, 800 times: we create a new transaction with Rand(10) + 1 parents that are randomly sampled from all existing transactions (with unspent outputs). From each such parent, we then select Rand(remaining outputs) +1 50% of the time, or 1 outputs 50% of the time. Then, we trim the size to 3/4. Then we trim it to just a single transaction. This creates, hopefully, a big bundle of transactions with lots of complex structure, that should really put a strain on the mempool graph algorithms. This ends up testing both the descendant and ancestor tracking. I don't love that the test is "unstable". That is, in order to compare this test to another, you really can't modify any of the internal state because it will have a different order of invocations of the deterministic randomness. However, it certainly suffices for comparing branches. Top commit has no ACKs. Tree-SHA512: cabe96b849b9885878e20eec558915e921d49e6ed1e4b011b22ca191b4c99aa28930a8b963784c9adf78cc8b034a655513f7a0da865e280a1214ae15ebb1d574