Join GitHub today
core, eth, trie: streaming GC for the trie cache #16810
This PR tries to fine tune the initial trie pruning work introduced in Geth 1.8.0.
The current version of pruning works by accumulating trie writes in an intermediate in-memory database layer, also tracking the reference relationships between nodes. This in-memory layer tracks a window of 128 tries (to aid fast sync), and whenever a trie gets old enough, it is dereferenced, and any dangling node garbage collected. This process is repeated until either a memory limit or a time limit is reached, when an entire trie is flushed to disk. For more details, please see #15857.
After running the current pruning code for a few months, we can observe a few rough edges that make it suboptimal. On chain head, processing blocks takes a significant amount of time, and the baked-in 5 minute flush window is too small to meaningfully collect enough garbage. This results is flushes like
Albeit the above is a theoretical solution to disk growth, it does not - alone - work in practice. Bumping the flush timeout from 5 minute to 1 hour would on mainnet result in flushes every 10 minutes or so due to hitting the permitted memory limits (Geth by default runs with a 256MB cache limit (25% of
The second important observation we need to make with trie pruning is that most of the trie is junk; or rather will become junk after some number of blocks. However, the more time a trie node does not become junk, the higher the probability it will never do so (never = rarely active pathway). When the current trie pruning code reaches its memory limit, it blindly flushes out nodes (a recent entire trie) to disk; most of which we know will end up as junk very fast. But there is no reason to flush an entire trie vs. random nodes... we just need to flush something. By tracking the age of nodes within the trie cache, we could free up memory by flushing old nodes to disk. This should significantly reduce disk junk, since a node can only end up on disk if it has been actively referenced for a very long time, essentially making it very unlikely to become junk in the near future. Memory cap wise we still enforce the same limits, just pick-and-chose what to write in a more meaningful way.
This "tracking by age" is a bit more involved, as a single timestamp field would make it hard to quickly find nodes to flush: we are constantly adding and removing nodes. The fastest data structure to handle this would be a heap, which is still log(n) complexity, where N is the number of live nodes (1M+ on mainnet). This PR implements this age tracking with a doubly linked list representing the age order, each item of the list being also inserted into a map. The linked list permits O(1) iteration complexity to find and flush the next node; and also O(1) complexity to add/remove a node. The map part ensures that when we're deleting a node, we can find it in the linked list in O(1) too.
The last piece of the puzzle is creating the flush-list in the first place, since the doubly-linked list needs to retain the invariant that writing a node to disk must entail all its children already being present on disk. This invariant is already satisfied by the trie cache insertion order, since we're always pushing a child into the cache first, and the parent after (since the parent needs the hash of the child). As such, if we create the flush-list in the node insertion order, the flush order will also retain the child-first-parent-after storage. I.e. the complexity of insertion is also O(1).
Memory complexity wise the cache still retains it's current O(n) complexity, where previously it was
This PR should also fix #16674, at least for the general flushes. The slowdown will still be felt during the "hourly" snapshot flushes.
Stats at block 4.8M:
Stats at chain head (5.7M):
changed the title from
[WIP] core, eth, trie: streaming GC for the trie cache
core, eth, trie: streaming GC for the trie cache
May 29, 2018
Generally looks good to me, with some questions/comments