New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core, eth, trie: streaming GC for the trie cache #16810

Merged
merged 2 commits into from Jun 4, 2018

Conversation

Projects
None yet
3 participants
@karalabe
Member

karalabe commented May 25, 2018

This PR tries to fine tune the initial trie pruning work introduced in Geth 1.8.0.

The current version of pruning works by accumulating trie writes in an intermediate in-memory database layer, also tracking the reference relationships between nodes. This in-memory layer tracks a window of 128 tries (to aid fast sync), and whenever a trie gets old enough, it is dereferenced, and any dangling node garbage collected. This process is repeated until either a memory limit or a time limit is reached, when an entire trie is flushed to disk. For more details, please see #15857.

After running the current pruning code for a few months, we can observe a few rough edges that make it suboptimal. On chain head, processing blocks takes a significant amount of time, and the baked-in 5 minute flush window is too small to meaningfully collect enough garbage. This results is flushes like Persisted trie from memory database nodes=779937 size=224.01mB time=6.34419045s gcnodes=1631823 gcsize=701.68mB gctime=6.624533035s livenodes=301196 livesize=114.13mB where we push 225MB of data to disk out of 725MB total. This leads to a too large of a chain growth, which acts as a vicious cycle, further growing the chain ever faster. The 5 minute window was meant as a sanity fallback against crashes, but it's just too wasteful. This PR bumps the sanity write window to 1 hour. Yes, that's a lot more painful if Geth crashes, but keeping the chain under control is more important than catering for unstable environments.

Albeit the above is a theoretical solution to disk growth, it does not - alone - work in practice. Bumping the flush timeout from 5 minute to 1 hour would on mainnet result in flushes every 10 minutes or so due to hitting the permitted memory limits (Geth by default runs with a 256MB cache limit (25% of --cache)).

The second important observation we need to make with trie pruning is that most of the trie is junk; or rather will become junk after some number of blocks. However, the more time a trie node does not become junk, the higher the probability it will never do so (never = rarely active pathway). When the current trie pruning code reaches its memory limit, it blindly flushes out nodes (a recent entire trie) to disk; most of which we know will end up as junk very fast. But there is no reason to flush an entire trie vs. random nodes... we just need to flush something. By tracking the age of nodes within the trie cache, we could free up memory by flushing old nodes to disk. This should significantly reduce disk junk, since a node can only end up on disk if it has been actively referenced for a very long time, essentially making it very unlikely to become junk in the near future. Memory cap wise we still enforce the same limits, just pick-and-chose what to write in a more meaningful way.

This "tracking by age" is a bit more involved, as a single timestamp field would make it hard to quickly find nodes to flush: we are constantly adding and removing nodes. The fastest data structure to handle this would be a heap, which is still log(n) complexity, where N is the number of live nodes (1M+ on mainnet). This PR implements this age tracking with a doubly linked list representing the age order, each item of the list being also inserted into a map. The linked list permits O(1) iteration complexity to find and flush the next node; and also O(1) complexity to add/remove a node. The map part ensures that when we're deleting a node, we can find it in the linked list in O(1) too.

The last piece of the puzzle is creating the flush-list in the first place, since the doubly-linked list needs to retain the invariant that writing a node to disk must entail all its children already being present on disk. This invariant is already satisfied by the trie cache insertion order, since we're always pushing a child into the cache first, and the parent after (since the parent needs the hash of the child). As such, if we create the flush-list in the node insertion order, the flush order will also retain the child-first-parent-after storage. I.e. the complexity of insertion is also O(1).

Memory complexity wise the cache still retains it's current O(n) complexity, where previously it was O(n) = n * common.HashLength + SUM(blobs), and now it's O(n) = n * 3 * common.HashLength + SUM(blobs).

This PR should also fix #16674, at least for the general flushes. The slowdown will still be felt during the "hourly" snapshot flushes.


Stats at block 4.8M:

Import Time Datadir Disk Reads Disk Writes
master 40h 25m 105.5GB 10.75TB 9.3TB
PR 34h 55m 74.5GB 6.16TB 5.6TB

Stats at chain head (5.7M):

Import Time Datadir Disk Reads Disk Writes
master 100h 267GB 27.4TB 20.7TB
PR 86h 190GB 21.8TB 17.2TB

@karalabe karalabe requested a review from holiman as a code owner May 25, 2018

Show outdated Hide outdated trie/database.go
Show outdated Hide outdated trie/database.go
delete(db.nodes, db.oldest)
db.oldest = node.flushNext
db.nodesSize -= common.StorageSize(common.HashLength + len(node.blob))

This comment has been minimized.

@holiman

holiman May 28, 2018

Contributor

Earlier, you're using common.StorageSize(3*common.HashLength + len(node.blob)) (3*), but here a different fomula. Why?

@holiman

holiman May 28, 2018

Contributor

Earlier, you're using common.StorageSize(3*common.HashLength + len(node.blob)) (3*), but here a different fomula. Why?

This comment has been minimized.

@karalabe

karalabe May 29, 2018

Member

Damn it. Originally I tracked the size differently. Apparently I didn't update all paths.... then the cache is a bit f-ed now.

@karalabe

karalabe May 29, 2018

Member

Damn it. Originally I tracked the size differently. Apparently I didn't update all paths.... then the cache is a bit f-ed now.

This comment has been minimized.

@karalabe

karalabe May 29, 2018

Member

Ah, no, it's fine. But definitely need to document this.

The size of the cache is defined as nodesSize + flushlistSize. The nodesSize is made up of a map, mapping hashes to blobs (nodesSize = SUM(common.Hash + len(blob))). The flushlist is a doubly linked list maintained inside the nodes map, so it's a prevHash and nextHash, i.e. SUM(2 * common.Hash).

The net total weight of a tracked node is thus 3*common.Hash+len(blob).

The reason I'm tracking the size of the nodes and the flush list separately is to allow printing the correct GC amount when logging. The flushlist is just metadata... it is counted to limit memory usage, but it should not be counted when reporting the total amount not written to disk.

I need to document it better.

@karalabe

karalabe May 29, 2018

Member

Ah, no, it's fine. But definitely need to document this.

The size of the cache is defined as nodesSize + flushlistSize. The nodesSize is made up of a map, mapping hashes to blobs (nodesSize = SUM(common.Hash + len(blob))). The flushlist is a doubly linked list maintained inside the nodes map, so it's a prevHash and nextHash, i.e. SUM(2 * common.Hash).

The net total weight of a tracked node is thus 3*common.Hash+len(blob).

The reason I'm tracking the size of the nodes and the flush list separately is to allow printing the correct GC amount when logging. The flushlist is just metadata... it is counted to limit memory usage, but it should not be counted when reporting the total amount not written to disk.

I need to document it better.

@karalabe karalabe changed the title from [WIP] core, eth, trie: streaming GC for the trie cache to core, eth, trie: streaming GC for the trie cache May 29, 2018

@karalabe karalabe requested a review from fjl May 30, 2018

@holiman

Generally looks good to me, with some questions/comments

@@ -47,7 +47,7 @@ var DefaultConfig = Config{
LightPeers: 100,
DatabaseCache: 768,
TrieCache: 256,
TrieTimeout: 5 * time.Minute,
TrieTimeout: 60 * time.Minute,

This comment has been minimized.

@holiman

holiman May 31, 2018

Contributor

In blockchain.go NewBlockchain, some other (hardcoded) defaults are used in case cacheconfig is nil. Perhaps they should use this value instead?

@holiman

holiman May 31, 2018

Contributor

In blockchain.go NewBlockchain, some other (hardcoded) defaults are used in case cacheconfig is nil. Perhaps they should use this value instead?

This comment has been minimized.

@karalabe

karalabe May 31, 2018

Member

Hmmm, I thought there would be used.

@karalabe

karalabe May 31, 2018

Member

Hmmm, I thought there would be used.

nodes, imgs = triedb.Size()
limit = common.StorageSize(bc.cacheConfig.TrieNodeLimit) * 1024 * 1024
)
if nodes > limit || imgs > 4*1024*1024 {

This comment has been minimized.

@holiman

holiman May 31, 2018

Contributor

What's the 4*1024*1024 limit for imgs, and why is that not a named parameter like the other settings?

@holiman

holiman May 31, 2018

Contributor

What's the 4*1024*1024 limit for imgs, and why is that not a named parameter like the other settings?

This comment has been minimized.

@karalabe

karalabe May 31, 2018

Member

This is a random number really, just needs to be small enough not to be bothersome, large enough to dedup data. It's arbitrary really.

@karalabe

karalabe May 31, 2018

Member

This is a random number really, just needs to be small enough not to be bothersome, large enough to dedup data. It's arbitrary really.

limit = common.StorageSize(bc.cacheConfig.TrieNodeLimit) * 1024 * 1024
)
if nodes > limit || imgs > 4*1024*1024 {
triedb.Cap(limit - ethdb.IdealBatchSize)

This comment has been minimized.

@holiman

holiman May 31, 2018

Contributor

This will tell the triedb to Cap at 256M - 100k. What's the reason to remove 100k from the limit?

@holiman

holiman May 31, 2018

Contributor

This will tell the triedb to Cap at 256M - 100k. What's the reason to remove 100k from the limit?

This comment has been minimized.

@karalabe

karalabe May 31, 2018

Member

Hysteresis. If we cap it as the limit, we will keep writing out 32 byte blobs. This way when we go over the limit, we push out 100KB, and then we have a bit of buffer to accumulate data before flushing. At least that's the theory. In practice, it might be interesting to see how much we overflow the limit.

@karalabe

karalabe May 31, 2018

Member

Hysteresis. If we cap it as the limit, we will keep writing out 32 byte blobs. This way when we go over the limit, we push out 100KB, and then we have a bit of buffer to accumulate data before flushing. At least that's the theory. In practice, it might be interesting to see how much we overflow the limit.

}
db.flushnodes += uint64(nodes - len(db.nodes))
db.flushsize += storage - db.nodesSize
db.flushtime += time.Since(start)

This comment has been minimized.

@holiman

holiman May 31, 2018

Contributor

How about exposing the flush stats as metrics? Wouldn't it be pretty usefull to have on our graphs?

@holiman

holiman May 31, 2018

Contributor

How about exposing the flush stats as metrics? Wouldn't it be pretty usefull to have on our graphs?

This comment has been minimized.

@karalabe

karalabe May 31, 2018

Member

Added them.

@karalabe

karalabe May 31, 2018

Member

Added them.

@karalabe karalabe added this to the 1.8.10 milestone May 31, 2018

@@ -47,7 +47,7 @@ var DefaultConfig = Config{
LightPeers: 100,

This comment has been minimized.

@jebek29

jebek29 May 31, 2018

    LightPeers:    100,
DatabaseCache: 768,
TrieCache:     256,
TrieTimeout:   60 * time.Minute,
@jebek29

jebek29 May 31, 2018

    LightPeers:    100,
DatabaseCache: 768,
TrieCache:     256,
TrieTimeout:   60 * time.Minute,

@karalabe karalabe merged commit 143c434 into ethereum:master Jun 4, 2018

1 of 3 checks passed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
continuous-integration/appveyor/pr AppVeyor build failed
Details
commit-message-check/gitcop All commit messages are valid
Details

@karalabe karalabe modified the milestones: 1.8.10, 1.8.11 Jun 11, 2018

jpeletier pushed a commit to epiclabs-io/go-ethereum that referenced this pull request Jun 14, 2018

core, eth, trie: streaming GC for the trie cache (#16810)
* core, eth, trie: streaming GC for the trie cache

* trie: track memcache statistics

kielbarry added a commit to kielbarry/go-ethereum that referenced this pull request Jul 9, 2018

core, eth, trie: streaming GC for the trie cache (#16810)
* core, eth, trie: streaming GC for the trie cache

* trie: track memcache statistics

firmianavan added a commit to firmianavan/go-ethereum that referenced this pull request Aug 28, 2018

core, eth, trie: streaming GC for the trie cache (#16810)
* core, eth, trie: streaming GC for the trie cache

* trie: track memcache statistics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment